New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory usage information #2

Open
nemequ opened this Issue Mar 30, 2015 · 10 comments

Comments

Projects
None yet
4 participants
@nemequ
Member

nemequ commented Mar 30, 2015

The obvious route for heap usage (fork() and wait3()) also has some issues when considering things like preexisting freelists in malloc implementations, fragmentation, and malloc requesting more memory than it needs (e.g., next highest power of two, a multiple of the page size, etc.).

I think the only way to do this accurately would be to override malloc/realloc/free/new/delete/mmap, but I still need to find a reliable solution for measuring the stack size.

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Sep 23, 2015

I'm attaching a sample implementation for Linux. Includes stack, heap and mmap. To avoid page size alignment overhead, you should use valgrind's massif, but I'd argue that it should be included.

Most everyone uses 4kb pages, and if a codec does many small allocations, it will also have that overhead in real usage.

http://pastebin.ca/3171788

clbr commented Sep 23, 2015

I'm attaching a sample implementation for Linux. Includes stack, heap and mmap. To avoid page size alignment overhead, you should use valgrind's massif, but I'd argue that it should be included.

Most everyone uses 4kb pages, and if a codec does many small allocations, it will also have that overhead in real usage.

http://pastebin.ca/3171788

@nemequ

This comment has been minimized.

Show comment
Hide comment
@nemequ

nemequ Sep 23, 2015

Member

Thanks, but AFAIK statm provides an instantaneous measurement. We can hardly ask each codec to call this code at the point when they happen to be using the most memory—we need a highwater measurement. It also suffers from all the problems I mentioned in the original report.

Member

nemequ commented Sep 23, 2015

Thanks, but AFAIK statm provides an instantaneous measurement. We can hardly ask each codec to call this code at the point when they happen to be using the most memory—we need a highwater measurement. It also suffers from all the problems I mentioned in the original report.

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Sep 23, 2015

clbr commented Sep 23, 2015

@nemequ

This comment has been minimized.

Show comment
Hide comment
@nemequ

nemequ Sep 23, 2015

Member

Using /proc/$PID/status also suffers from all the problems I mentioned in the original report. I think it is much better not to provide a number than to provide a wildly inaccurate one. Providing an inaccurate number could lead people to the wrong conclusions when in reality they could/should just test the codecs they are interested in in their software to see if it performs as they need it to. Squash even makes this trivial; changing codecs typically just requires changing a single string.

Massif would kill performance, which is far more important to most people than memory usage.

Member

nemequ commented Sep 23, 2015

Using /proc/$PID/status also suffers from all the problems I mentioned in the original report. I think it is much better not to provide a number than to provide a wildly inaccurate one. Providing an inaccurate number could lead people to the wrong conclusions when in reality they could/should just test the codecs they are interested in in their software to see if it performs as they need it to. Squash even makes this trivial; changing codecs typically just requires changing a single string.

Massif would kill performance, which is far more important to most people than memory usage.

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Sep 23, 2015

clbr commented Sep 23, 2015

@nemequ

This comment has been minimized.

Show comment
Hide comment
@nemequ

nemequ Sep 23, 2015

Member

AFAIK it would require a significant rewrite of the benchmark, since it would have to fork()/exec() massif, and a second executable would need to be created to actually run the benchmark. That's a pretty big effort for a non-default option.

Also, the data wouldn't be included in the web interface as it would simply be too slow for me to be able to run the benchmark anymore. On the fastest computer it already takes almost 24 hours to run, and the slowest computer takes a few hours shy of a week. IIRC massif usually slows things by about an order of magnitude… I can't give up the computers I actually use for two weeks, and I can't wait 2 months for results from the slower machines.

Member

nemequ commented Sep 23, 2015

AFAIK it would require a significant rewrite of the benchmark, since it would have to fork()/exec() massif, and a second executable would need to be created to actually run the benchmark. That's a pretty big effort for a non-default option.

Also, the data wouldn't be included in the web interface as it would simply be too slow for me to be able to run the benchmark anymore. On the fastest computer it already takes almost 24 hours to run, and the slowest computer takes a few hours shy of a week. IIRC massif usually slows things by about an order of magnitude… I can't give up the computers I actually use for two weeks, and I can't wait 2 months for results from the slower machines.

@r-lyeh-archived

This comment has been minimized.

Show comment
Hide comment
@r-lyeh-archived

r-lyeh-archived Sep 23, 2015

Are benchmarks run on linux? If so, an LD_PRELOAD export w/ dlmalloc with a few tweaks over there could get the total RAM consumption and peaks.

r-lyeh-archived commented Sep 23, 2015

Are benchmarks run on linux? If so, an LD_PRELOAD export w/ dlmalloc with a few tweaks over there could get the total RAM consumption and peaks.

@nemequ

This comment has been minimized.

Show comment
Hide comment
@nemequ

nemequ Sep 23, 2015

Member

Yes, they are currently run exclusively on Linux. I don't think LD_PRELOAD would be necessary; you could get the same effect from a glibc malloc hook.

Unfortunately it would miss memory allocated by C++'s new keyword (several plugins use it). It also wouldn't take into account plugins which use buffers on the stack. Finally, it would miss anonymous mappings from mmap and other allocators, but honestly I don't think that is a problem; I haven't done an exhaustive search but I'm not aware of any plugins which use either.

To be viable I think we need to be able to measure the high-water mark for:

  • malloc/realloc/calloc
  • new in C++
  • stack size

Without having a significant effect on performance.

Member

nemequ commented Sep 23, 2015

Yes, they are currently run exclusively on Linux. I don't think LD_PRELOAD would be necessary; you could get the same effect from a glibc malloc hook.

Unfortunately it would miss memory allocated by C++'s new keyword (several plugins use it). It also wouldn't take into account plugins which use buffers on the stack. Finally, it would miss anonymous mappings from mmap and other allocators, but honestly I don't think that is a problem; I haven't done an exhaustive search but I'm not aware of any plugins which use either.

To be viable I think we need to be able to measure the high-water mark for:

  • malloc/realloc/calloc
  • new in C++
  • stack size

Without having a significant effect on performance.

@travisdowns

This comment has been minimized.

Show comment
Hide comment
@travisdowns

travisdowns Sep 26, 2015

I feel like launching a process per codec run and using the OS highwater counters is probably the most complete and promising approach. As you point out, though, that is a lot of work - although perhaps the process-per-run model will have other advantages too, in terms of being able to read the /proc numbers to learn interesting stuff.

travisdowns commented Sep 26, 2015

I feel like launching a process per codec run and using the OS highwater counters is probably the most complete and promising approach. As you point out, though, that is a lot of work - although perhaps the process-per-run model will have other advantages too, in terms of being able to read the /proc numbers to learn interesting stuff.

@nemequ

This comment has been minimized.

Show comment
Hide comment
@nemequ

nemequ Sep 26, 2015

Member

My main concern with that is memory which malloc has sitting in a pool when you start that process wouldn't be counted. For codecs which require a lot of memory it wouldn't be a big deal, but for codecs which require little it could account for everything they require.

Member

nemequ commented Sep 26, 2015

My main concern with that is memory which malloc has sitting in a pool when you start that process wouldn't be counted. For codecs which require a lot of memory it wouldn't be a big deal, but for codecs which require little it could account for everything they require.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment