Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a 100Mb .mat file produces peak RSS of 20Gb #55

Closed
vadimkantorov opened this issue Mar 4, 2017 · 10 comments
Closed

Loading a 100Mb .mat file produces peak RSS of 20Gb #55

vadimkantorov opened this issue Mar 4, 2017 · 10 comments

Comments

@vadimkantorov
Copy link

vadimkantorov commented Mar 4, 2017

I'm using matio 1.5.10 and matio-ffi.torch. I have a 100Mb file that makes matio to allocate suspiciously a lot of memory:

du -h SelectiveSearchVOC2007trainval.mat.edgeboxes.mat
# 93M     SelectiveSearchVOC2007trainval.mat.edgeboxes.mat

/usr/bin/time -f %M th -e '(require "matio").load("SelectiveSearchVOC2007trainval.mat.edgeboxes.mat")' 
# 20407944 KiB

Probably I'm missing something obvious, but such memory consumption seems a little fishy to me. Doing the same with matdump gives:

/usr/bin/time -f %M matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 5102308

Does this discrepancy of 5Gb vs 20Gb mean matio-ffi.torch is using matio sub-optimally?

log.txt contains substring Empty many times:

grep Empty log.txt | wc -l
# 50210218

wc -l log.txt
# 50230273 log.txt

File uploaded to my OneDrive: https://1drv.ms/u/s!Apx8USiTtrYmprRlRQmgSbPJNcWzEw

@tbeu
Copy link
Owner

tbeu commented Mar 4, 2017

There are 2 variables with 5011x5011 cells. Each cell allocates 80Bytes for the matvar_t struct, no matter if the cell is empty or not. Thus 2*(5011*5011 + 1)*80 = roughly 3.8GiB get allocated just for the data structures (not yet including the cell data). It is the cells overhead that is causing the high memory comsumption.

As a hack, the empty cells can be freed by

diff --git a/src/mat5.c b/src/mat5.c
index 075d3d2..cd281f6 100644
--- a/src/mat5.c
+++ b/src/mat5.c
@@ -1792,6 +1792,8 @@ ReadNextCell( mat_t *mat, matvar_t *matvar )
             nbytes = uncomp_buf[1];
             if ( !nbytes ) {
                 /* empty cell */
+                Mat_VarFree(cells[i]);
+                cells[i] = NULL;
                 continue;
             } else if ( uncomp_buf[0] != MAT_T_MATRIX ) {
                 Mat_VarFree(cells[i]);

which helps in your case. But technically it is no longer the same and I do not feel comfortable to commit this change. I'd rather recommend you to get rid of the high-dimensional cell array.

@tbeu
Copy link
Owner

tbeu commented Mar 4, 2017

Just noticed that if MATLAB reads such a cell array with empty cells, it does not allocate the usual array header (overhead). Hence, I will think about the hack and its consequences for e.g., Mat_VarSize or Mat_VarPrint.

@vadimkantorov
Copy link
Author

vadimkantorov commented Mar 5, 2017

Thanks for very fast response!

Now I see, I would strip the extra empty dimensions. Somehow boxes{1} still returns the cell value, while boxes{1, 1} returns an empty cell (boxes is one of the two variables).

The issue might still be troublesome in adversarial / DoS setting.

tbeu added a commit that referenced this issue Mar 5, 2017
* Memory optimization: Only allocate one empty field or cell per struct or cell array, respectively
* Use reference counter num_empty of internal structure to keep track of number of referenced empty fields or cells
* As reported by #55
tbeu added a commit that referenced this issue Mar 5, 2017
* Memory optimization: Only allocate one empty field or cell per struct or cell array, respectively
* Use reference counter num_empty of internal structure to keep track of number of referenced empty fields or cells
* As reported by #55
@tbeu
Copy link
Owner

tbeu commented Mar 5, 2017

Can you please test and confirm that 464de5c also solves the issue for matio-ffi.torch. Thanks.

@vadimkantorov
Copy link
Author

matio-ffi.torch still eats 4x times more memory (matdump peaks at 400 Mb, matio-ffi.torch at 1600 Mb), but it's no longer a problem for my case. I guess to fix it, one then needs to get deeper into matio-ffi.torch. Thanks for the quick patch!

@tbeu
Copy link
Owner

tbeu commented Mar 6, 2017

Do you know if function load of matio-ffi.torch only loads the variable info or the complete variable including data from a variable? In the latter case you would need to compare with matdump -d that also retrieves the data (and doubles the memory consumption of your file).

/usr/bin/time -f %M ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 391452
/usr/bin/time -f %M ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 779612

@tbeu
Copy link
Owner

tbeu commented Mar 6, 2017

I also noticed that the testsuite currently misses struct/cell arrays with empty fields or cells. I'll add these cases.

@vadimkantorov
Copy link
Author

vadimkantorov commented Mar 6, 2017

It definitely loads the data. Even more, from what I can see, matio-ffi.torch copies the data, hence potentially one more doubling.

@tbeu
Copy link
Owner

tbeu commented Mar 6, 2017

Finally, this explains the doubled memory consumption of matio-ffi.torch w.r.t. matdump -d. Thanks.

tbeu added a commit that referenced this issue Oct 21, 2017
…s from v5 MAT file

* Memory optimization: Free internal struct member, which is unused for empty variables
* As reported by #55
@tbeu
Copy link
Owner

tbeu commented Oct 21, 2017

Performance comparison

1. Matio v1.5.10

7751748 (last commit before dd1d2cd), so bascially matio v1.5.10 (and former)

/usr/bin/time -f "%es %MK" ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 474.62s 3639144K
/usr/bin/time -f "%es %MK" ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 752.46s 3629448K

2. Upcoming matio v1.5.11

dd1d2cd as part of upcoming matio v1.5.11

/usr/bin/time -f "%es %MK" ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 34.50s 2744772K
/usr/bin/time -f "%es %MK" ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 56.12s 3132860K

For memory usage, this simply is some compromise between performance and backward-compatibility. But I was surprised about the observed speed improvements of more than one order of magnitude.

3. Even better

The best values you can get, is, if empty cells are freed again according to the mentioned hack in #55 (comment). However, such a patch would not be backward-compatible, e.g., API functions like Mat_VarPrint or Mat_VarWrite will result in different output then.

/usr/bin/time -f "%es %MK" ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 18.17s 391072K
/usr/bin/time -f "%es %MK" ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 40.91s 779360K

@tbeu tbeu closed this as completed Oct 21, 2017
papadop pushed a commit to papadop/matio that referenced this issue Nov 29, 2017
…s from v5 MAT file

* Memory optimization: Free internal struct member, which is unused for empty variables
* As reported by tbeu#55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants