New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hledger uses too much memory with large files [1mo] #275

Open
MarkusTeufelberger opened this Issue Jul 21, 2015 · 18 comments

Comments

Projects
None yet
2 participants
@MarkusTeufelberger

MarkusTeufelberger commented Jul 21, 2015

I have a ledger of about 40 MB worth of files, hledger 0.26 now suddenly chokes on them, the previous version worked fine (well it also took waaay to long to calculate balances, but at least it worked):

hledger bal -f ./output/full.ledger
hledger: out of memory

full.ledger just imports a couple of other files (I have a collection of scripts that generate them from CSV files) that total to 37 MB. My machine has 16 GB of RAM, so this is not very likely an issue.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Jul 21, 2015

Owner

Very interesting, thanks for the report. (I'd be interested to see as much of the stats output as you'd care to show for that..)

Owner

simonmichael commented Jul 21, 2015

Very interesting, thanks for the report. (I'd be interested to see as much of the stats output as you'd care to show for that..)

@MarkusTeufelberger

This comment has been minimized.

Show comment
Hide comment
@MarkusTeufelberger

MarkusTeufelberger Jul 22, 2015

Well, the stats command runs out of memory too... maybe I'll find a way to build a 64 bit binary on Windows.

On Linux I get the following error (same for stats and bal):
[me@mycomputer foldername]$ ~/.cabal/bin/hledger bal -f ./output/full.ledger
hledger: /path/to/foldername/./output/paypal_com.ledger: hGetContents: invalid argument (invalid byte sequence)

I suspect there is some issue with Unicode in Linux (the file contains some japanese charaters as PayPal username), not necessarily related to the OOM issue. Maybe it could be my script that is not properly flagging the file encoding?

After manually saving the file as UTF-8 in a text editor it worked (both stats and bal), even though it takes a LONG time (a couple seconds at least) to produce the output:

[me@mycomputer foldername]$ ~/.cabal/bin/hledger stats -f ./output/full.ledger 
Main journal file        : /path/to/foldername/./output/full.ledger
Included journal files   : /path/to/foldername/./output/XXX.ledger           
                           /path/to/foldername/./output/YYY.ledger
                           etc. etc. etc.
Transactions span        : 2000-01-01 to 2015-07-22 (5681 days)
Last transaction         : 2015-07-21 (1 days ago)
Transactions             : 295923 (52.1 per day)
Transactions last 30 days: 176881 (5896.0 per day)
Transactions last 7 days : 38687 (5526.7 per day)
Payees/descriptions      : 411
Accounts                 : 218 (depth 5)
Commodities              : 8 (AAA, BBB, CCC, DDD, EEE, FFF, GGG, HHH)

A few things are censored, the numbers are accurate though. Real transactions don't really start in 2000, more like in 2010 or so - the few ones in 2000 are moving "Equity" amounts into place for having correct opening balances. Most of this stuff is auto-generated from CSV files and some of them are relatively verbose/microtransaction heavy.

MarkusTeufelberger commented Jul 22, 2015

Well, the stats command runs out of memory too... maybe I'll find a way to build a 64 bit binary on Windows.

On Linux I get the following error (same for stats and bal):
[me@mycomputer foldername]$ ~/.cabal/bin/hledger bal -f ./output/full.ledger
hledger: /path/to/foldername/./output/paypal_com.ledger: hGetContents: invalid argument (invalid byte sequence)

I suspect there is some issue with Unicode in Linux (the file contains some japanese charaters as PayPal username), not necessarily related to the OOM issue. Maybe it could be my script that is not properly flagging the file encoding?

After manually saving the file as UTF-8 in a text editor it worked (both stats and bal), even though it takes a LONG time (a couple seconds at least) to produce the output:

[me@mycomputer foldername]$ ~/.cabal/bin/hledger stats -f ./output/full.ledger 
Main journal file        : /path/to/foldername/./output/full.ledger
Included journal files   : /path/to/foldername/./output/XXX.ledger           
                           /path/to/foldername/./output/YYY.ledger
                           etc. etc. etc.
Transactions span        : 2000-01-01 to 2015-07-22 (5681 days)
Last transaction         : 2015-07-21 (1 days ago)
Transactions             : 295923 (52.1 per day)
Transactions last 30 days: 176881 (5896.0 per day)
Transactions last 7 days : 38687 (5526.7 per day)
Payees/descriptions      : 411
Accounts                 : 218 (depth 5)
Commodities              : 8 (AAA, BBB, CCC, DDD, EEE, FFF, GGG, HHH)

A few things are censored, the numbers are accurate though. Real transactions don't really start in 2000, more like in 2010 or so - the few ones in 2000 are moving "Equity" amounts into place for having correct opening balances. Most of this stuff is auto-generated from CSV files and some of them are relatively verbose/microtransaction heavy.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Jul 22, 2015

Owner

A quarter million transactions - very cool! Now I'm motivated to test with some larger data sets.

The unicode thing can probably be fixed by setting LANG to an appropriate locale (cf http://hledger.org/manual.html#run-time-problems). GHC-compiled software tends to fail this way when it encounters non-ascii data and the system locale is not set. We should give a better error message (#73).

Owner

simonmichael commented Jul 22, 2015

A quarter million transactions - very cool! Now I'm motivated to test with some larger data sets.

The unicode thing can probably be fixed by setting LANG to an appropriate locale (cf http://hledger.org/manual.html#run-time-problems). GHC-compiled software tends to fail this way when it encounters non-ascii data and the system locale is not set. We should give a better error message (#73).

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Jul 22, 2015

Owner

Hmm. I didn't see this in pre-release testing, and I don't see it now with the simple test journals. Time is the same:

$ time tools/simplebench -n1 hledger-0.24 hledger-0.25 hledger-0.26
Using bench.tests
Running 6 tests 1 times with 3 executables at 2015-07-22 13:53:31.42249 UTC:

Summary (best iteration):

+------------------------------------++--------------+--------------+--------------+
|                                    || hledger-0.24 | hledger-0.25 | hledger-0.26 |
+====================================++==============+==============+==============+
| bal -f data/100x100x10.journal     ||         0.04 |         0.04 |         0.04 |
| bal -f data/1000x1000x10.journal   ||         0.21 |         0.25 |         0.24 |
| bal -f data/1000x10000x10.journal  ||         0.31 |         0.32 |         0.35 |
| bal -f data/10000x1000x10.journal  ||         1.76 |         1.72 |         1.85 |
| bal -f data/10000x10000x10.journal ||         3.18 |         3.38 |         3.58 |
| bal -f data/100000x1000x10.journal ||        18.69 |        20.74 |        19.79 |
+------------------------------------++--------------+--------------+--------------+


real    1m16.674s
user    1m10.822s
sys 0m3.867s

and more importantly, the heap profile looks the same:

0.25.1:
hledgerprof 0 25 1

0.26:
hledgerprof 0 26

What special features does your data have ? Maybe an unusually high number of aliases, accounts, includes, or something else ? Would it be possible to simplify it until you see the problem go away ?

Owner

simonmichael commented Jul 22, 2015

Hmm. I didn't see this in pre-release testing, and I don't see it now with the simple test journals. Time is the same:

$ time tools/simplebench -n1 hledger-0.24 hledger-0.25 hledger-0.26
Using bench.tests
Running 6 tests 1 times with 3 executables at 2015-07-22 13:53:31.42249 UTC:

Summary (best iteration):

+------------------------------------++--------------+--------------+--------------+
|                                    || hledger-0.24 | hledger-0.25 | hledger-0.26 |
+====================================++==============+==============+==============+
| bal -f data/100x100x10.journal     ||         0.04 |         0.04 |         0.04 |
| bal -f data/1000x1000x10.journal   ||         0.21 |         0.25 |         0.24 |
| bal -f data/1000x10000x10.journal  ||         0.31 |         0.32 |         0.35 |
| bal -f data/10000x1000x10.journal  ||         1.76 |         1.72 |         1.85 |
| bal -f data/10000x10000x10.journal ||         3.18 |         3.38 |         3.58 |
| bal -f data/100000x1000x10.journal ||        18.69 |        20.74 |        19.79 |
+------------------------------------++--------------+--------------+--------------+


real    1m16.674s
user    1m10.822s
sys 0m3.867s

and more importantly, the heap profile looks the same:

0.25.1:
hledgerprof 0 25 1

0.26:
hledgerprof 0 26

What special features does your data have ? Maybe an unusually high number of aliases, accounts, includes, or something else ? Would it be possible to simplify it until you see the problem go away ?

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Jul 22, 2015

Owner

Actually, I meant to show profiles with 100k transactions, not 10k. But the main point is I see no change since 0.25, so I'm not sure why you would see it suddenly break with 0.26, unless it's borderline.

Here's one with 100k transactions:
hledgerprof-0 26-100k
I'm still learning about heap profiles - this one goes up to 600M, but I observed the process using 2.1G in activity monitor. So indeed, it might be that your 32-bit build is hitting its 2G limit.

Owner

simonmichael commented Jul 22, 2015

Actually, I meant to show profiles with 100k transactions, not 10k. But the main point is I see no change since 0.25, so I'm not sure why you would see it suddenly break with 0.26, unless it's borderline.

Here's one with 100k transactions:
hledgerprof-0 26-100k
I'm still learning about heap profiles - this one goes up to 600M, but I observed the process using 2.1G in activity monitor. So indeed, it might be that your 32-bit build is hitting its 2G limit.

@MarkusTeufelberger

This comment has been minimized.

Show comment
Hide comment
@MarkusTeufelberger

MarkusTeufelberger Jul 22, 2015

Maybe it was/is borderline? After all most transactions were added recently and that might have tipped it over the edge no matter what? hledger.exe uses about 2 GiB of RAM and then dies OOM. I highly suspect that it is compiled to a 64bit binary in Linux...

There are about a dozen files alltogether (so about a dozen includes), nearly all transactions are very simple (I don't duplicate amounts though, so I generally only have one line with the amount and the second line has to auto-balance). As you can see there are a few hundred accounts and a few of the transactions (e.g. PayPal ones) have payment IDs added as tags ("date (payment_id) description").

While it might be possible to simplify it (e.g. process data for one day, write it out, let (h)ledger/beancount process it and only store a single total transaction instead of one transaction per line), this would make the process much slower, the resulting file potentially more error prone and also potentially more incorrect. I'd prefer if it were possible to just take a CSV file, run my parser and have a text file ready for *ledger without adding "intelligence".

By the way, using >2 GB of memory on a <40 MB input consisting only of dates, amounts, descriptions and accounts seems a bit excessive...?

MarkusTeufelberger commented Jul 22, 2015

Maybe it was/is borderline? After all most transactions were added recently and that might have tipped it over the edge no matter what? hledger.exe uses about 2 GiB of RAM and then dies OOM. I highly suspect that it is compiled to a 64bit binary in Linux...

There are about a dozen files alltogether (so about a dozen includes), nearly all transactions are very simple (I don't duplicate amounts though, so I generally only have one line with the amount and the second line has to auto-balance). As you can see there are a few hundred accounts and a few of the transactions (e.g. PayPal ones) have payment IDs added as tags ("date (payment_id) description").

While it might be possible to simplify it (e.g. process data for one day, write it out, let (h)ledger/beancount process it and only store a single total transaction instead of one transaction per line), this would make the process much slower, the resulting file potentially more error prone and also potentially more incorrect. I'd prefer if it were possible to just take a CSV file, run my parser and have a text file ready for *ledger without adding "intelligence".

By the way, using >2 GB of memory on a <40 MB input consisting only of dates, amounts, descriptions and accounts seems a bit excessive...?

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Jul 22, 2015

Owner

Yes that seems to be it. hledger installed with the 32-bit windows version of stack dies with "out of memory" as soon as it gets near 2G. hledger installed with the 64-bit windows stack works though (at least it has been using 3G on this 4G machine for 20m on a million-transaction file and I assume it would finish eventually). I recommend installing that way for now.

And yes, it's using much more memory than it should. We haven't paid attention to that until now, thanks for highlighting it.

Owner

simonmichael commented Jul 22, 2015

Yes that seems to be it. hledger installed with the 32-bit windows version of stack dies with "out of memory" as soon as it gets near 2G. hledger installed with the 64-bit windows stack works though (at least it has been using 3G on this 4G machine for 20m on a million-transaction file and I assume it would finish eventually). I recommend installing that way for now.

And yes, it's using much more memory than it should. We haven't paid attention to that until now, thanks for highlighting it.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Jul 23, 2015

Owner

I tried strictifying and unpacking fields in the main types (at last), just to see if there's any cheap improvement, but not really. (It's a small improvement, worth keeping I think even though I don't know what other effects this change could have):

Strictified:
hledgerprof-0 26-strictified
Unpacked:
hledgerprof-0 26-unpacked

Owner

simonmichael commented Jul 23, 2015

I tried strictifying and unpacking fields in the main types (at last), just to see if there's any cheap improvement, but not really. (It's a small improvement, worth keeping I think even though I don't know what other effects this change could have):

Strictified:
hledgerprof-0 26-strictified
Unpacked:
hledgerprof-0 26-unpacked

@simonmichael simonmichael changed the title from hledger 0.26 runs out of memory with large(?) ledger file(s) to hledger 0.26 runs out of memory with large(?) ledger file(s) (32 bit windows build exceeds 2G limit) Jul 23, 2015

@simonmichael simonmichael added this to the 1.0 milestone Sep 2, 2015

@simonmichael simonmichael added the journal label Sep 2, 2015

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Oct 12, 2015

Owner

I have my suspicions about this loop, if anyone would care to look at this.

We parse the journal in a strange way, I think just for historical reasons - as a sequence of journal updates which we then apply one after the other. I think instead we should parse to some kind of AST, which among other things should make debugging the parser much easier.

Owner

simonmichael commented Oct 12, 2015

I have my suspicions about this loop, if anyone would care to look at this.

We parse the journal in a strange way, I think just for historical reasons - as a sequence of journal updates which we then apply one after the other. I think instead we should parse to some kind of AST, which among other things should make debugging the parser much easier.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Oct 12, 2015

Owner

Notes from #haskell:

<sm> I think I have a space leak of some kind here, can you tell ? http://lpaste.net/142800
<sm> heap profile at https://github.com/simonmichael/hledger/issues/275#issuecomment-147257314
<Gurkenglas> sm, would you try "ala Endo foldMap" instead of "foldl' (\acc new x -> new (acc x)) id" and see if that's better?
<sm> Gurkenglas: how's that ?                                   [17:12]
<Gurkenglas> "ala Endo foldMap" means "runEndo . foldMap Endo"  [17:13]
<sm> hmm I'll look at that, thanks
<lispy> sm: foldl' can only be strict if the function you give it is strict
<lispy> sm: And I suspect that's not the case here
<sm> lispy: probably not                                        [17:15]
<lispy> looking at the github issue now                         [17:16]
<Gurkenglas> (What I want to say is 'also consider "combineJournalUpdates = execStateT . traverse modifyT" with strict StateT', but modifyT isn't in the libraries.)
<Gurkenglas> I hereby submit a pull request request to get modifyT and getsT added to them, which I haven't done myself yet because there are two libraries that we might want to add it to in some order and I$
<Gurkenglas> (Afterwards, isos for modifyT and getsT and modify and gets so we can have "combineJournalUpdates = au _ModifiedT traverse")
<Gurkenglas> Although I guess the one for modify is already provided by Endo
<lispy> sm: have you tried making combineJournalUpdates lazier?
<lispy> sm: I don't know if sequence returns results lazily or not in this case, but it might be building the whole list ( [Journal -> Journal] ) and feeding that to foldl' at once.
<lispy> sm: also, can you upload your profiling data here: http://heap.ezyang.com/
<sm> lispy: I haven't, yet. All of these comments will be quite helpful when I next work on it. I'll do that and report back
<sm> lispy, Gurkenglas: thanks!
Owner

simonmichael commented Oct 12, 2015

Notes from #haskell:

<sm> I think I have a space leak of some kind here, can you tell ? http://lpaste.net/142800
<sm> heap profile at https://github.com/simonmichael/hledger/issues/275#issuecomment-147257314
<Gurkenglas> sm, would you try "ala Endo foldMap" instead of "foldl' (\acc new x -> new (acc x)) id" and see if that's better?
<sm> Gurkenglas: how's that ?                                   [17:12]
<Gurkenglas> "ala Endo foldMap" means "runEndo . foldMap Endo"  [17:13]
<sm> hmm I'll look at that, thanks
<lispy> sm: foldl' can only be strict if the function you give it is strict
<lispy> sm: And I suspect that's not the case here
<sm> lispy: probably not                                        [17:15]
<lispy> looking at the github issue now                         [17:16]
<Gurkenglas> (What I want to say is 'also consider "combineJournalUpdates = execStateT . traverse modifyT" with strict StateT', but modifyT isn't in the libraries.)
<Gurkenglas> I hereby submit a pull request request to get modifyT and getsT added to them, which I haven't done myself yet because there are two libraries that we might want to add it to in some order and I$
<Gurkenglas> (Afterwards, isos for modifyT and getsT and modify and gets so we can have "combineJournalUpdates = au _ModifiedT traverse")
<Gurkenglas> Although I guess the one for modify is already provided by Endo
<lispy> sm: have you tried making combineJournalUpdates lazier?
<lispy> sm: I don't know if sequence returns results lazily or not in this case, but it might be building the whole list ( [Journal -> Journal] ) and feeding that to foldl' at once.
<lispy> sm: also, can you upload your profiling data here: http://heap.ezyang.com/
<sm> lispy: I haven't, yet. All of these comments will be quite helpful when I next work on it. I'll do that and report back
<sm> lispy, Gurkenglas: thanks!
@MarkusTeufelberger

This comment has been minimized.

Show comment
Hide comment
@MarkusTeufelberger

MarkusTeufelberger Mar 11, 2016

Any news on this? 64-bit hledger 0.26 meanwhile also fails to parse a meager ~65 MiB ledger file with 16 GB of RAM...

MarkusTeufelberger commented Mar 11, 2016

Any news on this? 64-bit hledger 0.26 meanwhile also fails to parse a meager ~65 MiB ledger file with 16 GB of RAM...

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Mar 11, 2016

Owner

No news. How many transactions in that file ?

Owner

simonmichael commented Mar 11, 2016

No news. How many transactions in that file ?

@MarkusTeufelberger

This comment has been minimized.

Show comment
Hide comment
@MarkusTeufelberger

MarkusTeufelberger Mar 11, 2016

A bit more than half a million (about 520k).

MarkusTeufelberger commented Mar 11, 2016

A bit more than half a million (about 520k).

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Mar 11, 2016

Owner

Thanks for the data point, and I'm sorry I don't have more time to work on it right now - all help welcome.

Owner

simonmichael commented Mar 11, 2016

Thanks for the data point, and I'm sorry I don't have more time to work on it right now - all help welcome.

@MarkusTeufelberger

This comment has been minimized.

Show comment
Hide comment
@MarkusTeufelberger

MarkusTeufelberger Mar 11, 2016

I'd give it a try myself, but Haskell is unfortunately not a language I have any experience in. :-(

It is a mystery to me why it should scale so poorly or waste that much memory. Isn't that an issue/concern with smaller input files too (getting the balance of ~1100 transactions already takes a noticeable amount of time - less than a second though) or are typical computers just too fast for that to register?

MarkusTeufelberger commented Mar 11, 2016

I'd give it a try myself, but Haskell is unfortunately not a language I have any experience in. :-(

It is a mystery to me why it should scale so poorly or waste that much memory. Isn't that an issue/concern with smaller input files too (getting the balance of ~1100 transactions already takes a noticeable amount of time - less than a second though) or are typical computers just too fast for that to register?

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Mar 11, 2016

Owner

It's simply using the easy but non-optimal data types and algorithms that were easiest for me to get working. (I won't say specifically what's to blame, since I don't know for sure.) Nowadays we have Text instead of String, newer parser libraries (we have an unfinished port to megaparsec in the PR queue), better profiling tools etc., but fixing it is a non-trivial time investment.

It hasn't been top of my queue because I split files by year, so my typical queries are against < 1k transactions and run in less than a second on a macbook. I spend far more time entering data, creating reports, staring at the results, searching in the editor etc. than waiting for reports to complete. I wasn't even particularly aware of it until somebody (you) reported it as a problem (thanks!). It is definitely a goal to see how much more efficient we can get on large data sets.

Owner

simonmichael commented Mar 11, 2016

It's simply using the easy but non-optimal data types and algorithms that were easiest for me to get working. (I won't say specifically what's to blame, since I don't know for sure.) Nowadays we have Text instead of String, newer parser libraries (we have an unfinished port to megaparsec in the PR queue), better profiling tools etc., but fixing it is a non-trivial time investment.

It hasn't been top of my queue because I split files by year, so my typical queries are against < 1k transactions and run in less than a second on a macbook. I spend far more time entering data, creating reports, staring at the results, searching in the editor etc. than waiting for reports to complete. I wasn't even particularly aware of it until somebody (you) reported it as a problem (thanks!). It is definitely a goal to see how much more efficient we can get on large data sets.

simonmichael added a commit that referenced this issue May 23, 2016

lib: simplify parsers; cleanups (#275)
The journal/timeclock/timedot parsers, instead of constructing (opaque)
journal update functions which are later applied to build the journal,
now construct the journal directly (by modifying the parser state). This
is easier to understand and debug. It also removes any possibility of
the journal updates being a space leak. (They weren't, in fact memory
usage is now slightly higher, but that will be addressed in other ways.)

Also:

Journal data and journal parse info have been merged into one type (for
now), and field names are more consistent.

The ParsedJournal type alias has been added to distinguish being-parsed
and finalised journals.

Journal is now a monoid.

stats: fixed an issue with ordering of include files

journal: fixed an issue with ordering of included same-date transactions

timeclock: sessions can no longer span file boundaries (unclocked-out
sessions will be auto-closed at the end of the file).

expandPath now throws a proper IO error (and requires the IO monad).
@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael May 25, 2016

Owner

hledger now mostly uses Text instead of String, which has reduced memory usage by ~30% with large files. For the record I think there is no space leak here, rather it's that our nice intuitive in-memory representation of the journal and transactions is quite large and redundant.

Owner

simonmichael commented May 25, 2016

hledger now mostly uses Text instead of String, which has reduced memory usage by ~30% with large files. For the record I think there is no space leak here, rather it's that our nice intuitive in-memory representation of the journal and transactions is quite large and redundant.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael May 27, 2016

Owner

Latest heap profile. More allocation, but lower maximum residency. (And slower according to this chart, but benchmarks say it's actually a little faster. Odd.)
hledger-f3bf98b-201605262149

Owner

simonmichael commented May 27, 2016

Latest heap profile. More allocation, but lower maximum residency. (And slower according to this chart, but benchmarks say it's actually a little faster. Odd.)
hledger-f3bf98b-201605262149

@simonmichael simonmichael changed the title from hledger 0.26 runs out of memory with large(?) ledger file(s) (32 bit windows build exceeds 2G limit) to hledger uses too much memory with large files May 27, 2016

@simonmichael simonmichael changed the title from hledger uses too much memory with large files to hledger uses too much memory with large files [1mo] Jul 18, 2016

@simonmichael simonmichael modified the milestone: post 1.0 Oct 31, 2016

@simonmichael simonmichael added this to the 1.1 milestone Dec 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment