hledger uses too much memory with large files [1mo] #275

Open
MarkusTeufelberger opened this Issue Jul 21, 2015 · 18 comments

Projects

None yet

2 participants

@MarkusTeufelberger
MarkusTeufelberger commented Jul 21, 2015 edited

I have a ledger of about 40 MB worth of files, hledger 0.26 now suddenly chokes on them, the previous version worked fine (well it also took waaay to long to calculate balances, but at least it worked):

hledger bal -f ./output/full.ledger
hledger: out of memory

full.ledger just imports a couple of other files (I have a collection of scripts that generate them from CSV files) that total to 37 MB. My machine has 16 GB of RAM, so this is not very likely an issue.

@simonmichael
Owner

Very interesting, thanks for the report. (I'd be interested to see as much of the stats output as you'd care to show for that..)

@MarkusTeufelberger

Well, the stats command runs out of memory too... maybe I'll find a way to build a 64 bit binary on Windows.

On Linux I get the following error (same for stats and bal):
[me@mycomputer foldername]$ ~/.cabal/bin/hledger bal -f ./output/full.ledger
hledger: /path/to/foldername/./output/paypal_com.ledger: hGetContents: invalid argument (invalid byte sequence)

I suspect there is some issue with Unicode in Linux (the file contains some japanese charaters as PayPal username), not necessarily related to the OOM issue. Maybe it could be my script that is not properly flagging the file encoding?

After manually saving the file as UTF-8 in a text editor it worked (both stats and bal), even though it takes a LONG time (a couple seconds at least) to produce the output:

[me@mycomputer foldername]$ ~/.cabal/bin/hledger stats -f ./output/full.ledger 
Main journal file        : /path/to/foldername/./output/full.ledger
Included journal files   : /path/to/foldername/./output/XXX.ledger           
                           /path/to/foldername/./output/YYY.ledger
                           etc. etc. etc.
Transactions span        : 2000-01-01 to 2015-07-22 (5681 days)
Last transaction         : 2015-07-21 (1 days ago)
Transactions             : 295923 (52.1 per day)
Transactions last 30 days: 176881 (5896.0 per day)
Transactions last 7 days : 38687 (5526.7 per day)
Payees/descriptions      : 411
Accounts                 : 218 (depth 5)
Commodities              : 8 (AAA, BBB, CCC, DDD, EEE, FFF, GGG, HHH)

A few things are censored, the numbers are accurate though. Real transactions don't really start in 2000, more like in 2010 or so - the few ones in 2000 are moving "Equity" amounts into place for having correct opening balances. Most of this stuff is auto-generated from CSV files and some of them are relatively verbose/microtransaction heavy.

@simonmichael
Owner

A quarter million transactions - very cool! Now I'm motivated to test with some larger data sets.

The unicode thing can probably be fixed by setting LANG to an appropriate locale (cf http://hledger.org/manual.html#run-time-problems). GHC-compiled software tends to fail this way when it encounters non-ascii data and the system locale is not set. We should give a better error message (#73).

@simonmichael
Owner

Hmm. I didn't see this in pre-release testing, and I don't see it now with the simple test journals. Time is the same:

$ time tools/simplebench -n1 hledger-0.24 hledger-0.25 hledger-0.26
Using bench.tests
Running 6 tests 1 times with 3 executables at 2015-07-22 13:53:31.42249 UTC:

Summary (best iteration):

+------------------------------------++--------------+--------------+--------------+
|                                    || hledger-0.24 | hledger-0.25 | hledger-0.26 |
+====================================++==============+==============+==============+
| bal -f data/100x100x10.journal     ||         0.04 |         0.04 |         0.04 |
| bal -f data/1000x1000x10.journal   ||         0.21 |         0.25 |         0.24 |
| bal -f data/1000x10000x10.journal  ||         0.31 |         0.32 |         0.35 |
| bal -f data/10000x1000x10.journal  ||         1.76 |         1.72 |         1.85 |
| bal -f data/10000x10000x10.journal ||         3.18 |         3.38 |         3.58 |
| bal -f data/100000x1000x10.journal ||        18.69 |        20.74 |        19.79 |
+------------------------------------++--------------+--------------+--------------+


real    1m16.674s
user    1m10.822s
sys 0m3.867s

and more importantly, the heap profile looks the same:

0.25.1:
hledgerprof 0 25 1

0.26:
hledgerprof 0 26

What special features does your data have ? Maybe an unusually high number of aliases, accounts, includes, or something else ? Would it be possible to simplify it until you see the problem go away ?

@simonmichael
Owner

Actually, I meant to show profiles with 100k transactions, not 10k. But the main point is I see no change since 0.25, so I'm not sure why you would see it suddenly break with 0.26, unless it's borderline.

Here's one with 100k transactions:
hledgerprof-0 26-100k
I'm still learning about heap profiles - this one goes up to 600M, but I observed the process using 2.1G in activity monitor. So indeed, it might be that your 32-bit build is hitting its 2G limit.

@MarkusTeufelberger

Maybe it was/is borderline? After all most transactions were added recently and that might have tipped it over the edge no matter what? hledger.exe uses about 2 GiB of RAM and then dies OOM. I highly suspect that it is compiled to a 64bit binary in Linux...

There are about a dozen files alltogether (so about a dozen includes), nearly all transactions are very simple (I don't duplicate amounts though, so I generally only have one line with the amount and the second line has to auto-balance). As you can see there are a few hundred accounts and a few of the transactions (e.g. PayPal ones) have payment IDs added as tags ("date (payment_id) description").

While it might be possible to simplify it (e.g. process data for one day, write it out, let (h)ledger/beancount process it and only store a single total transaction instead of one transaction per line), this would make the process much slower, the resulting file potentially more error prone and also potentially more incorrect. I'd prefer if it were possible to just take a CSV file, run my parser and have a text file ready for *ledger without adding "intelligence".

By the way, using >2 GB of memory on a <40 MB input consisting only of dates, amounts, descriptions and accounts seems a bit excessive...?

@simonmichael
Owner

Yes that seems to be it. hledger installed with the 32-bit windows version of stack dies with "out of memory" as soon as it gets near 2G. hledger installed with the 64-bit windows stack works though (at least it has been using 3G on this 4G machine for 20m on a million-transaction file and I assume it would finish eventually). I recommend installing that way for now.

And yes, it's using much more memory than it should. We haven't paid attention to that until now, thanks for highlighting it.

@simonmichael
Owner

I tried strictifying and unpacking fields in the main types (at last), just to see if there's any cheap improvement, but not really. (It's a small improvement, worth keeping I think even though I don't know what other effects this change could have):

Strictified:
hledgerprof-0 26-strictified
Unpacked:
hledgerprof-0 26-unpacked

@simonmichael simonmichael changed the title from hledger 0.26 runs out of memory with large(?) ledger file(s) to hledger 0.26 runs out of memory with large(?) ledger file(s) (32 bit windows build exceeds 2G limit) Jul 23, 2015
@simonmichael simonmichael added this to the 1.0 milestone Sep 2, 2015
@simonmichael simonmichael added the journal label Sep 2, 2015
@simonmichael
Owner

I have my suspicions about this loop, if anyone would care to look at this.

We parse the journal in a strange way, I think just for historical reasons - as a sequence of journal updates which we then apply one after the other. I think instead we should parse to some kind of AST, which among other things should make debugging the parser much easier.

@simonmichael
Owner

Notes from #haskell:

<sm> I think I have a space leak of some kind here, can you tell ? http://lpaste.net/142800
<sm> heap profile at https://github.com/simonmichael/hledger/issues/275#issuecomment-147257314
<Gurkenglas> sm, would you try "ala Endo foldMap" instead of "foldl' (\acc new x -> new (acc x)) id" and see if that's better?
<sm> Gurkenglas: how's that ?                                   [17:12]
<Gurkenglas> "ala Endo foldMap" means "runEndo . foldMap Endo"  [17:13]
<sm> hmm I'll look at that, thanks
<lispy> sm: foldl' can only be strict if the function you give it is strict
<lispy> sm: And I suspect that's not the case here
<sm> lispy: probably not                                        [17:15]
<lispy> looking at the github issue now                         [17:16]
<Gurkenglas> (What I want to say is 'also consider "combineJournalUpdates = execStateT . traverse modifyT" with strict StateT', but modifyT isn't in the libraries.)
<Gurkenglas> I hereby submit a pull request request to get modifyT and getsT added to them, which I haven't done myself yet because there are two libraries that we might want to add it to in some order and I$
<Gurkenglas> (Afterwards, isos for modifyT and getsT and modify and gets so we can have "combineJournalUpdates = au _ModifiedT traverse")
<Gurkenglas> Although I guess the one for modify is already provided by Endo
<lispy> sm: have you tried making combineJournalUpdates lazier?
<lispy> sm: I don't know if sequence returns results lazily or not in this case, but it might be building the whole list ( [Journal -> Journal] ) and feeding that to foldl' at once.
<lispy> sm: also, can you upload your profiling data here: http://heap.ezyang.com/
<sm> lispy: I haven't, yet. All of these comments will be quite helpful when I next work on it. I'll do that and report back
<sm> lispy, Gurkenglas: thanks!
@MarkusTeufelberger

Any news on this? 64-bit hledger 0.26 meanwhile also fails to parse a meager ~65 MiB ledger file with 16 GB of RAM...

@simonmichael
Owner

No news. How many transactions in that file ?

@MarkusTeufelberger

A bit more than half a million (about 520k).

@simonmichael
Owner

Thanks for the data point, and I'm sorry I don't have more time to work on it right now - all help welcome.

@MarkusTeufelberger

I'd give it a try myself, but Haskell is unfortunately not a language I have any experience in. :-(

It is a mystery to me why it should scale so poorly or waste that much memory. Isn't that an issue/concern with smaller input files too (getting the balance of ~1100 transactions already takes a noticeable amount of time - less than a second though) or are typical computers just too fast for that to register?

@simonmichael
Owner

It's simply using the easy but non-optimal data types and algorithms that were easiest for me to get working. (I won't say specifically what's to blame, since I don't know for sure.) Nowadays we have Text instead of String, newer parser libraries (we have an unfinished port to megaparsec in the PR queue), better profiling tools etc., but fixing it is a non-trivial time investment.

It hasn't been top of my queue because I split files by year, so my typical queries are against < 1k transactions and run in less than a second on a macbook. I spend far more time entering data, creating reports, staring at the results, searching in the editor etc. than waiting for reports to complete. I wasn't even particularly aware of it until somebody (you) reported it as a problem (thanks!). It is definitely a goal to see how much more efficient we can get on large data sets.

@simonmichael simonmichael added a commit that referenced this issue May 23, 2016
@simonmichael lib: simplify parsers; cleanups (#275)
The journal/timeclock/timedot parsers, instead of constructing (opaque)
journal update functions which are later applied to build the journal,
now construct the journal directly (by modifying the parser state). This
is easier to understand and debug. It also removes any possibility of
the journal updates being a space leak. (They weren't, in fact memory
usage is now slightly higher, but that will be addressed in other ways.)

Also:

Journal data and journal parse info have been merged into one type (for
now), and field names are more consistent.

The ParsedJournal type alias has been added to distinguish being-parsed
and finalised journals.

Journal is now a monoid.

stats: fixed an issue with ordering of include files

journal: fixed an issue with ordering of included same-date transactions

timeclock: sessions can no longer span file boundaries (unclocked-out
sessions will be auto-closed at the end of the file).

expandPath now throws a proper IO error (and requires the IO monad).
0f5ee15
@simonmichael
Owner

hledger now mostly uses Text instead of String, which has reduced memory usage by ~30% with large files. For the record I think there is no space leak here, rather it's that our nice intuitive in-memory representation of the journal and transactions is quite large and redundant.

@simonmichael
Owner
simonmichael commented May 27, 2016 edited

Latest heap profile. More allocation, but lower maximum residency. (And slower according to this chart, but benchmarks say it's actually a little faster. Odd.)
hledger-f3bf98b-201605262149

@simonmichael simonmichael changed the title from hledger 0.26 runs out of memory with large(?) ledger file(s) (32 bit windows build exceeds 2G limit) to hledger uses too much memory with large files May 27, 2016
@simonmichael simonmichael changed the title from hledger uses too much memory with large files to hledger uses too much memory with large files [1mo] Jul 18, 2016
@simonmichael simonmichael modified the milestone: post 1.0 Oct 31, 2016
@simonmichael simonmichael modified the milestone: 1.1 Dec 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment