Pausing the parser? #67
Comments
Hi, the parser has two api, a callback based which hold object in memory to supply the final callback and a stream based which doesnt hold anything in memory (unless i made a stupid mistake somewhere). you seems to be using the stream approach so this isnt normal. this is the first time (after many years, this module is maybe 5 years hold in history) that i've heard of a memory issue. could you make sure this come from the parser and if so, could you provide me with a sample file that i could download an test to reproduce? |
Hi David, I am pretty sure its happening in the parser. At first I did not doubt the parser. In my complex app, here is the cpu/mem usage when parsing starts. I know parsing starts based on log message that throws up. I am running #top | grep node. From the logs below you can see the climb. The code being executed the above code I posted. Memory usage shoots up from 800 odd MB (which is also big) to 1.9 GB. At that stage, the node is running still at 100% cpu but the parsing is done, I am assuming it is garbage collector and then after few mins it drops memory usage to 1.2GB as shown below. 8781 chandva 20 0 1817m 889m 9168 S 87.1 5.6 2:26.06 node This is ok in this scenario, but in some cases it touches 2.0b before crash with no memory. The most interesting bit is that this always happens when the parser is on. I do some memory intensive ops after the parser is done (and memory is 1.2G), and my operation makes memory climb to 2.3G and then clears up after the op is done. So my question is why my memory intensive op is allowing memory to climb to 2.3G while the parser craps out by 1.9G. The only difference is cpu usage. During my memory intensive ops, the cpu is around 20%. During parsing it is constant 100%. So I suspect it is something to do with it. Now, you could clearly doubt my app on this, so I wrote a simple code to test the above. var destination = 'tmp/myfile.csv' // The listen port for the API var api_certs = { function parse_csv_internal_csv_stream(parseFile, onComplete){
} function parse_csv(csv_data) { https.createServer(api_certs, function (req, res) { var parseData = undefined; parse_csv(destination) This is very simple simulation and here is what I found on my laptop (OSX) 4812 node 0.0 01:01.71 11 0 43 1500M 0B 6680K 4812 344 sleeping *0[1] <- end Could it be that all the size is merely parseData?? |
ok, then i must do sth stupid that wasn't there before |
Do you want my CSV file? I have to sanitize it before giving you or all the powers that rule me will fire me from my job :-) |
i'll try to save u this effort, give me a change to replicate this before i confirm |
can you try this or sth similar. from the coffee repl, in the parent directory where csv module are git cloned: stringify = require './csv-stringify'
parse = require './csv-parse'
generate = require './csv-generate'
generate(duration:3600*1000).pipe(parse()).pipe(stringify()).pipe(fs.createWriteStream('./test.csv')); i've produce a 2GB file with constant memory usage: real memory size of 100MB, virtual memory size of 3GB. |
How are you calculating the memory? Does it have to be coffee repl? Or can I use my csv instead of generating a new one? |
to calculate memory, you can use |
I had sometime to play around and I noticed that if I comment out finalData.push(record); from my code, the parser seem to be behaving as you described. It is well under 60MB when parsing. I think we can conclude that the output we are storing in array is the memory eater. Now, for my question, should'nt we have some way to pause the parsing and flush the array off and continue? Wouldnt this help for massive files? |
about the pause/resume mechanism, this parser is a good Node.js citizen and implement the official Stream API which itself include those methods. |
@wdavidw csv-parser doing the greate job to synchronously parse the data record by record. But whenever parser triggers some error, already triggered records in the transformer are continue to parse. I'm looking for something related to parser stream pause and resume but I couldn't find anything, maybe my bad. What I expect something like the following:-
And suppose a 3 line csv file where 3rd line is an error, I expect the log to be
But I got the logs as follows:
So I want to know is there any possibility to pause and resume parser, as I need to parse line one by one so that when an error occurred I can simply handle that. I'm expecting something like this: asynchronous processing in line-by-line |
The parser implement the Node.js Stream API which has pause and resume functions but I doubt they will provide what you wish. The problem with CSV is that we cant process line by line easily because a linefeed (which we call rowDelimiter) can be inside a quoted cell and that makes it unclear on how we should interpret what is a rowDelimiter in case of an error. |
If you are using the stream-transform package, then you should be able to set the "parallel" option to "1" to avoid parallel execution. Seems like this could solve you're issue. |
But it is possible to emit an event just after receives a complete record, is it? and continue further only if the parser is not in |
Yes I am using that thing, but whenever an error emitted from parser, the transformer continues to process pending records. |
CSV parsing is sequential, so no problem from here. Following the Stream API, the throttling is handled by the consumer which received the data after the parse (eg stream-transform). |
When an error occured during parsing, no further records are emited by the paser. |
Suppose a 10 line record, if 6th line is an error containing one, the parser will emit 5 records along with an error event ryt? |
The error event emit by the stream-writer (the parser in our case) will prevent new transform functions to be called in the stream-reader, so disabling parallel execution (in case you have one) should achieve what you wish. I dont see the need to pause/resume the parser (even if you can do this with the steam api) |
ok @wdavidw , I made changes to my code. Thanks for your prompt support and gr8 module :) |
I love your parser and I usually parse around 500K lines of CSV files. While this does well, I often noticed that during parsing, I run out of memory. The original code stored all records in array. I was wondering if I could just pause after say 500 entries in array and sleep for few a while before proceeding? This would give a chance for garbage collector to clean up some older objects for this to continue. Is this possible?
function parse_csv_internal_csv_stream(parseFile, onComplete){
}
The text was updated successfully, but these errors were encountered: