Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Records list now accepts any iterable, not only an array #11

Merged
merged 3 commits into from
Aug 22, 2018
Merged

Records list now accepts any iterable, not only an array #11

merged 3 commits into from
Aug 22, 2018

Conversation

pineapplemachine
Copy link
Contributor

@pineapplemachine pineapplemachine commented Aug 21, 2018

Previously there were several calls to input.map in AbstractCsvStringifier.stringifyRecords, which works for arrays but not for generators or most other iterable objects. The code has been changed to use a for(x of y) loop instead of map functions.

Also added two new unit tests to verify this behavior.

As a side-effect, the new code is also more performant and memory-efficient! (edit: Probably? I haven't actually benchmarked. Depends on how string concatenation compares to building an array and then calling join, I guess)

Previously there were several calls to `input.map` which work for arrays but not for generators or most other iterable objects. The code has been changed to use a `for(x of y)` loop instead of `map` functions. (As a side-effect, the new code is also more performant and memory-efficient!)
@coveralls
Copy link

coveralls commented Aug 21, 2018

Pull Request Test Coverage Report for Build 47

  • 5 of 5 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 43: 0.0%
Covered Lines: 86
Relevant Lines: 86

💛 - Coveralls

@@ -16,10 +16,12 @@ class AbstractCsvStringifier {
}

stringifyRecords(records) {
const csvLines = records
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it also work if we just change records to Array.from(records)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be functional, but it would be very inefficient.

@ryu1kn
Copy link
Owner

ryu1kn commented Aug 21, 2018

Hello @pineapplemachine , thank you for your PR! Code looks good to me except one trivial point I asked in the review.

Do you want to share the context of why you want it to accept iterable instead of just an array? When we introduce new functionalities, I want to understand how they help you (not how they may help you in the future). This is something I try to ask people including myself 😉

It could be just you heavily use generator and without this you have to do Array.from everywhere or something.

@pineapplemachine
Copy link
Contributor Author

pineapplemachine commented Aug 21, 2018

I am working on adding functionality to a web backend to export certain data in CSV format. It is potentially rather a lot of data. When I installed the package and used your package for the CSV export functionality, I was surprised to receive an error when I passed in my list of records. When I looked at the source I saw that this was because you assume that the input has a map method, which is true of arrays but not of many other collection types.

There are many advantages to accepting any iterable, not only arrays, but the one that I need is that the function can accept generators. (Generators are iterators that do not contain the entire result, but compute the output list one element at a time, as each element as needed.)

Note also that each call to Array.map constructs a new array. The current implementation, which calls Array.map twice, implies that during the invocation of this function:

  • The input must be fully stored in memory
  • The result of _getRecordAsArray(record) for every element in the input must be simultaneously be stored in memory
  • The result of _getCsvLine(_getRecordAsArray(record)) for every element in the input must be simultaneously be stored in memory
  • The output CSV string must be simultaneously stored in memory

Accepting a generator and not using map for intermediate results means that, during the invocation of the function:

  • Only the output CSV string must be fully stored in memory, and the record being appended to it at a given time

Since the application I am working on needs to be able to export large datasets, it is advantageous to pass a generator instead of an array. If the function effectively stores the dataset in memory four times over, I run the real risk of servers running out of ram.

@pineapplemachine
Copy link
Contributor Author

I wrote before,

Probably? I haven't actually benchmarked. Depends on how string concatenation compares to building an array and then calling join, I guess

I realize now that I chose my words poorly and this may have led to confusion.

My uncertainty was not about this implementation being more efficient than what was there before, but about the performance and memory characteristics of string concatenation vs. joining. It may be worth investigating later how concatenation (what's in the PR) compares to eagerly building an array of row strings and joining, and how it compares to using Array.prototype.join.call with a generator function as input to transform the input records into string rows. (If Array.join will in fact accept an arbitrary iterable, which I don't know off the top of my head.)

@pineapplemachine
Copy link
Contributor Author

pineapplemachine commented Aug 21, 2018

To find out without a doubt, I wrote a script to determine how the performance characteristics actually compare.

I have determined that V8 JIT works in mysterious ways. Any time I think I know about the performance characteristics of code, V8 turns out to be doing something strange.

Using Array.prototype.join with a generator sadly did not work.

Simply adding Array.from as you suggested consistently has the worst performance of the tests I did. Somehow it turned out to have the lowest peak memory usage when the CSV test was repeated many times. It had the highest memory usage when the CSV test was run only once for a high number of rows. I cannot explain this, but I suspect GC shenanigans.

Overall, this implementation of the method seems to come out on top,
though concatenation outperformed it for smaller inputs (up to at least 5000 rows):

    stringifyRecords(records) {
        const array = [];
        for (let record of records) {
            array.push(this._getCsvLine(this._getRecordAsArray(record)));
        }
        array.push('');
        return array.join(RECORD_DELIMITER);
    }

I also tested this against using Array.from(records).map(...).join(...) + delim. It performed better than it did with two calls to map but worse than constructing the array in a loop.

It's also worth noting that in V8, loops of the form for(let i = 0; i < array.length; i++){...} perform better than loops of the form for(let element of array){...}. (Though this is not true for SpiderMonkey. And I'm not really sure how much longer this will be the case with V8, either.) I won't worry about this for this PR, but you may want to investigate using the former loop structure for arrays and array-like objects and use for...of only for iterables that are not arrays or array-like objects, at least until V8 is updated to have for loop performance characteristics more similar to SpiderMonkey.

Here are the full results of the tests on my machine, using node v10.7.0.

Here's the code to reproduce the results: https://gist.github.com/pineapplemachine/ec5f2356b6470729084f022441d0954c

Testing: abstract-original.js
Number of rows in CSV: 500 rows.
Number of iterations: 100 times.
Time taken: 52 ms.
Peak heap total: 11304960 bytes.
Peak heap used: 6545264 bytes.
Peak external: 16464 bytes.
Peak RSS: 24956928 bytes.
CSV output length: 5963.

Testing: abstract-concat.js
Number of rows in CSV: 500 rows.
Number of iterations: 100 times.
Time taken: 38 ms.
Peak heap total: 15499264 bytes.
Peak heap used: 8612944 bytes.
Peak external: 32848 bytes.
Peak RSS: 30158848 bytes.
CSV output length: 5963.

Testing: abstract-join-eager.js
Number of rows in CSV: 500 rows.
Number of iterations: 100 times.
Time taken: 43 ms.
Peak heap total: 15499264 bytes.
Peak heap used: 8891736 bytes.
Peak external: 49232 bytes.
Peak RSS: 30859264 bytes.
CSV output length: 5963.



Testing: abstract-original.js
Number of rows in CSV: 5000 rows.
Number of iterations: 100 times.
Time taken: 388 ms.
Peak heap total: 41713664 bytes.
Peak heap used: 20937696 bytes.
Peak external: 16464 bytes.
Peak RSS: 56340480 bytes.
CSV output length: 74629.

Testing: abstract-concat.js
Number of rows in CSV: 5000 rows.
Number of iterations: 100 times.
Time taken: 291 ms.
Peak heap total: 42237952 bytes.
Peak heap used: 20685032 bytes.
Peak external: 32848 bytes.
Peak RSS: 57131008 bytes.
CSV output length: 74629.

Testing: abstract-join-eager.js
Number of rows in CSV: 5000 rows.
Number of iterations: 100 times.
Time taken: 301 ms.
Peak heap total: 42237952 bytes.
Peak heap used: 21924168 bytes.
Peak external: 49232 bytes.
Peak RSS: 57720832 bytes.
CSV output length: 74629.



Testing: abstract-original.js
Number of rows in CSV: 50000 rows.
Number of iterations: 100 times.
Time taken: 6030 ms.
Peak heap total: 90570752 bytes.
Peak heap used: 65372992 bytes.
Peak external: 16464 bytes.
Peak RSS: 111431680 bytes.
CSV output length: 896295.

Testing: abstract-concat.js
Number of rows in CSV: 50000 rows.
Number of iterations: 100 times.
Time taken: 5755 ms.
Peak heap total: 98861056 bytes.
Peak heap used: 71590312 bytes.
Peak external: 16464 bytes.
Peak RSS: 128045056 bytes.
CSV output length: 896295.

Testing: abstract-join-eager.js
Number of rows in CSV: 50000 rows.
Number of iterations: 100 times.
Time taken: 5159 ms.
Peak heap total: 88014848 bytes.
Peak heap used: 57504424 bytes.
Peak external: 16464 bytes.
Peak RSS: 117350400 bytes.
CSV output length: 896295.



Testing: abstract-original.js
Number of rows in CSV: 500000 rows.
Number of iterations: 100 times.
Time taken: 90217 ms.
Peak heap total: 238682112 bytes.
Peak heap used: 207717896 bytes.
Peak external: 8272 bytes.
Peak RSS: 289943552 bytes.
CSV output length: 10462961.

Testing: abstract-concat.js
Number of rows in CSV: 500000 rows.
Number of iterations: 100 times.
Time taken: 66311 ms.
Peak heap total: 321683456 bytes.
Peak heap used: 291250760 bytes.
Peak external: 16464 bytes.
Peak RSS: 373735424 bytes.
CSV output length: 10462961.

Testing: abstract-join-eager.js
Number of rows in CSV: 500000 rows.
Number of iterations: 100 times.
Time taken: 56886 ms.
Peak heap total: 309051392 bytes.
Peak heap used: 283792256 bytes.
Peak external: 16464 bytes.
Peak RSS: 361304064 bytes.
CSV output length: 10462961.



Testing: abstract-original.js
Number of rows in CSV: 500000 rows.
Number of iterations: 1 times.
Time taken: 1011 ms.
Peak heap total: 184778752 bytes.
Peak heap used: 153772192 bytes.
Peak external: 8272 bytes.
Peak RSS: 215330816 bytes.
CSV output length: 10462961.

Testing: abstract-concat.js
Number of rows in CSV: 500000 rows.
Number of iterations: 1 times.
Time taken: 658 ms.
Peak heap total: 130842624 bytes.
Peak heap used: 88640968 bytes.
Peak external: 8272 bytes.
Peak RSS: 165765120 bytes.
CSV output length: 10462961.

Testing: abstract-join-eager.js
Number of rows in CSV: 500000 rows.
Number of iterations: 1 times.
Time taken: 581 ms.
Peak heap total: 118448128 bytes.
Peak heap used: 94447512 bytes.
Peak external: 8272 bytes.
Peak RSS: 154337280 bytes.
CSV output length: 10462961.



Testing: abstract-original.js
Number of rows in CSV: 5000000 rows.
Number of iterations: 1 times.
Time taken: 10935 ms.
Peak heap total: 1155989504 bytes.
Peak heap used: 1116954792 bytes.
Peak external: 8272 bytes.
Peak RSS: 1221926912 bytes.
CSV output length: 119629627.

Testing: abstract-concat.js
Number of rows in CSV: 5000000 rows.
Number of iterations: 1 times.
Time taken: 7809 ms.
Peak heap total: 952401920 bytes.
Peak heap used: 912804464 bytes.
Peak external: 8272 bytes.
Peak RSS: 1020747776 bytes.
CSV output length: 119629627.

Testing: abstract-join-eager.js
Number of rows in CSV: 5000000 rows.
Number of iterations: 1 times.
Time taken: 6427 ms.
Peak heap total: 866758656 bytes.
Peak heap used: 792540008 bytes.
Peak external: 8272 bytes.
Peak RSS: 938725376 bytes.
CSV output length: 119629627.

Anyway, that was an interesting exercise. I expect I will update the PR to use the join implementation.

@ryu1kn
Copy link
Owner

ryu1kn commented Aug 22, 2018

Thanks for the detailed explanations and the performance benchmark! Few thoughts.

  1. I prefer to write code in immutable way and try to avoid variable reassignments or updates of the state of objects where possible & appropriate, even though it is less performant (so, how much less performant matters than just "it's worse"). But if I find it an issue, I would change it to the way it performs better and I don't mind as long as the section that mutation happens is contained (like your change).

  2. I myself don't see why I wrote 2 maps instead of just 1. Still creates 1 more array than your updated code.

  3. As it's written in README, I was thinking people may use node's stream when they work with large data, not expecting large data is passed to a single stringifyRecords call. If the data is really big, like dumping whole DB contents or something, building them up in one string anyway takes large space.

    However, if you need to keep writing large data to a certain file, you would want to create node's transform stream and use CsvStringifier, which is explained later, inside it, and pipe the stream into a file write stream.

So, in short, I'm not too convinced about the performance improvements necessity, I'm happy with your change. (But without continuous performance test proving the benefit, someday someone might, by accident, change it to immutable code...)

@pineapplemachine
Copy link
Contributor Author

pineapplemachine commented Aug 22, 2018

Unfortunately, JavaScript does not have immutability, which means writing code as you would for a language that does have immutability will get you poor performance. Languages with good immutability features are able to back that up with good optimizations. JavaScript does not.

As you can see in the benchmarks I ran, the implementation in the PR runs in roughly half the time of your Array.from suggestion for larger numbers of rows. It has about 75% the peak heap memory usage. (Almost 300 megabytes fewer for 5,000,000 rows.) That's a big difference.

Since I'm not writing to a file it had not occurred to me to use streams. I'll have to look into it more to see if it's possible in my case. A little bit of googling conveys a resounding "maybe".

Regardless, I can't think of any reason not to use the implementation in this PR. If you're concerned about continuous performance tests, then I encourage you to add them. All the tools you might need are in that gist linked in my previous post.

@ryu1kn
Copy link
Owner

ryu1kn commented Aug 22, 2018

I'm not trying to achieve the true immutability as you can see in my code, e.g. private methods are just relying on underscore naming convention and not even using closures. But avoiding variable reassignment or object updates makes it easy to reason the code and that's what I'm after.

Whether you write large data in a single stringifyRecords has an impact on this discussion and probably that's the difference between us coming from.

But as I wrote in my previous comment, I'm going to merge the change. Thanks for your contribution 👍

@ryu1kn ryu1kn merged commit 0ab40be into ryu1kn:master Aug 22, 2018
@ryu1kn
Copy link
Owner

ryu1kn commented Aug 22, 2018

I found that I can pass an optional map function to Array.from, this way the intermediate array will not be created.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/from#Syntax

Below is the benchmark with your snippet, the new version is abstract-array-from.js. All ran with node v10.9. Among 4, the time taken is the 2nd shortest, like 10% behind the best one. Memory usage is harder to tell but 1st or 2nd, use 5% more or 5% less.

So it's good enough for me to keep the functional way of writing it. I decided to go with this.

Testing: abstract-original.js
Number of rows in CSV: 50000 rows.
Number of iterations: 100 times.
Time taken: 6858 ms.
Peak heap total: 91570176 bytes.
Peak heap used: 65432552 bytes.
Peak external: 16464 bytes.
Peak RSS: 108494848 bytes.
CSV output length: 896295.

Testing: abstract-array-from.js
Number of rows in CSV: 50000 rows.
Number of iterations: 100 times.
Time taken: 6146 ms.
Peak heap total: 80265216 bytes.
Peak heap used: 59225520 bytes.
Peak external: 16464 bytes.
Peak RSS: 102912000 bytes.
CSV output length: 896295.

Testing: abstract-concat.js
Number of rows in CSV: 50000 rows.
Number of iterations: 100 times.
Time taken: 6179 ms.
Peak heap total: 99336192 bytes.
Peak heap used: 71823232 bytes.
Peak external: 16464 bytes.
Peak RSS: 122437632 bytes.
CSV output length: 896295.

Testing: abstract-join-eager.js
Number of rows in CSV: 50000 rows.
Number of iterations: 100 times.
Time taken: 5422 ms.
Peak heap total: 92454912 bytes.
Peak heap used: 57797296 bytes.
Peak external: 16464 bytes.
Peak RSS: 115048448 bytes.
CSV output length: 896295.
Testing: abstract-original.js
Number of rows in CSV: 5000000 rows.
Number of iterations: 1 times.
Time taken: 12114 ms.
Peak heap total: 1156464640 bytes.
Peak heap used: 1119708136 bytes.
Peak external: 8272 bytes.
Peak RSS: 1223794688 bytes.
CSV output length: 119629627.

Testing: abstract-array-from.js
Number of rows in CSV: 5000000 rows.
Number of iterations: 1 times.
Time taken: 7718 ms.
Peak heap total: 868990976 bytes.
Peak heap used: 762845304 bytes.
Peak external: 8272 bytes.
Peak RSS: 921296896 bytes.
CSV output length: 119629627.

Testing: abstract-concat.js
Number of rows in CSV: 5000000 rows.
Number of iterations: 1 times.
Time taken: 9201 ms.
Peak heap total: 904118272 bytes.
Peak heap used: 843999552 bytes.
Peak external: 8272 bytes.
Peak RSS: 948568064 bytes.
CSV output length: 119629627.

Testing: abstract-join-eager.js
Number of rows in CSV: 5000000 rows.
Number of iterations: 1 times.
Time taken: 7093 ms.
Peak heap total: 834105344 bytes.
Peak heap used: 740571664 bytes.
Peak external: 8272 bytes.
Peak RSS: 895705088 bytes.
CSV output length: 119629627.

ryu1kn added a commit that referenced this pull request Aug 22, 2018
@ryu1kn
Copy link
Owner

ryu1kn commented Aug 22, 2018

Released as v1.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants