Extended bin/loader to be more memory efficient for large datasets #27

Closed
wants to merge 3 commits into
from

Conversation

Projects
None yet
3 participants

I split the Soulmate::Loader#load method into two parts: cleanup and load.
By doing this we can keep calling Loader#load without having to blow away the
old data.

I then extended the bin/soulmate file to take filename and batch_size
arguments. By not waiting for all the data to arrive on STDIN, we can tune
the batch_size to reach a consistent memory usage, which also helps speed up
loading as well.

I ran some tests using time soulmate load sample.json with various batch_size
parameters. I inspected memory using Activity Monitor.app and took some
averages over 3 runs each. sample.json was a 100,000 line file.

batch_size  time  memory_used
   100,000  1:37  219.8
    10,000  1:15  58.2
     1,000  1:13  46
       100  1:14  46.4

As you can see, memory stays stable when the batch size is smaller, and loading
data is faster, too.

I haven't touched the other commands in bin/soulmate, since I don't have a
need to use them yet.

Tests have been modified to reflect the new Loader#cleanup method (they were
using Loader#load as a hack anyway), and they all pass.

Fixes #26

Here's the original loader time:

time: 1:23
memory used: 269.8

I tried to run another test that had 1,000,000 lines, but the original version never finished.

Here's a run of the working dataset I use on my laptop:

    Loaded a total of 245152 items in 244 second(s)
    real    4m4.474s
    memory used: ~220MB
Member

erwaller commented Jun 26, 2012

Thanks a ton for this, our data sets have been much smaller, so this was never on my radar.

One piece of feedback: I'd prefer if batch_size were either just fixed at 1000 (based on your tests), or set by an option flag, rather than passed as an argument. Do you mind updating the pull?

Thanks again.

No problem! Thanks for writing this gem :)

I'll set it as an option flag, since I'm on an SSD and others might get better performance if they tune the input size.

Tonight I loaded 5 different files in parallel. Here are the times:

Loaded a total of 200996 items in 591 second(s)
Loaded a total of 131918 items in 479 second(s)
Loaded a total of 69406 items in 229 second(s)
Loaded a total of 166788 items in 475 second(s)
Loaded a total of 110289 items in 481 second(s)

680,000 lines in ~10 minutes totaling 1.11 GB of data into redis. Redis says there are 507,591 keys, which is interesting because there are 680,000 lines, but the names might not be unique ( have to look into that). Each ruby process was ~30MB used. Pretty good, I'd say, but there's a lot of room to improve it (What can I say? I'm really impatient). If I have time I'll make it even faster by forking within each process and parallelizing the load, emulating what I had done by splitting the data sets.

From my tests, I'm somewhat CPU-bound and I'm barely hitting 85 KB/s reads into redis, so I think I can shove more data into redis if I reduce the number of ruby objects being instantiated.

Edit: clarity and formatting and added redis key count info

Just an update: I've refactored my code to use the fastest way possible to get data into redis: using the raw protocol itself. I'll probably make the necessary changes to this branch and open a separate pull request for that branch, since mass-insert might not be for everyone.

The way it works is I generate the raw redis commands, store them in a tempfile, then shove that into redis via redis-cli --pipe < tempfile.redis. The interesting bit about this method is the time spent removing keys and then re-adding them is minimized, and you could generate the file on one server and then transfer it to your redis machine to minimize insertion time.

I've got to clean it up a bit – which I may get to tonight – but here are some rough numbers:

Lines of JSON  Time generating  Time Removing/Inserting
166788         39s              14s
110289         60s              24s
200996         90s              36s

Ruby memory usage: 12MB (constant)

I'm just code golfing (performance golfing?) at this point, but it's been very interesting to learn more about the internals of redis. It's also interesting to note that inserting data into redis cannot become faster unless I either change my hardware or shard the data across several redis-server instances. The only thing I can make faster is generating the redis commands, which is a ruby exercise. I've taken care to use what existing code as much as possible, but there is still a lot of performance gains to be had.


Insertion speed is very important to me since I'm dealing with 10+ million records, and being able to quickly blow it all away and start from scratch in development is critical.

Interesting to note is 8.25 million lines of JSON equates to almost 11GB of redis commands (took 36 minutes to generate). I'd test that on my laptop, but I only have 4GB of ram and redis performance collapses when the dataset can't fit into memory. I'll have to test that on EC2 someday.

Tom Clark added some commits Jul 3, 2012

Extended bin/loader to be more memory efficient for large datasets
I split the `Soulmate::Loader#load` method into two parts: cleanup and load.
By doing this we can keep calling `Loader#load` without having to blow away the
old data.

I then extended the `bin/soulmate` file to take filename and batch_size
arguments.  By not waiting for all the data to arrive on STDIN, we can tune
the batch_size to reach a consistent memory usage, which also helps speed up
loading as well.

I ran some tests using `time soulmate load sample.json` with various batch_size
parameters.  I inspected memory using `Activity Monitor.app` and took some
averages over 3 runs each.  `sample.json` was a 100,000 line file.

batch_size  time  memory_used
   100,000  1:37  219.8
    10,000  1:15  58.2
     1,000  1:13  46
       100  1:14  46.4

As you can see, memory stays stable when the batch size is smaller, and loading
data is faster, too.

I haven't touched the other commands in `bin/soulmate`, since I don't have a
need to use them yet.

Tests have been modified to reflect the new `Loader#cleanup` method (they were
using `Loader#load` as a hack anyway), and they all pass.

sethherr added a commit to sethherr/soulheart that referenced this pull request Jun 12, 2015

Merge pull request from soulmate
seatgeek/soulmate#27

Merge branch 'memory-efficient-loads' of github.com:Whitespace/soulmate
Owner

josegonzalez commented Jun 29, 2017

Closing since this is unmaintained. Please use soulheart instead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment