Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Extended bin/loader to be more memory efficient for large datasets #27

Open
wants to merge 3 commits into from

2 participants

@Whitespace

I split the Soulmate::Loader#load method into two parts: cleanup and load.
By doing this we can keep calling Loader#load without having to blow away the
old data.

I then extended the bin/soulmate file to take filename and batch_size
arguments. By not waiting for all the data to arrive on STDIN, we can tune
the batch_size to reach a consistent memory usage, which also helps speed up
loading as well.

I ran some tests using time soulmate load sample.json with various batch_size
parameters. I inspected memory using Activity Monitor.app and took some
averages over 3 runs each. sample.json was a 100,000 line file.

batch_size  time  memory_used
   100,000  1:37  219.8
    10,000  1:15  58.2
     1,000  1:13  46
       100  1:14  46.4

As you can see, memory stays stable when the batch size is smaller, and loading
data is faster, too.

I haven't touched the other commands in bin/soulmate, since I don't have a
need to use them yet.

Tests have been modified to reflect the new Loader#cleanup method (they were
using Loader#load as a hack anyway), and they all pass.

Fixes #26

@Whitespace

Here's the original loader time:

time: 1:23
memory used: 269.8

I tried to run another test that had 1,000,000 lines, but the original version never finished.

Here's a run of the working dataset I use on my laptop:

    Loaded a total of 245152 items in 244 second(s)
    real    4m4.474s
    memory used: ~220MB
@erwaller
Owner

Thanks a ton for this, our data sets have been much smaller, so this was never on my radar.

One piece of feedback: I'd prefer if batch_size were either just fixed at 1000 (based on your tests), or set by an option flag, rather than passed as an argument. Do you mind updating the pull?

Thanks again.

@Whitespace

No problem! Thanks for writing this gem :)

I'll set it as an option flag, since I'm on an SSD and others might get better performance if they tune the input size.

Tonight I loaded 5 different files in parallel. Here are the times:

Loaded a total of 200996 items in 591 second(s)
Loaded a total of 131918 items in 479 second(s)
Loaded a total of 69406 items in 229 second(s)
Loaded a total of 166788 items in 475 second(s)
Loaded a total of 110289 items in 481 second(s)

680,000 lines in ~10 minutes totaling 1.11 GB of data into redis. Redis says there are 507,591 keys, which is interesting because there are 680,000 lines, but the names might not be unique ( have to look into that). Each ruby process was ~30MB used. Pretty good, I'd say, but there's a lot of room to improve it (What can I say? I'm really impatient). If I have time I'll make it even faster by forking within each process and parallelizing the load, emulating what I had done by splitting the data sets.

From my tests, I'm somewhat CPU-bound and I'm barely hitting 85 KB/s reads into redis, so I think I can shove more data into redis if I reduce the number of ruby objects being instantiated.

Edit: clarity and formatting and added redis key count info

@Whitespace

Just an update: I've refactored my code to use the fastest way possible to get data into redis: using the raw protocol itself. I'll probably make the necessary changes to this branch and open a separate pull request for that branch, since mass-insert might not be for everyone.

The way it works is I generate the raw redis commands, store them in a tempfile, then shove that into redis via redis-cli --pipe < tempfile.redis. The interesting bit about this method is the time spent removing keys and then re-adding them is minimized, and you could generate the file on one server and then transfer it to your redis machine to minimize insertion time.

I've got to clean it up a bit – which I may get to tonight – but here are some rough numbers:

Lines of JSON  Time generating  Time Removing/Inserting
166788         39s              14s
110289         60s              24s
200996         90s              36s

Ruby memory usage: 12MB (constant)

I'm just code golfing (performance golfing?) at this point, but it's been very interesting to learn more about the internals of redis. It's also interesting to note that inserting data into redis cannot become faster unless I either change my hardware or shard the data across several redis-server instances. The only thing I can make faster is generating the redis commands, which is a ruby exercise. I've taken care to use what existing code as much as possible, but there is still a lot of performance gains to be had.


Insertion speed is very important to me since I'm dealing with 10+ million records, and being able to quickly blow it all away and start from scratch in development is critical.

Interesting to note is 8.25 million lines of JSON equates to almost 11GB of redis commands (took 36 minutes to generate). I'd test that on my laptop, but I only have 4GB of ram and redis performance collapses when the dataset can't fit into memory. I'll have to test that on EC2 someday.

Tom Clark added some commits
Tom Clark Allow cache-busting via query params 5e3efd1
Tom Clark Extended bin/loader to be more memory efficient for large datasets
I split the `Soulmate::Loader#load` method into two parts: cleanup and load.
By doing this we can keep calling `Loader#load` without having to blow away the
old data.

I then extended the `bin/soulmate` file to take filename and batch_size
arguments.  By not waiting for all the data to arrive on STDIN, we can tune
the batch_size to reach a consistent memory usage, which also helps speed up
loading as well.

I ran some tests using `time soulmate load sample.json` with various batch_size
parameters.  I inspected memory using `Activity Monitor.app` and took some
averages over 3 runs each.  `sample.json` was a 100,000 line file.

batch_size  time  memory_used
   100,000  1:37  219.8
    10,000  1:15  58.2
     1,000  1:13  46
       100  1:14  46.4

As you can see, memory stays stable when the batch size is smaller, and loading
data is faster, too.

I haven't touched the other commands in `bin/soulmate`, since I don't have a
need to use them yet.

Tests have been modified to reflect the new `Loader#cleanup` method (they were
using `Loader#load` as a hack anyway), and they all pass.
2057aaf
Tom Clark Use raw redis commands for fastest possible insertion ac43311
@sethherr sethherr referenced this pull request from a commit in sethherr/soulheart
@sethherr sethherr Merge pull request from soulmate
seatgeek/soulmate#27

Merge branch 'memory-efficient-loads' of github.com:Whitespace/soulmate
614826f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 3, 2012
  1. Allow cache-busting via query params

    Tom Clark authored
  2. Extended bin/loader to be more memory efficient for large datasets

    Tom Clark authored
    I split the `Soulmate::Loader#load` method into two parts: cleanup and load.
    By doing this we can keep calling `Loader#load` without having to blow away the
    old data.
    
    I then extended the `bin/soulmate` file to take filename and batch_size
    arguments.  By not waiting for all the data to arrive on STDIN, we can tune
    the batch_size to reach a consistent memory usage, which also helps speed up
    loading as well.
    
    I ran some tests using `time soulmate load sample.json` with various batch_size
    parameters.  I inspected memory using `Activity Monitor.app` and took some
    averages over 3 runs each.  `sample.json` was a 100,000 line file.
    
    batch_size  time  memory_used
       100,000  1:37  219.8
        10,000  1:15  58.2
         1,000  1:13  46
           100  1:14  46.4
    
    As you can see, memory stays stable when the batch size is smaller, and loading
    data is faster, too.
    
    I haven't touched the other commands in `bin/soulmate`, since I don't have a
    need to use them yet.
    
    Tests have been modified to reflect the new `Loader#cleanup` method (they were
    using `Loader#load` as a hack anyway), and they all pass.
This page is out of date. Refresh to see the latest.
View
4 .gitignore
@@ -16,4 +16,6 @@ pkg
test/db/*.rdb
-Gemfile.lock
+Gemfile.lock
+
+.rvmrc
View
112 bin/soulmate
@@ -9,6 +9,7 @@ rescue LoadError
end
require 'soulmate'
require 'optparse'
+require 'tempfile'
parser = OptionParser.new do |opts|
opts.banner = "Usage: soulmate [options] COMMAND"
@@ -31,19 +32,98 @@ parser = OptionParser.new do |opts|
exit
end
+ opts.on("-b", "--batch-size", "Number of lines to read at a time") do |size|
+ BATCH_SIZE = size
+ end
+
opts.separator ""
opts.separator "Commands:"
- opts.separator " load TYPE Replaces collection specified by TYPE with items read from stdin in the JSON lines format."
- opts.separator " add TYPE Adds items to collection specified by TYPE read from stdin in the JSON lines format."
- opts.separator " remove TYPE Removes items from collection specified by TYPE read from stdin in the JSON lines format. Items only require an 'id', all other fields are ignored."
- opts.separator " query TYPE QUERY Queries for items from collection specified by TYPE."
+ opts.separator " load TYPE FILE Replaces collection specified by TYPE with items read from FILE in the JSON lines format."
+ opts.separator " add TYPE Adds items to collection specified by TYPE read from stdin in the JSON lines format."
+ opts.separator " remove TYPE Removes items from collection specified by TYPE read from stdin in the JSON lines format. Items only require an 'id', all other fields are ignored."
+ opts.separator " query TYPE QUERY Queries for items from collection specified by TYPE."
end
-def load(type)
- puts "Loading items of type #{type}..."
- items = $stdin.read.split("\n").map { |l| MultiJson.decode(l) }
- loaded = Soulmate::Loader.new(type).load(items)
- puts "Loaded a total of #{loaded.size} items"
+def generate(type, file)
+ include Soulmate::Helpers
+
+ begin
+ temp = Tempfile.new("soulmate")
+
+ if File.exists?(file)
+ start_time = Time.now.to_i
+ base = "soulmate-index:#{type}"
+ database = "soulmate-data:#{type}"
+ # hset = "*4\r\n$4\r\nHSET\r\n$#{database.length}\r\n#{database}\r\n$"
+ # del = "*2\r\n$3\r\nDEL\r\n$"
+ begin
+ f = File.open(file)
+ # cleanup
+ phrases = Soulmate.redis.smembers(base)
+ phrases.each do |phrase|
+ temp << gen_redis_proto("DEL", phrase)
+ # temp << del + phrase.length.to_s + "\r\n" + phrase + "\r\n"
+ end
+ temp << gen_redis_proto("DEL", base)
+ # temp << del + base.length.to_s + "\r\n" + base + "\r\n"
+ while !f.eof?
+ line = f.gets.chomp
+ line =~ /"id":(\d+)/
+ id = $1
+ line =~ /"score":(\d+)/
+ score = $1
+ json = MultiJson.decode(line)
+ temp << gen_redis_proto("HSET", database, id, line)
+ # temp << hset + $1.length.to_s + "\r\n" + $1 + "\r\n$" + line.length.to_s + "\r\n" + line + "\r\n"
+ phrase = json.key?("aliases") ? json["term"] + " " + json["aliases"] : json["term"]
+ prefixes_for_phrase(phrase).each do |p|
+ temp << gen_redis_proto("SADD", base, p)
+ temp << gen_redis_proto("ZADD", base + ":" + p, score, id)
+ end
+ end
+ ensure
+ f.close
+ end
+ puts "Converted in #{Time.now.to_i - start_time} second(s)"
+ puts "Importing into redis ..."
+ `time redis-cli --pipe < #{temp.path}`
+ else
+ puts "Couldn't open file: #{file}"
+ end
+ ensure
+ temp.close
+ end
+end
+
+def load(type, file)
+ if File.exists?(file)
+ start_time = Time.now.to_i
+
+ puts "Purging existing items of type #{type} ..."
+ loader = Soulmate::Loader.new(type)
+ loader.cleanup
+
+ puts "Loading items of type #{type} in batches of #{BATCH_SIZE} ..."
+ count = 0
+ begin
+ f = File.open(file)
+ while !f.eof?
+ lines = []
+ BATCH_SIZE.times do
+ break if f.eof?
+ lines << MultiJson.decode(f.gets)
+ count += 1
+ end
+ loader.load(lines)
+ puts "Loaded #{count} items ..." unless f.eof?
+ end
+ ensure
+ f.close
+ end
+ puts "Loaded a total of #{count} items in #{Time.now.to_i - start_time} second(s)"
+ else
+ puts "Couldn't open file: #{file}"
+ end
end
def add(type)
@@ -76,11 +156,23 @@ def query(type, query)
puts "> Found #{results.size} matches"
end
+def gen_redis_proto(*cmd)
+ proto = "*"+cmd.length.to_s+"\r\n"
+ cmd.each{|arg|
+ proto << "$"+arg.bytesize.to_s+"\r\n"
+ proto << arg+"\r\n"
+ }
+ proto
+end
+
parser.parse!
+BATCH_SIZE ||= 1000
case ARGV[0]
+when 'generate'
+ generate ARGV[1], ARGV[2]
when 'load'
- load ARGV[1]
+ load ARGV[1], ARGV[2]
when 'add'
add ARGV[1]
when 'remove'
View
6 lib/soulmate/loader.rb
@@ -2,7 +2,7 @@ module Soulmate
class Loader < Base
- def load(items)
+ def cleanup
# delete the sorted sets for this type
phrases = Soulmate.redis.smembers(base)
Soulmate.redis.pipelined do
@@ -19,8 +19,10 @@ def load(items)
# delete the data stored for this type
Soulmate.redis.del(database)
+ end
- items.each_with_index do |item, i|
+ def load(items)
+ items.each do |item|
add(item, :skip_duplicate_check => true)
end
end
View
5 lib/soulmate/server.rb
@@ -23,11 +23,12 @@ class Server < Sinatra::Base
limit = (params[:limit] || 5).to_i
types = params[:types].map { |t| normalize(t) }
term = params[:term]
-
+ cache = params[:cache] != "false"
+
results = {}
types.each do |type|
matcher = Matcher.new(type)
- results[type] = matcher.matches_for_term(term, :limit => limit)
+ results[type] = matcher.matches_for_term(term, limit: limit, cache: cache)
end
MultiJson.encode({
View
4 test/test_soulmate.rb
@@ -49,7 +49,7 @@ def test_can_remove_items
matcher = Soulmate::Matcher.new('venues')
# empty the collection
- loader.load([])
+ loader.cleanup
results = matcher.matches_for_term("te", :cache => false)
assert_equal 0, results.size
@@ -69,7 +69,7 @@ def test_can_update_items
matcher = Soulmate::Matcher.new('venues')
# empty the collection
- loader.load([])
+ loader.cleanup
# initial data
loader.add("id" => 1, "term" => "Testing this", "score" => 10)
Something went wrong with that request. Please try again.