Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #22876 - message/source digests are bigint #5319

Closed
wants to merge 1 commit into from

Conversation

lzap
Copy link
Member

@lzap lzap commented Mar 13, 2018

This patch converts digest columns from SHA1 HEX (40 characters) to
BIGINT (8 bytes). Instead using secure SHA1 hash, it now uses very fast
XXHash which aims for speed and delivers high quality entropy.

Although the benchmark included in the patch shows small performance
improvement on sqlite3, the main reason for this page is saved storage
on PostgreSQL server - instead storing 40 characters we only need 8
bytes (64 bits). Searching BIGINT via indexes is faster than VARCHAR in
every aspect, so there will be performance boost seen on deployments
with millions of records in those tables.

Please do not merge until this performance boost is confirmed. To test
this patch, please try it on production instance (PostgreSQL/MySQL) with
tens of millions of records in message and/or source tables. The
benchmark test can be used to compare before/after.

@theforeman-bot
Copy link
Member

Issues: #22876

@@ -8,8 +8,17 @@ def to_s
value
end

def self.make_digest(val)
# convert from unsinged to signed int64
XXhash.xxh64(val) - 9223372036854775808

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use underscores(_) as decimal mark and separate every 3 digits with them.

@@ -8,8 +8,17 @@ def to_s
value
end

def self.make_digest(val)
# convert from unsinged to signed int64
XXhash.xxh64(val) - 9223372036854775808

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use underscores(_) as decimal mark and separate every 3 digits with them.

@@ -8,8 +8,17 @@ def to_s
value
end

def self.make_digest(val)
# convert from unsinged to signed int64
XXhash.xxh64(val) - 9223372036854775808

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use underscores(_) as decimal mark and separate every 3 digits with them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better now? 9_223_372_036_854_775_808 ;-)

@@ -8,8 +8,17 @@ def to_s
value
end

def self.make_digest(val)
# convert from unsinged to signed int64
XXhash.xxh64(val) - 9223372036854775808

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use underscores(_) as decimal mark and separate every 3 digits with them.

Copy link
Member

@stbenjam stbenjam Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this doing? I'm not super familiar with the magic ruby does behind the scene for handling integers.

So this returns a hash > 2**63, so you'd need an unsigned int64 type to represent it:

2.4.2 :008 > XXhash.xxh64("foox")
 => 18147147302980099409 

Ruby seems to do the right thing internally right? It uses 8-bytes:

2.4.2 :007 > XXhash.xxh64("foox").size
 => 8 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is the :bigint data type in the database always signed, so when it sees > 2**63 it freaks out?

Copy link
Member

@stbenjam stbenjam Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Although bigint unsigned seems to be available: https://www.philipcurtis.com/bigint-unsigned-ids-in-rails5-0/, at least on MySQL. Anyway, the hack here is fine but a better comment would help explain what's going on)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah PostgreSQL only supports signed, so I made this trick which seems to be fastest way of doing this in Ruby. If you know better, let me know how to convert it. I think some bitshifting also does the job, but this looks pretty straightforward.

Copy link
Member

@stbenjam stbenjam Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it for a wihle. Ruby is a PITA. I thought this would correctly convert from unsigned to signed, but Ruby doesn't always seem to do the right thing.

-1*(~(value+1) & 0xFFFF_FFFF_FFFF_FFFF)

But the subtraction should be faster, and be good enough.

C is easier, (int64_t) value: done.

# messages digest
remove_index :messages, :digest
remove_column :messages, :digest
add_column :messages, :digest, :bigint, :limit => 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be wiser to add a new column instead? It's interesting to start thinking about this in the context of continuous deployments and deployments in clusters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean renaming it basically to something else? I can do that, sure.

class ChangeDigestsToBigint < ActiveRecord::Migration[5.1]
def up
# messages digest
remove_index :messages, :digest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long would it take to recreate the index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster than doing index on strings? ;-) I don't have any numbers at the moment, I was thinking this will need an upgrade step of exporting old reports prior upgrading to speed up the procedure. But I get your point - if this does not show radical boost in performance with millions of records, then maybe the patch is not worth it. I still believe it will be dramatic improvement.

remove_index :messages, :digest
remove_column :messages, :digest
add_column :messages, :digest, :bigint, :limit => 8
Message.unscoped.all.find_each {|rec| rec.update_attribute :digest, Message.make_digest(rec.value)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please in batches!!!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry what? It's using find_each which defaults to batch of size 1000 records.

@ekohl
Copy link
Member

ekohl commented Mar 13, 2018

I'm also hesitant to add a dependency with a C extension so there should be some good performance benefits to justify this. Especially because the database migration could be very expensive.

@@ -3,8 +3,8 @@
require File.expand_path('../../../config/environment', __FILE__)

unless Rails.env.production? && !Rails.configuration.database_configuration["production"]["migrate"]
puts "Rais must be in production and database must have migrations turned off!"
puts "Please add similar configuration to your config/database.yaml:"
puts "Rais must be in production (set RAILS_ENV) and database must have migrations turned off!"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Rais should be Rails

@lzap
Copy link
Member Author

lzap commented Mar 20, 2018

There are pure Ruby implementations of the hash algorhthm, we don't need to use xxhash but anything that is fast enough. But I considered both CRC64 and XXHash and both pure Ruby versions are quite slow.

I agree this needs to be tested on database with many records and we want this only if it proves to perform better. That's my next step - to test this and provide some numbers.

@lzap
Copy link
Member Author

lzap commented Mar 21, 2018

I am testing on PostgreSQL now. This is the schema before the change:

                                                     Table "public.sources"
 Column |          Type          |                      Modifiers                       | Storage  | Stats target | Description
 
--------+------------------------+------------------------------------------------------+----------+--------------+------------
-
 id     | integer                | not null default nextval('sources_id_seq'::regclass) | plain    |              | 
 value  | text                   |                                                      | extended |              | 
 digest | character varying(255) |                                                      | extended |              | 
Indexes:
    "sources_pkey" PRIMARY KEY, btree (id)
    "index_sources_on_digest" btree (digest)
Has OIDs: no

And after:

 Column |  Type   |                      Modifiers                       | Storage  | Stats target | Description 
--------+---------+------------------------------------------------------+----------+--------------+-------------
 id     | integer | not null default nextval('sources_id_seq'::regclass) | plain    |              | 
 value  | text    |                                                      | extended |              | 
 digest | bigint  |                                                      | plain    |              | 
Indexes:
    "sources_pkey" PRIMARY KEY, btree (id)
    "index_sources_on_digest" btree (digest)
Has OIDs: no

Results with empty database, filling up to 10k records, before:

Warming up --------------------------------------
      create message     9.000  i/100ms
       create source     9.000  i/100ms
        find message    17.000  i/100ms
         find source    10.000  i/100ms
Calculating -------------------------------------
      create message    109.917  (±20.0%) i/s -      6.021k in  60.060677s
       create source    108.108  (±23.1%) i/s -      5.805k in  60.051191s
        find message    465.180  (±103.4%) i/s -      8.925k in  60.009204s
         find source    123.538  (±98.8%) i/s -      6.100k in  60.023494s
Memory stats
Total objects allocated: 37945219
Total heap pages allocated: 3728
No. of messages in db after test: 17088
Example message digest: 9344f05adc194ebd64417537f9465678aa45705a
No. of sources in db after test: 17584
Example source digest: 3804c0a96e68be5212545be4f6d03bc6446c40ae

And after:

  Warming up --------------------------------------
        create message    10.000  i/100ms
         create source     9.000  i/100ms
          find message    19.000  i/100ms
           find source    10.000  i/100ms
  Calculating -------------------------------------
        create message    109.233  (±21.1%) i/s -      5.930k in  60.043792s
         create source    110.342  (±19.9%) i/s -      6.075k in  60.070752s
          find message    458.648  (±106.4%) i/s -      8.626k in  60.054695s
           find source    197.765  (±168.4%) i/s -      6.340k in  60.058138s
  Memory stats
  Total objects allocated: 36995309
  Total heap pages allocated: 3041
  No. of messages in db after test: 17476
  Example message digest: 4492894814423649894
  No. of sources in db after test: 17862
  Example source digest: -5599136425389434189

As you can see, I am not able to saturate database with single thread at all, therefore there is no measurable speedup. This needs proper test on loaded server with multiple concurrent report uploads where faster index will payoff.

Migration test of 100k records: 75 seconds per table, so 100k + 100k records in both tables should be done in less than 3 minutes. Only users with uncleaned database (report expiration cron job does not work) with tens of millions of records can be affected by the migration. I can implement a check and if there's more than 1000k records migration can error out asking report expiration task to be executed first. This could be overridable via ENV variable.

@lzap
Copy link
Member Author

lzap commented Jun 26, 2018

I failed to prove this improves performance, closing.

@lzap lzap closed this Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants