Fixes #22876 - message/source digests are bigint #5319

lzap · 2018-03-13T14:03:25Z

This patch converts digest columns from SHA1 HEX (40 characters) to
BIGINT (8 bytes). Instead using secure SHA1 hash, it now uses very fast
XXHash which aims for speed and delivers high quality entropy.

Although the benchmark included in the patch shows small performance
improvement on sqlite3, the main reason for this page is saved storage
on PostgreSQL server - instead storing 40 characters we only need 8
bytes (64 bits). Searching BIGINT via indexes is faster than VARCHAR in
every aspect, so there will be performance boost seen on deployments
with millions of records in those tables.

Please do not merge until this performance boost is confirmed. To test
this patch, please try it on production instance (PostgreSQL/MySQL) with
tens of millions of records in message and/or source tables. The
benchmark test can be used to compare before/after.

theforeman-bot · 2018-03-13T14:03:32Z

Issues: #22876

houndci-bot · 2018-03-13T14:03:33Z

app/models/source.rb

@@ -8,8 +8,17 @@ def to_s
    value
  end

+  def self.make_digest(val)
+    # convert from unsinged to signed int64
+    XXhash.xxh64(val) - 9223372036854775808


Use underscores(_) as decimal mark and separate every 3 digits with them.

houndci-bot · 2018-03-13T14:03:33Z

app/models/message.rb

@@ -8,8 +8,17 @@ def to_s
    value
  end

+  def self.make_digest(val)
+    # convert from unsinged to signed int64
+    XXhash.xxh64(val) - 9223372036854775808


Use underscores(_) as decimal mark and separate every 3 digits with them.

houndci-bot · 2018-03-13T14:03:33Z

app/models/source.rb

@@ -8,8 +8,17 @@ def to_s
    value
  end

+  def self.make_digest(val)
+    # convert from unsinged to signed int64
+    XXhash.xxh64(val) - 9223372036854775808


Use underscores(_) as decimal mark and separate every 3 digits with them.

Better now? 9_223_372_036_854_775_808 ;-)

houndci-bot · 2018-03-13T14:03:33Z

app/models/message.rb

@@ -8,8 +8,17 @@ def to_s
    value
  end

+  def self.make_digest(val)
+    # convert from unsinged to signed int64
+    XXhash.xxh64(val) - 9223372036854775808


Use underscores(_) as decimal mark and separate every 3 digits with them.

What's this doing? I'm not super familiar with the magic ruby does behind the scene for handling integers.

So this returns a hash > 2**63, so you'd need an unsigned int64 type to represent it:

2.4.2 :008 > XXhash.xxh64("foox") => 18147147302980099409

Ruby seems to do the right thing internally right? It uses 8-bytes:

2.4.2 :007 > XXhash.xxh64("foox").size => 8

Oh is the :bigint data type in the database always signed, so when it sees > 2**63 it freaks out?

(Although bigint unsigned seems to be available: https://www.philipcurtis.com/bigint-unsigned-ids-in-rails5-0/, at least on MySQL. Anyway, the hack here is fine but a better comment would help explain what's going on)

Yeah PostgreSQL only supports signed, so I made this trick which seems to be fastest way of doing this in Ruby. If you know better, let me know how to convert it. I think some bitshifting also does the job, but this looks pretty straightforward.

I thought about it for a wihle. Ruby is a PITA. I thought this would correctly convert from unsigned to signed, but Ruby doesn't always seem to do the right thing.

-1*(~(value+1) & 0xFFFF_FFFF_FFFF_FFFF)

But the subtraction should be faster, and be good enough.

C is easier, (int64_t) value: done.

ekohl · 2018-03-13T14:20:56Z

db/migrate/20180313113654_change_digests_to_bigint.rb

+    # messages digest
+    remove_index :messages, :digest
+    remove_column :messages, :digest
+    add_column :messages, :digest, :bigint, :limit => 8


Would it be wiser to add a new column instead? It's interesting to start thinking about this in the context of continuous deployments and deployments in clusters.

Do you mean renaming it basically to something else? I can do that, sure.

ohadlevy · 2018-03-13T14:25:25Z

db/migrate/20180313113654_change_digests_to_bigint.rb

+class ChangeDigestsToBigint < ActiveRecord::Migration[5.1]
+  def up
+    # messages digest
+    remove_index :messages, :digest


how long would it take to recreate the index?

Faster than doing index on strings? ;-) I don't have any numbers at the moment, I was thinking this will need an upgrade step of exporting old reports prior upgrading to speed up the procedure. But I get your point - if this does not show radical boost in performance with millions of records, then maybe the patch is not worth it. I still believe it will be dramatic improvement.

ohadlevy · 2018-03-13T14:25:41Z

db/migrate/20180313113654_change_digests_to_bigint.rb

+    remove_index :messages, :digest
+    remove_column :messages, :digest
+    add_column :messages, :digest, :bigint, :limit => 8
+    Message.unscoped.all.find_each {|rec| rec.update_attribute :digest, Message.make_digest(rec.value)}


please in batches!!!

Sorry what? It's using find_each which defaults to batch of size 1000 records.

ekohl · 2018-03-13T14:58:36Z

I'm also hesitant to add a dependency with a C extension so there should be some good performance benefits to justify this. Especially because the database migration could be very expensive.

timogoebel · 2018-03-19T15:00:16Z

test/benchmark/benchmark_helper.rb

@@ -3,8 +3,8 @@
 require File.expand_path('../../../config/environment', __FILE__)

 unless Rails.env.production? && !Rails.configuration.database_configuration["production"]["migrate"]
-  puts "Rais must be in production and database must have migrations turned off!"
-  puts "Please add similar configuration to your config/database.yaml:"
+  puts "Rais must be in production (set RAILS_ENV) and database must have migrations turned off!"


Typo: Rais should be Rails

lzap · 2018-03-20T09:17:13Z

There are pure Ruby implementations of the hash algorhthm, we don't need to use xxhash but anything that is fast enough. But I considered both CRC64 and XXHash and both pure Ruby versions are quite slow.

I agree this needs to be tested on database with many records and we want this only if it proves to perform better. That's my next step - to test this and provide some numbers.

lzap · 2018-03-21T08:52:58Z

I am testing on PostgreSQL now. This is the schema before the change:

                                                     Table "public.sources"
 Column |          Type          |                      Modifiers                       | Storage  | Stats target | Description
 
--------+------------------------+------------------------------------------------------+----------+--------------+------------
-
 id     | integer                | not null default nextval('sources_id_seq'::regclass) | plain    |              | 
 value  | text                   |                                                      | extended |              | 
 digest | character varying(255) |                                                      | extended |              | 
Indexes:
    "sources_pkey" PRIMARY KEY, btree (id)
    "index_sources_on_digest" btree (digest)
Has OIDs: no

And after:

 Column |  Type   |                      Modifiers                       | Storage  | Stats target | Description 
--------+---------+------------------------------------------------------+----------+--------------+-------------
 id     | integer | not null default nextval('sources_id_seq'::regclass) | plain    |              | 
 value  | text    |                                                      | extended |              | 
 digest | bigint  |                                                      | plain    |              | 
Indexes:
    "sources_pkey" PRIMARY KEY, btree (id)
    "index_sources_on_digest" btree (digest)
Has OIDs: no

Results with empty database, filling up to 10k records, before:

Warming up --------------------------------------
      create message     9.000  i/100ms
       create source     9.000  i/100ms
        find message    17.000  i/100ms
         find source    10.000  i/100ms
Calculating -------------------------------------
      create message    109.917  (±20.0%) i/s -      6.021k in  60.060677s
       create source    108.108  (±23.1%) i/s -      5.805k in  60.051191s
        find message    465.180  (±103.4%) i/s -      8.925k in  60.009204s
         find source    123.538  (±98.8%) i/s -      6.100k in  60.023494s
Memory stats
Total objects allocated: 37945219
Total heap pages allocated: 3728
No. of messages in db after test: 17088
Example message digest: 9344f05adc194ebd64417537f9465678aa45705a
No. of sources in db after test: 17584
Example source digest: 3804c0a96e68be5212545be4f6d03bc6446c40ae

And after:

  Warming up --------------------------------------
        create message    10.000  i/100ms
         create source     9.000  i/100ms
          find message    19.000  i/100ms
           find source    10.000  i/100ms
  Calculating -------------------------------------
        create message    109.233  (±21.1%) i/s -      5.930k in  60.043792s
         create source    110.342  (±19.9%) i/s -      6.075k in  60.070752s
          find message    458.648  (±106.4%) i/s -      8.626k in  60.054695s
           find source    197.765  (±168.4%) i/s -      6.340k in  60.058138s
  Memory stats
  Total objects allocated: 36995309
  Total heap pages allocated: 3041
  No. of messages in db after test: 17476
  Example message digest: 4492894814423649894
  No. of sources in db after test: 17862
  Example source digest: -5599136425389434189

As you can see, I am not able to saturate database with single thread at all, therefore there is no measurable speedup. This needs proper test on loaded server with multiple concurrent report uploads where faster index will payoff.

Migration test of 100k records: 75 seconds per table, so 100k + 100k records in both tables should be done in less than 3 minutes. Only users with uncleaned database (report expiration cron job does not work) with tens of millions of records can be affected by the migration. I can implement a check and if there's more than 1000k records migration can error out asking report expiration task to be executed first. This could be overridable via ENV variable.

lzap · 2018-06-26T09:21:23Z

I failed to prove this improves performance, closing.

houndci-bot reviewed Mar 13, 2018

View reviewed changes

theforeman-bot added Needs testing Not yet reviewed labels Mar 13, 2018

ekohl reviewed Mar 13, 2018

View reviewed changes

ohadlevy reviewed Mar 13, 2018

View reviewed changes

timogoebel reviewed Mar 19, 2018

View reviewed changes

Fixes #22876 - message/source digests are bigint

c37c2ce

lzap force-pushed the xxhash-22876 branch from a51fd9e to c37c2ce Compare March 20, 2018 09:29

lzap closed this Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #22876 - message/source digests are bigint #5319

Fixes #22876 - message/source digests are bigint #5319

lzap commented Mar 13, 2018

theforeman-bot commented Mar 13, 2018

houndci-bot Mar 13, 2018

houndci-bot Mar 13, 2018

houndci-bot Mar 13, 2018

lzap Mar 20, 2018

houndci-bot Mar 13, 2018

stbenjam Mar 13, 2018 •

edited

Loading

stbenjam Mar 13, 2018

stbenjam Mar 13, 2018 •

edited

Loading

lzap Mar 20, 2018

stbenjam Mar 20, 2018 •

edited

Loading

ekohl Mar 13, 2018

lzap Mar 20, 2018

ohadlevy Mar 13, 2018

lzap Mar 20, 2018

ohadlevy Mar 13, 2018

lzap Mar 20, 2018

ekohl commented Mar 13, 2018

timogoebel Mar 19, 2018

lzap commented Mar 20, 2018

lzap commented Mar 21, 2018

lzap commented Jun 26, 2018

Fixes #22876 - message/source digests are bigint #5319

Fixes #22876 - message/source digests are bigint #5319

Conversation

lzap commented Mar 13, 2018

theforeman-bot commented Mar 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stbenjam Mar 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stbenjam Mar 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stbenjam Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekohl commented Mar 13, 2018

Choose a reason for hiding this comment

lzap commented Mar 20, 2018

lzap commented Mar 21, 2018

lzap commented Jun 26, 2018

stbenjam Mar 13, 2018 •

edited

Loading

stbenjam Mar 13, 2018 •

edited

Loading

stbenjam Mar 20, 2018 •

edited

Loading