Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating row digest is slow #2884

Closed
duarten opened this issue Oct 10, 2017 · 20 comments
Closed

Calculating row digest is slow #2884

duarten opened this issue Oct 10, 2017 · 20 comments
Assignees
Milestone

Comments

@duarten
Copy link
Contributor

@duarten duarten commented Oct 10, 2017

Installation details
Scylla version (or git commit hash): >= 2.0, master

The stall detector reported stack traces pointing to digest calculation taking more than 6ms:

md5_hasher::update(char const*, unsigned long) at /usr/src/debug/scylla-enterprise-2017.1.2.rc2/md5_hasher.hh:24
 (inlined by) void appending_hash<std::experimental::fundamentals_v1::basic_string_view<signed char, std::char_traits<signed char> >, void>::operator()<md5_hasher>(md5_hasher&, std::experimental::fundamentals_v1::basic_string_view<signed char, std::char_traits<signed char> >) const at /usr/src/debug/scylla-enterprise-2017.1.2.rc2/./bytes.hh:64
 (inlined by) void feed_hash<md5_hasher, std::experimental::fundamentals_v1::basic_string_view<signed char, std::char_traits<signed char> > >(md5_hasher&, std::experimental::fundamentals_v1::basic_string_view<signed char, std::char_traits<signed char> > const&) at /usr/src/debug/scylla-enterprise-2017.1.2.rc2/./hashing.hh:43
 (inlined by) void appending_hash<atomic_cell_view, void>::operator()<md5_hasher>(md5_hasher&, atomic_cell_view) const at /usr/src/debug/scylla-enterprise-2017.1.2.rc2/atomic_cell_hash.hh:43
hash_row_slice(md5_hasher&, schema const&, column_kind, row const&, std::vector<unsigned int, std::allocator<unsigned int> > const&) [clone .constprop.1215] at mutation_partition.cc:?
mutation_querier::consume(clustering_row&&, tombstone) at ??:?
future<bool_class<stop_iteration_tag> > do_consume_streamed_mutation_flattened<stable_flattened_mutations_consumer<compact_for_query<(emit_only_live_rows)1, query_result_builder> > >(streamed_mutation&, stable_flattened_mutations_consumer<compact_for_query<(emit_only_live_rows)1, query_result_builder> >&) at ??:?
_ZN8futurizeI6futureII10bool_classI18stop_iteration_tagEEEE5applyIZZZ17consume_flattenedI35stable_flattened_mutations_consumerI17compact_for_queryIL19emit_only_live_rows1E20query_result_builderEEEDa15mutation_readerOT_bENKUlRSF_RT0_RT1_E_clISE_SD_NSt12experimental15fundamentals_v18optionalI17streamed_mutationEEEEDaSH_SJ_SL_ENKUlvE_clEvEUlSF_E_I18optimized_optionalISR_EEEES4_SG_OSt5tupleIIDpT0_EE at ??:?
_ZN6futureII18optimized_optionalI17streamed_mutationEEE4thenIZZZ17consume_flattenedI35stable_flattened_mutations_consumerI17compact_for_queryIL19emit_only_live_rows1E20query_result_builderEEEDa15mutation_readerOT_bENKUlRSD_RT0_RT1_E_clISC_SB_NSt12experimental15fundamentals_v18optionalIS1_EEEEDaSF_SH_SJ_ENKUlvE_clEvEUlSD_E_S_II10bool_classI18stop_iteration_tagEEEEESG_SE_ at ??:?
_Z6repeatIZZ17consume_flattenedI35stable_flattened_mutations_consumerI17compact_for_queryIL19emit_only_live_rows1E20query_result_builderEEEDa15mutation_readerOT_bENKUlRS8_RT0_RT1_E_clIS7_S6_NSt12experimental15fundamentals_v18optionalI17streamed_mutationEEEEDaSA_SC_SE_EUlvE_E6futureIJEES9_ at mutation_partition.cc:?
data_query(lw_shared_ptr<schema const>, mutation_source const&, nonwrapping_range<dht::ring_position> const&, query::partition_slice const&, unsigned int, unsigned int, std::chrono::time_point<gc_clock, std::chrono::duration<int, std::ratio<1l, 1l> > >, query::result::builder&) at ??:?
column_family::query(lw_shared_ptr<schema const>, query::read_command const&, query::result_request, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > > const&, lw_shared_ptr<tracing::trace_state>, query::result_memory_limiter&, unsigned long)::{lambda(query::result_memory_accounter)#1}::operator()(query::result_memory_accounter)::{lambda()#1}::operator()() const at /usr/src/debug/scylla-enterprise-2017.1.2.rc2/database.cc:2564
future<lw_shared_ptr<query::result> > futurize<future<lw_shared_ptr<query::result> > >::apply<column_family::query(lw_shared_ptr<schema const>, query::read_command const&, query::result_request, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > > const&, lw_shared_ptr<tracing::trace_state>, query::result_memory_limiter&, unsigned long)::{lambda(query::result_memory_accounter)#1}, query::result_memory_accounter>(column_family::query(lw_shared_ptr<schema const>, query::read_command const&, query::result_request, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > > const&, lw_shared_ptr<tracing::trace_state>, query::result_memory_limiter&, unsigned long)::{lambda(query::result_memory_accounter)#1}&&, std::tuple<query::result_memory_accounter>&&) at database.cc:?
column_family::query(lw_shared_ptr<schema const>, query::read_command const&, query::result_request, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > > const&, lw_shared_ptr<tracing::trace_state>, query::result_memory_limiter&, unsigned long) at ??:?
database::query(lw_shared_ptr<schema const>, query::read_command const&, query::result_request, std::vector<nonwrapping_range<dht::ring_position>, std::allocator<nonwrapping_range<dht::ring_position> > > const&, lw_shared_ptr<tracing::trace_state>, unsigned long) at ??:?

I benchmarked it and got the following results:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

Machine:

Core I7-5960X @ 3Ghz (C-states disabled, XMP and Turbo enabled)
64GB DDR4 2800Mhz

While 4us is fine and dandy for 246 bytes, we can see that 8MB already takes 15msec. 64MB takes 263msec.

Here is a flamegraph:

screen shot 2017-10-10 at 23 00 23

@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Oct 10, 2017

For reference, this is the benchmark:

return seastar::async([] {
    auto builder = schema_builder("ks", "cf")
            .with_column("pk", bytes_type, column_kind::partition_key)
            .with_column("ck", bytes_type, column_kind::clustering_key)
            .with_column("v", bytes_type, column_kind::regular_column);
    auto make_blob = [] (unsigned blob_size, int8_t prefix) {
        bytes b(blob_size, int8_t(0));
        std::copy_n(&prefix, sizeof(prefix), b.begin());
        return b;
    };
    auto schema = builder.build();
    auto pkey = partition_key::from_single_value(*schema, make_blob(128, 0));
    auto dk = dht::global_partitioner().decorate_key(*schema, pkey);
    auto prange = dht::partition_range::make_singular(dk);
    auto ckey = clustering_key::from_exploded(*schema, { make_blob(128, 1) });
    auto slice = partition_slice_builder(*schema).build();

    auto size = 256;
    while (size < (1 << 27)) {
        mutation m(pkey, schema);
        deletable_row& row = m.partition().clustered_row(*schema, ckey);
        auto cell = atomic_cell::make_live(1, make_blob(size, 4));
        row.cells().apply(schema->column_at(column_kind::regular_column, 0), std::move(cell));
        row.marker() = row_marker(1);
        auto source = mutation_source([m = std::move(m)] (
                        schema_ptr,
                        const dht::partition_range&,
                        const query::partition_slice&,
                        const io_priority_class&,
                        tracing::trace_state_ptr,
                        streamed_mutation::forwarding,
                        mutation_reader::forwarding) {
            return make_reader_returning(m);
        });

        auto ops = 100;
        std::chrono::duration<double> acc;
        auto start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < ops; ++i) {
            query::result::builder b(slice, query::result_request::only_digest, { });
            data_query(schema, source, prange, slice, 10000, query::max_partitions, gc_clock::now(), b, nullptr).get();
        }
        auto total = std::chrono::duration<double>(std::chrono::high_resolution_clock::now() - start);
        auto total_op = std::chrono::duration_cast<std::chrono::microseconds>(total).count() / ops;

        std::cout << "size = " << size << "; elapsed = " << total_op << "us" << std::endl;
        size = size << 1;
    }
});
@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Oct 11, 2017

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Oct 11, 2017

Perhaps we can switch to a faster hash (one with hardware support).

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Oct 11, 2017

@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Oct 11, 2017

We should only be calculating the digest for the columns the user selected.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Oct 11, 2017

Adding information we had on the daily call - If a user is selecting a blob of 1MB and requesting metadata of that blob ttl or writetime - currently we do not have capability to compute digest only on the metadata and are forced to compute the digest on the blob and its metadata.

We have discussed the following options:

  • Switch to a HW accelerated digest and how to make sure the complete cluster supports that
  • Pushing down query information - in which case we would be able to compute the digest on the needed info.
  • Adding "yield" spots in the computation path - that may help in some cases - yet in case of a large blob - not clear if it will help.

One other option that may help in such a case is maybe instead of computing the digest returning back the information that was requested - @gleb-cloudius - is there a way that a user can force a query on select ttl(X) ... to return back the query_result instead of computing the digest ?

@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Oct 11, 2017

The part about the metadata is not clear to me. If we do this in the context of read repair, wouldn't it be wrong to send the digest of the metadata instead of the value? Two different values may have equal metadata.

Yielding also wouldn't help much, as it would increase even more the latency of the query. We would also need to add defer points to the algorithm itself, for which we currentely use a third-party library.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Oct 11, 2017

We need to compute the digest of the relevant part of the returned query data - if the user requested only the ttl of a cell then we could compute the digest for that part and verify it is correct / fix only that part if needed.

Consider the following case:
two nodes A, B a row with element X stored in both where:
Node A: X = Node B: X
Node A: ttl(X) != Node B: ttl(x)

A query with CL=QUORUM of select ttl(X) ... on current master would "fix" the data (although the ttl(X) on both nodes is the same) and that was the only part the user requested

@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Oct 11, 2017

The ttl is part of the cell, so it will always appear in the digest. However, if we have:

Node A: X != Node B: X
Node A: ttl(X) = Node B: ttl(x)

Then we still want to repair.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Oct 11, 2017

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Oct 12, 2017

@gleb-cloudius

This comment has been minimized.

Copy link
Contributor

@gleb-cloudius gleb-cloudius commented Oct 12, 2017

@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Oct 12, 2017

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Oct 12, 2017

1ms compute on 1MB blob -> 1GB/s, we will never get that from a single core. Sub-millisecond latencies can only be provided on significantly smaller queries.

Our blobs are already fragmented into smaller chunks (currently we linearize them for processing, but in the future we won't). Fragmentation gives natural yield points.

@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Oct 12, 2017

Maybe we could add some metadata to each row in the cache, and cache a cell's hash in that metadata, which would be lazily populate.

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Oct 13, 2017

Interesting idea. Would increase footprint, perhaps do it for large cells only.

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Oct 13, 2017

Perhaps xxhash. It's not cryptographic but might not matter.

@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Oct 13, 2017

Better results with xxhash:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

It's 64bits instead of 128.

@duarten duarten self-assigned this Oct 17, 2017
@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Nov 26, 2017

@duarten - are you really coding this - if not I'll move it to someone else

@slivne slivne added this to the 2.x milestone Nov 26, 2017
@duarten

This comment has been minimized.

Copy link
Contributor Author

@duarten duarten commented Nov 26, 2017

@slivne I do have a branch moving us to xxhash, yes. I'll finish it this week!

duarten added a commit to duarten/scylla that referenced this issue Nov 30, 2017
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it when possible. We use a
cluster feature because older versions are not ready to receive a
different digest algorithm than MD5 when answering a data request.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Nov 30, 2017
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it when possible. We use a
cluster feature because older versions are not ready to receive a
different digest algorithm than MD5 when answering a data request.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 16, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it when possible. We use a
cluster feature because older versions are not ready to receive a
different digest algorithm than MD5 when answering a data request.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 16, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it when possible. We use a
cluster feature because older versions are not ready to receive a
different digest algorithm than MD5 when answering a data request.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 26, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it when possible. We use a
cluster feature because older versions are not ready to receive a
different digest algorithm than MD5 when answering a data request.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 26, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 26, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 28, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 29, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Jan 30, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
@slivne slivne modified the milestones: 2.x, 2.2 Jan 31, 2018
duarten added a commit to duarten/scylla that referenced this issue Feb 5, 2018
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes scylladb#2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
@avikivity avikivity closed this in 0bab3e5 Feb 8, 2018
avikivity added a commit that referenced this issue Feb 8, 2018
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.