New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodetool tablestats: Partition keys number (estimate) in Scylla post migration from C* differs by 20% less up to 120% more than the original amount in C* #2545

Open
tomer-sandler opened this Issue Jul 4, 2017 · 2 comments

Comments

Projects
None yet
4 participants
@tomer-sandler
Copy link
Contributor

tomer-sandler commented Jul 4, 2017

Installation details
Scylla version (or git commit hash): 1.7.1
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu16.04
C* version: 3.10 (3 node cluster)

I performed a migration of 3 KS, each with 1 table of ~10M partitions, in parallel, while utilizing 3 intermediate nodes, each has an NFS mount point to 1 of the C* nodes, to one of the KS.
Each intermediate node ran sstableloader and loaded the file to a different Scylla node.

Metrics in Grafana here: http://104.196.52.52:3000/dashboard/db/scylla-per-server-metrics-1-7?from=1499084205742&to=1499086585000

After all sstables files loaded and compactions completed, the number of partitions it much bigger than the 9.8M we had in C*. So far in my tests the partition keys estimate post migration + compactions complete + nodetool flush, is wither 20% less up to 120% more than the original amount in C* 3.10.

tomer@ubuntu16-scylla171-migration-1:~$ nodetool tablestats migration3 | grep keys
                Number of keys (estimate): 19510809
tomer@ubuntu16-scylla171-migration-1:~$ nodetool tablestats migration4 | grep keys
                Number of keys (estimate): 12313292
tomer@ubuntu16-scylla171-migration-1:~$ nodetool tablestats migration5 | grep keys
                Number of keys (estimate): 19668921

@glommer wrote about the estimate:
In Nodetool, it is exported as estimatedPartitionCount, which is calculated as

                Object estimatedPartitionCount =
probe.getColumnFamilyMetric(keyspaceName, tableName,
"EstimatedPartitionCount");

Now let's look at what that metric really is in TableMetrics.java, it is basically:

                                                        long
memtablePartitions = 0;
                                                           for
(Memtable memtable : cfs.getTracker().getView().getAllMemtables())

memtablePartitions += memtable.partitionCount();
                                                           return
SSTableReader.getApproximateKeyCount(cfs.getSSTables(SSTableSet.CANONICAL))
+ memtablePartitions;

And is also defined as an alias for EstimatedRowCount. The latter is what scylla-jmx responds to, and it translates to /column_family/metrics/estimated_row_count/

Looking at our implementation, we do not include memtables. Also, when
getting the sstable set, they pass that flag "CANONICAL". The comment
on top of that definition says:

    // returns the "canonical" version of any current sstable, i.e. if
an sstable is being replaced and is only partially
    // visible to reads, this sstable will be returned as its original
entirety, and its replacement will not be returned
    // (even if it completely replaces it)
    CANONICAL,

So my conclusion here is that Scylla is misreporting this. The fact
that we don't include memtables should lead us to underreport. Shared
sstables and sstables being compacted will lead us to overreport.
Those, I think, we should fix.

Another potential interesting difference is the calculation of the
estimate itself. Those estimates come from the Statistics.db file if
available, with a fallback to a simple index-based calculation. It is
entirely possible that C*3 has a newer version of that file.

@slivne slivne added this to the 2.x milestone Jul 24, 2017

@tzach

This comment has been minimized.

Copy link
Contributor

tzach commented Mar 20, 2018

@glommer if I understand correctly, the issue is:
Scylla does not include memtable in nodetool tablestats while Apache Cassandra do. Is it the case?

@glommer

This comment has been minimized.

Copy link
Contributor

glommer commented Mar 20, 2018

no, I hever said that. I am bring this back from my cache, but reading above, my statements concentrate on the fact that we calculate size estimates differently (by double counting some SSTables, for instance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment