New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hinted handoff discards hints during sending for no good reason #4122

Closed
vladzcloudius opened this Issue Jan 22, 2019 · 7 comments

Comments

Projects
None yet
5 participants
@vladzcloudius
Copy link
Contributor

vladzcloudius commented Jan 22, 2019

Installation details
Scylla version: dbc1894
Releases affected: any release that includes HH

Description
Hints are being discarded during sending because column family mapping is not found.

How to reproduce:

  1. Create a 3 nodes ccm cluster: ccm create scylla-3 --scylla --vnodes -n 3:3 --install-dir=<directory to the local scylla repo>
  2. Start it: ccm start --jvm_arg="--memory" --jvm_arg="3G" --jvm_arg="--collectd-address" --jvm_arg="127.0.0.1:25826" --jvm_arg="--collectd" --jvm_arg="1" --jvm_arg="--default-log-level" --jvm_arg="info" --jvm_arg="--smp" --jvm_arg="3" --jvm_arg="--logger-log-level" --jvm_arg="hints_manager=trace"
  3. Run a cassandra-stress against it: cassandra-stress write n=10000000 cl=ONE no-warmup -rate threads=300 -mode native cql3 -schema "keyspace=ks1 replication(factor=3)"

Grep logs of node1 as follows:

$ tail -f ~/.ccm/scylla-3/node1/logs/system.log | grep "is missing" 

Observe a non-empty output after some time - when hints will begin to be sent.
This is not expected because there is no schema change (which could have caused such errors).

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

vladzcloudius commented Jan 22, 2019

This is caused by a logic in out HH implementation that would send the segment not from the beginning but rather from the place it left off in the previous send attempt (in the example above this would be caused by the timeout because the receiving node is overloaded).

In this code we cache the last replay position but do not cache the columns mappings we learned so far and would start with an empty column mappings cache the next time we try to re-transmit the segment. However commitlog framework doesn't save the corresponding column mapping in each entry and tries to optimize it and saves it once in each segment. Therefore when we start re-transmitting from the middle of a segment we may need mappings that were written in entries we have already sent.

TODO:

  1. Cache column mappings together with the last replay position.
  2. Add hints.discarded metric to account hints that are discarded during sending.
@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

vladzcloudius commented Jan 22, 2019

@slivne I already have the patch that addresses both items above. I'm going to send them tomorrow after I finish testing.

@tzach tzach added the bug label Jan 22, 2019

@tzach

This comment has been minimized.

Copy link
Contributor

tzach commented Jan 22, 2019

@vladzcloudius can you explain the user impact of this bug?
My understanding: some hints may be lost.
Is it the case? Can we scope the effect further, to a particular condition?

@slivne slivne added this to the 3.1 milestone Jan 22, 2019

@slivne slivne added the Backport 3.0 label Jan 22, 2019

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

vladzcloudius commented Jan 22, 2019

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

vladzcloudius commented Jan 22, 2019

@elcallio I would like to ask you something: I noticed that commitlog resets column family mappings every "cycle". This means that if a segment has undergo a few "cycles" it would save information of the same CF a few times. In the use case I described above (default c-s schema) this is the case and I saw that some segments have the same cf mapping is saved ~250 times!!!
I wonder what's the reason to save it more than once per-segment?

duarten added a commit that referenced this issue Jan 23, 2019

Merge 'hinted handoff: cache cf mappings' from Vlad
"
Cache cf mappings when breaking in the middle of a segment sending so
that the sender has them the next time it wants to send this segment
for where it left off before.

Also add the "discard" metric so that we can track hints that are being
discarded in the send flow.
"

Fixes #4122

* 'hinted_handoff_cache_cf_mappings-v1' of https://github.com/vladzcloudius/scylla:
  hinted handoff: cache column family mappings for segments that were not sent out in full
  hinted handoff: add a "discarded" metric

avikivity added a commit that referenced this issue Jan 23, 2019

Merge 'hinted handoff: cache cf mappings' from Vlad
"
Cache cf mappings when breaking in the middle of a segment sending so
that the sender has them the next time it wants to send this segment
for where it left off before.

Also add the "discard" metric so that we can track hints that are being
discarded in the send flow.
"

Fixes #4122

* 'hinted_handoff_cache_cf_mappings-v1' of https://github.com/vladzcloudius/scylla:
  hinted handoff: cache column family mappings for segments that were not sent out in full
  hinted handoff: add a "discarded" metric

(cherry picked from commit 88c7c1e)
@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

vladzcloudius commented Feb 12, 2019

@glommer @slivne
@elcallio ping. Could you, please, address my question above?

@glommer

This comment has been minimized.

Copy link
Contributor

glommer commented Feb 13, 2019

I have no idea, therefore deferring to mr @elcallio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment