Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commitlog files are corrupted: Checksum error in segment chunk at XXXX #4231

Closed
vladzcloudius opened this issue Feb 16, 2019 · 28 comments
Closed
Labels
bug

Comments

@vladzcloudius
Copy link
Contributor

@vladzcloudius vladzcloudius commented Feb 16, 2019

Installation details
Scylla version: 9cd14f2
Cluster size: 3 nodes ccm cluster
OS: Ubuntu 18.04

Hardware details
Platform: physical
Disks: a single NVMe card

Description

  1. Create a 3 nodes ccm cluster:
ccm create scylla-3 --scylla --vnodes -n 3 --install-dir=<path to scylla repo>
  1. Start scylla:
ccm start --jvm_arg="--memory" --jvm_arg="2G" --jvm_arg="--collectd-address" --jvm_arg="127.0.0.1:25826" --jvm_arg="--collectd" --jvm_arg="0" --jvm_arg="--default-log-level" --jvm_arg="info" --jvm_arg="--smp" --jvm_arg="2" --jvm_arg="--io-properties-file" --jvm_arg="/home/vladz/.ccm/scylla-1/node1/conf/io_properties.yaml" --jvm_arg="--developer-mode" --jvm_arg="false" --jvm_arg="--enable-sstables-mc-format" --jvm_arg="true" --jvm_arg="--hinted-handoff-enabled" --jvm_arg="false" --jvm_arg="--logger-log-level" --jvm_arg="commitlog=info"

Contents of /home/vladz/.ccm/scylla-1/node1/conf/io_properties.yaml:

disks:
- mountpoint: '/home/vladz/.ccm/scylla-1/node1/data'
  read_iops: 4178
  read_bandwidth: 1062918272
  write_iops: 1587
  write_bandwidth: 435770624
  1. Run c-s with the following user profile:
#
# Keyspace name and create CQL
#
keyspace: stressexample
keyspace_definition: |
  CREATE KEYSPACE stressexample WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '3'};
#
# Table name and create CQL
#
table: eventsrawtest
table_definition: |
  CREATE TABLE eventsrawtest (
        key int,
        cluster_key int,
        data blob,
        PRIMARY KEY (key, cluster_key)
  )

#
# Meta information for generating data
#
columnspec:
  - name: key
    #size: fixed(32) #In chars, no. of chars of UUID
    population: uniform(1..6000000000)  # We have about 600 hosts with equal events per host
  - name: cluster_key
    population: uniform(1..100) # 100
    cluster: fixed(100)
  - name: data
    size: fixed(100000)

#
# Specs for insert queries
#
insert:
  partitions: fixed(1)      # 1 partition per batch
  batchtype: UNLOGGED       # use unlogged batches
  select: fixed(20)/100      # no chance of skipping a row when generating inserts

#
# Read queries to run against the schema
#
queries:
   get-part:
      cql: select * from bulk where key = ?
      #fields: samerow             # pick selection values from same row in partition
cassandra-stress user profile=/home/vladz/work/scripts/stress_tools/cassandra-stress/user_profile.yaml no-warmup ops\(insert=1\) n=1000000 -rate threads=50 throttle=30/s
  1. Run pkill -9 scylla after there are about 20 commitlog files in some of commitlog directories, e.g. at ~/.ccm/scylla-3/node1/commitlog.
  2. Start scylla again but this time with a commitlog logger in a debug mode:
ccm start --jvm_arg="--memory" --jvm_arg="2G" --jvm_arg="--collectd-address" --jvm_arg="127.0.0.1:25826" --jvm_arg="--collectd" --jvm_arg="0" --jvm_arg="--default-log-level" --jvm_arg="info" --jvm_arg="--smp" --jvm_arg="2" --jvm_arg="--io-properties-file" --jvm_arg="/home/vladz/.ccm/scylla-1/node1/conf/io_properties.yaml" --jvm_arg="--developer-mode" --jvm_arg="false" --jvm_arg="--enable-sstables-mc-format" --jvm_arg="true" --jvm_arg="--hinted-handoff-enabled" --jvm_arg="false" --jvm_arg="--logger-log-level" --jvm_arg="commitlog=debug"
  1. Watch the scylla log:
<-- snip -->
INFO  2019-02-15 21:28:22,767 [shard 0] commitlog_replayer - Replaying /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705169.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187154.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187158.log, /home/vladz/.ccm/scylla
-3/node1/commitlogs/CommitLog-1-3705173.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187159.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705174.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705175.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-1801
4398513187160.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705176.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187161.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705177.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187162.log, /home/v
ladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187163.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705178.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187164.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705179.log, /home/vladz/.ccm/scylla-3/node1/c
ommitlogs/CommitLog-1-18014398513187165.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705180.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/Recycled-CommitLog-1-3705182.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/Recycled-CommitLog-1-3705183.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/Recycled
-CommitLog-1-18014398513187168.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187169.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705184.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187170.log
DEBUG 2019-02-15 21:28:22,773 [shard 1] commitlog - Checksum error in segment chunk at 1302536.
DEBUG 2019-02-15 21:28:22,775 [shard 0] commitlog - Checksum error in segment chunk at 1605640.
WARN  2019-02-15 21:28:22,775 [shard 1] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187169.log. 32251896 bytes skipped.
WARN  2019-02-15 21:28:22,776 [shard 0] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705180.log. 31948792 bytes skipped.
DEBUG 2019-02-15 21:28:22,781 [shard 0] commitlog - Checksum error in segment chunk at 1904648.
WARN  2019-02-15 21:28:22,784 [shard 0] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705179.log. 31649784 bytes skipped.
DEBUG 2019-02-15 21:28:22,785 [shard 1] commitlog - Checksum error in segment chunk at 4509704.
WARN  2019-02-15 21:28:22,788 [shard 1] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187165.log. 29044728 bytes skipped.
DEBUG 2019-02-15 21:28:22,795 [shard 0] commitlog - Checksum error in segment chunk at 5115912.
DEBUG 2019-02-15 21:28:22,796 [shard 1] commitlog - Checksum error in segment chunk at 1101832.
WARN  2019-02-15 21:28:22,797 [shard 0] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705178.log. 28438520 bytes skipped.
WARN  2019-02-15 21:28:22,798 [shard 1] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187163.log. 32452600 bytes skipped.
DEBUG 2019-02-15 21:28:22,800 [shard 1] commitlog - Checksum error in segment chunk at 1003528.
WARN  2019-02-15 21:28:22,802 [shard 1] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187162.log. 32550904 bytes skipped.
DEBUG 2019-02-15 21:28:22,806 [shard 0] commitlog - Checksum error in segment chunk at 4210696.
DEBUG 2019-02-15 21:28:22,807 [shard 1] commitlog - Checksum error in segment chunk at 2007048.
WARN  2019-02-15 21:28:22,808 [shard 0] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705177.log. 29343736 bytes skipped.
WARN  2019-02-15 21:28:22,808 [shard 1] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187161.log. 31547384 bytes skipped.
DEBUG 2019-02-15 21:28:22,810 [shard 1] commitlog - Checksum error in segment chunk at 704520.
DEBUG 2019-02-15 21:28:22,812 [shard 0] commitlog - Checksum error in segment chunk at 1503240.
WARN  2019-02-15 21:28:22,812 [shard 1] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398513187160.log. 32849912 bytes skipped.
WARN  2019-02-15 21:28:22,814 [shard 0] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-3705176.log. 32051192 bytes skipped.
DEBUG 2019-02-15 21:28:22,816 [shard 1] commitlog - Checksum error in segment chunk at 2105352.
DEBUG 2019-02-15 21:28:22,817 [shard 0] commitlog - Checksum error in segment chunk at 1605640.
<-- snip -->
@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 16, 2019

This issue has a very unpleasant side effect: it hits HH as well (that's how I found it at the first place). As a result, HH send path gets stuck since it's unable to send these corrupted segments out. I have a patch that addresses that and I'll send it on Monday.

Regardless this issue I think HH should still be ready to this kind of error because it may happen for a "good" reason file FS error or disk HW issue. Whatever the reason - this file is not recoverable and we may not rely on its contents anymore therefore it's safer to drop it.

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Feb 16, 2019

Checksum errors are expected after a crash. The tail of the commitlog was partially written, and the way we detect this partial record is via the checksum mismatch.

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 18, 2019

@avikivity I see exactly the same errors with HH segments that use the same (commitlog) framework but do not involve "crash".

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 19, 2019

@elcallio In order to make HH ready for these errors I need to patch it. Since the "smart skipping" logic is hidden inside the commitlog_replayer::recover() in order to keep it simple I can simply drop the whole HH segment if I detect a checksum error (it should not happen under normal conditions). However in order to keep the "discarded" metric consistent I need to increment it by the number of entries in the segment in question. Is there a way to read to amount of entries in the commitlog segment?

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 19, 2019

@elcallio From the comment in the code I see that we should not trust the remainder of the file as soon as we detected the checksum error and it makes sense. I guess I need to add another metric that would indicate that we have discarded a portion of the file.

@duarten You may need to take this new statistics into an account in the MV somehow.

@elcallio

This comment has been minimized.

Copy link
Contributor

@elcallio elcallio commented Feb 19, 2019

Note that since the adding of "recycled" elements you can potentially have files that end with a checksum error even in a clean case (though they should mostly hit the zero termination block at the end).
Also, while it is likely that once you hit a broken checksum, the rest of the file is junk, it is not always so. The reader should iterate everything that can be read.
There is no info on number of entries in a segment.
If you want to ensure that you never have any unreadable data except for actual broken files (crash), you might want to disable segment recycling in the commitlog options.

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Feb 19, 2019

@elcallio let's add a termination record or marker so we only see errors in case there really are errors.

@elcallio

This comment has been minimized.

Copy link
Contributor

@elcallio elcallio commented Feb 19, 2019

@avikivity

This comment has been minimized.

Copy link
Contributor

@avikivity avikivity commented Feb 19, 2019

According to @vladzcloudius this happens even without a "crash"

@elcallio

This comment has been minimized.

Copy link
Contributor

@elcallio elcallio commented Feb 19, 2019

It might be that he is picking up files to replay without having shut down the commitlog prior?

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 19, 2019

@elcallio The corresponding HH code didn't change for a long time: it's in manager::end_point_hints_manager::add_store().
So, what we do here is we create a new instance of a commitlog object which operates in its own directory and there is only one such object alive (for each directory) at any given time. Right after that we call get_segments_to_replay() and append them to the list of segments we are going to send.

If this new "recycling" feature may cause the commitlog to "touch" old segments then this indeed may explain what I see. Please, confirm that I'm getting it right.

I'll see how I can disable it.

@elcallio

This comment has been minimized.

Copy link
Contributor

@elcallio elcallio commented Feb 19, 2019

Is the dir that you read from always (unless crashed) from a previous commitlog instance, terminated by "shutdown()"? There are two places calling shutdown in the code, but I am not sure if it covers all terminations of a particular dir?

Yes, recycling will mean that we have files that are non-zero contents, but possibly not actually in play. However, the replay mechanism should just skip these, since they will not pass even initial parse (name does not match file header), and iff the file is actually in play, it will either be terminated gracefully, with a zero data page (triggering parse stop, same as an unused part of a segment), or possibly terminate in a chunk that does not pass checksum verification (i.e. benign failure).

You can turn recycling off by setting cfg.reuse_segments = false in config.

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 19, 2019

@elcallio Yes, a per-shard end_point_manager is going to use its own directory by periorically gracefully shutting down a commitlog instance and creating a new one. According to what you just said I should not see checksum errors in this case - but I do, which leaves me confused.

I'll check if disabling "recycling" helps and will update here.

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 19, 2019

@elcallio I see these errors even when I set reuse_segments to false.
I re-tried the original "pure" commitlog use case described in this issue with "commitlog" logger set to "trace" and this is what I found:

$ grep CommitLog-1-18014398840595760.log ~/.ccm/scylla-3/node1/logs/system.log
<-- snip -->
DEBUG 2019-02-19 16:24:56,896 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 3641344/5607424 bytes
DEBUG 2019-02-19 16:24:56,901 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 3772416/5607424 bytes
DEBUG 2019-02-19 16:24:56,904 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 3903488/5607424 bytes
DEBUG 2019-02-19 16:24:56,908 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4034560/5607424 bytes
DEBUG 2019-02-19 16:24:56,912 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4165632/5607424 bytes
DEBUG 2019-02-19 16:24:56,919 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4296704/5607424 bytes
DEBUG 2019-02-19 16:24:56,920 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4427776/5607424 bytes
DEBUG 2019-02-19 16:24:56,922 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4558848/5607424 bytes
DEBUG 2019-02-19 16:24:56,924 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4689920/5607424 bytes
DEBUG 2019-02-19 16:24:56,928 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4820992/5607424 bytes
DEBUG 2019-02-19 16:24:56,934 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 4952064/5607424 bytes
DEBUG 2019-02-19 16:24:56,937 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 5083136/5607424 bytes
DEBUG 2019-02-19 16:24:56,938 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 5214208/5607424 bytes
DEBUG 2019-02-19 16:24:56,943 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 5345280/5607424 bytes
DEBUG 2019-02-19 16:24:56,946 [shard 1] commitlog - Partial write CommitLog-1-18014398840595760.log: 5476352/5607424 bytes
TRACE 2019-02-19 16:24:57,659 [shard 1] commitlog - Syncing CommitLog-1-18014398840595760.log 0 -> 31346688
TRACE 2019-02-19 16:24:57,681 [shard 1] commitlog - CommitLog-1-18014398840595760.log synced to 31346688
TRACE 2019-02-19 16:24:57,681 [shard 1] commitlog - CommitLog-1-18014398840595760.log is closed but not terminated.
TRACE 2019-02-19 16:24:57,681 [shard 1] commitlog - Terminating CommitLog-1-18014398840595760.log at pos 31350784
TRACE 2019-02-19 16:24:57,709 [shard 1] commitlog - CommitLog-1-18014398840595760.log synced to 31350784
INFO  2019-02-19 16:25:22,168 [shard 0] commitlog_replayer - Replaying /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113753.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595738.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595749.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595750.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595755.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113764.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595756.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113765.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595757.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113766.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113767.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113768.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595758.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113773.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113774.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595759.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113775.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595760.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595761.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-331113776.log, /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595762.log
WARN  2019-02-19 16:25:22,245 [shard 1] commitlog_replayer - Corrupted file: /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595760.log. 7815160 bytes skipped.
DEBUG 2019-02-19 16:25:25,261 [shard 0] commitlog - Recycling segment file /home/vladz/.ccm/scylla-3/node1/commitlogs/CommitLog-1-18014398840595760.log

So, despite the fact that this segment has been successfully terminated it's still corrupted. I think HH do not work due to the same problem.

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 20, 2019

@elcallio This is what we see with a tracing logging with HH:

$ grep HintsLog-1-18014398847994217.log  ~/.ccm/scylla-3/node1/logs/system.log
<-- snip -->
DEBUG 2019-02-19 20:18:46,238 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 4136960/5709824 bytes
DEBUG 2019-02-19 20:18:46,261 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 4268032/5709824 bytes
DEBUG 2019-02-19 20:18:46,296 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 4399104/5709824 bytes
DEBUG 2019-02-19 20:18:46,333 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 4530176/5709824 bytes
DEBUG 2019-02-19 20:18:46,364 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 4661248/5709824 bytes
DEBUG 2019-02-19 20:18:46,389 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 4792320/5709824 bytes
DEBUG 2019-02-19 20:18:46,418 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 4923392/5709824 bytes
DEBUG 2019-02-19 20:18:46,442 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 5054464/5709824 bytes
DEBUG 2019-02-19 20:18:46,479 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 5185536/5709824 bytes
DEBUG 2019-02-19 20:18:46,501 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 5316608/5709824 bytes
DEBUG 2019-02-19 20:18:46,527 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 5447680/5709824 bytes
DEBUG 2019-02-19 20:18:46,557 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 5578752/5709824 bytes
TRACE 2019-02-19 20:18:47,458 [shard 1] commitlog - Syncing HintsLog-1-18014398847994217.log 0 -> 31948800
TRACE 2019-02-19 20:18:47,470 [shard 1] commitlog - HintsLog-1-18014398847994217.log synced to 31948800
TRACE 2019-02-19 20:18:47,470 [shard 1] commitlog - HintsLog-1-18014398847994217.log is closed but not terminated.
TRACE 2019-02-19 20:18:47,470 [shard 1] commitlog - Terminating HintsLog-1-18014398847994217.log at pos 31952896
TRACE 2019-02-19 20:18:47,509 [shard 1] commitlog - HintsLog-1-18014398847994217.log synced to 31952896
TRACE 2019-02-19 20:18:51,004 [shard 1] commitlog - Syncing HintsLog-1-18014398847994217.log 31952896 -> 31952896
TRACE 2019-02-19 20:18:51,004 [shard 1] commitlog - HintsLog-1-18014398847994217.log already synced! (31952896 < 31952896)
DEBUG 2019-02-19 20:18:51,004 [shard 1] commitlog - Synced segment HintsLog-1-18014398847994217.log
TRACE 2019-02-19 20:18:51,199 [shard 1] commitlog - Syncing HintsLog-1-18014398847994217.log 31952896 -> 31952896
TRACE 2019-02-19 20:18:51,199 [shard 1] commitlog - HintsLog-1-18014398847994217.log already synced! (31952896 < 31952896)
DEBUG 2019-02-19 20:18:51,199 [shard 1] commitlog - Synced segment HintsLog-1-18014398847994217.log
DEBUG 2019-02-19 20:18:51,230 [shard 1] commitlog - Not safe to delete segment HintsLog-1-18014398847994217.log; dirty is bc3b27d0-348c-11e9-80a0-000000000000
TRACE 2019-02-19 20:18:51,775 [shard 1] commitlog - Syncing HintsLog-1-18014398847994217.log 32653312 -> 32653312
TRACE 2019-02-19 20:18:51,775 [shard 1] commitlog - HintsLog-1-18014398847994217.log already synced! (32653312 < 32653312)
DEBUG 2019-02-19 20:18:51,775 [shard 1] commitlog - Synced segment HintsLog-1-18014398847994217.log
TRACE 2019-02-19 20:18:51,787 [shard 1] commitlog - Syncing HintsLog-1-18014398847994217.log 32653312 -> 32653312
TRACE 2019-02-19 20:18:51,787 [shard 1] commitlog - HintsLog-1-18014398847994217.log already synced! (32653312 < 32653312)
DEBUG 2019-02-19 20:18:51,787 [shard 1] commitlog - Synced segment HintsLog-1-18014398847994217.log
DEBUG 2019-02-19 20:18:51,812 [shard 1] commitlog - Not safe to delete segment HintsLog-1-18014398847994217.log; dirty is bc3b27d0-348c-11e9-80a0-000000000000
ERROR 2019-02-19 20:19:03,152 [shard 1] hints_manager - sending of /home/vladz/.ccm/scylla-3/node1/hints/1/127.0.0.3/HintsLog-1-18014398847994217.log failed: Data corruption
ERROR 2019-02-19 20:19:03,152 [shard 1] hints_manager - going to discard /home/vladz/.ccm/scylla-3/node1/hints/1/127.0.0.3/HintsLog-1-18014398847994217.log
@elcallio

This comment has been minimized.

Copy link
Contributor

@elcallio elcallio commented Feb 20, 2019

What did you run this on? Why is there so many partial writes?
You might want to check the whole log (and possibly run it on smp=1), since the replayer will trace-write where (file positions) it finds broken checksums. This might help give some clue as to where the file breaks. The logs above look perfectly valid (sans all the partial writes) up until the stated corruption...

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 21, 2019

@elcallio It's on my laptop with NVMe driver + xfs:

/dev/nvme0n1p5 on /home type xfs (rw,relatime,attr2,inode64,noquota)

HEAD is at 497367f plus this patch:

diff --git a/db/hints/manager.cc b/db/hints/manager.cc
index 5ec502219..6aa0cb934 100644
--- a/db/hints/manager.cc
+++ b/db/hints/manager.cc
@@ -315,6 +315,7 @@ future<db::commitlog> manager::end_point_hints_manager::add_store() noexcept {
             cfg.commitlog_total_space_in_mb = resource_manager::max_hints_per_ep_size_mb;
             cfg.fname_prefix = manager::FILENAME_PREFIX;
             cfg.extensions = &_shard_manager.local_db().extensions();
+            cfg.reuse_segments = false;
 
             return commitlog::create_commitlog(std::move(cfg)).then([this] (commitlog l) {
                 // add_store() is triggered every time hint files are forcefully flushed to I/O (every hints_flush_period).
@@ -725,6 +726,10 @@ bool manager::end_point_hints_manager::sender::send_one_file(const sstring& fnam
         }, _last_not_complete_rp.pos, &_db.extensions()).get0();
 
         s->done().get();
+    } catch (db::commitlog::segment_data_corruption_error& e){
+        manager_logger.error("sending of {} failed: {}", fname, e.what());
+        manager_logger.error("going to discard {}", fname);
+        ctx_ptr->state.remove(send_state::segment_replay_failed);
     } catch (...) {
         manager_logger.trace("sending of {} failed: {}", fname, std::current_exception());
         ctx_ptr->state.set(send_state::segment_replay_failed);

Running with smp=1 as follows:

ccm start --jvm_arg="--memory" --jvm_arg="2G" --jvm_arg="--collectd-address" --jvm_arg="127.0.0.1:25826" --jvm_arg="--collectd" --jvm_arg="0" --jvm_arg="--default-log-level" --jvm_arg="info" --jvm_arg="--smp" --jvm_arg="1" --jvm_arg="--io-properties-file" --jvm_arg="/home/vladz/.ccm/scylla-1/node1/conf/io_properties.yaml" --jvm_arg="--developer-mode" --jvm_arg="false" --jvm_arg="--enable-sstables-mc-format" --jvm_arg="true" --jvm_arg="--hinted-handoff-enabled" --jvm_arg="true" --jvm_arg="--logger-log-level" --jvm_arg="commitlog=trace"  --jvm_arg="--logger-log-level" --jvm_arg="hints_manager=trace"

I get exactly the same result.
As for partial writes - I see a lot of them (it feels like there isn't non-partial writes).
And I see the same picture both for regular commitlogs and for hints.

Here I'm attaching the full log from node1. I patched the code a little to make logs a little more verbose and I also patches HH to drop broken segments.

system.zip

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 21, 2019

@elcallio You say that the above log is ok but there is something I don't understand: how come the file is terminated at the position before the the last one that has been written:

DEBUG 2019-02-19 20:18:46,557 [shard 1] commitlog - Partial write HintsLog-1-18014398847994217.log: 5578752/5709824 bytes <---- Last write
TRACE 2019-02-19 20:18:47,458 [shard 1] commitlog - Syncing HintsLog-1-18014398847994217.log 0 -> 31948800
TRACE 2019-02-19 20:18:47,470 [shard 1] commitlog - HintsLog-1-18014398847994217.log synced to 31948800
TRACE 2019-02-19 20:18:47,470 [shard 1] commitlog - HintsLog-1-18014398847994217.log is closed but not terminated.
TRACE 2019-02-19 20:18:47,470 [shard 1] commitlog - Terminating HintsLog-1-18014398847994217.log at pos 31952896 <---- Termination at the position way before 5578752. 

This looks very suspicious to me. At least according to logs this means that the whole tail from 31948801 till 5578752 has not been synced or even worse - possibly trimmed!

Please, comment.

@slivne

This comment has been minimized.

Copy link
Contributor

@slivne slivne commented Feb 21, 2019

@elcallio ping ..

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 21, 2019

@elcallio I think the problem may be in a fact that after a "commitlog: switch to fragmented buffers" a single operation in _pending_ops is not a single continuation.

The main difference this change introduces is that now it's almost always going to be a few continuations (and this use case is increasing this probability since mutations are big).

There are also a few "strangenesnesses" there like cycle_count being incremented for each fragment in a fragmented buffer.

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 21, 2019

Reverting the following patch makes the issue go away:

commit 4469f76e7c95041702de6a22f4260dcec03225fd
Author: Paweł Dziepak <pdziepak@scylladb.com>
Date:   Tue Aug 14 16:08:01 2018 +0100

    commitlog: switch to fragmented buffers

    So far commitlog was using contiguous buffers for storing the data that
    is about to be written to disk. It was able to coalesce small writes so
    that multiple small mutations would use the same buffer, but if a
    muation was large the commitlog would attempt to allocate a single,
    appropriately large buffer. This excessively stresses the memory
    allocator and may cause memory fragmentation to become an issue. The
    solution is to use fixed-size buffers of 128 kB, which is the standard
    buffer size in Scylla and keep large values fragmented.

@slivne @pdziepak FYI

@elcallio

This comment has been minimized.

Copy link
Contributor

@elcallio elcallio commented Feb 25, 2019

@elcallio

This comment has been minimized.

Copy link
Contributor

@elcallio elcallio commented Feb 25, 2019

Whether or not the continuation into run_with_ordered... is a single call or not should make no difference, since only the end result future counts.
Also, I take back what I said above. The write loop only writes on partial writes, so it does not actually say when it is done. Confusing, but...

Is there any reasonably small repro case for this?

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 25, 2019

@elcallio Please, read the opening message. The reproducer is about 61 seconds run - so, this is pretty short.
How to run it is up to you. You ultimately need a 3 nodes cluster. I used ccm which runs completely on your workstation. Alternatively you may create a real cluster: AWS, GCE, etc.

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 25, 2019

@elcallio If you want to concentrate on a classic commitlog path you don't really need 3 nodes - one would do just fine. If you want to work work with HH (which would require less instrumentation) you do need 3 nodes cluster.

@vladzcloudius

This comment has been minimized.

Copy link
Contributor Author

@vladzcloudius vladzcloudius commented Feb 27, 2019

@pdziepak @elcallio Any updates on this?

@glommer

This comment has been minimized.

Copy link
Contributor

@glommer glommer commented Feb 27, 2019

@pdziepak see above. This was broken by your fragmented buffers patch

@glommer glommer added the bug label Feb 27, 2019
@pdziepak

This comment has been minimized.

Copy link
Contributor

@pdziepak pdziepak commented Feb 28, 2019

The problem is in this part:

return _pending_ops.run_with_ordered_post_op(rp, [this, size, off, buf = std::move(buf)]() mutable { ///////////////////////////////////////////////////
auto view = fragmented_temporary_buffer::view(buf);
return do_with(off, view, [&] (uint64_t& off, fragmented_temporary_buffer::view& view) {
if (view.empty()) {
return make_ready_future<>();
}
return repeat([this, size, &off, &view] {
auto&& priority_class = service::get_local_commitlog_priority();
auto current = *view.begin();
return _file.dma_write(off, current.data(), current.size(), priority_class).then_wrapped([this, size, &off, &view](future<size_t>&& f) {

The commitlog code has two notions of the buffer size at that point, variable size which is the size on-disk (and is used to increment the file position) as well as buf.size_bytes() which is the size of the buffer in memory. Those needn't be the same. In other words, an assertion assert(size == view.size_bytes()) after line 724 would fail.

If on-disk and in-memory sizes are different, we will, incorrectly, write more than needed to the file. If we have more than one write happening at the same point and they are reordered it may happen that the too large part of the previous buffer will override the beginning of the next one, thus corrupting it.

Moreover, we probably should use vectored IO, but that'd be an enhancement, not a bug fix. There are some minor issue with logging, as well, (e.g. inconsistency in reporting offsets of chunks between the writer and the replayer) which make debugging more time-consuming than it should be.

I'll send patches soon.

avikivity pushed a commit that referenced this issue Mar 4, 2019
This commits adds some more information to the logs. Motivated, by
experiences with investigating #4231.

 * size of each write
 * position of each write
 * log message for final write
avikivity added a commit that referenced this issue Mar 4, 2019
"
This series contains minor improvements to commitlog log messages that
have helped investigating #4231, but are not specific to that bug.
"

* tag 'improve-commitlog-logs/v1' of https://github.com/pdziepak/scylla:
  commitlog: use consistent chunk offsets in logs
  commitlog: provide more information in logs
  commitlog: remove unnecessary comment
avikivity added a commit that referenced this issue Mar 4, 2019
"
This series fixes a problem in the commitlog cycle() function that
confused in-memory and on-disk size of chunks it wrote to disk. The
former was used to decide how much data needs to be actually written,
and the latter was used to compute the offset of the next chunk. If two
chunk writes happened concurrently one the one positioned earlier in
the file could corrupt the header of the next one.

Fixes #4231.

Tests: unit(dev), dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup,test_commitlog_replay_with_alter_table)
"

* tag 'fix-commitlog-cycle/v1' of https://github.com/pdziepak/scylla:
  commitlog: write the correct buffer size
  utils/fragmented_temporary_buffer_view: add remove suffix

(cherry picked from commit d95dec2)
tgrabiec added a commit that referenced this issue Apr 16, 2019
…ments" from Vlad

This series addresses two issues in the hinted handoff that should
complete fixing the infamous #4231.

In particular the second patch removes the requirement to manually
delete hints files after upgrading to 3.0.4.

Tested with manual unit testing.

* https://github.com/vladzcloudius/scylla.git hinted_handoff_drop_broken_segments-v3:
  hinted handoff: disable "reuse_segments"
  commitlog: introduce a segment_error
  hinted handoff: discard corrupted segments
tgrabiec added a commit that referenced this issue Sep 25, 2019
…ments" from Vlad

This series addresses two issues in the hinted handoff that should
complete fixing the infamous #4231.

In particular the second patch removes the requirement to manually
delete hints files after upgrading to 3.0.4.

Tested with manual unit testing.

* https://github.com/vladzcloudius/scylla.git hinted_handoff_drop_broken_segments-v3:
  hinted handoff: disable "reuse_segments"
  commitlog: introduce a segment_error
  hinted handoff: discard corrupted segments

(cherry picked from commit ac0d435)
avikivity added a commit that referenced this issue Sep 28, 2019
…ments" from Vlad

This series addresses two issues in the hinted handoff that should
complete fixing the infamous #4231.

In particular the second patch removes the requirement to manually
delete hints files after upgrading to 3.0.4.

Tested with manual unit testing.

* https://github.com/vladzcloudius/scylla.git hinted_handoff_drop_broken_segments-v3:
  hinted handoff: disable "reuse_segments"
  commitlog: introduce a segment_error
  hinted handoff: discard corrupted segments

(cherry picked from commit ac0d435)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.