Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant crashes on RPi4 when syncing #6116

Closed
canton7 opened this issue Oct 31, 2019 · 16 comments

Comments

@canton7
Copy link
Member

@canton7 canton7 commented Oct 31, 2019

Version: syncthing v1.3.0 "Fermium Flea" (go1.13.1 linux-arm) builduser@leming 2019-10-19 17:26:46 UTC
Running on a Raspberry Pi 4.

This device has been seemingly happily churning away, until now. Connected a new computer, started syncing many gigs of data. Device started crashing (and being restarted by systemd) every few minutes. One of the crash logs is attached.

Looks like it also managed to corrupt a bunch of files which were synced to it (some time ago - I'm afraid I don't have records or logs on when), and then sync those corrupted versions back to the source device. I don't know whether it was experiencing the same crash at the time, or whether that's a symptom of something else, such as a faulty disk.

I'm afraid I'm not well-versed enough in Go crash logs to say whether this is a duplicate - please close if so.

Crash Log

@AudriusButkevicius

This comment has been minimized.

Copy link
Member

@AudriusButkevicius AudriusButkevicius commented Oct 31, 2019

This does not have the panic message, so not very useful.

@AudriusButkevicius

This comment has been minimized.

Copy link
Member

@AudriusButkevicius AudriusButkevicius commented Oct 31, 2019

Closing, if you do get the panic message, let's start off on the forum before we move to the issue tracker.

@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Oct 31, 2019

That's everything that I had - the messages either side were the last normal log message, and systemd saying it had stopped. Can you give an example if what I should search for?

I asked whether I should post to the forum but calmh said I should open on here, so I did.

I should also correct something: the device was corrupting files as it was crashing, and syncing those corruptions to other devices.

@imsodin

This comment has been minimized.

Copy link
Member

@imsodin imsodin commented Oct 31, 2019

It looks like the beginning of the panic message was cut off. A typical panic looks like this: https://crash.syncthing.net/report/4f021c9649efe1a281821df949e933b1d72aebf41a106c0652eb9e5ae25ac92b

The important part is the beginning with the panic reason followed by the goroutine that caused it, e.g.

Panic at 2019-10-31T16:57:52-04:00
panic: handling index for nonexistent folder

goroutine 151509 [running]:
github.com/syncthing/syncthing/lib/model.(*model).handleIndex(0x2bc4000, 0xd0a9fc8f, 0xa4b6e523, 0x34c77e4, 0x6ca02b70, 0x57c9cb57, 0x7479cc1e, 0x280e3ba7, 0x41135365, 0x7c1f590, ...)
	/opt/tcagent/syncthing-1-work/174e136266f8a219/lib/model/model.go:1001 +0xc78
github.com/syncthing/syncthing/lib/model.(*model).Index(0x2bc4000, 0xd0a9fc8f, 0xa4b6e523, 0x34c77e4, 0x6ca02b70, 0x57c9cb57, 0x7479cc1e, 0x280e3ba7, 0x41135365, 0x7c1f590, ...)
	/opt/tcagent/syncthing-1-work/174e136266f8a219/lib/model/model.go:969 +0x68
github.com/syncthing/syncthing/lib/protocol.(*rawConnection).handleIndex(0x8affa70, 0x7c1f590, 0x5, 0xcbdc000, 0x3e8, 0x400)
	/opt/tcagent/syncthing-1-work/174e136266f8a219/lib/protocol/protocol.go:564 +0x18c
github.com/syncthing/syncthing/lib/protocol.(*rawConnection).dispatcherLoop(0x8affa70, 0x0, 0x0)
	/opt/tcagent/syncthing-1-work/174e136266f8a219/lib/protocol/protocol.go:420 +0x958
github.com/syncthing/syncthing/lib/protocol.(*rawConnection).Start.func1(0x8affa70)
	/opt/tcagent/syncthing-1-work/174e136266f8a219/lib/protocol/protocol.go:257 +0x1c
created by github.com/syncthing/syncthing/lib/protocol.(*rawConnection).Start
	/opt/tcagent/syncthing-1-work/174e136266f8a219/lib/protocol/protocol.go:256 +0x48
@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Oct 31, 2019

This is the closest I got (copied from the log I posted):

stack: frame={sp:0x66b4d968, fp:0x0} stack=[0x6634e214,0x66b4de14)
runtime: unknown pc 0x76dfe104
goroutine 0 [idle]:
PC=0x76dfe104 m=2 sigcode=4294967290
SIGABRT: abort
runtime/cgo: pthread_create failed: Resource temporarily unavailable

Looks like a runtime problem, rather than a panic?

(it looks like the newest entries are on top, oldest at the bottom. My bad)

@AudriusButkevicius

This comment has been minimized.

Copy link
Member

@AudriusButkevicius AudriusButkevicius commented Oct 31, 2019

Looks like "large" db config fallout, or running out of fd's/threads due to ulimit.

@AudriusButkevicius

This comment has been minimized.

Copy link
Member

@AudriusButkevicius AudriusButkevicius commented Oct 31, 2019

I guess corruption is not ideal. Do we know how it was corrupting them?

@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Oct 31, 2019

I'll dig into the corrupted files a bit further tomorrow. From first observations, jpegs were no longer recognised as valid images, and videos got added artifacts, freezing for short periods, audio dropped out. I had a quick look inside one of the images and there wasn't anything obvious - no blocks of repeated bytes, etc.

@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Nov 2, 2019

Looking at the corruption:

  1. Sometimes the corrupted file starts with the same content as the original, sometimes not.
  2. Once the corrupted file starts to differ from the original, they don't appear to have the same content at the same position after that (although there is a corrupted video which picks up again after some stuttering, which implies that the corrupted files can contain content from the original file, but at a different position).
  3. The corrupted files appear to be exactly the same lengths as the originals.
  4. The corrupted files sometimes have long sequences of 0x00 / 0xFF / 0xAA, but more often than not they contain what appears to be genuine data - just not data from the original file at that position.
  5. Some of the corrupted files have data which appears to come from other files in the same (Syncthing) folder (but a different directory), sometimes repeated multiple times in different places.
  6. The files on the device that was crashing were corrupted.

I've got some examples I can share, but I'd rather not do so publicly. Let me know if you want them, and how I should get them to you.

@calmh

This comment has been minimized.

Copy link
Member

@calmh calmh commented Nov 2, 2019

Are the breaks between good and bad data on some kind of discernible offsets (4k, 1M, etc)?

@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Nov 2, 2019

Good point. Out of 6 sample files, these are the offsets at which content first changes:

0000 5000
0000 0000
0000 0000
0001 E000
0001 E000
0017 8000

For the cases where the corrupted files seem to contain data from another file in the same Syncthing folder, this can appear at any offset.

@calmh

This comment has been minimized.

Copy link
Member

@calmh calmh commented Nov 2, 2019

0000 5000 = 20480 = 5 * 4096
0000 0000 = 0
0001 E000 = 122880 = 15 * 8192
0017 8000 = 1540096 = 47 * 32768

That's the largest power-of-two multipler for each. It's not proof in either way, but if this was a Syncthing bug I would expect breaks at our block offsets, 128KiB or larger. The smallest common multiplier is 4k which is a common disk cluster size. My spider sense wants to consider this hardware failure, either RAM or disk...

@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Nov 2, 2019

All of those files are less than 500kb, except the one which has offset 0017 8000, which is about 5mb.

The disk seems healthy (at least according to fsck and s.m.a.r.t.) and the RAM's pretty new. Maybe the database didn't like Syncthing crashing out every few minutes?

Currently trying to reproduce the crashes (although it'll take a while, as I need to get that device back in sync without the corrupted files first).

@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Nov 2, 2019

The crashing appears to be a problem with the Arch Linux Arm build (https://archlinuxarm.org/packages/arm/syncthing/files/PKGBUILD) -- I can't reproduce it with an official build.

Not sure where the corruption was coming from, but can't reproduce that without the crashes.

Happy to close this.

@AudriusButkevicius

This comment has been minimized.

Copy link
Member

@AudriusButkevicius AudriusButkevicius commented Nov 2, 2019

Looks like it's built with gccgo and has funky patches:

 # go-pie with -race segfaults
  cd "${srcdir}/src/github.com/syncthing/${pkgbase}"
  patch -Np1 -i "${srcdir}/gcc-pie-segfaults-race.patch"
@canton7

This comment has been minimized.

Copy link
Member Author

@canton7 canton7 commented Nov 2, 2019

That patch just removes the race detection (https://archlinuxarm.org/packages/arm/syncthing/files/gcc-pie-segfaults-race.patch) -- nothing too funky. Looks like gccgo / PIE could be suspect, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.