Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic error when running siegfried #172

Closed
hmiguim opened this issue Dec 7, 2021 · 9 comments
Closed

Panic error when running siegfried #172

hmiguim opened this issue Dec 7, 2021 · 9 comments
Assignees
Labels

Comments

@hmiguim
Copy link
Contributor

hmiguim commented Dec 7, 2021

Siegfried version: 1.9.1 in server mode

siegfried-stderr.log:

2021/12/03 22:12:13 Starting server at localhost:5138. Use CTRL-C to quit.
panic: sync: negative WaitGroup counter

goroutine 401476 [running]:
sync.(*WaitGroup).Add(0xc0069c8690, 0xffffffffffffffff)
        /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
        /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.identifyFile.func1(0xc005ec6500, 0xc00033f4a0, 0xc005d42740)
        /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:205 +0x75
created by main.identifyFile
        /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:202 +0xe5
@richardlehane
Copy link
Owner

thanks for reporting this Miguel, I'll investigate

@luis100
Copy link

luis100 commented Dec 20, 2021

@richardlehane any updates on this? Currently, to workaround this issue, we've downgraded Siegfried to 1.7.13 where the issue doesn't seem to affect us.

@ross-spencer
Copy link
Collaborator

ross-spencer commented Dec 21, 2021

hi @luis100 @hmiguim I am wondering if you have any more information about what's triggering the error? I have reinstalled 1.9.1 on Ubuntu 21.04, Go 1.16.4 and I am able to return through GET and POST requests on the command line and through the web UI.

Before running I ran sf -update to make sure the signature file was up to date as well.

Example GET:

  • http://localhost:5138/identify/%2Ftmp%2Ftest?format=json

Example POST:

  • curl "http://localhost:5138/identify?format=json&hash=crc" -F file=@/tmp/test

Perhaps your command, operating environment, or build are having an impact here?

NB. It looks like the error begins here. We can see what happens when we step through it with more info.

NB++. I have a windows installation available, running 1.9.1 from the releases, I can also run the following in the browser (I haven't access to cURL on this machine to try that):

  • C:\Users\RS>sf -serve localhost:5381
  • http://127.0.0.1:5381/identify/C%3A%5Ctemp%5Cgifs%5C1880412.png?format=json

@luis100
Copy link

luis100 commented Dec 30, 2021

We are using a docker image based on Debian where we install siegfried and use it in server mode.

docker run -it --rm keeps/roda:v4.1.1 bash
# lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

# sf -version
siegfried 1.9.1
/usr/share/siegfried/default.sig (2020-10-06T19:13:40+02:00)
identifiers: 
  - pronom: DROID_SignatureFile_V97.xml; container-signature-20201001.xml
config: 
  - multi: 32

# supervisorctl status siegfried
siegfried                        RUNNING   pid 696055, uptime 0:41:37

# ps aux | grep sf
root      696055  8.7  1.1 1886728 187856 ?      Sl   14:03   3:39 sf -serve localhost:5138

The issue only occurs when we run many threads against siegfried server using RODA Web application. I tried to reproduce the issue using command-line tools, but could not make it fail in the same way. We use base64 encoding of path, but I tried both just in case.

# apt-get update; apt-get install -y parallel libany-uri-escape-perl
# find / -type f -not -path '/proc/*' -not -path '/sys/*' -not -empty | tee /tmp/filelist | wc -l
find: ‘/proc/10/map_files’: Permission denied
19195


# parallel --progress --linebuffer 'wget -q  --server-response -O /dev/null "http://localhost:5138/identify/{= use URI::Escape; $_=uri_escape($_) =}?base64=false&format=json" 2>&1 | awk "/^  HTTP/{print $2}" ' :::: /tmp/filelist | sort | uniq -c

Computers / CPU cores / Max jobs to run
1:local / 8 / 8

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/19195/100%/0.0s 
  19195   HTTP/1.1 200 OK


# cat /tmp/filelist | parallel --progress --linebuffer 'echo "http://localhost:5138/identify/{= use MIME::Base64; $_=encode_base64($_, ""); =}?base64=true&format=json" | sed s/\\\\//g | xargs -I% wget -q  --server-response -O /dev/null "%" 2>&1 | awk "/^  HTTP/{print $2}"' | sort | uniq -c

Computers / CPU cores / Max jobs to run
1:local / 8 / 8

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/19195/100%/0.0s 
  19195   HTTP/1.1 200 OK


# cat /var/log/supervisor/siegfried-stderr-*.log
No relevant errors

But the error occurs consistently in production servers when using siegfried 1.9.1. There is also some correlation between panics and a higher amount of scans in empty files, but I am not sure that is relevant.

# grep panic -A 10  /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log 
panic: sync: negative WaitGroup counter

goroutine 20 [running]:
sync.(*WaitGroup).Add(0xc005f55ac0, 0xffffffffffffffff)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00088d440, 0xc00010f7c0)
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc0058859b0, 0xffffffffffffffff)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00080d860, 0xc0001017c0)
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc005f7f970, 0xffffffffffffffff)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00018b4a0, 0xc0001e0f00)
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 8 [running]:
sync.(*WaitGroup).Add(0xc0063ff600, 0xffffffffffffffff)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc000830c00, 0xc000101770)
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb

# cat /var/log/supervisor/siegfried-stderr-*.log | grep "empty source" | wc -l
1065

We will continue to try to reproduce the issue in a way we can provide it to you.
Thank you

@luis100
Copy link

luis100 commented Dec 30, 2021

Expanding testing into a production server using a much larger ammount of files, errors start to creep in.
There are some other failures in the log, all complaining about empty files, and there are some 404 errors that come from files with a character '~' on them, not sure why, but many failures are due to panic errors. On a panic error, the service shuts down, but supervisord restarts it.

The panic errors come in pairs, or tupples, and seems to be this is a unprobable race condition.

# find /roda/data/storage/aip/ -type f | parallel --progress --linebuffer 'wget -q  --server-response -O /dev/null "http://localhost:5138/identify/{= use URI::Escape; $_=uri_escape($_) =}?base64=false&format=json" 2>&1 | awk "/^  HTTP/{print $2}"' | sort | uniq -c

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/704393/100%/0.0s 
 702492   HTTP/1.1 200 OK
   1901   HTTP/1.1 404 Not Found

# grep panic   /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log | wc -l
35

# grep panic -A 10  /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log
panic: sync: negative WaitGroup counter

goroutine 20 [running]:
sync.(*WaitGroup).Add(0xc005f55ac0, 0xffffffffffffffff)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00088d440, 0xc00010f7c0)
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc0058859b0, 0xffffffffffffffff)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
	/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00080d860, 0xc0001017c0)
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
	/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--

@richardlehane richardlehane self-assigned this Jan 17, 2022
@richardlehane
Copy link
Owner

Sorry for my delay in resolving this, it is quite a tricky one!

I have made some progress:

  • I can reliably reproduce this locally now with this client program: https://github.com/richardlehane/stresssf
  • I haven't been able to find the race condition yet but do have an idea for a general change to the server that may fix it (currently there is a universal printer goroutine that all output is sent through, with http requests it may be safer assign each request its own printer in case of dropped requests etc). The stresssf program times each run so I can also see impact of this change on performance.

Hopefully good news soon :)

@richardlehane
Copy link
Owner

I've found a quick fix for you: rather than run with -multi 32, run your server with -multi 1. I.e. sf -multi 1 -serve localhost:8080
You can make this the default with sf -multi 1 -setconf.
I found in this mode I wasn't able to crash the server with my script, even with a large corpus size.

Changing this setting should only impact performance if you are giving siegfried directories to scan. If each of your http requests sends a single filename then it shouldn't make any difference to the speed & you get concurrency anyway because you can be sending multiple requests at a time.

@luis100
Copy link

luis100 commented Jan 31, 2022

Thank you for your update, we will try the quick fix for now, let us know when a there's a final solution to the issue.

@richardlehane
Copy link
Owner

I found and fixed a race condition in the code over the weekend that in my testing has resolved this bug. Please re-open if you see it again.

You can access the fix in the latest release (v1.9.2). I'll be updating the debian package repository this evening. You download the new release here: https://github.com/richardlehane/siegfried/releases/tag/v1.9.2

Thanks again for taking the time to report this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants