Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warcio.bufferedreaders.BufferedReader.readline can get stuck in an infinite loop #122

Open
ThomasA opened this issue Dec 3, 2020 · 12 comments

Comments

@ThomasA
Copy link

ThomasA commented Dec 3, 2020

I am currently processing a large-ish (on the order of 600GB) batch of WARC files containing a number of dumped homepages.
I am sifting through all of these files for image content which I then extract and do some further processing of. I use a warcio.ArchiveIterator to iterate through all of the records in all of the files. Once in a while, I come across WARC files that seem to cause the warcio.ArchiveIterator to get stuck in an infinite loop.
I suspect it happens in

while not linebuff.endswith(b'\n'):
. This is where I catch it hanging every time I encounter this problem and stop it with the debugger.
I found this problem in warcio 1.7.4, but I cannot identify this version in your repo as the releases do not seem to be tagged.
I suspect this could be related to #121 and the way I speculate it might happen is that #121 may occur if the expected content length is somehow too short compared to the actual content. On the other hand, if the expected content length is too long compared to the actual length, reading the record might miss the newline character that warcio.bufferedreaders.BufferedReader.readline() is looking for, and cause the loop to never stop?
I notice when this occurs, that
line = self.reader.readline()
has called ...readline() without a length argument. Adding a sensible length argument here would probably be good "emergency exit" mechanism?

@wumpus
Copy link
Collaborator

wumpus commented Dec 3, 2020

I see an endless loop possibility involving a truncated file, are you reading the last record when the problem happens?

@ThomasA
Copy link
Author

ThomasA commented Dec 3, 2020

I don't know, actually. I will have to investigate.
Does warcio have a way to tell up front how many records a WARC file contains? I have only used the iterator so far.

@ikreymer
Copy link
Member

ikreymer commented Dec 5, 2020

Do you know if the WARC records are gzipped or not? If (correctly) gzipped per record, I think the content-length should matter less, as it would hit the gzip boundary and end.
If they're not gzipped, it can be more tricky with invalid lengths. Yes, perhaps adding a limit to the readline() makes sense, and if it reaches that, it should just abort reading as it won't be able to recover. For too small content-length, the goal was to at least be able to continue by finding the next record boundary.

Can you run the warcs through warcio index ..., so you can see what the last valid record offset that it prints out?
That should help find the record that's causing issues

@wumpus
Copy link
Collaborator

wumpus commented Dec 30, 2020

I'm happiest to debug this if you send us the warc with the problem. I could explore the possible bug I found, but really, solving your actual bug is probably the best path forward. Maybe it's the one I guessed, maybe it isn't.

@ThomasA
Copy link
Author

ThomasA commented Jan 8, 2021

I had to put this project on the back burner over Christmas and New Year. Now I can hopefully get back to working on it again.
@ikreymer I do not know if the records are gzipped. Assuming that is transparent to reading them with an ArchiveIterator, I have not noticed at least. I can investigate...
I was getting started investigating via warcio index in December and will have to get back to it now.
@wumpus I would love to send the WARC, but I am not allowed to since my files archive web sites that may have contained accidentally leaked sensitive information at the time of ingestion. Furthermore, many of these files are on the order of many gigabytes.
I will get back to you once I hopefully get permission to take the laptop with the data on it to my home office so I can work on it. We are quite locked down here (again) for the time being.

@ThomasA
Copy link
Author

ThomasA commented Jan 8, 2021

I have got access to my data again now and I can see that my script using an ArchiveIterator to iterate through the WARC for images and warcio index both get stuck around approximately the same record in the WARC.
Now that I know from warcio index approximately which offset this happens at, I am going to try tomorrow to manually experiment with reading records around there using warcio.extractor.Extractor. Since I can see Extractor also uses ArchiveIterator to read the record, I expect this to also fail when I find the problematic record and then I hopefully have better chances of debugging it from there.
If I can narrow down the problematic part of the WARC, I can try sending you that part as a smaller WARC.

@ThomasA
Copy link
Author

ThomasA commented Jan 8, 2021

@wumpus so far I can at least say that this does not seem to happen at the last record in the file. The file is approximately 53GB and warcio index hangs after a record at offset 32624067591 which appears to be record number 546990 in that file.

@ThomasA
Copy link
Author

ThomasA commented Jan 11, 2021

I have identified the problematic record in my WARC now. I can open an ArchiveIterator on the file positioned at the preceding record and the get the next item:

In[76]: file.seek(32624067591)
Out[76]: 32624067591

In[77]: arch_iter = warcio.ArchiveIterator(file)

In[78]: record = next(arch_iter)

In[79]: arch_iter.get_record_length()
Out[79]: 242221

In[80]: arch_iter.get_record_offset()
Out[80]: 32624067591

In[81]: record.rec_headers
Out[81]: StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('WARC-Type', 'response'), ('WARC-Target-URI', 'http://ft.arkena.tv/flash/ftplayer.swf?xml=http%3A%2F%2Fft.arkena.tv%2Fxml%2Fcore_player_clip_data_v2_REAL.php%3Fxid%3D1614%26wtve%3D298%26wtvl%3D2%26as%3D0%26pi%3D0%26as%3D0%26i%3D2010-11-30T19%253A12%253A11%26o%3D2010-11-30T19%253A13%253A45&progress_callback=secondPassedEvent&playstate_callback=playStateEvent'), ('WARC-Date', '2013-10-11T10:20:16Z'), ('WARC-Payload-Digest', 'sha1:BHAX6KZ3QPRGWHYHMZGX27OEGWDBD2DC'), ('WARC-IP-Address', '87.48.153.91'), ('WARC-Record-ID', '<urn:uuid:ee46c606-16a3-441c-8fdb-92c19dc42f8f>'), ('Content-Type', 'application/http; msgtype=response'), ('Content-Length', '241594')])

In[82]: record.format
Out[82]: 'warc'

In[83]: record.length
Out[83]: 241594

In[84]: record.payload_length
Out[84]: 241312

So far, so good. I can also read another item from the iterator:

In[89]: record = next(arch_iter)

In[90]: record.rec_headers
Out[90]: StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('WARC-Type', 'response'), ('WARC-Target-URI', 'http://www.ft.dk/samling/20101/lovforslag/l67/beh1/forhandling.htm?startItem=66&showscroll=0'), ('WARC-Date', '2014-02-05T11:55:30Z'), ('WARC-Payload-Digest', 'sha1:7GKTP5T5STNM4WKOQ5S5RKIAAHY2TCRV'), ('WARC-IP-Address', '194.255.119.21'), ('WARC-Record-ID', '<urn:uuid:b2e58be0-8555-4d79-9fe5-2d8dd394d1ad>'), ('Content-Type', 'application/http; msgtype=response'), ('Content-Length', '40575')])

In[91]: record.format
Out[91]: 'warc'

In[92]: record.length
Out[92]: 40575

In[93]: record.payload_length
Out[93]: 40220

This is where it starts acting up now. If I attempt to call arch_iter.get_record_length() or arch_iter.get_record_offset() now, it will hang in the described infinite loop. The same happens if I attempt next(arch_iter) once more.
Can you glean anything useful from the above quoted lengths or offsets?
Do you have any suggestions on how to further debug this? I have this idea that after reading the first item above, I would like to trick the ArchiveIterator into reading somewhat shorter than it apparently tries when reading the next item. I would like to see if this causes the iterator to successfully stumble on the ending of the record that seems to be missing now.

@wumpus
Copy link
Collaborator

wumpus commented Jan 11, 2021

Can you prepare and attach a cut-down warc that starts at 32624067591 and is long enough to have 2 complete records plus a fair bit of the 3rd? Compressed the same as the original file. So I'd guess 242,221 + 241,594 bytes (the sum of the uncompressed record lengths) would be enough. You can then review the webpages inside to make sure they aren't sensitive.

@ThomasA
Copy link
Author

ThomasA commented Jan 11, 2021

I am trying to do that now.
What is the difference between arch_iter.get_record_length() and record.length above? I assume they are talking about the same record, but arch_iter.get_record_length() includes some metadata overhead of the WARC format that is not visible from record.length's level? But based on your suggestion to sum the two lengths I must be wrong.

@ThomasA
Copy link
Author

ThomasA commented Jan 11, 2021

I have read the problematic record now, directly from the file object. It is clearly a truncated file. I don't know what has happened here, but from somewhere in the record, the remaining contents are simply zero. I have browsed manually through the file and this seems to go on, gigabyte by gigabyte. I have stored the contents in this gist; it appears to be a page about a bill being discussed in the Danish parliament.
So, clearly this is not warcio's fault, but since this can happen, could it be an idea to add an emergency break in the loop as I suggested in the original post?

@wumpus
Copy link
Collaborator

wumpus commented Jan 13, 2021

This is a not unusual situation in the face of a hardware or software bug. Either the file ends in a hole, which means the filesystem metadata is corrupt, or your bits were written elsewhere on the disk, or not written anywhere. We ought to be able to trigger it by simply overwriting the end of a test warc with nulls. We have plenty of such files in the test data already. I think there are 6 interesting cases, one is when the payload is where the problem starts, and a second is when the http headers is where the problem starts, and the 3rd is when the warc headers is when the problem starts. And then with and without warc-record-level compression. And I suppose all nulls vs random bytes, that's 12 cases.

This is related to another recently filed issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants