Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read(write(record)) != record #57

Closed
PromyLOPh opened this issue Dec 4, 2018 · 1 comment · Fixed by #106
Closed

read(write(record)) != record #57

PromyLOPh opened this issue Dec 4, 2018 · 1 comment · Fixed by #106

Comments

@PromyLOPh
Copy link

Records read back from a file just written should be equal to the Python object written. This is something I discovered while writing tests for an application using warcio. Test case:

import pytest

from io import BytesIO
from tempfile import NamedTemporaryFile

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders

def test_identity ():
    """ read(write(record)) should yield record """
    with NamedTemporaryFile () as fd:
        payload = b'foobar'
        writer = WARCWriter (fd, gzip=True)
        httpHeaders = StatusAndHeaders('GET / HTTP/1.1', {}, is_http_request=True)
        warcHeaders = {'Foo': 'Bar'}
        record = writer.create_warc_record ('http://example.com/', 'request',
                payload=BytesIO(payload),
                warc_headers_dict=warcHeaders, http_headers=httpHeaders)
        writer.write_record (record)

        fd.seek (0)
        rut = next (ArchiveIterator (fd))
        golden = record
        assert rut.rec_type == golden.rec_type
        assert rut.rec_headers == golden.rec_headers
        assert rut.content_type == golden.content_type
        assert rut.length == golden.length
        assert rut.http_headers == golden.http_headers
        assert rut.raw_stream.read() == payload

results in the following assertion failure:

E           AssertionError: assert StatusAndHead...ngth', '24')]) == StatusAndHeade...ngth', '24')])
E             Full diff:
E             - StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E             ?                                         ^^^^^^^^^^   --------------
E             + StatusAndHeaders(protocol = '', statusline = 'WARC/1.0', headers = [('Foo', 'Bar'), ('WARC-Type', 'request'), ('WARC-Record-ID', '<urn:uuid:2eb39603-c759-4865-9e5b-2a3cd9c81c92>'), ('WARC-Target-URI', 'http://example.com/'), ('WARC-Date', '2018-12-04T15:27:45Z'), ('WARC-Payload-Digest', 'sha1:RBB5P6JECYQR32PLXFR76THCQESZGKDY'), ('WARC-Block-Digest', 'sha1:HVUJ5SESVATOLVXZZTFORJY44V5BW7YB'), ('Content-Type', 'application/http; msgtype=request'), ('Content-Length', '24')])
E             ?                            +++++++++++++++++             ^^^^^^^
ikreymer added a commit that referenced this issue Feb 20, 2020
- Fix issue #104 where utf-8 canonicalization caused record to be written with incorrect content-length
- Fix issue #57 where protocol is set as statusline in correctly
- fill payload_length when reading warc records
- always support tell() to LimitReader and BufferedReader()
ikreymer added a commit that referenced this issue Feb 20, 2020
- Fix issue #104 where utf-8 canonicalization caused record to be written with incorrect content-length
- Fix issue #57 where protocol is set as statusline in correctly
- fill payload_length when reading warc records
- always support tell() to LimitReader and BufferedReader()
- bump version to 1.7.2
ikreymer added a commit that referenced this issue Mar 1, 2020
* Fixes related to reading record and writing same record back out:
- Fix issue #104 where utf-8 canonicalization caused record to be written with incorrect content-length
- Fix issue #57 where protocol is set as statusline in correctly
- fill payload_length when reading warc records
- always support tell() to LimitReader and BufferedReader()
- bump version to 1.7.2

* fix tests for py27, add py38
@ikreymer
Copy link
Member

ikreymer commented Mar 1, 2020

This should now be a fixed, I added a unit test based on your example above:
https://github.com/webrecorder/warcio/blob/develop/test/test_writer.py#L826

Released in 1.7.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants