Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latest version 0.9.0 does not allow compression of empty string #27

Closed
bmoscon opened this issue May 12, 2017 · 8 comments
Closed

latest version 0.9.0 does not allow compression of empty string #27

bmoscon opened this issue May 12, 2017 · 8 comments

Comments

@bmoscon
Copy link

bmoscon commented May 12, 2017

Previous versions of python-lz4 allowed compression of the empty string:

In [1]: lz4.compress('')
Out[1]: b'\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x88/\xc9\x0b'

the latest version fails:

In [2]: lz4.compress('')
lz4/__init__.py:18: DeprecationWarning: Call to deprecated function or method compress (use lz4.block.compress instead).
  def compress(*args, **kwargs):
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last) in <module>()
----> 1 lz4.compress('')

lz4/deprecated.py in new_func(*args, **kwargs)
     37             warnings.warn_explicit(msg, category=DeprecationWarning, filename=filename, lineno=lineno)
     38             warnings.simplefilter('default', DeprecationWarning)  # reset filter
---> 39             return cls_or_func(*args, **kwargs)
     40 
     41         return new_func

lz4/__init__.py in compress(*args, **kwargs)
     17 @deprecated('use lz4.block.compress instead')
     18 def compress(*args, **kwargs):
---> 19     return lz4.block.compress(*args, **kwargs)
     20 
     21 @deprecated('use lz4.block.decompress instead')

ValueError: Input source data size invalid: 0 bytes

Was there a reason for this change? We use lz4 for compression of data we write to a database. Some of the records might contain empty strings. If I were to upgrade to the latest version of lz4 I'd no longer be able to support datasets with empty strings, and I'd also no longer be able to decompress the compressed data that's already been written.

@jonathanunderwood
Copy link
Member

This wasn't an intentional change - we should support the previous behaviour. Will look into it when I get avbit of time. PRs always welcome too :)

@jonathanunderwood
Copy link
Member

Looking at the code, I realize it was an intentional change, as I thought it would be a helpful check - I couldn't imagine a use case for compressing an empty string. I suppose you have provided one. I'll add an option to check for empty strings, which defaults to off, to retain the old behaviour.

@bmoscon
Copy link
Author

bmoscon commented May 12, 2017

@jonathanunderwood more than happy to supply a PR, I just wanted to understand the reason for the change, and learn if a PR would be accepted or not. Will try and produce one this weekend.

@jonathanunderwood
Copy link
Member

Ok, I think I've fixed this, will cut a new release. Please re-open if you think there's still a problem.

@bmoscon
Copy link
Author

bmoscon commented May 13, 2017

looks good, thanks. I'll test it out once a new release is on pypi.

jonathanunderwood added a commit that referenced this issue May 16, 2017
Previously we returned a string (bytes) object.
@spectral54
Copy link

As far as I can tell, this lets you compress and decompress the empty string, but doesn't allow you to decompress empty strings compressed with <v0.9.0.

From bmoscon's first comment opening this issue, lz4.compress('') produced this on v0.8.2 (this actually looks corrupt, the stuff after the first four bytes is just random memory... yikes; in my testing it's always 20 bytes long (I assume 4 size bytes and 16 bytes of random memory)):

b'\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x88/\xc9\x0b'

However, the decompression isn't backwards compatible; on v0.8.2, this decompresses to the empty string:

>>> lz4.decompress(b'\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x88/\xc9\x0b')
''

But on v0.9.1 (and newer, as far as I can tell):

>>> block.decompress(b'\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x88/\xc9\x0b')
ValueError (the actual error changes a bit, I believe).

Would it make sense to support the old (v0.8.2) "corrupt" compressed empty strings? As far as I know, this would mean checking that the size information is zero, and accepting any garbage after that.

@jonathanunderwood
Copy link
Member

jonathanunderwood commented Feb 16, 2019

Would it make sense to support the old (v0.8.2) "corrupt" compressed empty strings? As far as I know, this would mean checking that the size information is zero, and accepting any garbage after that.

I'm in two minds over this. On the face of it, this seems like a reasonable idea. But, it would special case the zero length case. At the moment, we always error if there is junk in the passed in bytes, irrespective of uncompressed size. My inclination is to leave it as simple as possible.

@bmoscon
Copy link
Author

bmoscon commented Feb 16, 2019

Simple as possible always best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants