Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZIP file decompression and TestCompression. #12175

Closed
wants to merge 1 commit into from

Conversation

lababidi
Copy link

Closes #11413

@jreback jreback added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv labels Jan 29, 2016
klass = FixedWidthFieldParser
else: #default to engine == 'python':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you modify this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified this to default to the Python engine. If this is not the wanted functionality. I can remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just change code that is relevant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. Does this need to a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you are changing this in the first place.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the use case that engine is neither 'python' nor 'python-fwf'.

Is it possible for this to happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that would be an exception and should raise a ValueError (do this in another issue/PR)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the clarification.

@lababidi
Copy link
Author

@jreback The idea of the Mixin makes sense. Took me a bit to wrap my head around it. I had to add self.engine to a couple of the tests because bzip decompression will not raise an exception and so the Compression Mixin needs to check what engine it is using to make sure the test runs correctly.

@lababidi
Copy link
Author

@jreback I think this is ready for your review.

f = zip_file.open(file_name)
else:
raise ValueError('ZIP file contains multiple files {}',
zip_file.filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a .format here

@max-sixty
Copy link
Contributor

Jeff is going to ask you to squash your commits into one, as per the contributing docs.
Nice job overall!

@TomAugspurger
Copy link
Contributor

@MaximilianR no need for squashing anymore thanks to https://github.com/pydata/pandas/blob/master/scripts/merge-py.py ;)

EDIT: which reminds me, CONTRIBUTING.md needs to be updated. Will do this weekend unless someone beats me to it.

@lababidi
Copy link
Author

@MaximilianR @TomAugspurger Thanks for the comments guys!

@max-sixty
Copy link
Contributor

@TomAugspurger Cool!

@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

pls squash

even though I can do it - it's much cleaner from a future reader perspective on a smaller change

@lababidi
Copy link
Author

@jreback Squashed. Thanks.

result = self.read_csv(path, compression='zip')
tm.assert_frame_equal(result, expected)

result = self.read_csv(open(path, 'rb'), compression='zip')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in a with block to make sure the file is closed

@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

looks pretty good. just some minor stylistic comments.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

======================================================================
ERROR: test_to_csv_compression_value_error (pandas.tests.frame.test_to_csv.TestDataFrameToCSV)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tests/frame/test_to_csv.py", line 998, in test_to_csv_compression_value_error
    filename, compression="zip")
  File "/home/travis/build/pydata/pandas/pandas/util/testing.py", line 1952, in assertRaises
    _callable(*args, **kwargs)
  File "/home/travis/build/pydata/pandas/pandas/core/frame.py", line 1338, in to_csv
    formatter.save()
  File "/home/travis/build/pydata/pandas/pandas/core/format.py", line 1524, in save
    compression=self.compression)
  File "/home/travis/build/pydata/pandas/pandas/io/common.py", line 346, in _get_handle
    zip_file = zipfile.ZipFile(path)
  File "/home/travis/miniconda/envs/pandas/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()
  File "/home/travis/miniconda/envs/pandas/lib/python2.7/zipfile.py", line 811, in _RealGetContents
    raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

put a test in for this error as well (e.g. try to open a non-zipfile); I don't think you need any code changes though, you can just let it raise.

@lababidi
Copy link
Author

lababidi commented Feb 1, 2016

@jreback Good ideas. I think I covered all your requests.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2016

@lababidi ok, this lgtm.

just need a whatsnew note (you can put kind of what you did in the addition to the doc-string). pls add, ping when green.

@jreback jreback added this to the 0.18.0 milestone Feb 1, 2016
@lababidi lababidi force-pushed the feature_zip branch 2 times, most recently from ccfc4a7 to 032362c Compare February 2, 2016 15:39

file_name = 'test_file.zip'
with tm.ensure_clean(file_name) as path:
tmp = zipfile.ZipFile(path, mode='w')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you didn't change anything here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback what line are you referring to? zipfile.ZipFile?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, your context managers are nested, but don't need to be

 with open(self.csv1, 'rb') as data_file:
        data = data_file.read()

data can be used here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for clarifying. I was using the previous convention. I'll clean this up now.

compression : {'gzip', 'bz2', 'zip', 'infer', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
bz2 or zip if filepath_or_buffer is a string ending in '.gz', '.bz2' or
'.zip', respectively, and no decompression otherwise. New in 0.18.0: ZIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so add in a versionadded tag and put the 0.18.0 stuff there

@lababidi
Copy link
Author

@jreback previous tests passed. The most recent push only changed the version in the docstring
https://travis-ci.org/pydata/pandas/builds/117001384

compression : {'gzip', 'bz2', 'zip', 'infer', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
bz2 or zip if filepath_or_buffer is a string ending in '.gz', '.bz2' or
'.zip', respectively, and no decompression otherwise. New in 0.18.1: ZIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a versionadded tag here

@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

@lababidi ok looks pretty good. only changes are to use assertRaisesRegexp when you assert for zip in order to make sure the correct messages are raised (as there are multiple possibilites).

pls make that and ping when green.

@lababidi lababidi force-pushed the feature_zip branch 2 times, most recently from f57a7ee to daae2aa Compare March 21, 2016 18:33
@lababidi
Copy link
Author

@jreback could you help me? the test only failed on the following:

--------------------------------------------------------------------------------------------------------------
#176 nose.failure.Failure.runTest: direct creation of extension dtype datetime64[ns, UTC] is not supported ATM

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

not sure where you are seeing this. your 2.7 tests is failing because of a linting issue (line too long)

git diff master | flake8 --diff

@lababidi
Copy link
Author

@jreback it's in the Travis results

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

I restarted that job, though you need to repush anyhow (lint error). Never saw that one before; I think its a crash in something else, so let's see if it recurs.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@lababidi ok this passed, except for the lint check. pls fix and repush, ping when green.

Fix PEP8 issues. Change Compression to be a Mixin. Add Compression Mixin correctly with current Tests. Add .format, Rename Compression, with-block, empty zip, bad-zip
@lababidi
Copy link
Author

Thank you @jreback for your help and patience with this. I'll help out on the other issues soon.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@lababidi no thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants