Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.15.0 Can't read csv.gz from url #8685

Closed
olgabot opened this issue Oct 30, 2014 · 6 comments
Closed

v0.15.0 Can't read csv.gz from url #8685

olgabot opened this issue Oct 30, 2014 · 6 comments
Labels
Enhancement IO CSV read_csv, to_csv
Milestone

Comments

@olgabot
Copy link

olgabot commented Oct 30, 2014

import pandas as pd
pd.read_csv('https://raw.githubusercontent.com/YeoLab/shalek2013/master/expression.csv.gz', compression='gzip', index_col=0)

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-17-2e5c16b1e504> in <module>()
----> 1 pd.read_csv('https://raw.githubusercontent.com/YeoLab/shalek2013/master/expression.csv.gz', compression='gzip', index_col=0)

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    461                     skip_blank_lines=skip_blank_lines)
    462 
--> 463         return _read(filepath_or_buffer, kwds)
    464 
    465     parser_f.__name__ = name

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    237 
    238     # Create the parser.
--> 239     parser = TextFileReader(filepath_or_buffer, **kwds)
    240 
    241     if (nrows is not None) and (chunksize is not None):

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    551             self.options['has_index_names'] = kwds['has_index_names']
    552 
--> 553         self._make_engine(self.engine)
    554 
    555     def _get_options_with_defaults(self, engine):

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    688     def _make_engine(self, engine='c'):
    689         if engine == 'c':
--> 690             self._engine = CParserWrapper(self.f, **self.options)
    691         else:
    692             if engine == 'python':

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1050         kwds['allow_leading_cols'] = self.index_col is not False
   1051 
-> 1052         self._reader = _parser.TextReader(src, **kwds)
   1053 
   1054         # XXX

/usr/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4693)()

/usr/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._get_header (pandas/parser.c:6091)()

/usr/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8119)()

/usr/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:20349)()

CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

But the local file works:

image

I have v0.15.0

@rockg
Copy link
Contributor

rockg commented Oct 30, 2014

Is this a documented behavior that is supposed to work? It doesn't seem like the current code handles reading a compressed file from a URL from a quick glance at it.

@rockg
Copy link
Contributor

rockg commented Oct 30, 2014

We would have to add something similar to this so

@jreback
Copy link
Contributor

jreback commented Oct 31, 2014

yep this looks not supported at the moment

welcome a pull request to fix

@jreback jreback added IO CSV read_csv, to_csv Enhancement labels Oct 31, 2014
@jreback jreback added this to the Someday milestone Oct 31, 2014
@dhimmel
Copy link
Contributor

dhimmel commented May 26, 2015

+1, this is an important feature for the modern workflow

Until someday, I've been using the following workaround in Python 3.4 and Pandas 0.16.0:

response = requests.get(url)
bytes_io = io.BytesIO(response.content)
with gzip.open(bytes_io, 'rt') as read_file:
    df = pandas.read_csv(read_file)

@jreback
Copy link
Contributor

jreback commented May 26, 2015

@dhimmel pull-requests are welcome to add this feature.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

closed by #10649

@jreback jreback closed this as completed Aug 20, 2015
dhimmel added a commit to dhimmel/bindingdb that referenced this issue Nov 19, 2015
For `process.ipynb`:

+ Improve documentation with markdown cells.
+ Switch to commit specific links for dhimmel/uniprot.
+ Adopt pandas 17.0 gzipped url support. See pandas-dev/pandas#8685
+ Exclude rows 192304-192473 (one indexed) where `BindingDB Reactant_set_id`
  was missing.
+ Handle affinities that cannot be converted to floats.

For `collapse.Rmd`:

+ Use readr for tsv io.
+ Retain pubmed_ids and sources when collapsing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants