Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: specify chunksize for read_sql #2908

Closed
davidstackio opened this issue Feb 21, 2013 · 10 comments · Fixed by #8330
Closed

Feature request: specify chunksize for read_sql #2908

davidstackio opened this issue Feb 21, 2013 · 10 comments · Fixed by #8330
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@davidstackio
Copy link

It would be helpful to iterate through rows returned from an sql query (sqlite specifically) chunk by chunk just as is done in the read_csv and text files function as described here: http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk

The return value should be an iterable object. This will prevent queries from returning too large an amount of data, (possibly) exceeding the system memory.

@davidstackio
Copy link
Author

The exact error I got was this on pandas version 0.10.1:

  runData = psql.read_frame("SELECT * FROM output", conn)
  File "C:\Python27\lib\site-packages\pandas\io\sql.py", line 151, in read_frame
    coerce_float=coerce_float)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1014, in from_records
    coerce_float=coerce_float)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 5468, in _to_arrays
    if len(data) == 0:
TypeError: object of type 'NoneType' has no len()

The TypeError is a little confusing as it took me awhile to figure out it was happening because I was hitting the memory limit. Maybe just a clearer error message would be enough (max query sized reached or something), perhaps suggesting that the user use the SQL LIMIT command to prevent this problem (See http://php.about.com/od/mysqlcommands/g/Limit_sql.htm)

@davidstackio
Copy link
Author

I just ran:

runData = psql.read_frame("SELECT * FROM output LIMIT 10", conn)

with no problem

@hayd hayd mentioned this issue Jul 8, 2013
20 tasks
@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

so this would need (to be consistent) iterator and chunksize keywords

@lebedov
Copy link

lebedov commented Oct 4, 2013

For the time being, here is a simple implementation of the requested functionality: https://gist.github.com/lebedov/6831387

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

@jorisvandenbossche @hayd this needs to go on the new sql issues list?

@jorisvandenbossche
Copy link
Member

Hmm, I preferably keep the list in #6292 as the important todo's that should ideally be finished before releasing it. And this is a nice feature request, but not a blocker for the basic functionality. Just keep it as a seperate issue?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2014

ok....how about you create another issue (mark as 0.15), then will include items that are not in #6292
but are marked as SQL; that way easy to move an issue out of current release to next one (and track all the SQL ones). make check boxes and such.

#3745, #5008, #2754 I think should go on one of these as well (or if already satisifed by another issue go ahead and close)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 25, 2014
@jorisvandenbossche jorisvandenbossche changed the title Feature Request: specify chunksize for sql.read_frame ENH: specify chunksize for read_sql Jun 3, 2014
@jorisvandenbossche jorisvandenbossche changed the title ENH: specify chunksize for read_sql Feature request: specify chunksize for read_sql Jun 3, 2014
@hayd
Copy link
Contributor

hayd commented Sep 3, 2014

This came up again here: http://stackoverflow.com/q/25633830/1240268

@mariusbutuc
Copy link

I take full responsibility for asking how to pull large amounts of data from a remote server, into a DataFrame, that @hayd just referenced and answered in such good detail on SO ––for which I thank you!

I've updated the SO question with more context, but if I can help / contribute in any way here, I'd be more than happy to.

@jorisvandenbossche
Copy link
Member

@mariusbutuc if you want to try to implement it and send a pull request, that would be very welcome!

I think this could be done inside the read_sql function (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L870) using fetchmany instead of fetchall ? (would that work?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants