Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame - Add chunksize to from_records() #13818

Open
achapkowski opened this issue Jul 27, 2016 · 17 comments
Open

DataFrame - Add chunksize to from_records() #13818

achapkowski opened this issue Jul 27, 2016 · 17 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Enhancement

Comments

@achapkowski
Copy link

Code Sample, a copy-pastable example if possible

Currently the from_records does not take a chunksize.

import pandas as pd
df = pd.DataFrame.from_records(data, index=None, exclude=None, 
                                     columns=None, 
                                     coerce_float=False, 
                                     nrows=None)

Enhancement

I would like to see the chunksize options like in read_csv()

import pandas as pd
dfs = pd.DataFrame.from_records(data, index=None, exclude=None, 
                                     columns=None, 
                                     coerce_float=False, 
                                     nrows=None, chunksize=10000)
res = []
for df in dfs:
    df['col'] = 'blah'
   res.append(df)

df = pd.concat(res)

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 32
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.1
nose: 1.3.7
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.1
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext)

@TomAugspurger
Copy link
Contributor

Can you say a bit more about why this would be useful to you? Typically chunksize is used to manage larger-than-memory datasets that are on disk. The data argument to from_records is presumably already in memory.

@achapkowski
Copy link
Author

@TomAugspurger - I can do my best. The idea of using a cursor like object with a __next__() implemented is that it allows you to go line by line through a dataset table/file, whatever. If you have a large dataset, let's say bigger than memory, then you need to limit the number of rows a user works with at one time. This means you maybe able to work row by row, but you won't get the advantage of DataFrame object, where as if we can chunk it, it will load better.

read_csv has the chunksize, but the from_records does not. It should because sometimes you need to limit the number rows that can be processed at once.

Thanks

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 27, 2016

Thanks, that's perfectly reasonable. The other option is to chunk your dataset before passing it to from_records, using e.g. toolz.partition_all

stream = ...
dfs = (pd.DataFrame(chunk, ...) for chunk in toolz.partition_all(chunksize, stream))

Is adding the parameter worth the extra complexity / maintenance?

@TomAugspurger TomAugspurger added the IO Data IO issues that don't fit into a more specific label label Jul 27, 2016
@achapkowski
Copy link
Author

@TomAugspurger - I think it would be worth it in the long run. All the supported datasets should probably allow for chunking not just a couple.

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

what is the actual use case for this? if you have a rec-array by-definition its in memory, so not sure how this helps at all.

@achapkowski
Copy link
Author

There are other 3rd party objects out there that implement iterator and generator objects.

For example, spatial data, there is an arcpy.da.SearchCursor. This object is your standard iterator object over a table, but I may not want to load the whole table into memory at once to do my work, but rather, I want to piece it together much like with csv files.

The concept here is to process large generator/iterators into smaller pieces to be more efficient when it comes to memory consumption.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

and u can easily do that

why should pandas add this very narrow case

@achapkowski
Copy link
Author

@jreback because iterators and generators are common ways to get data and it allows a more generic method to load data into dataframes.

Can you provide guidance on how to implement it if it is very easy?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

@achapkowski this would be extremely inefficient, but I suppose a usecase exists.

I said its easy to do externally.

pd.concat(list(generator))

@achapkowski
Copy link
Author

@jreback but aren't you just loading the whole dataset into memory then?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

of course. a Dataframe is a fixed size. Expanding it requires reallocattion and copying, which is quite expensive.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2016

you can see #5902 if you want to see the discussion

@achapkowski
Copy link
Author

But the chunksize, like in the csv file returns an iterator of dataframe of size x, why couldn't that be done for and iterator object?

@TomAugspurger
Copy link
Contributor

@achapkowski one is I have with adding it is that all the DataFrame.from_* methods return a DataFrame. IIUC, your DataFrame.from_records(., chunksize=10) wouldn't actually return a DataFrame, but rather an iterator similar to read_csv's TextFileReader. I feel like this breaks the convention that from_* methods are alternative constructors that return an instance of the object itself.

That, plus the fact that using something like toolz.partition_all or this recipe from the standard lib aren't too bad to write, means I'm slightly against adding the chunksize kwarg to from_records. I'd prefer to keep from_* to always return a DataFrame, and require the user to chunk the data before using the constructor.

@tinproject
Copy link
Contributor

@TomAugspurger @achapkowski from_records already have a chunksize atribute, it's called nrows, I believe it's name should be changed to count because it's purpose is to indicate how many records are going to be taken from an iterator, chunksize is a bad name because implies that there are chunks in pandas and there are not.

API things apart, the way pandas create a DataFrame from an iterator/generator is putting the iterator contents on a list, and then build the DataFrame from that list. As jreback points you could read the discussion on #5902.

@achapkowski
Copy link
Author

@tinproject - nrows parameters seems like the value will only take say 5000 rows from the datasource then stop reading it. Is that not correct?

@mroeschke mroeschke added Enhancement Constructors Series/DataFrame/Index/pd.array Constructors and removed IO Data IO issues that don't fit into a more specific label labels May 2, 2020
@bhavaniravi
Copy link
Contributor

I just came across a use case where this feature is useful, chunksize or equivalent attribute should be added to from_records might not be applicable to from_dict because the function takes an iterator as input.

In my case, I am using Ijson to read a large json file and want to convert it into a dataframe to be written to a DB. At no point in my workflow, I want the whole data to be preserved in memory.

objects = ijson.items(file_object, "item")
df = pd.DataFrame.from_records(objects, nrows=CHUNK_SIZE)

Few solutions I could think of

  1. Add a chunksize or a more close attribute to from_records that would generate df in chunks
  2. Update the existing code to return df iterator based on nrows (read_csv now returns a df iterator if chunksize is passed, similar behavior)
  3. add an iterator parameter (same as read_csv)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Enhancement
Projects
None yet
Development

No branches or pull requests

6 participants