DataFrame - Add chunksize to from_records() #13818

achapkowski · 2016-07-27T13:03:50Z

Code Sample, a copy-pastable example if possible

Currently the from_records does not take a chunksize.

import pandas as pd
df = pd.DataFrame.from_records(data, index=None, exclude=None, 
                                     columns=None, 
                                     coerce_float=False, 
                                     nrows=None)

Enhancement

I would like to see the chunksize options like in read_csv()

import pandas as pd
dfs = pd.DataFrame.from_records(data, index=None, exclude=None, 
                                     columns=None, 
                                     coerce_float=False, 
                                     nrows=None, chunksize=10000)
res = []
for df in dfs:
    df['col'] = 'blah'
   res.append(df)

df = pd.concat(res)

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 32
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.1
nose: 1.3.7
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.1
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-07-27T13:10:02Z

Can you say a bit more about why this would be useful to you? Typically chunksize is used to manage larger-than-memory datasets that are on disk. The data argument to from_records is presumably already in memory.

achapkowski · 2016-07-27T13:21:53Z

@TomAugspurger - I can do my best. The idea of using a cursor like object with a __next__() implemented is that it allows you to go line by line through a dataset table/file, whatever. If you have a large dataset, let's say bigger than memory, then you need to limit the number of rows a user works with at one time. This means you maybe able to work row by row, but you won't get the advantage of DataFrame object, where as if we can chunk it, it will load better.

read_csv has the chunksize, but the from_records does not. It should because sometimes you need to limit the number rows that can be processed at once.

Thanks

TomAugspurger · 2016-07-27T13:29:29Z

Thanks, that's perfectly reasonable. The other option is to chunk your dataset before passing it to from_records, using e.g. toolz.partition_all

stream = ...
dfs = (pd.DataFrame(chunk, ...) for chunk in toolz.partition_all(chunksize, stream))

Is adding the parameter worth the extra complexity / maintenance?

achapkowski · 2016-07-27T13:53:02Z

@TomAugspurger - I think it would be worth it in the long run. All the supported datasets should probably allow for chunking not just a couple.

jreback · 2016-07-27T21:30:38Z

what is the actual use case for this? if you have a rec-array by-definition its in memory, so not sure how this helps at all.

achapkowski · 2016-07-28T11:12:49Z

There are other 3rd party objects out there that implement iterator and generator objects.

For example, spatial data, there is an arcpy.da.SearchCursor. This object is your standard iterator object over a table, but I may not want to load the whole table into memory at once to do my work, but rather, I want to piece it together much like with csv files.

The concept here is to process large generator/iterators into smaller pieces to be more efficient when it comes to memory consumption.

jreback · 2016-07-28T11:17:12Z

and u can easily do that

why should pandas add this very narrow case

achapkowski · 2016-07-28T11:45:46Z

@jreback because iterators and generators are common ways to get data and it allows a more generic method to load data into dataframes.

Can you provide guidance on how to implement it if it is very easy?

jreback · 2016-07-28T12:11:00Z

@achapkowski this would be extremely inefficient, but I suppose a usecase exists.

I said its easy to do externally.

pd.concat(list(generator))

achapkowski · 2016-07-28T12:19:06Z

@jreback but aren't you just loading the whole dataset into memory then?

jreback · 2016-07-28T12:25:49Z

of course. a Dataframe is a fixed size. Expanding it requires reallocattion and copying, which is quite expensive.

jreback · 2016-07-28T12:26:31Z

you can see #5902 if you want to see the discussion

achapkowski · 2016-07-28T12:31:23Z

But the chunksize, like in the csv file returns an iterator of dataframe of size x, why couldn't that be done for and iterator object?

TomAugspurger · 2016-07-28T13:19:55Z

@achapkowski one is I have with adding it is that all the DataFrame.from_* methods return a DataFrame. IIUC, your DataFrame.from_records(., chunksize=10) wouldn't actually return a DataFrame, but rather an iterator similar to read_csv's TextFileReader. I feel like this breaks the convention that from_* methods are alternative constructors that return an instance of the object itself.

That, plus the fact that using something like toolz.partition_all or this recipe from the standard lib aren't too bad to write, means I'm slightly against adding the chunksize kwarg to from_records. I'd prefer to keep from_* to always return a DataFrame, and require the user to chunk the data before using the constructor.

tinproject · 2016-08-03T11:06:25Z

@TomAugspurger @achapkowski from_records already have a chunksize atribute, it's called nrows, I believe it's name should be changed to count because it's purpose is to indicate how many records are going to be taken from an iterator, chunksize is a bad name because implies that there are chunks in pandas and there are not.

API things apart, the way pandas create a DataFrame from an iterator/generator is putting the iterator contents on a list, and then build the DataFrame from that list. As jreback points you could read the discussion on #5902.

achapkowski · 2016-08-03T16:00:20Z

@tinproject - nrows parameters seems like the value will only take say 5000 rows from the datasource then stop reading it. Is that not correct?

bhavaniravi · 2021-01-20T11:09:19Z

I just came across a use case where this feature is useful, chunksize or equivalent attribute should be added to from_records might not be applicable to from_dict because the function takes an iterator as input.

In my case, I am using Ijson to read a large json file and want to convert it into a dataframe to be written to a DB. At no point in my workflow, I want the whole data to be preserved in memory.

objects = ijson.items(file_object, "item")
df = pd.DataFrame.from_records(objects, nrows=CHUNK_SIZE)

Few solutions I could think of

Add a chunksize or a more close attribute to from_records that would generate df in chunks
Update the existing code to return df iterator based on nrows (read_csv now returns a df iterator if chunksize is passed, similar behavior)
add an iterator parameter (same as read_csv)

TomAugspurger added the API Design label Jul 27, 2016

TomAugspurger added the IO Data IO issues that don't fit into a more specific label label Jul 27, 2016

datapythonista mentioned this issue Jul 23, 2018

Refactor from_records to load data in an efficient way #22025

Open

mroeschke added Enhancement Constructors Series/DataFrame/Index/pd.array Constructors and removed IO Data IO issues that don't fit into a more specific label labels May 2, 2020

mroeschke removed the API Design label May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame - Add chunksize to from_records() #13818

DataFrame - Add chunksize to from_records() #13818

achapkowski commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016

achapkowski commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016 •

edited

achapkowski commented Jul 27, 2016

jreback commented Jul 27, 2016

achapkowski commented Jul 28, 2016

jreback commented Jul 28, 2016

achapkowski commented Jul 28, 2016

jreback commented Jul 28, 2016

achapkowski commented Jul 28, 2016

jreback commented Jul 28, 2016

jreback commented Jul 28, 2016

achapkowski commented Jul 28, 2016

TomAugspurger commented Jul 28, 2016

tinproject commented Aug 3, 2016

achapkowski commented Aug 3, 2016

bhavaniravi commented Jan 20, 2021

DataFrame - Add chunksize to from_records() #13818

DataFrame - Add chunksize to from_records() #13818

Comments

achapkowski commented Jul 27, 2016

Code Sample, a copy-pastable example if possible

Enhancement

output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 27, 2016

achapkowski commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016 • edited

achapkowski commented Jul 27, 2016

jreback commented Jul 27, 2016

achapkowski commented Jul 28, 2016

jreback commented Jul 28, 2016

achapkowski commented Jul 28, 2016

jreback commented Jul 28, 2016

achapkowski commented Jul 28, 2016

jreback commented Jul 28, 2016

jreback commented Jul 28, 2016

achapkowski commented Jul 28, 2016

TomAugspurger commented Jul 28, 2016

tinproject commented Aug 3, 2016

achapkowski commented Aug 3, 2016

bhavaniravi commented Jan 20, 2021

output of `pd.show_versions()`

TomAugspurger commented Jul 27, 2016 •

edited