Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ndjson -- newline delimited json -- for streaming data. #9180

Closed
okdistribute opened this issue Jan 1, 2015 · 22 comments
Closed

Support ndjson -- newline delimited json -- for streaming data. #9180

okdistribute opened this issue Jan 1, 2015 · 22 comments
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Milestone

Comments

@okdistribute
Copy link

Hey all,

I'm a developer on dat project (git for data) and we are building a python library to interact with the data store.

Everything in dat is streaming, and we use newline delimited json as the official transfer format between processes.

Take a look at the specification for newline delimited json here

Does pandas support this yet, and if not, would you consider adding a to_ndjson function to the existing output formats?

For example, the following table:

> df
    a        b          key                     version
0  True   False  ci4diwru70000x6xmmn19nba1        1
1  False  True   ci4diww6j0001x6xmbyp5o2f0        1

Would be converted to

> df.to_ndjson()
'{ "a": True, "b": False, "key": ci4diwru70000x6xmmn19nba1, "version": 1}\n{ "a": False, "b": True, "key": ci4diww6j0001x6xmbyp5o2f0, "version": 1}'

For general streaming use cases, it might be nice to also consider other ways of supporting this format, like a generator function that outputs ndjson-able objects

@okdistribute
Copy link
Author

something like this, perhaps:

def iterndjson(df):
    generator = df.iterrows()
    ndjson = []
    row = True
    while row:
        try:
            row = next(generator)
            ndjson.append(row[1].to_dict())
        except StopIteration:
            row = None

    return ndjson
> df = pd.DataFrame({'one': [1,2,3,4], 'two': [3,4,5,6]})
> iterndjson(df)
[{'one': 1, 'two': 3},
 {'one': 2, 'two': 4},
 {'one': 3, 'two': 5},
 {'one': 4, 'two': 6}]

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Jan 2, 2015
@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

@jtratner
Copy link
Contributor

this would also be great for things like BigQuery, which outputs JSON files as new line delimited JSON. The issue is that it's not actually valid JSON (since it ends up as multiple objects).

@Karissa maybe you could hack around this by reading the file row by row, using ujson/json to read each row into a python dictionary and then passing the whole thing to the DataFrame constructor?

@aterrel
Copy link
Contributor

aterrel commented Jun 2, 2016

@mrocklin and I looked into this. The simplest solution we came up was loading the file into a buffer, add the appropriate commas and brackets then passing back to read_json.

Below are a few timings on this approach, it seems the current implementation of read_json is a bit slower than ujson, so we felt the simplicity of this approach didn't make anything too slow.

In [3]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ...:

In [4]: %timeit parse()
1 loop, best of 3: 4.05 s per loop

In [18]: import ujson as json

In [19]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ....:

In [20]: %timeit parse()
1 loop, best of 3: 1.43 s per loop

In [22]: %time _ = pd.read_json('reviews_Clothing_Shoes_and_Jewelry_5_comma.json', convert_dates=False)
CPU times: user 4.75 s, sys: 520 ms, total: 5.27 s
Wall time: 5.49 s

I'll try to get a patch together unless someone thinks there is a better solution. The notion would be to add a flag 'line=True' to the reader.

@aterrel aterrel self-assigned this Jun 2, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release Jun 3, 2016
@aterrel
Copy link
Contributor

aterrel commented Jun 3, 2016

@Karissa is there any difference between ndjson and jsonlines (http://jsonlines.org/) I've never heard of ndjson but it seems to be the same thing.

@okdistribute
Copy link
Author

Looks like jsonlines includes a few more rules, including a specification about UTF-8 encoding. http://jsonlines.org/ vs http://ndjson.org/

@MtDersvan
Copy link

MtDersvan commented Apr 3, 2017

Hi, is there a way to efficiently json_normalize newline delimited json read with read_json?
For example I have a big dataset of nested newline delimited json, and I'd like to read it and flatten at the same time so my data in the cells will be as actual values instead of dicts.

Right now I just json.loads line by line and then normalizing the whole dict, but it takes a lot of time, and I assume is not elegant in any way.

Thanks!

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

you would have to show an example

@MtDersvan
Copy link

Thanks for a quick reply!
Here is an example with a toy data:
image

Where I'd like to efficiently apply json_normalize somehow with read_json data.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

can you cat toy.csv to here.

@MtDersvan
Copy link

Here.
toy.csv.zip

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

This won't be that efficient, but its reasonably idiomatic.

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]: 
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

I don't think there is an easy way to do this ATM. In pandas2 this will be more built in.

cc @wesm

@wesm
Copy link
Member

wesm commented Apr 3, 2017

It would be much more sustainable to do this kind of parsing with RapidJSON in C++; I think we should be able to deliver this en route to pandas2

@MtDersvan
Copy link

MtDersvan commented Apr 4, 2017

Just checked on my production dataset:

This won't be that efficient, but its reasonably idiomatic.

First, fortunately it actually get's the job done better than using json_normalize. (no annoying *.string or *.nULL in the column name). So thanks a lot.

Second, unfortunately it is definitely less efficient than using json_normalize as I did previously. When I increased amount of data to 10000 entries (which is still just a fraction of data I need to process) it took ≈ 20x more time.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

@iFantastic yes, unfortunately our handling of newline delimited JSON is not great right now; json_normalize and the soln I gave are all in python (the actual parser is in c), so these are not very efficient.

stay tuned to wesm/pandas2#71

@MtDersvan
Copy link

@jreback thanks for your time, very helpful!

@MtDersvan
Copy link

@jreback I have 1 more question: Is it possible to achieve formatting like yours

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]:
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

but using json_normalize?

I mean, to get nicely formated column names like age, description, price instead of age.string, description.nULL, description.string, price.string in the columns.

Thanks!

@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

@iFantastic you can just post-process the columns

In [12]: i = pd.Index(['price.string', 'description.nuLL', 'age.string'])

In [13]: i.str.extract('(.*)\\.', expand=False)
Out[13]: Index(['price', 'description', 'age'], dtype='object')

@MtDersvan
Copy link

@jreback Thanks!

@KristianHolsheimer
Copy link
Contributor

For ordinary ndjson files (like those output by BigQuery), I use the DataFrame.from_records constructor together with ujson:

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

In python2, you'd probably want to use imap instead of the built-in map.

@jfthuong
Copy link

FYI. ndjson is now supported out of the box with argument lines=True.

import pandas as pd

df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants