Support ndjson -- newline delimited json -- for streaming data. #9180

okdistribute · 2015-01-01T23:46:25Z

Hey all,

I'm a developer on dat project (git for data) and we are building a python library to interact with the data store.

Everything in dat is streaming, and we use newline delimited json as the official transfer format between processes.

Take a look at the specification for newline delimited json here

Does pandas support this yet, and if not, would you consider adding a to_ndjson function to the existing output formats?

For example, the following table:

> df
    a        b          key                     version
0  True   False  ci4diwru70000x6xmmn19nba1        1
1  False  True   ci4diww6j0001x6xmbyp5o2f0        1

Would be converted to

> df.to_ndjson()
'{ "a": True, "b": False, "key": ci4diwru70000x6xmmn19nba1, "version": 1}\n{ "a": False, "b": True, "key": ci4diww6j0001x6xmbyp5o2f0, "version": 1}'

For general streaming use cases, it might be nice to also consider other ways of supporting this format, like a generator function that outputs ndjson-able objects

The text was updated successfully, but these errors were encountered:

okdistribute · 2015-01-02T00:10:26Z

something like this, perhaps:

def iterndjson(df):
    generator = df.iterrows()
    ndjson = []
    row = True
    while row:
        try:
            row = next(generator)
            ndjson.append(row[1].to_dict())
        except StopIteration:
            row = None

    return ndjson

> df = pd.DataFrame({'one': [1,2,3,4], 'two': [3,4,5,6]})
> iterndjson(df)
[{'one': 1, 'two': 3},
 {'one': 2, 'two': 4},
 {'one': 3, 'two': 5},
 {'one': 4, 'two': 6}]

jreback · 2015-01-02T15:48:56Z

have a look here http://pandas.pydata.org/pandas-docs/stable/io.html#json

jtratner · 2015-01-10T05:53:58Z

this would also be great for things like BigQuery, which outputs JSON files as new line delimited JSON. The issue is that it's not actually valid JSON (since it ends up as multiple objects).

@Karissa maybe you could hack around this by reading the file row by row, using ujson/json to read each row into a python dictionary and then passing the whole thing to the DataFrame constructor?

aterrel · 2016-06-02T17:25:21Z

@mrocklin and I looked into this. The simplest solution we came up was loading the file into a buffer, add the appropriate commas and brackets then passing back to read_json.

Below are a few timings on this approach, it seems the current implementation of read_json is a bit slower than ujson, so we felt the simplicity of this approach didn't make anything too slow.

In [3]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ...:

In [4]: %timeit parse()
1 loop, best of 3: 4.05 s per loop

In [18]: import ujson as json

In [19]: def parse():
    with open("reviews_Clothing_Shoes_and_Jewelry_5.json") as fp:
        list(map(json.loads, fp))
   ....:

In [20]: %timeit parse()
1 loop, best of 3: 1.43 s per loop

In [22]: %time _ = pd.read_json('reviews_Clothing_Shoes_and_Jewelry_5_comma.json', convert_dates=False)
CPU times: user 4.75 s, sys: 520 ms, total: 5.27 s
Wall time: 5.49 s

I'll try to get a patch together unless someone thinks there is a better solution. The notion would be to add a flag 'line=True' to the reader.

aterrel · 2016-06-03T18:28:00Z

@Karissa is there any difference between ndjson and jsonlines (http://jsonlines.org/) I've never heard of ndjson but it seems to be the same thing.

okdistribute · 2016-06-03T19:45:07Z

Looks like jsonlines includes a few more rules, including a specification about UTF-8 encoding. http://jsonlines.org/ vs http://ndjson.org/

MtDersvan · 2017-04-03T20:55:02Z

Hi, is there a way to efficiently json_normalize newline delimited json read with read_json?
For example I have a big dataset of nested newline delimited json, and I'd like to read it and flatten at the same time so my data in the cells will be as actual values instead of dicts.

Right now I just json.loads line by line and then normalizing the whole dict, but it takes a lot of time, and I assume is not elegant in any way.

Thanks!

jreback · 2017-04-03T21:11:11Z

you would have to show an example

MtDersvan · 2017-04-03T23:03:46Z

Thanks for a quick reply!
Here is an example with a toy data:

Where I'd like to efficiently apply json_normalize somehow with read_json data.

jreback · 2017-04-03T23:07:35Z

can you cat toy.csv to here.

MtDersvan · 2017-04-03T23:13:56Z

Here.
toy.csv.zip

jreback · 2017-04-03T23:20:28Z

This won't be that efficient, but its reasonably idiomatic.

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]: 
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

jreback · 2017-04-03T23:21:19Z

I don't think there is an easy way to do this ATM. In pandas2 this will be more built in.

cc @wesm

wesm · 2017-04-03T23:29:37Z

It would be much more sustainable to do this kind of parsing with RapidJSON in C++; I think we should be able to deliver this en route to pandas2

MtDersvan · 2017-04-04T00:02:06Z

Just checked on my production dataset:

This won't be that efficient, but its reasonably idiomatic.

First, fortunately it actually get's the job done better than using json_normalize. (no annoying *.string or *.nULL in the column name). So thanks a lot.

Second, unfortunately it is definitely less efficient than using json_normalize as I did previously. When I increased amount of data to 10000 entries (which is still just a fraction of data I need to process) it took ≈ 20x more time.

jreback · 2017-04-04T00:09:10Z

@iFantastic yes, unfortunately our handling of newline delimited JSON is not great right now; json_normalize and the soln I gave are all in python (the actual parser is in c), so these are not very efficient.

stay tuned to wesm/pandas2#71

MtDersvan · 2017-04-04T00:11:04Z

@jreback thanks for your time, very helpful!

MtDersvan · 2017-04-04T01:59:01Z

@jreback I have 1 more question: Is it possible to achieve formatting like yours

In [18]: pd.read_json('toy.csv', lines=True).stack().apply(Series)['string'].unstack()
Out[18]:
    age                  description price
0  None              Very Importrant   500
1     5                          NaN   150
2    15  Very Importrant Random Text  None

but using json_normalize?

I mean, to get nicely formated column names like age, description, price instead of age.string, description.nULL, description.string, price.string in the columns.

Thanks!

jreback · 2017-04-04T13:08:52Z

@iFantastic you can just post-process the columns

In [12]: i = pd.Index(['price.string', 'description.nuLL', 'age.string'])

In [13]: i.str.extract('(.*)\\.', expand=False)
Out[13]: Index(['price', 'description', 'age'], dtype='object')

MtDersvan · 2017-04-04T17:38:48Z

@jreback Thanks!

KristianHolsheimer · 2019-05-29T04:44:34Z

For ordinary ndjson files (like those output by BigQuery), I use the DataFrame.from_records constructor together with ujson:

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

In python2, you'd probably want to use imap instead of the built-in map.

jfthuong · 2021-07-20T10:15:43Z

FYI. ndjson is now supported out of the box with argument lines=True.

import pandas as pd

df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)

jreback added the IO JSON read_json, to_json, json_normalize label Jan 2, 2015

jreback mentioned this issue Jun 1, 2015

new feature/enhancement request: new JSON string option for pandas.DataFrame.to_json() #10244

Closed

jreback added this to the Next Major Release milestone Jun 1, 2015

jreback mentioned this issue Feb 2, 2016

Support for json lines format #12212

Closed

jreback added Enhancement Difficulty Intermediate labels Feb 2, 2016

aterrel self-assigned this Jun 2, 2016

aterrel mentioned this issue Jun 2, 2016

ENH: Adding json line parsing to pd.read_json #9180 #13351

Closed

5 tasks

jreback modified the milestones: 0.18.2, Next Major Release Jun 3, 2016

jreback closed this as completed in 6efd743 Jul 24, 2016

wesm unassigned aterrel Oct 12, 2016

jreback mentioned this issue Apr 4, 2017

high performance JSON & NDJson parser wesm/pandas2#71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ndjson -- newline delimited json -- for streaming data. #9180

Support ndjson -- newline delimited json -- for streaming data. #9180

okdistribute commented Jan 1, 2015

okdistribute commented Jan 2, 2015

jreback commented Jan 2, 2015

jtratner commented Jan 10, 2015

aterrel commented Jun 2, 2016

aterrel commented Jun 3, 2016

okdistribute commented Jun 3, 2016

MtDersvan commented Apr 3, 2017 •

edited

Loading

jreback commented Apr 3, 2017

MtDersvan commented Apr 3, 2017

jreback commented Apr 3, 2017

MtDersvan commented Apr 3, 2017

jreback commented Apr 3, 2017

jreback commented Apr 3, 2017

wesm commented Apr 3, 2017

MtDersvan commented Apr 4, 2017 •

edited

Loading

jreback commented Apr 4, 2017

MtDersvan commented Apr 4, 2017

MtDersvan commented Apr 4, 2017

jreback commented Apr 4, 2017

MtDersvan commented Apr 4, 2017

KristianHolsheimer commented May 29, 2019

jfthuong commented Jul 20, 2021

Support ndjson -- newline delimited json -- for streaming data. #9180

Support ndjson -- newline delimited json -- for streaming data. #9180

Comments

okdistribute commented Jan 1, 2015

okdistribute commented Jan 2, 2015

jreback commented Jan 2, 2015

jtratner commented Jan 10, 2015

aterrel commented Jun 2, 2016

aterrel commented Jun 3, 2016

okdistribute commented Jun 3, 2016

MtDersvan commented Apr 3, 2017 • edited Loading

jreback commented Apr 3, 2017

MtDersvan commented Apr 3, 2017

jreback commented Apr 3, 2017

MtDersvan commented Apr 3, 2017

jreback commented Apr 3, 2017

jreback commented Apr 3, 2017

wesm commented Apr 3, 2017

MtDersvan commented Apr 4, 2017 • edited Loading

jreback commented Apr 4, 2017

MtDersvan commented Apr 4, 2017

MtDersvan commented Apr 4, 2017

jreback commented Apr 4, 2017

MtDersvan commented Apr 4, 2017

KristianHolsheimer commented May 29, 2019

jfthuong commented Jul 20, 2021

MtDersvan commented Apr 3, 2017 •

edited

Loading

MtDersvan commented Apr 4, 2017 •

edited

Loading