Skip to content

ENH: Add JSON export option for DataFrame #631 #1226

Closed
wants to merge 114 commits into from
@Komnomnomnom

No description provided.

@Komnomnomnom Komnomnomnom ENH: Add JSON export option for DataFrame #631
Bundle custom ujson lib for DataFrame and Series JSON export & import.
cb7c6ae
@takluyver
Python for Data member

I don't think we should be bundling a json encoder. There's a JSON module in Python since 2.6, and it's simple enough to install other implementations if the user needs e.g. more speed. Let's just have a little shim module that will try to import JSON APIs in order of preference.

@Komnomnomnom

@takluyver there's a bit of a discussion already at #631, not sure if you're aware of it. I should have added more info in the description though, sorry. The main motivation for including this fork of ujson in pandas is it specifically works with pandas datatypes at a very low level (it is pure C) so it wouldn't be of any benefit to non-pandas users. If a user wants to use their own favourite JSON decoder they would obviously still be free to do so.

However I'll admit that high performance JSON serialisation is probably a minor requirement for most people's needs so I'm happy either way.

@takluyver
Python for Data member

Thanks, I wasn't aware of that. I'm still not wild on the approach - it seems like it will make for a heavier library and a bigger codebase to maintain. But Wes seems to be happy with the idea, so you don't have to worry about my objections ;-)

A couple of practical questions:

Your README has a lot of benchmarks, but I haven't taken the time to work out what they all mean. Can you summarise: what sort of improvement do we see from forking ujson, versus the best we could do with a stock build?

What sort of workloads do we envisage - is the bottleneck when you have one huge dataframe, or thousands of smaller ones?

Assuming ujson is still actively developed, how important and how easy will it be to get updates from upstream in the future?

@Komnomnomnom

When working with numpy types:

  • encoding : no real advantage, sometimes even a disadvantage, the numpy to list conversion is very efficient.
  • decoding : about 1.5 to 2x the speed (when working with numeric types).

DataFrames:

  • encoding : depending on the desired format and the nature (shape, size) of the input a speedup of about 2x to 10x. Although there's cases where it's about 20x (e.g. 200x4 zeros).
  • decoding : again depending on the encoded format a speedup of about 2x to 3x is typical, but can be up to 20x.
  • for time series data encoding & decoding is usually better, or on a par with, encoding the corresponding Python basic type (i.e. a dict). For time series data with datetime indices I'm seeing about a 7x speedup for encoding DataFrames and about 3x for decoding. In the best case, where a transpose would otherwise be necessary, the speedup is about 15 to 20x.

And this is on top of ujson already being one of the speediest JSON libraries.

My specific use case is the need to share lots of Dataframes between Python processes (and other languages) with a mix of sizes. JSON was the natural choice for us because of portability, and we wanted to get the best performance out of it.

ujson is a relatively small and stable library. There has only been some minor patches in the last few months and the author seems pretty open to pull requests etc. I'll be merging any applicable upstream changes to my fork and I'd be happy to do the same for pandas if it ends up being integrated. I'm pretty familiar with the ujson code now (it's really only four files) and I'd likewise be happy to deal with any bugs / enhancements coming from pandas usage too. It's worth noting that the library is split in two parts, one being the language agnostic JSON encoder / decoder and the other being the Python bindings. I managed to keep the bulk of my changes limited to the Python bindings and even then they are new functions / new code rather than changes to existing functions. My point being upstream changes should be easy enough to merge.

@takluyver
Python for Data member

Thanks, that all sounds pretty reasonable, and I'm satisfied that this is worth doing.

@wesm
Python for Data member
wesm commented May 12, 2012

This is really excellent work, thanks so much for doing this. Yeah, I was initially a bit hesitant to bundle ujson, but given that more and more people want to do JS<->pandas integration, getting the best possible encoding/decoding performance and being able to access the NumPy arrays directly in the C encoder makes a lot of sense. We'll have to periodically pull in upstream changes from ujson, I guess.

@vgoklani

just curious, how would this handle nested JSON? i.e.

j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}}

df = pandas.DataFrame(j)

df = ?

@Komnomnomnom

From a performance standpoint not very well I'm afraid, the numpy with labels handling bombs out if it detects more than two levels of nesting. It probably could be tweaked to deal with this better but when decoding with complex types (i.e. objects and strings) a Python list is needed as an intermediary anyway, so I'm not sure there'd be any advantage.

The good news is the methods in DataFrame and Series fall back to standard decoding if the numpy version fails so it should still work as expected, albeit without the performance improvements.

Just tested it out to make sure

In [1]: from pandas import DataFrame
In [2]: j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}}

In [3]: df = DataFrame(j)

In [4]: df
Out[4]: 
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person    3  non-null values
dtypes: object(1)

In [5]: df['person']['occupation']
Out[5]: 
{'accomplishments': ['Brownian motion',
  'Special Relativity',
  'General Relativity'],
 'institution': 'Princeton University',
 'job_title': 'Theoretical Physicist'}

In [6]: df.to_json()
Out[6]: '{"person":{"first_name":"Albert","last_name":"Einstein","occupation":{"accomplishments":["Brownian motion","Special Relativity","General Relativity"],"institution":"Princeton University","job_title":"Theoretical Physicist"}}}'

In [7]: json = df.to_json()

In [8]: DataFrame.from_json(json)
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person    3  non-null values
dtypes: object(1)

In [9]: DataFrame.from_json(json)['person']['occupation']
Out[9]: 
{u'accomplishments': [u'Brownian motion',
  u'Special Relativity',
  u'General Relativity'],
 u'institution': u'Princeton University',
 u'job_title': u'Theoretical Physicist'}

Edit: I should have mentioned the comments above are related to decoding only. Encoding does not suffer the same issues and the performance improvements still apply.

@wesm
Python for Data member
wesm commented May 19, 2012

Hey @Komnomnomnom I started to see if I can merge this and am getting a segfault on my system (Python 2.7.2, NumPy 1.6.1, 64-bit Ubuntu).

The object returned by series.to_json(orient='columns') in _check_orient(series, "columns", dtype=dtype) from test_series.py, line 341, appears to be NULL (the gdb backtrace showed the segfault in from_json, but the data returned by to_json is malformed):

test_from_json_to_json (__main__.TestSeries) ... > /home/wesm/code/pandas/pandas/tests/test_series.py(324)_check_orient()
-> foo
(Pdb) type(series.to_json(orient=orient))
Segmentation fault

I can probably track down the problem, but I figure since you wrote the C code that you'd be more able if you can reproduce the error.

@Komnomnomnom

Hi Wes, I just tried with my local clone of my fork and had no segmentation fault (all tests passed when I made my commit / pull request). I'll merge in the latest from pandas master and see what happens.

For the record I'm using Pyton 2.7.2, numpy 1.6.1 on 64-bit OSX.

@wesm
Python for Data member
wesm commented May 19, 2012

I put in print statements

  printf("%s\n", ret);

  printf("length: %d\n", strlen(ret));

and here's the output

{"2F4SMHsw4I":-1.4303216796,"nMi4KBCmg7":-1.32552412,"Molf5Ue3kF":-1.2705465829,"9kkHHlfXPA":-0.8877964843,"6E3ma1UHv7":-0.850191537,"2F5JdoFIqQ":-0.8013936673,"VzJclGGLsr":-0.7985248155,"cI4bkkV9MH":-0.7000873004,"TxS6mJ8UuP":-0.6864885751,"2jGSZe0rmF":-0.6708315768,"oHooxHeHqu":-0.6482430589,"HuqOm1mf57":-0.624890804,"bEWcPipOk9":-0.5669391204,"zpy7FQCGgp":-0.3383151716,"nYIL8VPVT3":-0.2663003599,"x0YmXOvJ49":-0.1767082308,"bJm3Pbjx14":-0.1510545428,"E51nrgW9Yt":0.0101299091,"QycwIANnTx":0.1575097137,"8wVdQ8RIdQ":0.2073634038,"90c5KPKyeS":0.2539122603,"eERFnAAd8k":0.3728367,"tZLEG6seKV":0.4332938883,"ehdTUcPK7A":0.457039038,"biYpVDeFiz":0.5021518808,"JlVXVA62Zz":0.5918523437,"2UTfjHGMEy":0.6413052158,"5VOyIV1TYs":0.6828158342,"WyNfVlEOK3":1.1809723971,"YrW1NS7fCX":1.3862224711}
length: 790

pandas/src/ujson/python/objToJSON.c: MARK(1490)
Segmentation fault

Somehow the result of PyString_FromString is malformed, it seems like maybe ret is not null-terminated? I suspect this is a red herring, though

@wesm
Python for Data member
wesm commented May 19, 2012

It looks like something is getting corrupted:

14:09 ~/code/pandas  (json-export)$ python pandas/tests/test_ujson.py 
nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
testArrayNumpyExcept (__main__.NumpyJSONTests) ... ok
testArrayNumpyLabelled (__main__.NumpyJSONTests) ... ok
testArrays (__main__.NumpyJSONTests) ... ok
testBool (__main__.NumpyJSONTests) ... ok
testBoolArray (__main__.NumpyJSONTests) ... ok
testFloat (__main__.NumpyJSONTests) ... ok
testFloatArray (__main__.NumpyJSONTests) ... ok
testFloatMax (__main__.NumpyJSONTests) ... ok
testInt (__main__.NumpyJSONTests) ... ok
testIntArray (__main__.NumpyJSONTests) ... ok
testIntMax (__main__.NumpyJSONTests) ... ok
testDataFrame (__main__.PandasJSONTests) ... > /home/wesm/code/pandas/pandas/tests/test_ujson.py(943)testDataFrame()
-> foo
(Pdb) u
> /home/wesm/epd/lib/python2.7/unittest/case.py(327)run()
-> testMethod()
(Pdb) d
> /home/wesm/code/pandas/pandas/tests/test_ujson.py(943)testDataFrame()
-> foo
(Pdb) l
938     class PandasJSONTests(TestCase):
939     
940         def testDataFrame(self):
941             df = DataFrame([[1,2,3], [4,5,6]], index=['a', 'b'], columns=['x', 'y', 'z'])
942     
943  ->         foo
944             # column indexed
945             outp = DataFrame(ujson.decode(ujson.encode(df)))
946             self.assertTrue((df == outp).values.all())
947             assert_array_equal(df.columns, outp.columns)
948             assert_array_equal(df.index, outp.index)
(Pdb) ujson.encode(df)
'{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'
(Pdb) print df
Segmentation fault
@wesm
Python for Data member
wesm commented May 19, 2012

It looks like the culprit must be NpyArr_encodeLabels. I'm not enough of a C guru to see what might be going wrong-- everything works here except encoding Series/DataFrame, and inside there is plenty of twiddling of bytes. Let me know if you manage to figure it out =/

@Komnomnomnom

Hmm I've merged to the latest on pandas master, seeing some failed tests but still no segmentation faults, no corruption and those print statements work fine. I'm going to try in a Ubuntu VM see if I can get to the bottom of it.

wesm and others added some commits May 10, 2012
@wesm wesm REF: working toward #1150, broke apart Cython module into generated _…
…algos extension
3af585e
@wesm wesm REF: have got things mostly working for #1150 11f2c0d
@wesm wesm BUG: more bug fixes, have to fix intraday frequencies still e9dee69
@wesm wesm BUG: more intraday unit fixes 69d0baa
@wesm wesm BUG: test suite passes, though negative ordinals broken 5485c2d
@wesm wesm BUG: weekly and business daily unit support #1150 879779d
@wesm wesm REF: remove period multipliers, close #1199 85fcd69
@mwiebe mwiebe Remove dependencies on details of experimental numpy datetime64 ABI
Pandas was using some of the enums and structures exposed by its headers.
By creating its own local copies of these, it is possible to allow the
numpy ABI to be improved while in its experimental state.
b457ff8
@wesm wesm ENH: move _ensure_{dtype} functions to Cython for speedup, close #1221 075f05e
@wesm wesm DOC: doc fixes ee73df1
@wesm wesm ENH: handle dict return values and vbench, close #823 9e88e0c
@wesm wesm ENH: add is_full method to PeriodIndex close #1114 a31ed38
@adamklein adamklein ENH: #1020 implementation. needs tests and adding to API b98e4e0
@mwiebe mwiebe Use datetime64 with a 'us' unit explicitly, for 1.6 and 1.7 compatibi…
…lity
3d83387
@mwiebe mwiebe Use an explicit unit for the 1.7 datetime64 scalar constructor c53e093
@mwiebe mwiebe Use assert_equal instead of assert, to see the actual values 89bd898
@mwiebe mwiebe Microseconds (us) not milliseconds (ms) 4e6720f
@wesm wesm TST: use NaT value a7bccd8
@wesm wesm ENH: add docs and add match function to API, close #502 1ecb5c4
@wesm wesm ENH: add Cython nth/last functions, vbenchmarks. close #1043 4ac9abb
@wesm wesm BUG: fix improper quarter parsing for frequencies other than Q-DEC, c…
…lose #1228
b246ae1
@wesm wesm BUG: implement Series.repeat to get expected results, close #1229 4d052f9
@wesm wesm ENH: anchor resampling frequencies like 5minute that evenly subdivide…
… one day in resampling to always get regular intervals. a bit more testing needed, but close #1165
74a6be0
Kelsey Jordahl ENH: Allow different number of rows & columns in a histogram plot 0cf9e3d
@wesm wesm BUG: support resampling of period data to, e.g. 5minute thoguh with t…
…imestamped result, close #1231
e043862
@wesm wesm BUG: remove restriction in lib.Reducer that index by object dtype. cl…
…ose #1214
996b964
@wesm wesm TST: vbenchmark for #561, push more work til 0.9 7baa84c
@wesm wesm BUG: don't print exception in reducer 8b972a1
@wesm wesm BUG: rogue foo 93b5221
@wesm wesm ENH: reimplment groupby_indices using better algorithmic tricks, asso…
…ciated vbenchmark. close #609
eb460c0
@wesm wesm BLD: fix npy_* -> pandas_*, compiler warnings 197a7f6
@wesm wesm TST: remove one skip test aca4c43
@wesm wesm ENH: store pytz time zones as zone strings in HDFStore, close #1232 c1260e3
@ruidc ruidc treat XLRD.XL_CELL_ERROR as NaN 8d27185
@ruidc ruidc replace tabs with spaces 1e6aea5
Chang She ENH: convert multiple text file columns to a single date column #1186 349bccb
Chang She Stop storing class reference in HDFStore #1235 4c32ab8
Chang She removed extraneous IntIndex instance test e057ad5
@wesm wesm BUG: fix rebase conflict from #1236 0cdfe75
@wesm wesm RLS: release note 63952a8
Chang She Merged extra keyword with parse_dates 52492dd
Chang She TST: VB for multiple date columns 9c01e77
Chang She A few related bug fixes 1febe66
@wesm wesm TST: test with headers 3fdf18a
@lbeltrame lbeltrame ENH: Add support for converting DataFrames to R data.frames and
matrices, close #350
c9af5c5
@lbeltrame lbeltrame BUG: Properly handle the case of matrices d17f1d5
Chang She ENH: maybe upcast masked arrays passed to DataFrame constructor a89e7b9
@wesm wesm RLS: release notes ea7f4e1
@wesm wesm ENH: optimize join/merge on integer keys, close #682 4c1eb1b
@wesm wesm RLS: release notes for #1081 8572d54
@wesm wesm ENH: efficiently box datetime64 -> Timestamp inside Series.__getitem_…
…_. close #1058
8ecb31b
@wesm wesm BLD: add modified numpy Cython header 4b56332
@wesm wesm BLD: fix datetime.pxd d2b947b
@wesm wesm ENH: can pass multiple columns to GroupBy.__getitem__, close #383 67a98ff
@tkf tkf ENH: treat complex number in maybe_convert_objects 48a073a
@tkf tkf ENH: treat complex number in maybe_convert_objects a3e538f
@wesm wesm ENH: accept list of tuples, preserving function order in SeriesGroupB…
…y.aggregate
2e9de0e
@wesm wesm ENH: more flexible multiple function application in DataFrameGroupBy,…
… close #642
92d050b
@wesm wesm DOC: release notes b07f097
@tkf tkf TST: Add complex number in test_constructor_scalar_inference ca6558c
@tkf tkf ENH: treat complex number in internals.form_blocks 3f3b900
@tkf tkf ENH: add internals.ComplexBlock dc43a1e
@tkf tkf BUG: fix max recursion error in test_reindex_items
It looks like sorting by dtype itself does not work.
To see that, try this snippet:

>>> from numpy import dtype
>>> sorted([dtype('bool'), dtype('float64'), dtype('complex64'),
...         dtype('float64'), dtype('object')])
[dtype('bool'),
 dtype('float64'),
 dtype('complex64'),
 dtype('float64'),
 dtype('object')]
c280d22
@wesm wesm BLD: fix platform int issues a7698da
@wesm wesm TST: verify consistently set group name, close #184 0782990
@wesm wesm ENH: don't populate hash table in index engine if > 1e6 elements, to …
…save memory and speed. close #1160
d66ac45
@wesm wesm ENH: support different 'bases' when resampling regular intervals like…
… 5 minute, close #1119
be5b5a4
Chang She VB: more convenience auto-updates 8d581c8
Chang She VB: get from and to email addresses from config file 6e09dda
Chang She VB: removing cruft; getting config from user folders 31fefba
@wesm wesm BUG: floor division for Python 3 d5b6b93
Chang She DOC: function for auto docs build e275d76
Chang She DOC: removed lingering sourceforge references 18d9a13
Chang She DOC: removed lingering timeRule keyword use 545e917
@wesm wesm ENH: very basic ordered_merge with forward filling, not with multiple…
… groups yet
40d9a3b
@wesm wesm ENH: add group-wise merge capability to ordered_merge, unit tests, cl…
…ose #813
69229e7
@wesm wesm BUG: ensure_platform_int actually makes lots of copies 9e2142b
@wesm wesm RLS: release notes, close #1239 5891ad5
@wesm wesm BLD: 32-bit compat fixes per #1242 42d1c90
@wesm wesm ENH: add keys() method to DataFrame, close #1240 f1c6c89
@wesm wesm DOC: release notes 6e8bbed
Chang She TST: test cases for replace method. #929 e50c7d8
Chang She ENH: Series.replace #929 b0e13c1
Chang She ENH: DataFrame.replace and cython replace. Only works for floats and …
…ints. Need to generate datetime64 and object versions.
b7546b2
Chang She ENH: finishing up DataFrame.replace need to revisit 45773c9
Chang She removed bottleneck calls from replace 2f5319d
Chang She moved mask_missing to common 245c126
Chang She TST: extra test case for Series.replace 35220b4
Chang She removed remaining references to replace code generation 40a0cb1
@wesm wesm DOC: release note re: #929 76355d0
@invisibleroads invisibleroads Removed erroneous reference to iterating over a Series, which iterate…
…s over values and not keys
927d370
Chang She TST: rephrased .keys call for py3compat 49ad7e2
@invisibleroads invisibleroads Fixed a few typos b60c0d3
@wesm wesm REF: microsecond -> nanosecond migration, most of the way there #1238 d4407a9
@wesm wesm BUG: more nano fixes 4f15d54
Chang She DOC: put back doc regarding inplace in rename in anticipation of feature 421f5d3
Chang She DOC: reworded description for MultiIndex 181f945
Chang She DOC: started on timeseries.rst for 0.8 fb1e662
@wesm wesm REF: more nanosecond support fixes, test suite passes #1238 9bc3814
@wesm wesm ENH: more nanosecond support #1238 b026566
@orbitfold orbitfold Changes to plotting scatter matrix diagonals c360391
@orbitfold orbitfold Changed xtick, ytick labels cf74512
@orbitfold orbitfold Added simple test cases d7d6a0f
@orbitfold orbitfold Updated plotting.py scatter_matrix docstring to describe all the para…
…meters
cd8222c
@orbitfold orbitfold Added scatter_matrix examples to visualization.rst 8e2f3f9
@wesm wesm DOC: release notes da1b234
Chang She BUG: DataFrame.drop_duplicates with NA values a6e32b8
Chang She use fast zip with a placeholder value just for np.nan 2a6fc11
Chang She TST: vbench for drop_duplicate with skipna set to False d95a254
Chang She optimized a little bit for speed 7953ae8
Chang She ENH: inplace option to DataFrame.drop_duplicates #805 with vbench 916be1d
@tkf tkf BUG: replace complex64 with complex128
As mentioned in #1098.
ba6a9c8
@wesm wesm ENH: add KDE plot from #1059 1cacb6c
@Komnomnomnom

Ugh, I did not know merging into my fork would flood this pull request. It might be best to delete my current fork and submit a new pull request once this issue is sorted.

The good news is after a bit of setup I was able to reproduce the memory corruption you are seeing in my Ubuntu VM. It appears to happen even when NpyArr_encodeLabels is not involved. There is also some weirdness with timestamp conversion but I think that is a separate issue.

@Komnomnomnom

I believe I've found the problem. The reference count of the object being encoded was mistakenly being decremented twice. I presume it was just chance that the memory layout or garbage collection schedule on my laptop meant the object wasn't being deleted.

There are a few more things I've noticed (like build clean deleting the C files and datetime conversion is now not working) which I'll fix before submitting a new pull request. I'll close this one for now and I'll create a feature branch on a new fork to avoid this mess happening again.

@wesm
Python for Data member
wesm commented May 20, 2012

That will teach you not to develop in master ;) BTW, you don't need to refork-- you can git reset --hard upstream/master and force-push that to github. Just make sure you make a branch of your current master with the JSON work

@Komnomnomnom

Ooop too late I re-forked a few minutes ago....hope this doesn't cause further problems... :-/

BTW if you want to test the fix on your machine the offending line was 278 in NpyArr_iterEnd
cb7c6ae#L6R279
(That line should be removed.)

Also I'm still noticing some timestamp weirdness, I'm guessing there were changes recently in master regarding datetime64 ? Is this work still ongoing?

@wesm
Python for Data member
wesm commented May 20, 2012

Yes, the work is still ongoing. Test failures in JSON encoding/decoding or elsewhere (pydata/master test suite passes cleanly for me)? I should be able to fix them myself

@jreback
jreback commented Jun 11, 2013

implemented via #3804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.