Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_json raises 'Unhandled numpy dtype 15' with complex values. #12554

Closed
gregory-marton opened this issue Mar 7, 2016 · 34 comments
Closed

to_json raises 'Unhandled numpy dtype 15' with complex values. #12554

gregory-marton opened this issue Mar 7, 2016 · 34 comments
Labels
Complex Complex Numbers Dtype Conversions Unexpected or buggy dtype conversions IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@gregory-marton
Copy link

Code Sample, a copy-pastable example if possible

import pandas
df = pandas.DataFrame({'a': [complex(3, -1)]})
df.to_json()

Expected Output

'{"a":{"0":3-1j}}'

Actual output:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-b53888fd1707> in <module>()
----> 1 df.to_json()

.../lib/python2.7/site-packages/pandas/core/generic.pyc in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler)
    892             force_ascii=force_ascii,
    893             date_unit=date_unit,
--> 894             default_handler=default_handler)
    895 
    896     def to_hdf(self, path_or_buf, key, **kwargs):

.../lib/python2.7/site-packages/pandas/io/json.pyc in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler)
     33             obj, orient=orient, date_format=date_format,
     34             double_precision=double_precision, ensure_ascii=force_ascii,
---> 35             date_unit=date_unit, default_handler=default_handler).write()
     36     else:
     37         raise NotImplementedError("'obj' should be a Series or a DataFrame")

.../lib/python2.7/site-packages/pandas/io/json.pyc in write(self)
     76             date_unit=self.date_unit,
     77             iso_dates=self.date_format == 'iso',
---> 78             default_handler=self.default_handler)
     79 
     80 

RuntimeError: Unhandled numpy dtype 15

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.3
setuptools: 19.7
Cython: 0.23.4
numpy: 1.8.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.3
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
Jinja2: None
@shoyer
Copy link
Member

shoyer commented Mar 7, 2016

I think the fundamental problem here is that JSON has no native way to represent complex values. I suppose we could encode complex values in some way here (e.g., by converting to strings or dictionaries with keys real/imaginary), but I'm not sure how useful that would be.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2016

this is not tested, and AFAIK not in the JSON standard. What would JSON output look like for this?

@jreback jreback added IO JSON read_json, to_json, json_normalize Complex Complex Numbers Needs Discussion Requires discussion from core team before further action labels Mar 7, 2016
@gregory-marton
Copy link
Author

Given that json is, after all, javascript object notation, one choice might be to implicitly rely on a javascript library that handles complex numbers. Three candidates are math.js, numbers.js and numericjs.

math.js and numbers.js are Apache licensed. numericjs is MIT licensed.

numbers.js depends on node.js. I did not find lists of dependencies for math.js. I did find a statement that numericjs has no dependencies.

math.js names complex values as Complex objects, but with a lowercase math.complex as a factory (documentation):

var a = math.complex(2, 3);     // Complex 2 + 3i
a.re;                           // Number 2
a.im;                           // Number 3

var b = math.complex('4 - 2i'); // Complex 4 - 2i
b.re = 5;                       // Number 5
b;                              // Complex 5 - 2i

numbers.js names complex values as Complex objects (test file):

    var A = new Complex(3, 4);

The documentation is sparse.

numericjs names complex values as x,y coordinates (documentation):

IN> z = new numeric.T(3,4);
OUT> {x: 3, y: 4}

On cursory examination of the repos, math.js seems the most mature.

So my proposal would be to actually expect the output:
'{"a":{"0":math.complex(3, -1)}}'

Once math.js has been imported, then that json becomes interpretable in javascript, but that's a side concern for pandas.

It also seems relatively clear for human consumption, and relatively easy to implement in python deserialization. The major downside I see is that any naive interpreter (e.g. a json validator) will fail.

Given that there is a downside, perhaps this could be implemented as an option to to_json, with a default of turning these values into strings, perhaps with a warning? E.g.

'{"a":{"0":"math.complex(3, -1)"}}'
Warning: Serializing numeric complex value as a string. Consider using the complex=True option.

or

'{"a":{"0":"3-1j"}}'

The first of those would be easier to DTRT and robustly re-serialize as a complex value anyway.

@gregory-marton
Copy link
Author

(grumble... why does github put "Close and comment" next to "Comment", instead of e.g. putting "Preview" there, and putting "Close and Comment" on the other side or whatever?) Sorry.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2016

@gregory-marton I think you are confusing the JSON spec and javascript. These are not the same thing. JSON is a platform independent way of representing things. See the spec here.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2016

see #12213 for some related commentary.

@gregory-marton
Copy link
Author

@jreback not confusing; intentionally confounding as a way of getting to a readable and sensible standard proposal.

@gregory-marton
Copy link
Author

In particular, especially in the case of stringifying the value, notating it as "math.complex(3, -1)" makes the intent very clear, and confusion with something else (e.g. a version string) unlikely. I look to javascript on a historical basis, and because using an existing library's notation might aid buy-in.

@shoyer
Copy link
Member

shoyer commented Mar 7, 2016

@gregory-marton Even in JavaScript, it's a bad idea to use eval rather than JSON.parse. No matter how we handle complex values, DataFrame.to_json needs to always generate valid JSON (or error).

@shoyer
Copy link
Member

shoyer commented Mar 7, 2016

@gregory-marton Now that I read your reply more carefully, I see that you do propose using strings. That seems fine to me. I would definitely prefer "3 - 1j" to math.complex(3, -1) because the former can be more easily parsed in different languages (e.g., in Python).

@gregory-marton
Copy link
Author

@shoyer fair enough. Non-string proposal withdrawn.

@gregory-marton
Copy link
Author

@shoyer the downside I see with using just '3-1j' or '3 - 1j' is that those are not crazy strings to have in other contexts. Consider a product code or version string. That gets worse when you allow '3.2-1.j' etc. It would be safer to automatically make the decision to parse "math.complex(3, -1)" as a complex value in python than it would be to parse "3-1j" that way.

As for requiring spaces, I know that python is the language of significant whitespace, but that's not a sort of whitespace that I would imagine standing out to a casual reader as required.

@jreback
Copy link
Contributor

jreback commented Mar 7, 2016

@gregory-marton well that's the issue with JSON its type-less and you have to infer things. You can have a look at how / if there are already any standards for how to represent complex numbers (e.g. for example in dates there are 2 competing ones, an integer epoch and as an ISO string).

@gregory-marton
Copy link
Author

@jreback Good point! I should have looked. math.js's complex numbers documentation has a json section. They encode complex values as

{mathjs: 'Complex', re: number, im: number}

That certainly seems like an option, as dictionaries are not allowed in DataFrames (right?)

So then the desired output would be

'{"a":{"0":{"mathjs":"Complex","re":3,"im":-1}}}'

@jreback
Copy link
Contributor

jreback commented Mar 7, 2016

In [6]: df = pandas.DataFrame({'a': [complex(3, -1)]})

In [11]: df['a'] = df['a'].apply(lambda x: {'mathjs' : 'Complex', 're' : x.real, 'im' : x.imag})

In [12]: df.to_json()
Out[12]: '{"a":{"0":{"mathjs":"Complex","im":-1.0,"re":3.0}}}'

@gregory-marton
Copy link
Author

@jreback well, there goes that idea, then. Other suggestions?

Or is that a special case pandas would be willing to live with?

@jreback
Copy link
Contributor

jreback commented Mar 7, 2016

well you can certainly do this.

I think the default should be to stringify (like we do for everything else that cannot be handled). The issue here is that .to_json() knows its a numpy array, but doesn't know what to do with it. e.g. its sensible to:

In [13]: str(complex(3, -1))
Out[13]: '(3-1j)'

for each element.

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Mar 7, 2016
@gregory-marton
Copy link
Author

So for json.dumps, I can provide a cls=... to specify how I want stuff converted if there is no default. That's analogous to the default_handler option to .to_json

So I'm trying to write an encoder that will dtrt without my having to apply the above transformation separately. So my first guess is:

import json
class MyEncoder(json.JSONEncoder):
  # disable method-hidden because https://github.com/PyCQA/pylint/issues/414
  def default(self, obj): # pylint: disable=method-hidden
    try:
      if hasattr(obj, 'to_json'):
        if 'default_handler' in obj.to_json.__code__.co_varnames:
          return obj.to_json(default_handler=MyEncoder)
        else:
          return obj.to_json()
      if hasattr(obj, 'toJSON'):
        return obj.toJSON()
      if hasattr(obj,'to_dict'):
        return obj.to_dict()
      if isinstance(obj, complex):
        return {'mathjs' : 'Complex', 're' : x.real, 'im' : x.imag}
        # See discussion at https://github.com/pydata/pandas/issues/12554
      return json.JSONEncoder.default(self, obj)
    except Exception as e:
      return repr(e)

from pandas import DataFrame
df = DataFrame({'a': [1, 2.3, complex(4, -5)],
                'b': [float('nan'), None, 'N/A']})

assert ('{"a":{"0":1.0,"1":2.3,"2":{"mathjs":"Complex","re":4,"im":-5}},'
        '"b":{"0":null,"1":null,"2":"N\/A"}}') == df.to_json(default_handler=MyEncoder)

But I still get

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-5adab7ea3faf> in <module>()
      1 assert ('{"a":{"0":1.0,"1":2.3,"2":{"mathjs":"Complex","re":4,"im":-5}},'
----> 2         '"b":{"0":null,"1":null,"2":"N\/A"}}') == df.to_json(default_handler=MyEncoder)

.../lib/python2.7/site-packages/pandas/core/generic.pyc in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler)
    892             force_ascii=force_ascii,
    893             date_unit=date_unit,
--> 894             default_handler=default_handler)
    895 
    896     def to_hdf(self, path_or_buf, key, **kwargs):

.../lib/python2.7/site-packages/pandas/io/json.pyc in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler)
     33             obj, orient=orient, date_format=date_format,
     34             double_precision=double_precision, ensure_ascii=force_ascii,
---> 35             date_unit=date_unit, default_handler=default_handler).write()
     36     else:
     37         raise NotImplementedError("'obj' should be a Series or a DataFrame")

.../lib/python2.7/site-packages/pandas/io/json.pyc in write(self)
     76             date_unit=self.date_unit,
     77             iso_dates=self.date_format == 'iso',
---> 78             default_handler=self.default_handler)
     79 
     80 

RuntimeError: Unhandled numpy dtype 15

Thoughts on next steps? I understand applying the function manually is a workaround. But it would be nice if default_handler could apply when a numpy dtype happens to be unhandled.

But also, I'm somewhat new to this, and may not have gotten my guess right.
Thanks in advance!

@jreback
Copy link
Contributor

jreback commented Mar 8, 2016

this is a bit tricky. The problem is that since this is a numpy dtype it goes down a certain path (in the c-code), then can't figure out what to do and raises an error (and the default-handler is not called). So that's a bug, that when fixed will allow the default_handler to succeed.

@gregory-marton
Copy link
Author

Should I file that bug separately, or will this one serve?

@gregory-marton
Copy link
Author

Perhaps I should file that with numpy, rather than pandas?

@jreback
Copy link
Contributor

jreback commented Mar 8, 2016

has nothing to do with numpy, code is here

@gregory-marton
Copy link
Author

Perhaps it should omit raising the RuntimeError, just return JT_INVALID, and it should be write's responsibility to check the return code and invoke the default_handler if available, which should raise its own error, or if there is no default_handler, to raise a RuntimeError then? I think I could write that patch.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2016

yep exactly. rather it should return a code that allows it to fall out of that loop, then treat it like an iterable (not 100% sure if that would work, but give it a try). Once the iterable has it it should call the defaultHandler

I misread a bit of what you pointed to. This should ALL be in the c-code. That's where everything is handled.

@gregory-marton
Copy link
Author

Thank you for misunderstanding me well.

Just removing the RuntimeError creates interesting json:

'"{\\"a\\":{\\"0\\":,\\"1\\":,\\"2\\":},\\"b\\":{\\"0\\":null,\\"1\\":null,\\"2\\":\\"N\\\\/A\\"}}"'

That's a json string representing a python string that contains a stringified dictionary, with the first column replaced by something very strange.

But of course one needs to do more. This is as far as I got today:

--- a/pandas/io/tests/test_json/test_pandas.py
+++ b/pandas/io/tests/test_json/test_pandas.py
@@ -815,11 +815,38 @@ DataFrame\\.index values are different \\(100\\.0 %\\)

     def test_default_handler(self):
         value = object()
-        frame = DataFrame({'a': ['a', value]})
-        expected = frame.applymap(str)
+        frame = DataFrame({'a': [7, value]})
+        expected = DataFrame({'a': [7, str(value)]})
         result = pd.read_json(frame.to_json(default_handler=str))
         assert_frame_equal(expected, result, check_index_type=False)

+    def test_default_handler_with_json_dumps(self):
+        import json
+        class PlausibleHandler(json.JSONEncoder):
+            # disable method-hidden because
+            # https://github.com/PyCQA/pylint/issues/414
+            def default(self, obj): # pylint: disable=method-hidden
+                try:
+                    if hasattr(obj, 'to_json'):
+                        optargs = obj.to_json.__code__.co_varnames
+                        if 'default_handler' in optargs:
+                            return obj.to_json(default_handler=PlausibleHandler)
+                        else:
+                            return obj.to_json()
+                    if isinstance(obj, complex):
+                        return {'mathjs' : 'Complex',
+                                're' : x.real, 'im' : x.imag}
+                        # https://github.com/pydata/pandas/issues/12554
+                    return json.JSONEncoder.default(self, obj)
+                except Exception as e:
+                    return repr(e)
+        df_list = [ 9, DataFrame({'a': [1, 2.3, complex(4, -5)],
+                                  'b': [float('nan'), None, 'N/A']})]
+        expected = ('[9,{"a":{"0":1.0,"1":2.3,'
+                    '"2":{"mathjs":"Complex","re":4,"im":-5}},'
+                    '"b":{"0":null,"1":null,"2":"N\/A"}}]')
+        self.assertEqual(expected, json.dumps(df_list, cls=PlausibleHandler))
+
     def test_default_handler_raises(self):
         def my_handler_raises(obj):
             raise TypeError("raisin")
diff --git a/pandas/src/ujson/python/objToJSON.c b/pandas/src/ujson/python/objToJSON.c
index dcb509b..cbabb11 100644
--- a/pandas/src/ujson/python/objToJSON.c
+++ b/pandas/src/ujson/python/objToJSON.c
@@ -544,12 +544,8 @@ static int NpyTypeToJSONType(PyObject* obj, JSONTypeContext* tc, int npyType, vo
     return *((npy_bool *) value) == NPY_TRUE ? JT_TRUE : JT_FALSE;
   }

-  PRINTMARK();
-  PyErr_Format (
-      PyExc_RuntimeError,
-      "Unhandled numpy dtype %d",
-      npyType);
-  return JT_INVALID;
+  PRINTMARK();  // GREMIO
+  return JT_OBJECT;
 }


@@ -1834,11 +1830,15 @@ void Object_beginTypeContext (JSOBJ _obj, JSONTypeContext *tc)

   if (enc->npyType >= 0)
   {
-    PRINTMARK();
-    tc->prv = &(enc->basicTypeContext);
-    tc->type = NpyTypeToJSONType(obj, tc, enc->npyType, enc->npyValue);
-    enc->npyType = -1;
-    return;
+    int jt_type = NpyTypeToJSONType(obj, tc, enc->npyType, enc->npyValue);
+    if (jt_type != JT_INVALID) // GREMIO
+    {
+      PRINTMARK();
+      tc->prv = &(enc->basicTypeContext);
+      tc->type = jt_type;
+      enc->npyType = -1;
+      return;
+    }
   }

   if (PyBool_Check(obj))

I verified that the test fails the way it should without changes.

With the shown set of changes, it gives me a segfault, which I would have started debugging, but el capitan is stupid. Tomorrow I'll try on a different machine where debugging isn't quite such a pita. In the meantime, I'd love some early feedback.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2016

cc @Komnomnomnom

@gregory-marton
Copy link
Author

So where I got with this today is that I have the default handler getting called for each element of the bad-dtype column. Unfortunately it's getting called with the whole DataFrame, instead of with the individual cell value. Even then, I'd like the dispatch only to go to the defaultHandler if the type is not one of the ones I can handle internally, so that remains to be done too. If one of you knows how I can get the individual value as a PyObject instead of getting the whole df, that'd be a great hint.

patch.txt

@gregory-marton
Copy link
Author

I feel like what I want is:

PyObject *PyArray_GETITEM(PyArrayObject* arr, void* itemptr)
Get a Python object from the ndarray, arr, at the location pointed to by itemptr. Return NULL on failure.

but I'm not sure where to get my hands on the relevant PyArrayObject instead of the whole dataframe.

@gregory-marton
Copy link
Author

I left in a huff last time after having the following conversation with gdb:

(gdb) p enc->outputFormat
$19 = 4 
(gdb) n
2282          pc->iterNext = PdBlock_iterNext;
(gdb) p enc->outputFormat
$20 = 4 
(gdb) n
2283          pc->iterGetName = PdBlock_iterGetName;
(gdb) p enc->outputFormat
$21 = 1207959552
(gdb) p pc
$24 = (TypeContext *) 0x7ffffffe99a0
(gdb) p &enc
$25 = (PyObjectEncoder **) 0x7ffffffe99b0
(gdb) p &(pc->iterNext)
$28 = (JSPFN_ITERNEXT *) 0x7ffffffe99b0

I have otherwise made no significant further progress. I'm going to stop working on this for awhile. If someone wants to jump in with some explanation of the right way to approach the desired behavior, I might take another shot, but stuff like the above is just ... discouraging.

Out of curiosity, how current and how strong are the use cases that drive the decision to write this in C? In what client contexts are serialization and deserialization the most performance-sensitive tasks, more important than flexibility (and perhaps correctness, if the above is an indication)? Is there a considered design behind the structure of the C code (long functions, use of goto, non-use of boost), and if so, where can I read about it?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

@gregory-marton that's why I cc @Komnomnomnom . He is the original author. The style is typical for a c-style when its pretty low level like this. This is how numpy does lots of things, the gotos are almost all for error handling. Further you can't use much newer type things (e.g. boost). These things I believe all compile under c99.

As far as perf. This is of course why its written in c. In fact this is a pretty big bottleneck in general for JSON.

As far as correctness, this is simply an unsupported feature (as are a couple of other things). The default_handler is supposed to work for this and that's the bug.

@gregory-marton
Copy link
Author

Correctness: the surprising overlap of memory addresses is not something I've diagnosed, and it may in fact be related to one of my attempts to implement the requested feature, and it may even be working as intended. But cryptic behavior of that sort may make it very easy indeed to introduce bugs, if in fact the interaction doesn't point to one already.

Some of the gotos appear to be for ITERABLES processing, which I suspect could be delineated more clearly, and which is precisely where my understanding fell down. One could clarify the code without sacrificing performance by breaking up large functions into more digestible pieces with clearer flow and more naming to guide a new developer regarding intent.

Re: perf and language, I didn't realize that C99 support (and perhaps beyond) was an explicit pandas goal, sorry. I don't want to derail this ticket with further discussion. Thanks.

Thank you for cc'ing @Komnomnomnom. I look forward to any input.

@Komnomnomnom
Copy link
Contributor

@gregory-marton the c code is the ujson library highly customised / subverted to iterate through numpy and pandas objects. Because it's trying to deal directly with c datatypes whilst sticking to the ujson logic things get a little... tricky.

Dealing with the numpy C data types directly at the top of that function is deliberate and it's unwise to let them be handled further down the function. That said it should be straightforward to dispatch to the default handler for unsupported dtypes. Happy to take a look if you're fed up with it, mind if I nab your test code?

@gregory-marton
Copy link
Author

Please do! I put my work-in-progress, including tests, into the patchfile at the bottom of comment #12554 (comment)

@Komnomnomnom
Copy link
Contributor

@gregory-marton just added some JSON fixes (#12802, #12878) which should take care of this once merged.

Also fyi have a look at pandas.io.json.dumps. It should serialise most standard, numpy or pandas objects and you can avoid having to hack around with the standard json lib, e.g.

In [31]: lst = [ 9, DataFrame({'a': [1, 2.3, (4, -5)], 'b': [float('nan'), None, 'N/A']}), np.array([1,2,3])]

In [32]: pandas.io.json.dumps(lst)
Out[32]: '[9,{"a":{"0":1,"1":2.3,"2":[4,-5]},"b":{"0":null,"1":null,"2":"N\\/A"}},[1,2,3]]'

Hopefully this will be exposed at the top level at some point (#9147)

@jreback jreback added this to the 0.18.1 milestone Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complex Complex Numbers Dtype Conversions Unexpected or buggy dtype conversions IO JSON read_json, to_json, json_normalize Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants