Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame constructor acts differently with lists and Numpy arrays #9131

Closed
scls19fr opened this issue Dec 22, 2014 · 5 comments
Closed

DataFrame constructor acts differently with lists and Numpy arrays #9131

scls19fr opened this issue Dec 22, 2014 · 5 comments

Comments

@scls19fr
Copy link
Contributor

Hello,

I noticed a strange behavior when a Numpy array is given to a Pandas DataFrame constructor.
I don't know really if it should be considered as an issue (or a feature)...
but anyway a tip to fix this will be nice.

import pandas as pd
import numpy as np
lst = [{'a':1, 'b':2}, {'a':3, 'b':2, 'c':3}]

In []: pd.DataFrame(lst)
Out[]:
   a  b   c
0  1  2 NaN
1  3  2   3

but with Numpy array

In []: arr=np.array(lst)
In []: arr
Out[]: array([{'a': 1, 'b': 2}, {'a': 3, 'c': 3, 'b': 2}], dtype=object)

In []: pd.DataFrame(arr)
Out[]:
                             0
0           {u'a': 1, u'b': 2}
1  {u'a': 3, u'c': 3, u'b': 2}

I was expecting same results. I was expecting a DataFrame with columns named 'a', 'b', 'c' like
when I feed DataFrame with a standard list.

I can "fix" this using

In []: pd.DataFrame(list(arr))
Out[]:
   a  b   c
0  1  2 NaN
1  3  2   3

I don't think that pd.DataFrame(list(arr)) is a nice idea... (with a big array it will be probably very long)

Any idea ?

Kind regards

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

if you are starting with a python structure (a list), I am not sure of the issue here. Why would you convert to a numpy array first? what are you trying to do?

@scls19fr
Copy link
Contributor Author

I'm getting data from a JSON response. This is a list of nested dictionaries which first need to be flatten. I could do this:

data = [flatten_dict(d) for d in data]

but I think it's better for performance issue to work with numpy arrays

f_flatten_dict = np.vectorize(flatten_dict)
a_data = np.array(data)
a_data = f_flatten_dict(a_data)

and I build a DataFrame

Here is my flatten_dict function

def flatten_dict(d, parent_key=''):
    """Recursively flatten a dict"""
    items = []
    for k, v in d.items():
        new_key = parent_key + '_' + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten_dict(v, new_key).items())
        elif type(v) == list:
            for n in range(len(v)):
                mykey = "%s_%d" % (new_key, n)
                items.extend(flatten_dict(v[n], mykey).items())
        else:
            items.append((new_key, v))
    return dict(items)

@shoyer
Copy link
Member

shoyer commented Dec 23, 2014

@scls19fr I would encourage you to profile your code to test your theories about performance (IPython makes this easy with the %timeit magic). In this case, I am pretty sure that a numpy array would not be faster than a Python list -- usually using non-native types in your array or np.vectorize are signs that numpy will not speed things up.

To give a little more context on the design here, pandas does some inference steps about how to format the data only when it is provided as a list for this exact reason, because it's usually not a good idea to nest dictionaries in numpy arrays.

@scls19fr
Copy link
Contributor Author

Thanks. I understand your point of view. But is there any tip to create columns from dict keys automatically ?

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

@scls19fr you might want to have a look here: http://pandas.pydata.org/pandas-docs/stable/io.html#normalization

closing as a usage issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants