Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with Panel depreciation with Pandas version >= 0.20 #297

Closed
femtotrader opened this issue Apr 23, 2017 · 20 comments

Comments

@femtotrader
Copy link
Contributor

@femtotrader femtotrader commented Apr 23, 2017

@bashtage

This comment has been minimized.

Copy link
Contributor

@bashtage bashtage commented Apr 23, 2017

MultiIndex - xarray's API is probably too unstable for internal use

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Apr 24, 2017

for sure return a MI here.

@femtotrader

This comment has been minimized.

Copy link
Contributor Author

@femtotrader femtotrader commented Apr 24, 2017

Panel was a nice structure for data alignment.

How can we build a DataFrame with MultiIndex from a dictionary of DataFrames?

@bashtage

This comment has been minimized.

Copy link
Contributor

@bashtage bashtage commented Apr 24, 2017

One way would be to treat each key in dictionary as a variable, then you could do something like

df = pd.concat([dfdict[key].stack() for key in dfdict], 1)
@femtotrader

This comment has been minimized.

Copy link
Contributor Author

@femtotrader femtotrader commented Apr 24, 2017

I'm doing some experiments

import pandas as pd
import pandas_datareader.data as web

panel = web.DataReader(["GOOG", "IBM"], "yahoo")

Panel looks like

<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1838 (major_axis) x 2 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2010-01-04 00:00:00 to 2017-04-21 00:00:00
Minor_axis axis: GOOG to IBM

so we used indexing like this

panel["Open"]["GOOG"]["2010-01-04"]
panel["Open", "2010-01-04", "GOOG"]

Let's build a dictionary of DataFrames

from collections import OrderedDict
d = OrderedDict()
d["GOOG"] = web.DataReader("GOOG", "yahoo")
d["IBM"] = web.DataReader("IBM", "yahoo")
df = pd.concat([d[key].stack() for key in d], axis=1)

but there is 2 problems:

                                 0             1
Date
2010-01-04 Open       6.269511e+02  1.311800e+02
           High       6.295111e+02  1.329700e+02
           Low        6.242411e+02  1.308500e+02
           Close      6.267511e+02  1.324500e+02
           Volume     3.927000e+06  6.155300e+06
           Adj Close  3.130625e+02  1.114050e+02
2010-01-05 Open       6.271811e+02  1.316800e+02
...                            ...           ...
2017-04-20 Adj Close  8.416500e+02  1.623000e+02
2017-04-21 Open       8.428800e+02  1.620500e+02
           High       8.438800e+02  1.621100e+02
           Low        8.406000e+02  1.603800e+02
           Close      8.431900e+02  1.603800e+02
           Volume     1.319300e+06  5.600600e+06
           Adj Close  8.431900e+02  1.603800e+02

[11028 rows x 2 columns]

Columns are not named correctly

Indexing order df[0]["2010-01-04"]["Open"] is not similar than with Panel

Pinging @shoyer about possible API instability of xarray and also to get his opinion about this migration.

da = panel.to_xarray()
da = da.rename({"major_axis": "Date", "minor_axis": "Symbol"})

DataArray looks like:

<xarray.DataArray (items: 6, major_axis: 1838, minor_axis: 2)>
array([[[  6.269511e+02,   1.311800e+02],
        [  6.271811e+02,   1.316800e+02],
        ...,
        [  8.414400e+02,   1.613200e+02],
        [  8.428800e+02,   1.620500e+02]],

       [[  6.295111e+02,   1.329700e+02],
        [  6.278411e+02,   1.318500e+02],
        ...,
        [  8.452000e+02,   1.624000e+02],
        [  8.438800e+02,   1.621100e+02]],

       ...,
       [[  3.927000e+06,   6.155300e+06],
        [  6.031900e+06,   6.841400e+06],
        ...,
        [  9.573000e+05,   6.486800e+06],
        [  1.319300e+06,   5.600600e+06]],

       [[  3.130625e+02,   1.114050e+02],
        [  3.116838e+02,   1.100592e+02],
        ...,
        [  8.416500e+02,   1.623000e+02],
        [  8.431900e+02,   1.603800e+02]]])
Coordinates:
  * items    (items) object 'Open' 'High' 'Low' 'Close' 'Volume' 'Adj Close'
  * Date     (Date) datetime64[ns] 2010-01-04 2010-01-05 2010-01-06 ...
  * Symbol   (Symbol) object 'GOOG' 'IBM'

let's use indexing with this DataArray

In [87]: da.loc["Open", "2010-01-04", "GOOG"]
Out[87]:
<xarray.DataArray ()>
array(626.951088)
Coordinates:
    items       <U4 'Open'
    major_axis  datetime64[ns] 2010-01-04
    minor_axis  <U4 'GOOG'

Indexing a DataArray to get a DataFrame is still odd for me.

Try:

da.loc[:, :, "GOOG"].to_dataframe()

it raises

ValueError: cannot convert an unnamed DataArray to a DataFrame: use the ``name`` parameter

A DataArray should have "default" name for axis...

I made something wrong:

In [134]: da.loc[:, :, "GOOG"].to_dataframe(name="value").unstack().transpose()
Out[134]:
items                 Adj Close   Close    High     Low    Open      Volume
           major_axis
minor_axis 2010-01-04      GOOG    GOOG    GOOG    GOOG    GOOG        GOOG
           2010-01-05      GOOG    GOOG    GOOG    GOOG    GOOG        GOOG
           2010-01-06      GOOG    GOOG    GOOG    GOOG    GOOG        GOOG
           2010-01-07      GOOG    GOOG    GOOG    GOOG    GOOG        GOOG
           2010-01-08      GOOG    GOOG    GOOG    GOOG    GOOG        GOOG
           2010-01-11      GOOG    GOOG    GOOG    GOOG    GOOG        GOOG
           2010-01-12      GOOG    GOOG    GOOG    GOOG    GOOG        GOOG
...                         ...     ...     ...     ...     ...         ...
value      2017-04-12    824.32  824.32  826.66  821.02  821.93      894600
           2017-04-13    823.56  823.56  826.38  821.44  822.14  1.1125e+06
           2017-04-17    837.17  837.17  837.75  824.47  825.01      892300
           2017-04-18    836.82  836.82  838.93  832.71  834.22      835300
           2017-04-19    838.21  838.21  842.22  836.29  839.79      954200
           2017-04-20    841.65  841.65   845.2  839.32  841.44      957300
           2017-04-21    843.19  843.19  843.88   840.6  842.88  1.3193e+06

[3676 rows x 6 columns]
@shoyer

This comment has been minimized.

Copy link
Member

@shoyer shoyer commented Apr 24, 2017

Indexing a DataArray to get a DataFrame is still odd for me.

Try da.loc[:, :, "GOOG"].to_pandas()

The to_dataframe() method is designed for tidy data. On a DataArray, it essentially calls da.to_series().to_frame(). You want the API for preserving multi-dimensional data.

@femtotrader

This comment has been minimized.

Copy link
Contributor Author

@femtotrader femtotrader commented Apr 24, 2017

Thanks @shoyer
How can we build a DataArray from a dict (OrderedDict in fact) of DataFrames?

@shoyer

This comment has been minimized.

Copy link
Member

@shoyer shoyer commented Apr 24, 2017

How can we build a DataArray from a dict (OrderedDict in fact) of DataFrames?

Make a Dataset first (i.e., xarray.Dataset(my_dict)) and then convert into a DataArray with .to_array(). But note that you may really want a Dataset, not a DataArray.

@femtotrader

This comment has been minimized.

Copy link
Contributor Author

@femtotrader femtotrader commented Apr 24, 2017

Ok so, with

import xarray as xr
ds = xr.Dataset(d)

Dataset looks like

<xarray.Dataset>
Dimensions:  (Date: 1838, dim_1: 6)
Coordinates:
  * Date     (Date) datetime64[ns] 2010-01-04 2010-01-05 2010-01-06 ...
  * dim_1    (dim_1) object 'Open' 'High' 'Low' 'Close' 'Volume' 'Adj Close'
Data variables:
    GOOG     (Date, dim_1) float64 627.0 629.5 624.2 626.8 3.927e+06 313.1 ...
    IBM      (Date, dim_1) float64 131.2 133.0 130.9 132.4 6.155e+06 111.4 ...

but

ds.loc["Open", "2010-01-04", "GOOG"]

raises TypeError: can only lookup dictionaries from Dataset.loc

is there an easier way for indexing than

  • ds["GOOG"].loc["2010-01-04", "Open"] or
  • ds.to_array().loc["GOOG", "2010-01-04", "Open"] ?
@shoyer

This comment has been minimized.

Copy link
Member

@shoyer shoyer commented Apr 24, 2017

I would probably organize this data with dimensions ['stock', 'date'] and data variables ['open', 'high', 'low', 'close', 'volume', 'adj close']. Usually different units is a clue that you should be using different variables.

But indeed, if you want to make it work in a single .loc call you'll need to be put it in a DataArray

MultiIndex - xarray's API is probably too unstable for internal use

I think it should be mostly flushed out now and working (as of v0.9). It's certainly possible some issues will come up that force breaking changes, but I don't think that's especially likely.

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Apr 24, 2017

@femtotrader the best method for this type of data is a MultiIndexed DataFrame. It is logical and quite consistent. Further adding a depending on xarray is a bit odd IMHO. I would simply explore this first. You can directly create a MI rather than the Panel.

@femtotrader

This comment has been minimized.

Copy link
Contributor Author

@femtotrader femtotrader commented Apr 25, 2017

Ok @jreback about MI but I think we should have Open High Low Close Volume Adj Close as columns because Volume is int64 and prices (OHLC) are float64

@femtotrader

This comment has been minimized.

Copy link
Contributor Author

@femtotrader femtotrader commented Apr 25, 2017

I'm still experimenting MultiIndex (I didn't used them a lot previously)

from collections import OrderedDict
d = OrderedDict()

for symb in ["GOOG", "IBM"]:
    d[symb] = web.DataReader(symb, "yahoo")
    d[symb]["Symbol"] = symb
    d[symb] = d[symb].reset_index().set_index(["Date", "Symbol"])
    
df = pd.concat([d[key] for key in d], axis=0)
df = df.sort_index(axis=0, level=0)

DataFrame looks like:

                         Open        High         Low       Close    Volume  \
Date       Symbol
2010-01-04 GOOG    626.951088  629.511067  624.241073  626.751061   3927000
           IBM     131.179993  132.970001  130.850006  132.449997   6155300
2010-01-05 GOOG    627.181073  627.841071  621.541045  623.991055   6031900
           IBM     131.679993  131.850006  130.100006  130.850006   6841400
2010-01-06 GOOG    625.861078  625.861078  606.361042  608.261023   7987100
           IBM     130.679993  131.490005  129.809998  130.000000   5605300
2010-01-07 GOOG    609.401025  610.001045  592.651008  594.101005  12876600
           IBM     129.869995  130.250000  128.910004  129.550003   5840600
...                       ...         ...         ...         ...       ...
2017-04-19 GOOG    839.789978  842.219971  836.289978  838.210022    954200
           IBM     161.759995  162.639999  159.600006  161.690002  19282800
2017-04-20 GOOG    841.440002  845.200012  839.320007  841.650024    957300
           IBM     161.320007  162.399994  160.500000  162.300003   6486800
2017-04-21 GOOG    842.880005  843.880005  840.599976  843.190002   1319300
           IBM     162.050003  162.110001  160.380005  160.380005   5600600
2017-04-24 GOOG    851.200012  863.450012  849.859985  862.760010   1370500
           IBM     161.289993  161.570007  160.419998  160.750000   5175700

                    Adj Close
Date       Symbol
2010-01-04 GOOG    313.062468
           IBM     111.405000
2010-01-05 GOOG    311.683844
           IBM     110.059232
2010-01-06 GOOG    303.826685
           IBM     109.344283
2010-01-07 GOOG    296.753749
           IBM     108.965786
...                       ...
2017-04-19 GOOG    838.210022
           IBM     161.690002
2017-04-20 GOOG    841.650024
           IBM     162.300003
2017-04-21 GOOG    843.190002
           IBM     160.380005
2017-04-24 GOOG    862.760010
           IBM     160.750000

[3678 rows x 6 columns]

and dtypes

In [123]: df.dtypes
Out[123]:
Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
dtype: object

but indexing is still odd for me:

In [124]: df.loc[("2010-01-04", "GOOG"), "Open"]
Out[124]:
Date        Symbol
2010-01-04  GOOG      626.951088
Name: Open, dtype: float64

I wasn't expecting a pandas.core.series.Series just a float64

And other strange behavior is:

In [132]: df.loc[["2010-01-04", "GOOG"], "Open"]
Out[132]:
Date        Symbol
2010-01-04  GOOG      626.951088
            IBM       131.179993
Name: Open, dtype: float64
@bashtage

This comment has been minimized.

Copy link
Contributor

@bashtage bashtage commented Apr 25, 2017

The last example makes "sense" in that I think pandas thinks you are asking for any element in the level 0 index that matches "2010-01-04" or "GOOG". The second isn't found but the first is.

@drafter250

This comment has been minimized.

Copy link

@drafter250 drafter250 commented Jun 30, 2017

can a MultiIndex DataFrame be created as simply as a Panel can be with a Dict of single index dataframes?

@bashtage

This comment has been minimized.

Copy link
Contributor

@bashtage bashtage commented Jun 30, 2017

I don't think do but maybe this constructor would be useful to transition away from from Panel.

@drafter250

This comment has been minimized.

Copy link

@drafter250 drafter250 commented Jul 26, 2017

So I'm Answering my own question.. like @jreback posted above you can pass a dict of DataFrames to pd.concat or a list of data-frames and corresponding list of labels for the primary index to pd.concat. Maybe this should be mentioned/advertised in the docs for deprecating the Panel If it isn't already.

my only concern would be ensuring no joining/merging occurred. Would this be done with ignore_index = True ?

@davidastephens davidastephens mentioned this issue May 21, 2018
4 of 4 tasks complete
@davidastephens

This comment has been minimized.

Copy link
Member

@davidastephens davidastephens commented May 21, 2018

I made an attempt at fixing this in #525.

I made the columns MultiIndex rather than the rows in @femtotrader's example. This allows for some behavior we could do with the Panel (and that were in the tests) to be maintained (less code breakage).

For example (from code in #525):

from pandas_datareader import data as web
data = web.get_data_yahoo(['AAPL', 'F'])
data.head()

Attributes       High               Low              Open             Close  \
Symbols          AAPL      F       AAPL      F       AAPL      F       AAPL   
Date                                                                          
2010-01-04  30.642857  10.28  30.340000  10.05  30.490000  10.17  30.572857   
2010-01-05  30.798571  11.24  30.464285  10.40  30.657143  10.45  30.625713   
2010-01-06  30.747143  11.46  30.107143  11.13  30.625713  11.21  30.138571   
2010-01-07  30.285715  11.69  29.864286  11.32  30.250000  11.46  30.082857   
2010-01-08  30.285715  11.74  29.865715  11.46  30.042856  11.67  30.282858   

Attributes              Volume               Adj Close            
Symbols         F         AAPL            F       AAPL         F  
Date                                                              
2010-01-04  10.28  123432400.0   60855800.0  20.616993  8.089488  
2010-01-05  10.96  150476200.0  215620200.0  20.652637  8.624592  
2010-01-06  11.37  138040000.0  200070600.0  20.324135  8.947227  
2010-01-07  11.66  119282800.0  130201700.0  20.286560  9.175431  
2010-01-08  11.69  111902700.0  130463000.0  20.421427  9.199037 

data.Close.head()

Symbols          AAPL      F
Date                        
2010-01-04  30.572857  10.28
2010-01-05  30.625713  10.96
2010-01-06  30.138571  11.37
2010-01-07  30.082857  11.66
2010-01-08  30.282858  11.69

The tricky part is getting the data for individual symbols (unless you swap the MultiIndex levels). The best way I've found is this:

data.xs('F', axis=1, level='Symbols').head()

Attributes   High    Low   Open  Close       Volume  Adj Close
Date                                                          
2010-01-04  10.28  10.05  10.17  10.28   60855800.0   8.089488
2010-01-05  11.24  10.40  10.45  10.96  215620200.0   8.624592
2010-01-06  11.46  11.13  11.21  11.37  200070600.0   8.947227
2010-01-07  11.69  11.32  11.46  11.66  130201700.0   9.175431
2010-01-08  11.74  11.46  11.67  11.69  130463000.0   9.199037

Please let me know if anyone disagrees with this implementation.

@davidastephens

This comment has been minimized.

Copy link
Member

@davidastephens davidastephens commented May 26, 2018

Closed by #525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.