# Data Formats for Panel Data Analysis

There are two primary methods to express data:

  * MultiIndex DataFrames where the outer index is the entity and the inner is the time index.  This requires using pandas.
  * 3D structures were dimension 0 (outer) is variable, dimension 1 is time index and dimension 2 is the entity index.  It is also possible to use a 2D data structure with dimensions (t, n) which is treated as a 3D data structure having dimensions (1, t, n). These 3D data structures can be pandas, NumPy or xarray.

## Multi Index DataFrames
The most precise data format to use is a MultiIndex `DataFrame`.  This is the most precise since only single columns can preserve all types within a panel.  For example, it is not possible to span a single Categorical variable across multiple columns when using a pandas `Panel`. 

This example uses the job training data to construct a MultiIndex `DataFrame` using the `set_index` command. The entity index is `fcode` and the time index is `year`.

In [None]:
from linearmodels.datasets import jobtraining
data = jobtraining.load()
print(data.head())

Here `set_index` is used to set the multi index using the firm code (entity) and year (time).

In [None]:
mi_data = data.set_index(['fcode', 'year'])
print(mi_data.head())

The `MultiIndex` `DataFrame` can be used to initialized the model.  When only referencing a single series, the `MultiIndex` `Series` representation can be used.

In [None]:
from linearmodels import PanelOLS
mod = PanelOLS(mi_data.lscrap, mi_data.hrsemp, entity_effects=True)
print(mod.fit())

## pandas Panels and DataFrames
An alternative formal is to use pandas Panels ad DataFrames.  Panels should be formatted with `items` containing distinct variables, `major_axis` holding the time dimension and `minor_axis` holding the entity index.  Here we transform the MultiIndex DataFrame to a panel to demonstrate this format.

A single index DataFrame can also be used and is treated as being a single item slice of a Panel, and so the items should contain the time series dimension and the columns should contain entities. Note that using the `DataFrame` version loses information about variable names, which is not usually desirable.

In [None]:
panel = mi_data[['lscrap','hrsemp']].to_panel().swapaxes(1,2)
lscrap = panel['lscrap']
hrsemp = panel['hrsemp']
panel

When using panels, it is best to input the panel which requires selecting using `[[`_var_`]]` so ensure that the variable(s) selected still has 3 dimensions.  This retains information about variable name.

In [None]:
res = PanelOLS(panel[['lscrap']], panel[['hrsemp']], entity_effects=True).fit()
print(res)

Using DataFrames removes this information and so the generic _Dep_ and _Exog_ are used.

In [None]:
res = PanelOLS(lscrap, hrsemp, entity_effects=True).fit()
print(res)

## NumPy arrays
NumPy arrays are treated identically to pandas Panel and single index DataFrames.  In particular, using `panel.values` and `df.values` will produce identical results.  The main difference between NumPy and pandas is that NumPy loses all label information.

In [None]:
res = PanelOLS(lscrap.values, hrsemp.values, entity_effects=True).fit()
print(res)

## xarray DataArrays

xarray is a relatively new entrant into the set of packages used for data structures.  The data structures provided by ``xarray`` are relevant in the context of panel models since pandas `Panel` is scheduled for removal in the futures, and so the only 3d data format that will remain viable is an `xarray` `DataArray`. `DataArray`s are similar to pandas `Panel` although `DataArrays` use some difference notation.  In principle it is possible to express the same information in a `DataArray` as one can in a `Panel`

In [None]:
da = panel.to_xarray()
da

In [None]:
res = PanelOLS(da.loc[['lscrap']], da.loc[['hrsemp']], entity_effects=True).fit()
print(res)

## Conversion of Categorical and Strings to Dummies
Categorical or string variables are treated as factors and so are converted to dummies. The first category is always dropped.  If this is not desirable, you should manually convert the data to dummies before estimating a model.

In [None]:
import pandas as pd
year_str = mi_data.reset_index()[['year']].astype('str')
year_cat = pd.Categorical(year_str.iloc[:,0])
year_str.index = mi_data.index
year_cat.index = mi_data.index
mi_data['year_str'] = year_str
mi_data['year_cat'] = year_cat

Here year has been converted to a string which is then used in the model to produce year dummies.

In [None]:
print('Exogenous variables')
print(mi_data[['hrsemp','year_str']].head())
print(mi_data[['hrsemp','year_str']].dtypes)

res = PanelOLS(mi_data[['lscrap']], mi_data[['hrsemp','year_str']], entity_effects=True).fit()
print(res)

Using ``categorical``s has the same effect.

In [None]:
print('Exogenous variables')
print(mi_data[['hrsemp','year_cat']].head())
print(mi_data[['hrsemp','year_cat']].dtypes)

res = PanelOLS(mi_data[['lscrap']], mi_data[['hrsemp','year_cat']], entity_effects=True).fit()
print(res)