# Multidimensional data in pandas

Files needed = ('nipa.xlsx', 'CPS_March_2016.csv')

We have covered some pandas basics and learned how to plot. Now let's sort out how to deal with more complex data. We will often find ourselves with data in which the unit of observation is complex. Pandas helps us deal with this by allowing for many index variables. So far, we have only used single indexing, but that is about to change. 

Some examples that could use a multiIndex

1. State and county
2. Team and player
3. Industry and firm
4. Country (or person, firm,...) and time

That last one is important, and one that shows up a lot in economics. We call this *panel data*. Panel data is sometimes called longitudinal data. It follows the same firm/person/country over time. Recall the example we saw last week with the pm2.5 exposure data.

MultiIndexes are important. Here are a few applications.

1. When the multiIndex is set correctly, we can use methods such as `.loc[]` in more powerful ways to retrieve subsets of data. 
2. The multiIndex is important when we want to "reshape" DataFrames to create Dataframes with observations as rows and variables as columns. 
3. The multiIndex helps us keep our data neat and organized. 

In [3]:
import pandas as pd                 # load pandas and shorten it to pd
import matplotlib.pyplot as plt     # for making figures

In [2]:
soccer = {'team' : ['Man City', 'Man City', 'Man City', 'Man City', 'Chelsea', 'Chelsea'], 
          'player' : ['Walker', 'Stones', 'Foden', 'Jesus', 'Cahill', 'Pedro'],
          'pos' : ['D', 'D', 'M', 'F', 'D', 'F'],
          'goals' : [1, 0, 0, 1, 0, 3],
          'assists': [0,0,0,0,0,0]
         }

prem = pd.DataFrame(soccer)
prem

Unnamed: 0,team,player,pos,goals,assists
0,Man City,Walker,D,1,0
1,Man City,Stones,D,0,0
2,Man City,Foden,M,0,0
3,Man City,Jesus,F,1,0
4,Chelsea,Cahill,D,0,0
5,Chelsea,Pedro,F,3,0


### Multiple indexing
The key to working with more complex datasets is getting the index right. So far, we have considered a single index, but pandas allows for multiple indexes that nest each other. 

**Key concept:** Hierarchical indexing takes multiple *levels* of indexes. 

Let's set up the DataFrame to take team and position as the indexes. 

In [3]:
prem.set_index(['team', 'pos'], inplace=True)
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man City,D,Walker,1,0
Man City,D,Stones,0,0
Man City,M,Foden,0,0
Man City,F,Jesus,1,0
Chelsea,D,Cahill,0,0
Chelsea,F,Pedro,3,0


Wow. 

Notice that the `set_index()` method is the same one we used earlier with single indexes. In this case, we passed it a list of variables to make the indexes
```python
prem.set_index(['team', 'pos'], inplace=True)
```

In the output, the highest level of the index is team (we passed it 'team' first in the list) and the second level is position. The output does not repeat the team name for each observation. The 'missing' team name just means that the team is the same as above. \[A very Tufte-esque removal of unnecessary ink.\] 

Let's take a look under the hood. What's our index? A new kind of object: the MultiIndex

In [4]:
print(prem.index)

MultiIndex([('Man City', 'D'),
            ('Man City', 'D'),
            ('Man City', 'M'),
            ('Man City', 'F'),
            ( 'Chelsea', 'D'),
            ( 'Chelsea', 'F')],
           names=['team', 'pos'])


### Subsetting with multiple indexes
With a multi index, we need two arguments to reference observations. Notice that I am using a **tuple** to pass the two values of the multiIndex.

In [5]:
# All the defenders on Man City
prem.loc[('Man City', 'D'),:] 

  prem.loc[('Man City', 'D'),:]


Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man City,D,Walker,1,0
Man City,D,Stones,0,0


It's always a good idea to pay attention to warnings, particularly 'PerformanceWarning'. Pandas is telling us that we are asking for something in the second index, but the second index is not ordered. If the index was big, this could slow down our program. Let's fix that with `sort_index()`.

**Important** Sort your mulitIndex. 

In [6]:
# Tell pandas which axis to sort. We could sort the columns, too...
# This returns a new DataFrame unless we use inplace=True.
prem = prem.sort_index(axis=0)   
                                 
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,D,Cahill,0,0
Chelsea,F,Pedro,3,0
Man City,D,Walker,1,0
Man City,D,Stones,0,0
Man City,F,Jesus,1,0
Man City,M,Foden,0,0


In [7]:
# Now let's ask for all the defenders on Man City again.
prem.loc[('Man City', 'D'), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man City,D,Walker,1,0
Man City,D,Stones,0,0


No warnings. 

### Partial indexing
With the indexes set, we can easily subset the data using only one of the indexes. In pandas, this is called *partial indexing* because we are only using part of the index to identify the data we want. 


We  use the `xs()` (for "cross-section") method of DataFrame. Here we specify which level we are looking into. Note that I can reference the levels either by an integer or by its name.

In [8]:
# Get all of the 'Chelsea' observations.
prem.xs('Chelsea', level = 0)

Unnamed: 0_level_0,player,goals,assists
pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
D,Cahill,0,0
F,Pedro,3,0


In [9]:
# Get all of the 'Man City' observations.
prem.xs('Man City', level = 'team')

Unnamed: 0_level_0,player,goals,assists
pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
D,Walker,1,0
D,Stones,0,0
F,Jesus,1,0
M,Foden,0,0


Notice that `.xs()` dropped the index level that we chose from. If we want to keep that level we use the `drop_level` option. 

In [10]:
prem.xs('Man City', level = 'team', drop_level=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man City,D,Walker,1,0
Man City,D,Stones,0,0
Man City,F,Jesus,1,0
Man City,M,Foden,0,0


With `xs()`, we can partially index on the 'inner index' as well. Suppose we want all the defenders, regardless of team.

In [11]:
prem.xs('D', level='pos')

Unnamed: 0_level_0,player,goals,assists
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chelsea,Cahill,0,0
Man City,Walker,1,0
Man City,Stones,0,0


As with a single index, we can get rid of the multiIndex and replace it with a generic list of integers. This adds the index levels back into the DataFrame as columns. 

In [12]:
prem.reset_index(inplace=True)    # this moves the indexes back to columns
prem

Unnamed: 0,team,pos,player,goals,assists
0,Chelsea,D,Cahill,0,0
1,Chelsea,F,Pedro,3,0
2,Man City,D,Walker,1,0
3,Man City,D,Stones,0,0
4,Man City,F,Jesus,1,0
5,Man City,M,Foden,0,0


Who says two indexes are enough...let's try three levels of indexes!

In [13]:
prem.set_index(['team', 'pos', 'player'], inplace=True)
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,goals,assists
team,pos,player,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,D,Cahill,0,0
Chelsea,F,Pedro,3,0
Man City,D,Walker,1,0
Man City,D,Stones,0,0
Man City,F,Jesus,1,0
Man City,M,Foden,0,0


In [14]:
prem = prem.sort_index(axis=0)
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,goals,assists
team,pos,player,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,D,Cahill,0,0
Chelsea,F,Pedro,3,0
Man City,D,Stones,0,0
Man City,D,Walker,1,0
Man City,F,Jesus,1,0
Man City,M,Foden,0,0


## Saving multiIndex DataFrames

Saving a multiIndexed DataFrame works like before. Pandas fills in all the repeated labels, so the output is ready to go. Run the following code and then open the csv files.

In [15]:
# Multiple indexes on rows
prem.to_csv('prem.csv')

## Reading multiIndex DataFrames

We can set up the multiIndex as we read in a file, too. 

If the multiIndex is on the **rows**, pass `index_col` a list of column names.

In [16]:
prem_readin = pd.read_csv('prem.csv', index_col=['team', 'player', 'pos'])
prem_readin

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,goals,assists
team,player,pos,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,Cahill,D,0,0
Chelsea,Pedro,F,3,0
Man City,Stones,D,0,0
Man City,Walker,D,1,0
Man City,Jesus,F,1,0
Man City,Foden,M,0,0


## Top Hat Practice Exercise: multiIndexing

Use the prem_readin DataFrame from above to answer these questions. 

1. How many goals has Man City player Walker scored? Use `.loc[]` to return his number of goals.

In [17]:
prem_readin.loc[('Man City', 'Walker', 'D'), 'goals']

1

2. How many goals has Man City player Walker scored? Use `.xs()` to return his number of goals. Hint: Search directly in the 'player' level.

3. Why might this approach cause problems?

In [18]:
prem_readin.xs('Walker', level = 'player')['goals']

team      pos
Man City  D      1
Name: goals, dtype: int64

4. Return a DataFrame containing only the Chelsea players. 

In [19]:
prem_readin.xs('Chelsea', level = 'team')

Unnamed: 0_level_0,Unnamed: 1_level_0,goals,assists
player,pos,Unnamed: 2_level_1,Unnamed: 3_level_1
Cahill,D,0,0
Pedro,F,3,0


## A multiIndex in columns
There is nothing that says you can't have multiple indexes in the `axis=1` dimension. It can be a bit more confusing, especially when we are reading in a file. 

Open up "nipa.xlsx" in Excel and take a look. 

If the multiIndex is on the **columns**, pass `header` a list of line numbers (integers). We also need to set the (row) index at the same time.

In [20]:
# Do not set the index. What is the name of the first column?
nipa = pd.read_excel('nipa.xlsx', header=[0,1])
nipa

Unnamed: 0_level_0,Unit,Nominal,Nominal,Real,Real
Unnamed: 0_level_1,Var,GDP,INV,GDP,INV
0,1990,5963.1445,993.449,9371.468,1223.03525
1,2000,10250.952,2038.408,13138.03525,2346.73125
2,2010,15048.97,2165.47275,15648.991,2216.47775
3,2020,21060.47425,3642.92525,18509.14275,3306.47325


In [21]:
nipa.columns

MultiIndex([(   'Unit', 'Var'),
            ('Nominal', 'GDP'),
            ('Nominal', 'INV'),
            (   'Real', 'GDP'),
            (   'Real', 'INV')],
           )

In [22]:
nipa = pd.read_excel('nipa.xlsx', header=[0,1], index_col=0)
nipa

Unit,Nominal,Nominal,Real,Real
Var,GDP,INV,GDP,INV
1990,5963.1445,993.449,9371.468,1223.03525
2000,10250.952,2038.408,13138.03525,2346.73125
2010,15048.97,2165.47275,15648.991,2216.47775
2020,21060.47425,3642.92525,18509.14275,3306.47325


In [23]:
nipa.columns

MultiIndex([('Nominal', 'GDP'),
            ('Nominal', 'INV'),
            (   'Real', 'GDP'),
            (   'Real', 'INV')],
           names=['Unit', 'Var'])

### Summary statistics by level
MultiIndexes provide a quick way to summarize data. We will see many different ways to do this &mdash; getting statistics by groups &mdash; and not all will involve a multiIndex. 

In [24]:
# Compute average goals for Chelsea players.
# We need to subset by the upppermost level.
print('Chelsea avg. goals:', prem.xs('Chelsea', level='team')['goals'].mean())   

Chelsea avg. goals: 1.5


In [25]:
# Compute average goals for defense. 
# We need to subset on the 'pos' level.
print('Defender avg. goals: {0:.2f}.'.format(prem.xs('D', level='pos')['goals'].mean()))  

Defender avg. goals: 0.33.


Notice the syntax with xs.
```python
 prem.xs('Chelsea', level='team')['goals']
```

The `prem.xs('Chelsea', level='team')` is returning a DataFrame with all the columns. \[Try it!\]

We then use the usual square-bracket syntax to pick off just the column 'goals' and then hit with `.mean()`.


In [26]:
prem.xs('Chelsea', level='team')

Unnamed: 0_level_0,Unnamed: 1_level_0,goals,assists
pos,player,Unnamed: 2_level_1,Unnamed: 3_level_1
D,Cahill,0,0
F,Pedro,3,0


### Limits to .xs()

`.xs()` can only search for one value in a level. This code, which looks reasonable, will not work.

```python
prem.xs(['Chelsea', 'Man City'], level='team')
```

will return an error. I'm not sure why this is not allowed. ¯\\_(ツ)_/¯

In [27]:
prem.loc[['Chelsea', 'Man City'],'goals']

team      pos  player
Chelsea   D    Cahill    0
          F    Pedro     3
Man City  D    Stones    0
               Walker    1
          F    Jesus     1
          M    Foden     0
Name: goals, dtype: int64

## Partial indexing with .loc[]

We can use `.loc[]` to partially index **only the outermost row.** Unlike `.xs()` we can pass it lists. 

In [28]:
prem.loc['Chelsea']

Unnamed: 0_level_0,Unnamed: 1_level_0,goals,assists
pos,player,Unnamed: 2_level_1,Unnamed: 3_level_1
D,Cahill,0,0
F,Pedro,3,0


In [29]:
prem.loc[['Chelsea', 'Man City']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,goals,assists
team,pos,player,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,D,Cahill,0,0
Chelsea,F,Pedro,3,0
Man City,D,Stones,0,0
Man City,D,Walker,1,0
Man City,F,Jesus,1,0
Man City,M,Foden,0,0


## Bonus Practice Exercise to Try at Home

Let's take data from the [Current Population Survey](https://www.census.gov/programs-surveys/cps.html), which surveys about 60,000 households each month. We will compute some average wages. This is the survery used to produce the official unemployment rate measures for the United States and many more labor-market indicators. 

We will need to clean up a bit, then work with a multiIndex. Think of this as a mini-project!

The unit of observation is a person. The variables are:

* `hrwage`: hourly wage
* `educ`: education level
* `female`: 1 if female, 0 if not
* `fulltimely`: 1 if worked full time, 0 if not

1. Load the march cps data, 'CPS_March_2016.csv'.  Note: the missing values are '.'

In [5]:
cps = pd.read_csv('CPS_March_2016.csv', na_values='.')
cps.head(2)

Unnamed: 0,hrwage,educ,female,fulltimely
0,20.961538,Some college,0,1.0
1,20.192308,HS diploma/GED,1,1.0


2. Keep only those with `fulltimely == 1`
3. Keep only those with `5 <= hrwage <= 200`

In [31]:
cps = cps[cps['fulltimely'] == 1]
cps = cps[(cps['hrwage'] >=5) & (cps['hrwage'] <=200)]
cps.describe()

Unnamed: 0,hrwage,female,fulltimely
count,67771.0,67771.0,67771.0
mean,25.851054,0.442593,1.0
std,19.860233,0.496697,0.0
min,5.0,0.0,1.0
25%,13.461538,0.0,1.0
50%,20.096153,0.0,1.0
75%,31.428572,1.0,1.0
max,200.0,1.0,1.0


4. Set the index to 'female' and 'educ', in that order.
5. Sort the index. 

In [4]:
cps.set_index(['female','educ'], inplace=True)
cps.sort_index(inplace=True)
cps

NameError: name 'cps' is not defined

6. Report the average wage for `HS diploma/GED` and for `College degree`, regardless of gender. 

In [33]:
print('The average wage for high school/GED graduates is ${0:.2f} and for college graduates is ${1:.2f}.'.format(cps.xs('HS diploma/GED', level='educ')['hrwage'].mean(), cps.xs('College degree', level='educ')['hrwage'].mean()))

The average wage for high school/GED graduates is $19.11 and for college graduates is $31.96.
