# The Index in Pandas

## Understanding Data Alignment and Indexing

## Learning Outcomes

- Understand how the index is used to align data
- Know how to set and reset the index
- Understand how to select subsets of data by slicing on index and columns
- Understand that for DataFrames, the column names also align data

## Setup

In [2]:
import pandas as pd
import numpy as np

## What is the Index?

- Every Series or DataFrame has an index
- More than just "row labels"
- **Key feature**: Data alignment is intrinsic
- The link between labels and data won't be broken unless explicitly done

## Loading Sample Data

World Bank's World Development Indicators Dataset

In [3]:
url = "https://datascience.quantecon.org/assets/data/wdi_data.csv"
df = pd.read_csv(url)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   country      72 non-null     object 
 1   year         72 non-null     int64  
 2   GovExpend    72 non-null     float64
 3   Consumption  72 non-null     float64
 4   Exports      72 non-null     float64
 5   Imports      72 non-null     float64
 6   GDP          72 non-null     float64
dtypes: float64(5), int64(1), object(1)
memory usage: 4.1+ KB


In [None]:
df.head()

## Creating Sample DataFrames

In [4]:
df_small = df.head(5)
df_small

Unnamed: 0,country,year,GovExpend,Consumption,Exports,Imports,GDP
0,Canada,2017,0.372665,1.095475,0.582831,0.600031,1.868164
1,Canada,2016,0.364899,1.058426,0.576394,0.575775,1.814016
2,Canada,2015,0.358303,1.035208,0.568859,0.575793,1.79427
3,Canada,2014,0.353485,1.011988,0.550323,0.572344,1.782252
4,Canada,2013,0.351541,0.9864,0.51804,0.558636,1.732714


In [15]:
df_tiny = df.iloc[[0, 3, 2, 4], :]
df_tiny

Unnamed: 0,country,year,GovExpend,Consumption,Exports,Imports,GDP
0,Canada,2017,0.372665,1.095475,0.582831,0.600031,1.868164
3,Canada,2014,0.353485,1.011988,0.550323,0.572344,1.782252
2,Canada,2015,0.358303,1.035208,0.568859,0.575793,1.79427
4,Canada,2013,0.351541,0.9864,0.51804,0.558636,1.732714


In [None]:
im_ex = df_small[["Imports", "Exports"]]
im_ex_copy = im_ex.copy()
im_ex_copy

## Element-wise Operations

When indices and columns match perfectly:

In [16]:
im_ex + im_ex_copy

Unnamed: 0,Imports,Exports
0,1.200063,1.165661
1,1.15155,1.152787
2,1.151585,1.137718
3,1.144688,1.100646
4,1.117272,1.036081


## Automatic Alignment in Action

What happens with mismatched indices?

In [17]:
df_tiny

Unnamed: 0,country,year,GovExpend,Consumption,Exports,Imports,GDP
0,Canada,2017,0.372665,1.095475,0.582831,0.600031,1.868164
3,Canada,2014,0.353485,1.011988,0.550323,0.572344,1.782252
2,Canada,2015,0.358303,1.035208,0.568859,0.575793,1.79427
4,Canada,2013,0.351541,0.9864,0.51804,0.558636,1.732714


In [18]:
im_ex_tiny = df_tiny + im_ex
im_ex_tiny

Unnamed: 0,Consumption,Exports,GDP,GovExpend,Imports,country,year
0,,1.165661,,,1.200063,,
1,,,,,,,
2,,1.137718,,,1.151585,,
3,,1.100646,,,1.144688,,
4,,1.036081,,,1.117272,,


## What Just Happened?

### Automatic Alignment
- Pandas aligned data by matching row and column labels
- This works even when rows/columns are in different orders

### Handling Missing Data
- When data exists in only one DataFrame: result is `NaN`
- `NaN` = "Not a Number" (missing data)

## Setting the Index

Use DataFrame columns as the index with `set_index()`

In [19]:
df_year = df.set_index(["year"])
df_year.head()

Unnamed: 0_level_0,country,GovExpend,Consumption,Exports,Imports,GDP
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017,Canada,0.372665,1.095475,0.582831,0.600031,1.868164
2016,Canada,0.364899,1.058426,0.576394,0.575775,1.814016
2015,Canada,0.358303,1.035208,0.568859,0.575793,1.79427
2014,Canada,0.353485,1.011988,0.550323,0.572344,1.782252
2013,Canada,0.351541,0.9864,0.51804,0.558636,1.732714


## Using the Index with .loc

Extract all data for a specific year:

In [20]:
df_year.loc[2010]

Unnamed: 0_level_0,country,GovExpend,Consumption,Exports,Imports,GDP
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010,Canada,0.347332,0.921952,0.469949,0.500341,1.613543
2010,Germany,0.653386,1.915481,1.443735,1.266126,3.417095
2010,United Kingdom,0.521146,1.598563,0.690824,0.745065,2.4529
2010,United States,2.510143,10.185836,1.84628,2.360183,14.992053


## Example: Year-over-Year Changes

In [21]:
df_year.loc[2009].mean(numeric_only=True) - df_year.loc[2008].mean(numeric_only=True)

GovExpend      0.033317
Consumption   -0.042998
Exports       -0.121425
Imports       -0.140042
GDP           -0.182610
dtype: float64

Notice: pandas automatically aligned the column names!

## Problem: Single Index Limitations

Query: "What was the GDP in the US in 2010?"

In [22]:
df_year.loc[df_year["country"] == "United States", "GDP"].loc[2010]

np.float64(14.992052727)

That's a lot of work! And it gets worse with multiple countries...

## Hierarchical Index (MultiIndex)

Use multiple columns as the index:

In [23]:
wdi = df.set_index(["country", "year"])
wdi.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,GovExpend,Consumption,Exports,Imports,GDP
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Canada,2017,0.372665,1.095475,0.582831,0.600031,1.868164
Canada,2016,0.364899,1.058426,0.576394,0.575775,1.814016
Canada,2015,0.358303,1.035208,0.568859,0.575793,1.79427
Canada,2014,0.353485,1.011988,0.550323,0.572344,1.782252
Canada,2013,0.351541,0.9864,0.51804,0.558636,1.732714
Canada,2012,0.354342,0.961226,0.505969,0.547756,1.693428
Canada,2011,0.351887,0.943145,0.492349,0.528227,1.66424
Canada,2010,0.347332,0.921952,0.469949,0.500341,1.613543
Canada,2009,0.339686,0.890078,0.440692,0.439796,1.565291
Canada,2008,0.330766,0.889602,0.50635,0.502281,1.612862


## Slicing with MultiIndex

Now our queries are much simpler!

In [24]:
wdi.loc[("United States", 2010), "GDP"]

np.float64(14.992052727)

In [27]:
wdi.loc[(["United Kingdom", "Germany"], 2010), ["GDP"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,GDP
country,year,Unnamed: 2_level_1
United Kingdom,2010,2.4529
Germany,2010,3.417095


## Slicing Rules

### Key Distinction
- **`list`** in row slicing → "OR" operation
- **`tuple`** in row slicing → single hierarchical index (must include all levels)

## Row Slicing Examples

1. All rows for United States:

In [28]:
wdi.loc["United States"]

Unnamed: 0_level_0,GovExpend,Consumption,Exports,Imports,GDP
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017,2.405743,12.019266,2.287071,3.069954,17.348627
2016,2.407981,11.722133,2.219937,2.936004,16.972348
2015,2.37313,11.4098,2.222228,2.881337,16.710459
2014,2.334071,11.000619,2.209555,2.732228,16.242526
2013,2.353381,10.687214,2.118639,2.600198,15.853796
2012,2.398873,10.534042,2.045509,2.560677,15.567038
2011,2.434378,10.37806,1.978083,2.493194,15.224555
2010,2.510143,10.185836,1.84628,2.360183,14.992053
2009,2.50739,10.010687,1.646432,2.086299,14.617299
2008,2.407771,10.137847,1.797347,2.400349,14.997756


## Row Slicing Examples (cont.)

2. Specific country and year:

In [29]:
wdi.loc[("United States", 2010)]

GovExpend       2.510143
Consumption    10.185836
Exports         1.846280
Imports         2.360183
GDP            14.992053
Name: (United States, 2010), dtype: float64

## Row Slicing Examples (cont.)

3. Multiple countries:

In [30]:
wdi.loc[["United States", "Canada"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,GovExpend,Consumption,Exports,Imports,GDP
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
United States,2017,2.405743,12.019266,2.287071,3.069954,17.348627
United States,2016,2.407981,11.722133,2.219937,2.936004,16.972348
United States,2015,2.37313,11.4098,2.222228,2.881337,16.710459
United States,2014,2.334071,11.000619,2.209555,2.732228,16.242526
United States,2013,2.353381,10.687214,2.118639,2.600198,15.853796
United States,2012,2.398873,10.534042,2.045509,2.560677,15.567038
United States,2011,2.434378,10.37806,1.978083,2.493194,15.224555
United States,2010,2.510143,10.185836,1.84628,2.360183,14.992053
United States,2009,2.50739,10.010687,1.646432,2.086299,14.617299
United States,2008,2.407771,10.137847,1.797347,2.400349,14.997756


## Row Slicing Examples (cont.)

4. Multiple countries AND multiple years:

In [32]:
wdi.loc[(["United States", "Canada"], [2010, 2011, 2012]), :]


Unnamed: 0_level_0,Unnamed: 1_level_0,GovExpend,Consumption,Exports,Imports,GDP
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
United States,2010,2.510143,10.185836,1.84628,2.360183,14.992053
United States,2011,2.434378,10.37806,1.978083,2.493194,15.224555
United States,2012,2.398873,10.534042,2.045509,2.560677,15.567038
Canada,2010,0.347332,0.921952,0.469949,0.500341,1.613543
Canada,2011,0.351887,0.943145,0.492349,0.528227,1.66424
Canada,2012,0.354342,0.961226,0.505969,0.547756,1.693428


## pd.IndexSlice

For more flexible slicing, especially with inner index levels:

In [33]:
# All countries, only years 2005, 2007, 2009
wdi.loc[pd.IndexSlice[:, [2005, 2007, 2009]], :]



Unnamed: 0_level_0,Unnamed: 1_level_0,GovExpend,Consumption,Exports,Imports,GDP
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Canada,2005,0.303043,0.79439,0.51995,0.447222,1.524608
Germany,2005,0.591184,1.866253,1.1752,1.028094,3.213777
United Kingdom,2005,0.490806,1.578914,0.640088,0.715951,2.403352
United States,2005,2.287022,9.643098,1.431205,2.246246,14.3325
Canada,2007,0.318777,0.864012,0.530453,0.498002,1.596876
Germany,2007,0.605624,1.894219,1.442436,1.213835,3.441356
United Kingdom,2007,0.504549,1.644789,0.7102,0.767699,2.527327
United States,2007,2.351987,10.159387,1.701096,2.455016,15.018268
Canada,2009,0.339686,0.890078,0.440692,0.439796,1.565291
Germany,2009,0.645023,1.908393,1.260525,1.121914,3.283144


## Multi-index Columns

Hierarchical indexing also works for columns!

In [34]:
wdiT = wdi.T  # Transpose: swap rows and columns
wdiT

country,Canada,Canada,Canada,Canada,Canada,Canada,Canada,Canada,Canada,Canada,...,United States,United States,United States,United States,United States,United States,United States,United States,United States,United States
year,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
GovExpend,0.372665,0.364899,0.358303,0.353485,0.351541,0.354342,0.351887,0.347332,0.339686,0.330766,...,2.50739,2.407771,2.351987,2.314957,2.287022,2.267999,2.233519,2.193188,2.112038,2.0405
Consumption,1.095475,1.058426,1.035208,1.011988,0.9864,0.961226,0.943145,0.921952,0.890078,0.889602,...,10.010687,10.137847,10.159387,9.938503,9.643098,9.311431,8.974708,8.698306,8.480461,8.272097
Exports,0.582831,0.576394,0.568859,0.550323,0.51804,0.505969,0.492349,0.469949,0.440692,0.50635,...,1.646432,1.797347,1.701096,1.56492,1.431205,1.335978,1.218199,1.19218,1.213253,1.287739
Imports,0.600031,0.575775,0.575793,0.572344,0.558636,0.547756,0.528227,0.500341,0.439796,0.502281,...,2.086299,2.400349,2.455016,2.395189,2.246246,2.108585,1.892825,1.804105,1.740797,1.790995
GDP,1.868164,1.814016,1.79427,1.782252,1.732714,1.693428,1.66424,1.613543,1.565291,1.612862,...,14.617299,14.997756,15.018268,14.741688,14.3325,13.846058,13.339312,12.968263,12.746262,12.620268


## Slicing Multi-index Columns

In [35]:
wdiT.loc[:, "United States"]

year,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
GovExpend,2.405743,2.407981,2.37313,2.334071,2.353381,2.398873,2.434378,2.510143,2.50739,2.407771,2.351987,2.314957,2.287022,2.267999,2.233519,2.193188,2.112038,2.0405
Consumption,12.019266,11.722133,11.4098,11.000619,10.687214,10.534042,10.37806,10.185836,10.010687,10.137847,10.159387,9.938503,9.643098,9.311431,8.974708,8.698306,8.480461,8.272097
Exports,2.287071,2.219937,2.222228,2.209555,2.118639,2.045509,1.978083,1.84628,1.646432,1.797347,1.701096,1.56492,1.431205,1.335978,1.218199,1.19218,1.213253,1.287739
Imports,3.069954,2.936004,2.881337,2.732228,2.600198,2.560677,2.493194,2.360183,2.086299,2.400349,2.455016,2.395189,2.246246,2.108585,1.892825,1.804105,1.740797,1.790995
GDP,17.348627,16.972348,16.710459,16.242526,15.853796,15.567038,15.224555,14.992053,14.617299,14.997756,15.018268,14.741688,14.3325,13.846058,13.339312,12.968263,12.746262,12.620268


In [36]:
wdiT.loc[:, (["United States", "Canada"], 2010)]

country,United States,Canada
year,2010,2010
GovExpend,2.510143,0.347332
Consumption,10.185836,0.921952
Exports,1.84628,0.469949
Imports,2.360183,0.500341
GDP,14.992053,1.613543


## Resetting the Index

Move index levels back to regular columns:

In [37]:
wdi.reset_index()

Unnamed: 0,country,year,GovExpend,Consumption,Exports,Imports,GDP
0,Canada,2017,0.372665,1.095475,0.582831,0.600031,1.868164
1,Canada,2016,0.364899,1.058426,0.576394,0.575775,1.814016
2,Canada,2015,0.358303,1.035208,0.568859,0.575793,1.794270
3,Canada,2014,0.353485,1.011988,0.550323,0.572344,1.782252
4,Canada,2013,0.351541,0.986400,0.518040,0.558636,1.732714
...,...,...,...,...,...,...,...
67,United States,2004,2.267999,9.311431,1.335978,2.108585,13.846058
68,United States,2003,2.233519,8.974708,1.218199,1.892825,13.339312
69,United States,2002,2.193188,8.698306,1.192180,1.804105,12.968263
70,United States,2001,2.112038,8.480461,1.213253,1.740797,12.746262


## Choosing the Index: Tidy Data Principles

### Guidelines:
1. Each column should have one variable
2. Each row should have one observation

### Index Selection:
- **Row labels (index)**: unique identifier for an observation
- **Column names**: identify one variable

## Context Matters!

### Different goals → Different indices

**Goal 1**: Study GDP/consumption evolution over time
- Index: `[year, country]`

**Goal 2**: Compare countries and variables within a year
- Index: `[country, variable]`
- Columns: `years`

## Key Takeaways

- Index enables automatic data alignment
- MultiIndex allows hierarchical row/column structure
- Use `.loc` with tuples and lists for flexible slicing
- `pd.IndexSlice` for advanced inner-level selection
- Choose index based on your analysis goals

## Practice Exercises

1. Experiment with alignment by creating subsets and performing operations
2. Try different `.loc` slicing combinations
3. Practice with `pd.IndexSlice`
4. Reset index with different parameters
5. Set different columns as indices for different analysis goals

## Thank You!

### Questions?