# Lag Variables in Pandas Tutorial

This notebook explains how to create lag variables in `pandas`.

This notebook will use gold and silver price data from `rdatasets` for this tutorial

### Packages

The documentation for each package used in this tutorial is linked below:
* [pandas](https://pandas.pydata.org/docs/)
* [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)

In [1]:
import statsmodels.api as sm
import pandas as pd

## Create initial dataset

The data is from `rdatasets` imported using the Python package `statsmodels`.

In [2]:
df = sm.datasets.get_rdataset('GoldSilver', 'AER').data.reset_index().rename(columns={'index': 'date'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9132 entries, 0 to 9131
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    9132 non-null   object 
 1   gold    9132 non-null   float64
 2   silver  9132 non-null   float64
dtypes: float64(2), object(1)
memory usage: 214.2+ KB


In [3]:
df['date'] = pd.to_datetime(df.date)

## Lag Variables

Create lag variables, using the `shift` function.  `shift(1)` creates a lag of a single record, while `shift(5)` creates a lag of five records.

In [4]:
df.sort_values('date', inplace=True)
df['silver_lag_1'] = df['silver'].shift(1)
df['silver_lag_5'] = df['silver'].shift(5)

In [5]:
df.head(20)

Unnamed: 0,date,gold,silver,silver_lag_1,silver_lag_5
0,1977-12-30,100.0,223.42,,
1,1978-01-02,100.0,223.42,223.42,
2,1978-01-03,100.0,229.84,223.42,
3,1978-01-04,100.0,224.58,229.84,
4,1978-01-05,100.0,227.99,224.58,
5,1978-01-06,100.0,227.19,227.99,223.42
6,1978-01-09,101.23,229.62,227.19,223.42
7,1978-01-10,100.95,228.97,229.62,229.84
8,1978-01-11,102.25,231.22,228.97,224.58
9,1978-01-12,100.88,227.89,231.22,227.99


This creates a lag variable based on the prior observations, but `shift` can also take a time offset to specify the time to use in shift.  For example, **1D** and **5D** can be used to lag by 1 and 5 days respectively.

First, a datetime index must be created from **date**.

In [6]:
df.set_index('date', inplace=True)

In [7]:
df['silver_lag_1d'] = df['silver'].shift(freq='1D')
df['silver_lag_5d'] = df['silver'].shift(freq='5D')

In [8]:
df.head(20)

Unnamed: 0_level_0,gold,silver,silver_lag_1,silver_lag_5,silver_lag_1d,silver_lag_5d
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1977-12-30,100.0,223.42,,,,
1978-01-02,100.0,223.42,223.42,,,
1978-01-03,100.0,229.84,223.42,,223.42,
1978-01-04,100.0,224.58,229.84,,229.84,223.42
1978-01-05,100.0,227.99,224.58,,224.58,
1978-01-06,100.0,227.19,227.99,223.42,227.99,
1978-01-09,101.23,229.62,227.19,223.42,,224.58
1978-01-10,100.95,228.97,229.62,229.84,229.62,227.99
1978-01-11,102.25,231.22,228.97,224.58,228.97,227.19
1978-01-12,100.88,227.89,231.22,227.99,231.22,
