# ARIMA Lab

One of the most common applications for ARIMA models is inventory planning. In this lab, you will be analyzing weekly Walmart sales data over a two year period from 2010 to 2012.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller

import warnings # necessary b/c pandas & statsmodels datetime issue
warnings.simplefilter(action="ignore")

## Importing the data

### Data checks: 
After importing, check the following:
- Is there any missing data?
- What are the column datatypes?
- How many observations are there?
- How many unique stores are there?
- How many unique departments are there?

In [None]:
walmart = pd.read_csv('datasets/train.csv')
walmart.head(3)

### Creating a datetime index:

Convert the `Date` column to datetime, and set it as the index for the dataframe.

### Getting weekly sales for store 1:

To simplify our work somewhat, we'll consider **only** sales from store 1, and we'll aggregate sales from all departments in store 1.

Create a new dataframe that contains weekly sales for store 1.

> **Note**: You might break this up into multiple steps, or you might do this in one line.
>
> To aggregate, use **groupby**. We're tallying up all sales that have the same date, _not_ aggregating many dates based on year or month.
>
> The only column you'll need to keep is `Weekly_Sales`. You should get rid of the `Store`, `Dept`, and `IsHoliday` columns in your new dataframe.

---

## Plotting and interpreting a time series

### Generate a line plot:

Generate a line plot for weekly sales, with time on the $x$-axis and sales on the $y$-axis. Make sure the plot has a title, and make sure axes are labeled where appropriate.

### Plotting rolling means:

Smoothing can help us see trends in the data. On one graph, plot the following:

- Weekly sales
- The 4-week rolling mean of weekly sales
- The 13-week rolling mean of weekly sales
    - (This is included because there are 13 weeks in a business quarter!)

Make sure the plot has a title, axis labels where appropriate, and a legend.

> **(BONUS)**: Add vertical lines on the plot to indicate January 2011 and January 2012.

### (Short answer) Describe any trends that you notice:

(Your answer here.)

---

## Autocorrelation and partial autocorrelation

Recall that autocorrelation and partial autocorrelation tell us about how a variable is related to itself at previous lags.

### Plot and interpret the autocorrelation:

Use statsmodels to plot the ACF and PACF. Look at up to **52 lags**. What do you notice? (Your answer can be given in bullet points; full sentences are not required.)

(Your interpretation here.)

---

## Modeling

### Train-test splitting:

Before fitting a model, we should train-test split. Use the first 90% of observations as training data, and use the remaining 10% as testing data. Remember -- we **do not need to shuffle the data**!

### Evaluating stationarity:

Use the augmented Dickey-Fuller test to evaluate the weekly sales data, and interpret the result.

---

### Fit and evaluate an AR(1) model:

We'll start with a simple autoregressive model with order 1. In statsmodels, an autoregressive model with order $p=1$ can be implemented by instantiating and fitting an ARIMA model with order $(1,0,0)$.

Instantiate and fit your model on the training data:

### Evaluating the model:

#### Store predictions:

Remember that statsmodels ARIMA models generate predictions off of `start` and `end` dates.

Generate and store predictions for the training and testing data:

#### Mean squared error:

Use the `mean_squared_error` function to identify the MSE on the testing data:

#### Generate a plot of the training data, testing data, train preds, and test preds:

Create a plot showing the training data, testing data, train preds, and test preds. Make sure there are labels and legends.

> **Note**: You'll be making more similar plots. You might consider writing a function to generate your plots!

#### Interpretation:

How well or how poorly does the model seem to perform? Provide a brief interpretation.

(Your answer here.)

---

### Fit and evaluate an ARMA(2,2) model:

In statsmodels, an autoregressive moving average model with order $p=2$, $q=2$ can be implemented by instantiating and fitting an ARIMA model with order $(2,0,2)$.

Instantiate and fit your model on the training data:

### Evaluating the model:

#### Store predictions:

Generate and store predictions for train and test:

#### Mean squared error:

Find the MSE of the testing data:

#### Generate a plot of the training data, testing data, train preds, and test preds:

As before, plot your data and predictions.

#### Intepretation:

How well or how poorly does the model seem to perform? Provide a brief interpretation.

(Your answer here.) It is not a very good model.

---

### Fit and evaluate an ARIMA(2,1,2) model:

Instantiate and fit an ARIMA model with order $(2,1,2)$:

### Evaluating the model:

#### Store predictions:

Generate and store predictions for train and test:

#### Mean squared error:

Find the MSE on your testing data:

#### Generate a plot of the training data, testing data, train preds and test preds:

As before, plot your data and predictions.

#### Interpretation:

How well or how poorly does the model seem to perform? Provide a brief interpretation.

(Your answer here.)

---

### Fit and evaluate an ARIMA(52,0,1) model:

The models above use few autoregressive terms and don't do a good job of capturing the long-term trends that we know exist.

Instantiate and fit an ARIMA of order $(52,0,1)$:

### Evaluating the model:

#### Store predictions:

Generate and store predictions for train and test:

#### Mean squared error:

Find the MSE on your testing data:

#### Generate a plot of the training data, testing data, train preds and test preds:

As before, plot your data and predictions.

#### Interpretation:

How well or how poorly does the model seem to perform? Provide a brief interpretation.

(Your answer here)

---

## (BONUS) SARIMA Modeling

Because of the seasonality of this data, a **seasonal** ARIMA model will perform more strongly.

A SARIMA model has an ARIMA portion which behaves as we expect it to. The S part of a SARIMA model allows us to use seasonal terms. The seasonal part of a SARIMA has order $(P, D, Q)_{m}$. $m$ is the **seasonal period** -- the number of observations per season. $P$, $D$, and $Q$ are somewhat similar to the $p$, $d$, and $q$ terms in an ARIMA model, but $P$, $D$, and $Q$ actually backshift by $m$. 

You can read more about SARIMA models here:
- [PennState SARIMA notes](https://online.stat.psu.edu/stat510/lesson/4/4.1)
- [Forecasting: Principles and Practice 3rd ed.](https://otexts.com/fpp3/seasonal-arima.html)

Fit and evaluate a SARIMA model with order $(2,0,2)$ and seasonal order $(1,1,1,52)$. How well does it perform?

> **Note**: SARIMA models are implemented in statsmodels as SARIMAX - the 'X' part allows exogenous data to be passed in as well, though we won't specify any.
>
> The seasonal order argument is `seasonal_order`.