# Demystifying SARIMA Models

Last week I finished a project where I identified the five zip codes in the city of Baltimore, Maryland where an individual would expect the highest return on investment for buying a house, living in the house, then reselling after two years. You can check out my github repository for the project [here](https://github.com/sethchart/Baltimore-Real-Estate-Investment). In that project, I used Seasonal AutoRegressive Integrated Moving Average (SARIMA) models to forecast house prices based on historical data that I collected from Zillow.

In this post I want to take a closer look at these powerful time series models and try to get some intuition for how they work. A SARIMA model is a fairly complex type of time series model that combines several ideas to provide a flexible framework for modeling time series data. I want to build up the full model slowly and step-by-step so we can get an idea of how each additional piece contributes to the full model.

## What is a time series?

First thing's first, what kind of data are we trying to model? Time series data. What does that mean? How about a few examples.

### Example 1 

Suppose that you are using a workout app and every morning you weigh yourself and record your weight in the app. The list of all of your weights is a time series. The list of weights represents the history of your weight over a period of time, measured at regular time intervals. 

### Example 2 

Suppose that you record the total amount of money that you spend at the end of each month. The list of expenditures is a time series. The list represents the history of your spending over a period of time, measured at regular time intervals. 

### Definition 

Given one measurable quantity, a time series is a list of measurements of that quantity taken at regular time intervals.

## Recurrence Relations

A time series is fundamentally a list of numbers. The list below are weights from my own fitness tracker app collected every morning starting on 2020-11-16 and ending on 2020-11-22. 

In [2]:
weights = [172.6, 172.6, 172.1, 171.7, 172.6, 171.0, 172.8]

Mathematicians like to use compact notation for lists like the one above. A mathematician might call the list above $w$. Then, they would refer to the very first number in the list by the name $w_0$, so $w_0 = 172.6$. This is pretty similar to the way we work with this list in python. If we want the first number in the list, we type `weights[0]` which returns 172.6. The function below lists out all of the numbers in the weights list with both mathematical notation and python syntax.

In [10]:
from IPython.display import display, Markdown
def print_weights():
    for k in range(len(weights)):
        display(Markdown(f"$w_{k}$ = weights[{k}] = {weights[k]}"))
        
print_weights()

$w_0$ = weights[0] = 172.6

$w_1$ = weights[1] = 172.6

$w_2$ = weights[2] = 172.1

$w_3$ = weights[3] = 171.7

$w_4$ = weights[4] = 172.6

$w_5$ = weights[5] = 171.0

$w_6$ = weights[6] = 172.8

One common way to describe a list of numbers is by providing an equation that allows you to compute a value of the list using previous values of the list. Usually we need to provide the equation and a few values to get the ball rolling. Let's look at a couple of classic examples.

### Example 3
$$a_n = 2a_{n-1} \text{ for } n \ge 1, a_0 = 1.$$

This says that for any whole number $n$ that is greater than zero, the $n$th value in my list is just two times the previous ($(n-1)$st) value in the list. Also we are given a value ($a_0 = 1$) to get the ball rolling. We can compute the first couple of values by hand:

$n = 1$ is greater than zero, so our equation tells us that:
$$a_1 = 2a_0$$
But we were given $a_0 = 1$, so we see that:
$$a_1 = 2(1) = 2$$

But, why do math by hand in a python notebook. Let's check out the first ten values in this list $a$.

In [16]:
import numpy as np
def compute_sequence(coefs, initial_conditions, n_values):
    order = len(coefs)
    sequence = initial_conditions
    while len(sequence) <= n_values:
        tail = sequence[-order:]
        new_value = np.dot(tail, coefs)
        sequence.append(new_value)
    return sequence

compute_sequence(coefs=[2], initial_conditions=[1], n_values=10)

[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]

As you can see the equation above describes the powers of two.

### Example 4
See if you can compute the first ten values in the list described by the equation below. 
$$ f_n = f_{n-1} + f_{n-2} \text{ for } n\ge 2, f_0 = 1\; \text{ and } f_1 = 1.$$
Now, check your answer by running the cell below.

In [17]:
compute_sequence(coefs=[1, 1], initial_conditions=[1, 1], n_values=10)

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

This is a pretty famous example find out more [here](https://en.wikipedia.org/wiki/Fibonacci_number).

The equations that we used in the examples above are called **Linear Recurrence Relations**. 

### Definition
A **Linear Recurrence Relation** of *order* $k$ with coefficients $c_1, c_2, ..., c_{k}$ and *initial conditions* $i_0, i_1, ..., i_{k-1}$ is an equation of the form

$$ a_n = c_1 a_{n-1} + c_2 a_{n-2} + ... + c_{k} a_{n-k} \text{ for } n \ge k, a_0 = i_0, a_1 = i_1, ..., a_{k-1} = i_{k-1}.$$

The equation describes a list of values also commonly called as *sequence*.