# 9 ARIMA Models

- Exponential Smoothing and ARIMA models are the two most widely used approached to Time Series forecasting.
- its provides complementary approaches to the problem
   - Exponential Smoothing models are based on description of the trend and seasonality in the data
   - ARIMA models aim to describe the autocorrelations in the data
   
- ARIMA: Auto Regressive Integrated Moving Averages

**Contents**
    - Stationarity and differencing
    - Backshift notion
    - Autogressive models
    - Moving Average models
    - Non Seasonal ARIMA Models
    - Estimation and orderd Selection
    - Forecasting
    - Seasonal ARIMA Models
    - ARIMA vs ETS
   

## 9.1 Stationarity and Differncing

### 9.1.1 Stationarity

- A time series is said to be stationary **if its statistical properties (mean, variance, autocorrelation) remain constant over time**. 
- In other words, if the data looks the same at any point in time, it is considered stationary.

**Types of Stationarity**:

- **Strict Stationarity**: A time series is strictly stationary if all its statistical properties remain constant over time, including the joint distribution of any subset of observations.
- **Weak Stationarity**: A time series is weakly stationary if its **first two moments (mean and variance) remain constant over time**, and its autocovariance function depends only on the time lag between observations, not on the specific time points.


**Why Stationarity Matters**:

- Many statistical methods used in time series analysis assume stationarity.
- Stationarity makes it easier to model and forecast time series data.
- Non-stationary data can be made stationary through differencing.

Remarks
- **Thus, time series with trends, or with seasonality, are not stationary** — the trend and seasonality will affect the value of the time series at different times.
- On the other hand, a white noise series is stationary
- a time series with cyclic behaviour (but with no trend or seasonality) is stationary.
   - This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be.

### 9.1.2 Differncing

- Differencing is **a transformation applied to a time series to make it stationary**. 
- **It involves taking the difference between consecutive observations**.

**Types of Differencing**:

- **First-order differencing**: Taking the difference between consecutive observations.
- Second-order differencing: Taking the difference between consecutive **first differences**.
- Higher-order differencing: Taking the difference between consecutive **differences of the previous order**.

**When to Use Differencing**:

- When the time series shows a clear trend or seasonality.
- When the autocorrelation function decays slowly or has a significant positive or negative value at large lags.


**Example**:

- Consider a time series that shows a clear upward trend. First-order differencing can be applied to remove the trend, making the series stationary.

**Remarks**
- Transformations such as logarithms can help to stabilise the variance of a time series. 
- Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.
- As well as the time plot of the data, the **ACF plot is also useful for identifying non-stationary time series**.

   - Key Indicators in ACF Plots:

     - Decay Rate: How quickly the correlation between observations diminishes as the lag increases.
     - Pattern: The shape or pattern of the ACF plot.
  
  - **Stationary Time Series**:

       - **Rapid Decay**: The ACF plot typically shows a rapid decay, meaning the correlation between observations decreases quickly as the lag increases.
       
       - **No Clear Pattern**: There's no discernible pattern or significant spikes at higher lags.

  - **Non-Stationary Time Series**:

    - **Slow Decay**: The ACF plot often shows a slower decay, indicating a persistent correlation between observations and past values.
    - **Clear Pattern**: There may be a clear pattern, such as a repeating pattern or a slow decay.
- **a large and positive r1 value** indicates a strong positive correlation between an observation and its immediate past, which is a common characteristic of **non-stationary time series**.

  -first-order autocorrelation coefficient, r1, measures the correlation between a time series observation and its immediate past observation. It's calculated using the following formula:

        r1 = Σ(yt - ȳ)(yt-1 - ȳ) / Σ(yt - ȳ)²
where:

       yt  : The value of the time series at time t
       ȳ   : The mean of the time series
       yt-1: The value of the time series at time t-1
       Σ   : The summation operator 
  - Example:

     - Consider a time series with the following values: 1, 2, 3, 4, 5.

     - Mean: (1+2+3+4+5)/5 = 3
     - Deviations: -2, -1, 0, 1, 2
     - Products: (-2)(-1), (-1)(0), (0)(1), (1)(2) = 2, 0, 0, 2
     - Sum of products: 2 + 0 + 0 + 2 = 4
     - Variance: ((-2)² + (-1)² + 0² + 1² + 2²) = 10
     - r1: 4 / 10 = 0.4
 - **Note**: A value of r1 close to 1 or -1 indicates a strong positive or negative correlation, respectively. A value close to 0 indicates a weak or no correlation.
 - **In practice, it is almost never necessary to go beyond second-order differences**


### 9.1.3 Seasonal Differencing

- Seasonal differencing is a technique used in time series analysis to remove seasonal patterns from the data. 
- It involves subtracting the values of a time series from corresponding values in a previous period, typically a year.

**How Seasonal Differencing Helps Make Time Series Stationary**:

- Removes Seasonal Patterns: Seasonal differencing directly removes the seasonal component from the time series. This is particularly useful when there are recurring patterns that occur at regular intervals (e.g., monthly, quarterly, yearly).
- Makes Data More Stationary: By eliminating the seasonal component, the remaining series is often more likely to exhibit stationarity. This is because the seasonal patterns can introduce non-stationarity, as the mean and variance of the series may vary over time due to these recurring patterns.
- Improves Forecasting: Stationarity is a desirable property for many time series forecasting models. By making the data more stationary, seasonal differencing can improve the accuracy of forecasts.

**Example**

- If you have monthly data and want to remove yearly seasonality, you would subtract the value for this month last year from the value for this month this year. This would create a new series that is seasonally differenced.

**Note**: In some cases, multiple levels of differencing may be necessary to achieve stationarity. For example, if a time series has both a trend and seasonality, first differencing can remove the trend, and then seasonal differencing can remove the remaining seasonal pattern.

## 9.1.4 Unit Root Test

- Unit Root Tests are statistical tests used to determine if a time series is non-stationary. 
- **A unit root in a time series indicates that the series is non-stationary**. This means that the statistical properties of the series (mean, variance, autocorrelation) change over time.
- These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required

**Common Unit Root Tests**:
  - **Augmented Dickey-Fuller (ADF test**  determine the presence of a unit root
    - null hypothesis of the Augmented Dickey-Fuller (ADF) test is that the time series is non-stationary. 
  - **Phillips-Perron (PP) test**:  also assumes  the null hypothesis of a unit root.
  - **KPSS Test: Null Hypothesis is Stationarity**
    

## Not here -- -- -- -- --

**Null Hypothesis and Alternative Hypothesis**


**Null Hypothesis (H0)**:

- A statement of "no effect" or "no difference."
- It is the **default assumption that there is no significant relationship or difference between variables**.
- **In statistical testing, the goal is to gather evidence to reject or fail to reject this null hypothesis**.


**Alternative Hypothesis (H1)**:

- A statement that contradicts the null hypothesis.
- It is the claim that there is a significant relationship or difference between variables.
- It is the hypothesis that the researcher is trying to prove.


**Example:**

- H0: There is no difference in average test scores between students who study for 5 hours and those who study for 10 hours.
- H1: Students who study for 10 hours have a higher average test score than those who study for 5 hours.

**In statistical testing**:

- We assume the null hypothesis is true until there is sufficient evidence to reject it.
- If the evidence is strong enough, we reject the null hypothesis in favor of the alternative hypothesis.
- If the evidence is not strong enough, we fail to reject the null hypothesis.


**Key Points**:

- The null hypothesis is always a statement of "no effect" or "no difference."
- The alternative hypothesis is the opposite of the null hypothesis.
- The goal of statistical testing is to gather evidence to either reject or fail to reject the null hypothesis.   

## -- -- -- -- --


## 9.3 Autoregressive (AR) Models

- **Autoregressive (AR) models are a class of statistical models used to describe the relationship between a variable and its own past values**. 

- **They assume that the current value of a variable is a linear function of its past values, plus a random error term.**


**AR(p) Model**:

   - An AR(p) model is an autoregressive model of order p. It is defined as follows:

           Yt = c + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + εt
     - where:

     - Yt: The current value of the variable at time t
     - c: A constant term
     - φ1, φ2, ..., φp: The model coefficients
     - Yt-1, Yt-2, ..., Yt-p: The past values of the variable
     - εt: A random error term


**Key Points**:

- The **order p determines the number of lagged values used to predict the current value**.
- The coefficients φ1, φ2, ..., φp measure the strength and direction of the relationship between the current value and its past values.
- The random error term εt captures the unexplained variation in the data.


**Example**:

- An AR(1) model would be:

   - Yt = c + φ1Yt-1 + εt
   - This means that the current value of Yt is a linear combination of the previous value Yt-1 and a random error term.

**Choosing the Order (p)**:

- ACF and PACF Plots: Analyzing the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots can help determine the appropriate order for an AR model.

- Information Criteria: Metrics like Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) can be used to compare models with different orders and select the best-fitting one.


**Applications of AR Models**:

- **Time Series Forecasting**: AR models are commonly used to forecast future values of a time series.
- **Financial Modeling**: They are used in financial modeling to analyze stock prices, exchange rates, and other financial time series.
- **Econometrics**: AR models are applied in econometrics to study economic variables and their relationships.

### 9.3.1 Using ACF and PACF Plots to Determine AR Model Order

**Understanding ACF and PACF Plots**

- **Autocorrelation Function (ACF)**: Measures the correlation between a time series observation and its lagged values.

- **Partial Autocorrelation Function (PACF)**: Measures the direct correlation between a time series observation and a lagged value, controlling for the effects of intervening lags.

- **Must watch** https://www.youtube.com/watch?v=MxIfXbeP_Yw

**Interpreting ACF and PACF Plots for AR Models**:

  - **Finding P value for AR models**:
- ACF: Typically decays exponentially, with a significant spike at lag 1 and then decaying to zero. 
       
    - it can downward with first positive value towards zero
    - it can upward with first high negative value towards zero
    - it can decrease in zigzag nature

- PACF: Shows a significant spike at lag p and then decays to zero.

- **Must watch** https://www.youtube.com/watch?v=_nSvoCkodS8&t=11s 

## -- -- -- -- --
## 9.4 Moving Average Models

- A moving average model is used for forecasting future values, while moving average smoothing is used for estimating the trend-cycle of past values.

**Moving Average (MA) models** are another class of time series models that assume the current value of a variable is a **linear combination of past error terms**. 
- They are often used in conjunction with Autoregressive (AR) models to form ARMA models.

  - MA(q) Model: An MA(q) model is a Moving Average model of order q. It is defined as follows:

        Yt = θ0 + εt + θ1εt-1 + θ2εt-2 + ... + θqεt-q
where:

     - Yt: The current value of the variable at time t
     - θ0: A constant term
     - εt, εt-1, ..., εt-q: Past error terms
     - θ1, θ2, ..., θq: The model coefficients


**Key Points**:

- The order q determines the number of past error terms used to predict the current value.
- The coefficients θ1, θ2, ..., θq measure the influence of past errors on the current value.

**Moving Average Smoothing**:

Moving Average smoothing is a technique used to create a smoother version of a time series. 
  - It involves calculating the average of a specified number of consecutive observations.
  - It used for estimating the trend-cycle of past values.

**Difference between MA Models and Moving Average Smoothing:

- **Purpose**: MA models are used to model the relationship between a variable and its past errors, while moving average smoothing is used to smooth out noise and identify trends in the data.
- **Calculation**: MA models use specific coefficients to weight past errors, while moving average smoothing simply takes the average of a fixed number of observations.
- **Interpretation**: MA models provide a statistical model for the underlying process, while moving average smoothing is a data-driven technique for visualization and analysis.


**Invertible Nature of MA Models**:

- An MA model is said to be invertible if it can be expressed as an equivalent AR model. 
- This means that the past values of the series can be expressed as a function of the current and past error terms.

  - **Invertibility**: If an MA model is invertible, it can be rewritten as an AR model.
  - **Non-Invertibility**: If an MA model is not invertible, it cannot be expressed as an AR model.

### 9.4.1 Determining the Order (q) of a Moving Average (MA) Model

To find the appropriate value of q, you can use the following methods:
- **1. ACF and PACF Plots**
   - ACF: For an MA(q) model, the ACF will cut off after lag q and the PACF will decay exponentially.
   - PACF: The PACF for an MA(q) model will decay exponentially.
   - **Must watch** https://www.youtube.com/watch?v=a0BVTH86JrI
   - Steps:

      - **Examine the ACF plot: Identify the lag at which the ACF cuts off or becomes negligible**.
      - **Examine the PACF plot: Look for a clear exponential decay in the PACF**.
      - Compare ACF and PACF: If the ACF cuts off after lag q and the PACF decays exponentially, then an MA(q) model is likely appropriate.

- 2. **Information Criteria**
  - **AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion)**: These metrics can be used to compare models with different orders and select the best-fitting one.
  - **Lower AIC or BIC**: Generally, a lower AIC or BIC value indicates a better-fitting model.
- 3. **Subject Matter Knowledge**
   - Consider the underlying process and any known characteristics of the data when selecting the order.

## 9.5 Non-Seasonal ARIMA Models

**ARIMA (AutoRegressive Integrated Moving Average)** models are a class of statistical models used to represent time series data. 
- They **combine Autoregressive (AR) and Moving Average (MA) components to capture both autoregressive and moving average effects in the data**.


**ARIMA(p,d,q) Model**:

  - p: The order of the autoregressive part, representing the number of lagged values of the dependent variable used as predictors.
  - d: The degree of differencing, indicating the number of times the data needs to be differenced to make it stationary.
  - q: The order of the moving average part, representing the number of lagged error terms used as predictors.


**Non-Seasonal ARIMA Model**:

A non-seasonal ARIMA model assumes that there is no seasonal component in the data. It is defined as:

      (Yt - ϕ1Yt-1 - ... - ϕpYt-p) = θ0 + εt + θ1εt-1 + ... + θqεt-q
where:

   - Yt: The current value of the variable at time t
   - ϕ1, ϕ2, ..., ϕp: The AR coefficients
   - θ0, θ1, ..., θq: The MA coefficients
   - εt: The error term at time t
   
 

  
**Steps to Build a Non-Seasonal ARIMA Model**:

 - **Stationarity**: Ensure the time series is stationary. If not, apply differencing (d) to make it stationary.
 - **Model Identification**: Use ACF and PACF plots to identify the appropriate values for p and q.
 - **Model Estimation**: Estimate the model parameters using a suitable method, such as maximum likelihood estimation.
 - **Model Diagnostics**: Assess the model's fit using diagnostic tests, such as the Ljung-Box test for autocorrelation in the residuals.
- **Forecasting**: Use the estimated model to forecast future values of the time series.


**Example**:

- ARIMA(1,1,1): This model suggests a 
  - first-order autoregressive component, 
  - first-order differencing, and a 
  - first-order moving average component.
**Key Points**:

- Non-seasonal ARIMA models are suitable for time series data without a clear seasonal pattern.
- The values of p, d, and q are chosen based on the characteristics of the data and the model's performance.
- Model diagnostics are crucial to ensure the model's validity and accuracy.

**AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion)** are commonly used metrics to compare the fit of statistical models, including ARIMA models.

**Lower is better**.

- AIC and BIC penalize models with more parameters. 
  - This helps to **prevent overfitting**, where a model becomes too complex and fits the training data too closely, potentially leading to poor performance on new data.
- A lower AIC or BIC value indicates a better balance between model fit and complexity.

**Important Notes**
- ACF and PACF are useful for identifying p in ARIMA(p, d, 0) models and q in ARIMA(0, d, q) models
- **When both p and q are positive, the behavior of the ACF and PACF plots can become more complex and intertwined**.
   - Neither the ACF nor the PACF will have a clear cutoff, as seen in pure AR or MA processes.
   - In such cases, it’s harder to directly determine the values of p and q using just the ACF and PACF plots.
   
- Instead, you may need to rely on:

  - **Grid search or information criteria (like AIC, BIC)** to find the best combination of p and q.
 - **Auto ARIMA** (automated selection methods) can help to determine both p and q when both terms are present.

#### Imp Notes Revision

- The data may follow an ARIMA( p,d,0) model if the ACF and PACF plots of the differenced data show the following patterns:
  - the ACF is exponentially decaying or **sinusoidal**; 
  - there is a significant spike at lag p in the PACF, but none beyond lag p
 .
- The data may follow an ARIMA(0, d,q) model if the ACF and PACF plots of the differenced data show the following patterns:

  - the PACF is exponentially decaying or sinusoidal;
  - there is a significant spike at lag q in the ACF, but none beyond lag q


| Model Type                | ARIMA Representation                |
|---------------------------|-------------------------------------|
| White noise               | ARIMA(0,0,0) with no constant       |
| Random walk               | ARIMA(0,1,0) with no constant       |
| Random walk with drift     | ARIMA(0,1,0) with a constant        |
| Autoregression            | ARIMA(p,0,0)                       |
| Moving average            | ARIMA(0,0,q)                       |


- Integration in ARIMA models refers to the **reverse process of differencing**. It essentially involves "undoing" the differencing operations to obtain the original time series.

#### -- -- -- -- --

**ARIMA Model Components: AR, I, and MA**

ARIMA (AutoRegressive Integrated Moving Average) models are a combination of three components: Autoregressive (AR), Integrated (I), and Moving Average (MA). Let's break down what each part does:

- **AR (Autoregressive) Component**
   - Purpose: Models the relationship between the current value of a time series and its own past values.
   - How it works: Uses a linear combination of previous values to predict the current value.
   - Example: An AR(1) model would use the previous value to predict the current value.

- **I (Integrated) Component**
   - Purpose: Makes a non-stationary time series stationary by differencing.
   - How it works: The order of integration (d) indicates how many times the data needs to be differenced to become stationary.
   - Example: If d=1, the first difference is taken. If d=2, the second difference is taken.
   
   
- **MA (Moving Average) Component**
  - Purpose: Models the relationship between the current value of a time series and past error terms.
  - How it works: Uses a linear combination of past error terms to predict the current value.
  - Example: An MA(1) model would use the previous error term to predict the current value.

**In summary**:

  - AR: Models the dependence on past values of the time series itself.
  - I: Makes the series stationary by differencing.
  - MA: Models the dependence on past errors.
     - Combining these components: **ARIMA(p,d,q)**:
        - p: Order of the AR component
        - d: Order of the integrated component
        - q: Order of the MA component
  - By combining these components, ARIMA models can capture a wide range of patterns in time series data, including trends, seasonality, and autocorrelation.
  
#### -- -- -- -- --

- MLE is a powerful method for estimating the parameters of time series models.
- It seeks to find the set of parameters that maximizes the likelihood of observing the given data.


**Key Steps**:

- **Specify the Model**: Choose a suitable time series model (e.g., ARIMA, SARIMA, GARCH) based on the characteristics of the data.
- Write the Likelihood Function: Derive the likelihood function for the chosen model. This function expresses the probability of observing the data given the model parameters.
- **Maximize the Likelihood**: Find the values of the model parameters that maximize the likelihood function. This can be done using optimization techniques like numerical optimization algorithms.
- **Evaluate the Model**: Assess the goodness of fit of the estimated model using diagnostic tests and information criteria.


**Advantages of MLE**:

- **Efficiency**: MLE provides efficient estimates under certain conditions.
- **Flexibility**: It can be applied to a wide range of time series models.
- **Statistical Inference**: MLE allows for statistical inference, such as hypothesis testing and confidence interval construction.


**Disadvantages of MLE**:

- **Computational Complexity**: MLE can be computationally intensive for complex models or large datasets.
- **Sensitivity to Initial Values**: The optimization algorithm may converge to different solutions depending on the initial parameter values.

