-----
# Baseline Forecasting
-----

### Overview

In this notebook, I’ll explore four different types of baseline forecasting methods:

**1. Mean Method**

Predicts that future values will be the mean of all past observed values.

**2. Naive Method**

Predicts that the next value will be the same as the last observed value.

**3. Seasonal Naive Method**

Predicts that the next value will be the same as the last observed value from the same seasonal period. This method is effective when there are clear seasonal patterns - like in the data we have there is a clear 7 day trend.

**4. Drift Method**

Predicts future values by extending the trend line between the first and last observed data points.

These methods will serve as a benchmark for time series forecasting models. These baseline models can be used assess the performance of more advanced forecasting model such as ARIMA to see if advanced methods are actually better at forecasting or if they are just as good as simplest approach. 



## Set Up
---

In [473]:
import numpy as np
import pandas as pd

# plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns

# stats
from statsmodels.api import tsa # time series analysis
import statsmodels.api as sm

# evaluate
from sklearn.metrics import mean_squared_error, mean_absolute_error

## Utility Functions
-----

In [474]:
def plt_forecast(predictions, fc_method):
    """
    Description:
    Plots the training data, validation data (actual), and baseline predictions on a single graph.

    Parameters:
    - predictions : A Series containing the predicted values with date indices.
    - fc_method: A string describing the forecasting method used.

    Output:
    The function creates a plot using Plotly to visualise:
        - Training data
        - Validation data 
        - Baseline forecast predictions

    """
    
    # Plot to visualise the training data, test data and baseline prediction
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=train_df.index, y=train_df['Adj Close'], mode='lines', name="Train"))
    fig.add_trace(go.Scatter(x=val_df.index, y=val_df['Adj Close'], mode='lines', name="Validation"))
    fig.add_trace(go.Scatter(x=predictions.index, y=predictions, mode='lines', name="Baseline Forecast"))

    fig.update_layout(
        yaxis_title='Adjusted Close', 
        xaxis_title='Date',
        title= f'Baseline Forecasting using {fc_method}'
    )
    fig.show()

In [475]:
def fcast_evaluation(predicted, actual):
    """
    Description:
    To evaluate forecasting performance using multiple metrics.

    Parameters:
    predicted: Forecasted values.
    actual: Actual observed values.

    Output:
    A dictionary containing the evaluation metrics:
        - 'MSE': Mean Squared Error
        - 'MAE': Mean Absolute Error
        - 'RMSE': Root Mean Squared Error
        - 'MAPE': Mean Absolute Percentage Error
    """

    err= actual - predicted

    # Calculating MSE
    mse = mean_squared_error(actual, predicted)

    # Calculating MAE
    mae = mean_absolute_error(actual, predicted)

    # Calculating RMSE
    rmse = np.sqrt(mse)

    # Calculating MAPE
    abs_percent_err = np.abs(err/actual)
    mape = abs_percent_err.mean() * 100

    return {'MSE': mse,
            'MAE': mae,
            'RMSE': rmse,
            'MAPE': mape
            }

## Data Loading
---

**Note:** Raw data is used in baseline forecasting to evaluate how well simple methods perform with real-world data.


In [476]:
raw_data = pd.read_csv('../../data/daily_data_clean.csv', index_col=0)

In [477]:
# Filter the data range to the past year for a closer inspection of the prediction
raw_data_filtered = raw_data.loc[(raw_data.index >= '2023-07-29'), ['Adj Close']]

In [478]:
raw_data_filtered.index

Index(['2023-07-29', '2023-07-30', '2023-07-31', '2023-08-01', '2023-08-02',
       '2023-08-03', '2023-08-04', '2023-08-05', '2023-08-06', '2023-08-07',
       ...
       '2024-07-20', '2024-07-21', '2024-07-22', '2024-07-23', '2024-07-24',
       '2024-07-25', '2024-07-26', '2024-07-27', '2024-07-28', '2024-07-29'],
      dtype='object', length=367)

In [479]:
# Splitting of the data into train/val datasets
train_df = raw_data_filtered.loc[raw_data_filtered.index <= "2024-04-29"]
val_df = raw_data_filtered.loc[raw_data_filtered.index > "2024-04-29"]

In [None]:
train_df.to_csv('../../data/train_fcast.csv')

## Mean Forecasting 
----

In [480]:
# Creating baseline predictions, filling array with mean of training set
# Assuming future predictions are equal to the mean of the training set
baseline_pred = np.full(val_df.shape[0], np.mean(train_df['Adj Close']))

In [481]:
# Convert baseline predictions into a pandas Series with the full data range index
mean_predictions = pd.Series(data=baseline_pred, index=val_df.index)

In [482]:
plt_forecast(mean_predictions, fc_method='Mean')

----
**Plot Description:**

Plot shows the baseline forecasting where the prediction for future values is the mean of the entire dataset. Here the assumption is that future values are equal to the average value of historical observations. 



## Naive Forecasting
-----

In [483]:
# Creating baseline predictions, filling array with last observed value
# Assuming future predictions are equal to the last observed value in the training set
baseline = np.full(val_df.shape[0], train_df['Adj Close'].iloc[-1])

In [484]:
naive_predictions = pd.Series(data=baseline, index=val_df.index)

In [485]:
plt_forecast(naive_predictions, fc_method='Naive')

----
**Plot Description:**

Plot shows the baseline forecasting where the prediction for future values is the last value of the train dataset. 

Here the assumption is that future values are equal to the last historical observation. 

## Seasonal Forecasting
----

In [486]:
# Empty list to store seasonal naive forecasts
snaive_fcasts = []

In [487]:
# Get the last 7 day values from the training set
# Established seasonality of data to be 7 days (eda notebook)
last_7_days_values = train_df['Adj Close'].iloc[-7:].values

In [488]:
for i in range(len(val_df)):
    # forecasts using values from the last 7 days of the training data
    forecast_value = last_7_days_values[i % 7]
    snaive_fcasts.append(forecast_value)

In [489]:
# Generate forecasts for the validation set based on the last 7 days of training data
# Use a rolling window approach to generate forecasts, where each forecast is based on
# the corresponding value from the last 7 days of the training data

In [490]:
snaive_predictions = pd.Series(snaive_fcasts, index=val_df.index, name='Predicted')


In [491]:
plt_forecast(snaive_predictions, 'Seasonal Naive')

----
**Plot Description:**

Plot shows baseline forecasting using a seasonal naive method, future values are predicted based on the repeating seasonlity seen in the past 7 days of the training data.


## Drift Model
----

In [492]:
# Calcualte constant: average change per time step, in other words the slope of the trend in the training data
const = (train_df['Adj Close'].iloc[-1] - train_df['Adj Close'].iloc[0])/(train_df.shape[0] -1)

In [493]:
# define range of data points to forecast
fcast_range = range(len(val_df))

In [494]:
# Get last observed value in training set
# Use const to predict future values for the range of val_df (fcast_range) from last observed value in training
drift_pred = train_df['Adj Close'].iloc[-1] + (fcast_range*const)

In [495]:
drift_predictions =  pd.Series(drift_pred, index=val_df.index)

In [496]:
plt_forecast(drift_predictions, 'Drift')

----
**Plot Description:**

Plot shows forecasting using the drift method, future values are predicted based on the trend seen in the training data. The trend in the training data is extended from the last training data point.

## Evaluation of Baseline Forecasting Methods
----

To evaluate each of the baseline models, I will be using the following metrics:

**1. MSE (Mean Squared Error)**

MSE measures the average of the squares of the errors. Errors are defined as the differences between predicted and actual values

This metric gives more weight to larger errors and outliers due to the squaring.

**2. MAE (Mean Absolute Error)**

MAE measures the average magnitude of errors in predictions regardless of direction (over/under estimation). It is the average of the absolute differences between the predicted and actual values.

**3. RMSE (Root Mean Squared Error)**

RMSE measures how much the predictions differ from the actual values.  It calculates the square root of the average squared errors and so is on the same scales as the original data making it more interpretable than MSE.

**4. MAPE (Mean Absolute Percentage Error)**

MAPE measures the average percentage error between predicted and actual values.

### Mean Forecasting
---

In [497]:
fcast_evaluation(mean_predictions.values, val_df['Adj Close'].values)

{'MSE': 4425.285263648947,
 'MAE': 63.94831237537823,
 'RMSE': 66.52281761658136,
 'MAPE': 14.581810963102226}

----
**Comment:**

**MSE = 4425.29** 

Relatively high suggesting there is a significant difference between actual and predicted values. Since MSE sqaures errors these large errors are magnified leading to a very high result. This makes sense as in the data there is an upward trend, taking the mean of the training set and assuming this value to be true does not capture the trend.

**MAE = 63.95** 

On average, the mean forecast predictions deviate from the actual values by approximately 64 units. 

**RMSE = 66.52** 

Essentially the MSE but on scale with the original units of data. Highlights same issue, there are large differences between predicted and actual values.

**MAPE = 14.58%** 

On average, predictions differ from the actual values by 14.8%. While this metric may seem relativelt lower compared to the metrics above, we need to keep in mind the purpose of what are forecasting. For stock prices, precision is crucial. A 14.8% error is still too high for stock price predictions.

Overall, these metrics suggest that while the mean baseline method offers a simple forecasting approach, its performance is limited and does not adequately account for trends or patterns in the data. Improvements are needed for more accurate forecasting.


### Naive Forecasting
---

In [498]:
fcast_evaluation(naive_predictions.values, val_df['Adj Close'].values)

{'MSE': 1389.8122977654198,
 'MAE': 33.0046011098901,
 'RMSE': 37.28018639660242,
 'MAPE': 7.452143350939089}

----
**Comment:**

**MSE = 1389.81** 

Result is a lot better than mean forecasting but still is relatively high suggesting there is still a significant difference between actual and predicted values. 

**MAE = 33** 

On average, the naive forecast predictions deviate from the actual values by approximately 33 units. 

**RMSE = 37.28** 

RMSE result shows that the typical prediction error is around 37.28 units. This highlights the issue that there are considerable differences between the predicted and actual values.

**MAPE = 7.45%** 

On average, predictions differ from the actual values by 7.45%. 

The naive forecasting method performs better than the mean method, as it is able to capture more of the trend in the data by using the last observed value for its predictions. However, results show there is still a substantial difference between the predicted and actual values, perhaps other baseline methods would perform better.

### Seasonal Naive Forecasting
---

In [499]:
fcast_evaluation(snaive_predictions.values, val_df['Adj Close'].values)

{'MSE': 1246.0521258158774,
 'MAE': 30.76753530769231,
 'RMSE': 35.299463534392096,
 'MAPE': 6.9411303231108485}

-----
**Comment:**

**MSE = 1246.05**

**MAE = 30.77**

**RMSE = 35.30**

**MAPE = 6.94%**

Overall, the seasonal naive method shows improved performance compared to both the mean and naive methods. 

By predicting future values based on the same day from the previous week, it effectively captures seasonal patterns. This results in lower error metrics, indicating that it better captures recurring patterns and trends in the data. 

### Drift Forecasting
-----

In [500]:
fcast_evaluation(drift_predictions.values, val_df['Adj Close'].values)

{'MSE': 673.8007342117846,
 'MAE': 22.210449356550125,
 'RMSE': 25.957671972112305,
 'MAPE': 5.006824887134923}

------
**Comment:**

**MSE = 673.80**

**MAE = 22.21**

**RMSE = 25.96**

**MAPE = 5.01%**

The drift forecasting model shows the best performance compared to other baseline methods. By extending the trend line from the training data, it provides forecasts that closely follow the underlying trend.

This approach results in the lowest error metrics, indicating that it effectively captures the trend in the data. This suggests the drift forecasting model is better at predicting future values with greater accuracy compared to the mean, naive, and seasonal naive methods.


### Best Baseline: Drift Forecasting Model

## Conclusion
----

These baseline models serve as a benchmark for assessing the performance of more advanced forecasting methods which I am moving on to. 

By comparing advanced models against these simple methods, we can evaluate whether they offer significant improvements in predictions made or not. 

After comparing evaluation metrics of all 4 methods, the drift model showed the best performance.Therefore, I will use the drift model as the baseline for comparing more advanced forecasting methods moving forward.