<a href="https://colab.research.google.com/github/waelrash1/forecastingmodelsPY/blob/main/ch02/ch02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This chapter covers
- Defining a baseline model
- Setting a baseline using the mean Building a baseline using the mean of the previous window of time
- Creating a baseline using the previous timestep
- Implementing the naive seasonal forecast

In [None]:
# change the cell width
from IPython.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))


In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
pd.set_option("display.width", 120)

In [None]:
# To avoid SettingWithCopyWarning
pd.options.mode.chained_assignment = None

In [None]:

# Loading data into a dataframe
data_url='https://raw.githubusercontent.com/waelrash1/timeSeriesPy/master/data/jj.csv'
df = pd.read_csv(data_url)

In [None]:
# Showing top records in the dataframe
df.head()

In [None]:
# Showing bottom records in the dataframe
df.tail()

# Plot data with train/test split 

In [None]:

# Plotting data with train/test split
fig, ax = plt.subplots(figsize=(12, 8))

# Plotting date on x-axis and data on y-axis
ax.plot(df['date'], df['data'])

# Setting x-axis label
ax.set_xlabel('Date')

# Setting y-axis label
ax.set_ylabel('Earnings per share (USD)')

# Highlighting the test data in the plot
ax.axvspan(80, 83, color='#808080', alpha=0.2)
# Highlight training data
ax.axvspan(0, 80, color='#e9a296', alpha=0.1)
# Adding x-axis ticks at specified intervals
plt.xticks(np.arange(0, 81, 8), [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980])

# Formatting x-axis labels
fig.autofmt_xdate()

# Adjusting layout
plt.tight_layout()

# Saving the plot to a file
plt.savefig('figures/CH02_F01_peixeiro.png', dpi=300)

Quarterly earnings per share of Johnson & Johnson in US dollars (USD)
between 1960 and 1980. We will use the data from 1960 to the last quarter of 1979
to build a baseline model that will forecast the earnings per share for the quarters of
1980 (as illustrated by the gray area).

# Baseline model
> A baseline model is a trivial solution to your forecasting problem. It relies on heuristics
or simple statistics and is usually the simplest solution. It does not require
model fitting, and it is easy to implement.

# Predict historical mean 

> Our goal is to use the data from 1960 to the end of 1979 to predict the four quarters of 1980. The first baseline we’ll discuss uses the historical mean, which is the arithmetic mean of past values. Its implementation is straightforward: calculate the mean of the training set,
and it will be our prediction for the four quarters of 1980. First, though, we need to do
some preliminary work that we’ll use in all of our baseline implementations.

# Split to train/test 

In [None]:
# Splitting data into train and test sets
train = df[:-4]
test = df[-4:]

## Implementing the historical mean baseline

In [None]:
# Predicting historical mean
historical_mean = np.mean(train['data'])

In [None]:

# Adding predicted values to test set
test.loc[:,'pred_mean'] = historical_mean

# Showing test set with predicted values
test

## Mean Absolute Percentage Error MAPE

$$
\Large MAPE= \Large\frac{1}{n} \sum_{i}^{n}{\left|{\frac{A_i-F_i}{A_i}}\right|x 100}
$$

In [None]:
# Function to calculate mean absolute percentage error (MAPE)
def mape(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [None]:
# Calculating MAPE for historical mean prediction
mape_hist_mean = mape(test['data'], test['pred_mean'])
mape_hist_mean

In [None]:

# Plotting train and test data with historical mean prediction
fig, ax = plt.subplots(figsize=(12, 8))

# Plotting train data
ax.plot(train['date'], train['data'], 'g-.', label='Train')

# Plotting test data
ax.plot(test['date'], test['data'], 'b-', label='Test')

# Plotting predicted historical mean
ax.plot(test['date'], test['pred_mean'], 'r--', label='Predicted')

# Setting x-axis label
ax.set_xlabel('Date')

# Setting y-axis label
ax.set_ylabel('Earnings per share (USD)')

# Highlighting the test data in the plot
ax.axvspan(80, 83, color='#808080', alpha=0.2)
ax.legend(loc=2)

plt.xticks(np.arange(0, 85, 8), [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980])

fig.autofmt_xdate()
plt.tight_layout()

plt.savefig('figures/CH02_F06_peixeiro.png', dpi=300)


# Predict last year mean

In [None]:
# Calculate the mean of the last 4 values of the 'data' column in the 'train' DataFrame
last_year_mean = np.mean(train['data'][-4:])
# Print the mean value
last_year_mean

In [None]:
# Add a new column named 'pred__last_yr_mean' to the 'test' DataFrame and fill it with the value of 'last_year_mean'
test.loc[:, 'pred__last_yr_mean'] = last_year_mean
# Print the 'test' DataFrame to verify the new column
test


In [None]:
 # Calculate the Mean Absolute Percentage Error (MAPE) between the 'data' column in the 'test' DataFrame and the 'pred__last_yr_mean' column
mape_last_year_mean = mape(test['data'], test['pred__last_yr_mean'])
# Print the MAPE value
mape_last_year_mean

In [None]:

# Create a figure and a subplot (ax)
fig, ax = plt.subplots(figsize=(12, 8))

# Plot the 'data' column in the 'train' DataFrame with a green dotted line, labeled as 'Train'
ax.plot(train['date'], train['data'], 'g-.', label='Train')
# Plot the 'data' column in the 'test' DataFrame with a blue line, labeled as 'Test'
ax.plot(test['date'], test['data'], 'b-', label='Test')
# Plot the 'pred__last_yr_mean' column in the 'test' DataFrame with a red dashed line, labeled as 'Predicted'
ax.plot(test['date'], test['pred__last_yr_mean'], 'r--', label='Predicted')
# Set the label for the x-axis as 'Date'
ax.set_xlabel('Date')
# Set the label for the y-axis as 'Earnings per share (USD)'
ax.set_ylabel('Earnings per share (USD)')
# Add a gray background color to the plot between dates 80 and 83 with an alpha of 0.2
ax.axvspan(80, 83, color='#808080', alpha=0.2)
# Add a legend to the plot at location 2
ax.legend(loc=2)

# Set the x-axis ticks at every 8th value from 0 to 85, with labels at [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980]
plt.xticks(np.arange(0, 85, 8), [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980])

# Automatically format the x-axis labels
fig.autofmt_xdate()
# Adjust the layout of the plot to reduce the whitespace
plt.tight_layout()

# Save the plot as a PNG image with a resolution of 300 dpi, to the 'figures' directory with the file name 'CH02_F07_peixeiro.png'
plt.savefig('figures/CH02_F07_peixeiro.png', dpi=300)


# Predict last know value 

In [None]:
# Predict last known value

# Get the last value from the training set 'data' column
last = train['data'].iloc[-1]
# Print the last value
print(last)


In [None]:

# Add a new column 'pred_last' to the test set with the last value
test['pred_last'] = last
# Print the updated test set
print(test)

In [None]:

# Calculate the mean absolute percentage error (MAPE) between the test set 'data' and 'pred_last' columns
mape_last = mape(test['data'], test['pred_last'])
# Print the MAPE
print(mape_last)

In [None]:

# Plot the train set 'data', the test set 'data', and the predicted values from 'pred_last'
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(train['date'], train['data'], 'g-.', label='Train')
ax.plot(test['date'], test['data'], 'b-', label='Test')
ax.plot(test['date'], test['pred_last'], 'r--', label='Predicted')
ax.set_xlabel('Date')
ax.set_ylabel('Earnings per share (USD)')
ax.axvspan(80, 83, color='#808080', alpha=0.2)
ax.legend(loc=2)
plt.xticks(np.arange(0, 85, 8), [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980])
fig.autofmt_xdate()
plt.tight_layout()
# Save the plot to a file
plt.savefig('figures/CH02_F08_peixeiro.png', dpi=300)

# Naive seasonal forecast 

In [None]:
# Add a new column 'pred_last_season' to the test set with the last 4 values from the train set 'data' column
test['pred_last_season'] = train['data'][-4:].values
# Print the updated test set
print(test)


In [None]:
# Calculate the mean absolute percentage error (MAPE) between the test set 'data' and 'pred_last_season' columns
mape_naive_seasonal = mape(test['data'], test['pred_last_season'])
# Print the MAPE
print(mape_naive_seasonal)

In [None]:


# Initialize the plot
fig, ax = plt.subplots(figsize=(12, 8))

# Plot the train set 'data' in green dotted line style
ax.plot(train['date'], train['data'], 'g-.', label='Train')

# Plot the test set 'data' in blue line style
ax.plot(test['date'], test['data'], 'b-', label='Test')

# Plot the predicted values from 'pred_last_season' in red dashed line style
ax.plot(test['date'], test['pred_last_season'], 'r--', label='Predicted')

# Set x-axis label as 'Date'
ax.set_xlabel('Date')

# Set y-axis label as 'Earnings per share (USD)'
ax.set_ylabel('Earnings per share (USD)')

# Add a gray shaded region with 80% opacity from x-value 80 to 83
ax.axvspan(80, 83, color='#808080', alpha=0.2)

# Add a legend to the plot located at position 2
ax.legend(loc=2)

# Set x-axis tick marks every 8 units with labels representing the year
plt.xticks(np.arange(0, 85, 8), [1960, 1962, 1964, 1966, 1968, 1970, 1972, 1974, 1976, 1978, 1980])

# Automatically format x-axis dates
fig.autofmt_xdate()

# Adjust the layout of the plot to ensure tight fit of all elements
plt.tight_layout()

# Save the plot to a file named 'figures/CH02_F09_peixeiro.png' with a resolution of 300 dpi
plt.savefig('figures/CH02_F09_peixeiro.png', dpi=300)



# Compare all Models

In [None]:

# Initialize the figure and axis
fig, ax = plt.subplots(figsize=(12, 8))


models = ['hist mean', 'last year mean', 'last', 'naive seasonal']
MAPE = [70.00, 15.60, 30.46, 11.56]
bar_colors = ['tab:red', 'tab:blue', 'tab:green', 'tab:orange']

ax.bar(model, MAPE, width=0.8, alpha=.6, color=bar_colors)
ax.set_xlabel('Baselines')
ax.set_ylabel('MAPE (%)')
ax.set_ylim(0, 75)

for index, value in enumerate(MAPE):
    plt.text(x=index, y=value+2 , s=str(value), ha='center', fontsize='x-large', alpha=.8)


plt.show()

plt.savefig('figures/CH02_F10_peixeiro.png', dpi=300)



### Four Baselines Developed

- Arithmetic mean of entire training set
- Mean of last year in training set
- Last known value of training set
- Naive seasonal forecast (selected as benchmark with lowest MAPE)

### Evaluation 
- Baselines were evaluated using MAPE metric on a test set.


### Comparison with Complex Models
- Complex models will be compared against the benchmark (naive seasonal forecast MAPE).
- If the complex model has a lower MAPE, it will be considered a better performing model.

### Special Cases
- There may be cases where time series can only be forecast using naive methods, such as when the process moves at random and can't be predicted using statistical learning methods (random walk).


# Summary
* Time series forecasting starts with a baseline model that serves as a benchmark for comparison with more complex models.
* A baseline model is a trivial solution to our forecasting problem because it only uses heuristics, or simple statistics, such as the mean.
* MAPE stands for mean absolute percentage error, and it is an intuitive measure of how much a predicted value deviates from the actual value.
* There are many ways to develop a baseline. In this chapter, you saw how to use the mean, the last known value, or the last season.