## Air Pollution Forecasting - Part 2: Forecasting using Prophet

Prophet is basically a machine learning model that has been developed by Facebook for time series analysis. Prophet can be used for univariate and multivariate time series analysis. It is very similar to sklearn models in terms of coding. We can just use the *fit()* and *predict()* functions after creating an instance of Prophet class. 

The input to prophet model is a dataframe with two columns: `ds` (date) and`y` (numeric target). For multivariate problems, along with this we add additional features using the *add_regressor()* function.

Prophet basically decomposes time series into its trend, seasonality, effect of holidays and residuals, and then creates an additive model that looks sums up all these components as shown below.

![Prophet Equaltion](https://hands-on.cloud/wp-content/uploads/2022/05/implementation-of-facebook-prophet-algorithm-equation.png?ezimgfmt=ng:webp/ngcb1)


**Drawbacks of Traditional Time Series Methods:**
- Cannot handle trend and seasonality well and data needs to be preprocessed before passing to the model
- Parameter tuning needs experts
- NA values are not handled
- Data needs to be at the same frequency

**Prophet Advantages:**
- A machine learning approach
- Easier to implement and tune
- Handles data with seasonality, trend and outliers
- Works best when there is a lot of historical data to train on

It also provides a function *make_future_dataframe()* to create a data frame with future dates and use it to predict the target values of univariate time series. However, we wont be using it for this project because we have divided the data into train and test for better interpretation.

**Flow of modelling using prophet:**
- Install and Import prophet
- Convert the date column to 'datetime' datatype
- Split the data into train and test sets
- Rename the date column to ds and target variable, pollution, to y
- Create a prophet models by assigning parameter values
- Fit your data and predict using the test data
- Compare the predictions

**References for prophet:**
- **[Facebook Research Blog](https://research.facebook.com/blog/2017/02/prophet-forecasting-at-scale/)**
- **[Prophet Documentation](https://facebook.github.io/prophet/)**

#### Importing the necessary Libraries

In [None]:
# Libraries for reading the data and preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ML Model
from prophet import Prophet

# pep8 formatting
import jupyter_black

jupyter_black.load()

import warnings
warnings.filterwarnings("ignore")

#### Read the data and plot correlation matrix

In [None]:
air_quality_data = pd.read_csv("data/air_quality_data.csv")

corr = air_quality_data.corr()

# Create a mask
mask = np.triu(np.ones_like(corr, dtype=bool))

# Create a custom divergin palette
cmap = sns.diverging_palette(100, 7, s=75, l=40, n=5, center="light", as_cmap=True)

plt.figure(figsize=(12, 10))
sns.heatmap(corr, mask=mask, center=0, annot=True, fmt=".2f", square=True, cmap=cmap)
plt.show();

#### Convert date to datetime and split the data into train and test sets
- The test data contains values for the last 24 hours i.e. one day
- The model will be trained on rest of the data

In [None]:
air_quality_data["date"] = pd.to_datetime(
    air_quality_data.date, infer_datetime_format="True"
)
split_date = pd.datetime(2014, 12, 31)
train = air_quality_data.loc[air_quality_data.date < split_date]
valid = air_quality_data.loc[air_quality_data.date >= split_date]

air_quality_data.columns

### Creating a Prophet Model - Using combination of features and lags

#### 1. Picking features and modifying the train and test data

In [None]:
train_1 = train.drop(
    ["month", "quarter", "day", "hour", "wnd_dir_NE", "wnd_dir_SE"],
    axis=1,
)

valid_1 = valid.drop(
    ["month", "quarter", "day", "hour", "wnd_dir_NE", "wnd_dir_SE"],
    axis=1,
)

# The prophet model needs date to be set as 'ds' and target as 'y'
train_1.rename(columns={"date": "ds", "pollution": "y"}, inplace=True)
valid_1.rename(columns={"date": "ds", "pollution": "y"}, inplace=True)

#### 2. Creating a model, fitting and predicting

In [None]:
model1 = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    changepoint_prior_scale=0.001,
)
model1.add_regressor("dew")
model1.add_regressor("temp")
model1.add_regressor("press")
model1.add_regressor("wnd_spd")
model1.add_regressor("snow")
model1.add_regressor("rain")
model1.add_regressor("year")
model1.add_regressor("Lag1_pollution")
model1.add_regressor("Daily_Avg_Pollution")
model1.add_regressor("Lag1_daily_avg_pollution")
model1.add_regressor("wnd_dir_NW")
model1.add_regressor("wnd_dir_cv")

# Fitting the model
model1.fit(train_1)

# predicting the outputs
forecast_multi_1 = model1.predict(valid_1.drop(columns="y"))

#### 3. Visualizing the outputs

In [None]:
valid_1 = valid_1.set_index("ds")
forecast_multi_1 = forecast_multi_1.set_index("ds")

plt.figure(figsize=(10, 5))
plt.ylabel("pm2.5")
plt.xlabel("date")
plt.plot(valid_1.y, c="lightgreen", label="Actual Pollution", linewidth=2.5)
plt.plot(forecast_multi_1.yhat, c="darkblue", label="Predicted pollution")
plt.title("Comparison graph")
plt.legend()
plt.show()

#### 4. Creating a dataframe with actual and predicted values

In [None]:
def create_comparison_df(valid, forecast):
    pollution = valid.y.values
    predicted_pollution = forecast.yhat.values
    zipped = list(zip(pollution, predicted_pollution))
    columns = ["Pollution", "Predicted_Pollution"]
    df = pd.DataFrame(zipped, columns=columns)
    return df

In [None]:
df_model1 = create_comparison_df(valid=valid_1, forecast=forecast_multi_1)
df_model1

### Model with only pollution based features

#### 1. Creating a model

In [None]:
model2 = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    changepoint_prior_scale=0.001,
)
model2.add_regressor("Lag1_pollution")
model2.add_regressor("Daily_Avg_Pollution")
model2.add_regressor("Lag1_daily_avg_pollution")

In [None]:
train_2 = train[
    [
        "date",
        "pollution",
        "Lag1_pollution",
        "Daily_Avg_Pollution",
        "Lag1_daily_avg_pollution",
    ]
]
train_2.rename(columns={"date": "ds", "pollution": "y"}, inplace=True)

valid_2 = valid[
    [
        "date",
        "pollution",
        "Lag1_pollution",
        "Daily_Avg_Pollution",
        "Lag1_daily_avg_pollution",
    ]
]
valid_2.rename(columns={"date": "ds", "pollution": "y"}, inplace=True)

#### 2. Training and predicting

In [None]:
model2.fit(train_2)
forecast_multi_2 = model2.predict(valid_2.drop(columns="y"))
valid_2 = valid_2.set_index("ds")
forecast_multi_2 = forecast_multi_2.set_index("ds")

#### 3. Visualizing the outputs

In [None]:
plt.figure(figsize=(10, 5))
plt.ylabel("pm2.5")
plt.xlabel("date")
plt.plot(valid_2.y, c="lightgreen", label="Actual Pollution", linewidth=2.5)
plt.plot(forecast_multi_2.yhat, c="darkblue", label="Predicted Pollution")
plt.title("Comparison graph")
plt.legend()
plt.show()

#### 4. Creating a dataframe with actual and predicted values

In [None]:
df_model2 = create_comparison_df(valid=valid_2, forecast=forecast_multi_2)
df_model2

### Diagnostics

The MSE of models can be used to compare the performance. Both the models seem to work well. The model 2 works slightly better than model 1.

In [None]:
def diagnostics(pred, valid):
    mse = np.mean(np.square(pred["yhat"] - valid["y"]))
    rmse = np.sqrt(mse)
    print("The RMSE is: ", rmse)
    mae = np.mean(np.abs(pred["yhat"] - valid["y"]))
    print("The MAE is: ", mae)

In [None]:
diagnostics(pred=forecast_multi_1, valid=valid_1)

In [None]:
diagnostics(pred=forecast_multi_2, valid=valid_2)

### Mean pollution for the day compared with the mean predicted pollution
The average prediction for the day is very close to the actual average for the model with less features.

In [None]:
df_model1.Pollution.mean(), df_model1.Predicted_Pollution.mean()

In [None]:
df_model2.Pollution.mean(), df_model2.Predicted_Pollution.mean()

### Predicting for a different day

In [None]:
split_date = pd.datetime(2014, 12, 2)
train1 = air_quality_data.loc[air_quality_data.date < split_date]
valid1 = air_quality_data[
    (air_quality_data.day == 12)
    & (air_quality_data.month == 2)
    & (air_quality_data.year == 2014)
]

train_3 = train1[
    [
        "date",
        "pollution",
        "Lag1_pollution",
        "Daily_Avg_Pollution",
        "Lag1_daily_avg_pollution",
    ]
]
train_3.rename(columns={"date": "ds", "pollution": "y"}, inplace=True)

valid_3 = valid1[
    [
        "date",
        "pollution",
        "Lag1_pollution",
        "Daily_Avg_Pollution",
        "Lag1_daily_avg_pollution",
    ]
]
valid_3.rename(columns={"date": "ds", "pollution": "y"}, inplace=True)

model3 = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    changepoint_prior_scale=0.001,
)

model3.add_regressor("Lag1_pollution")
model3.add_regressor("Daily_Avg_Pollution")
model3.add_regressor("Lag1_daily_avg_pollution")

model3.fit(train_3)
forecast_multi_3 = model3.predict(valid_3.drop(columns="y"))
valid_3 = valid_3.set_index("ds")
forecast_multi_3 = forecast_multi_3.set_index("ds")

In [None]:
plt.figure(figsize=(10, 5))
plt.ylabel("pm2.5")
plt.xlabel("date")
plt.plot(valid_3.y, c="lightgreen", label="Actual Pollution", linewidth=2.5)
plt.plot(forecast_multi_3.yhat, c="darkblue", label="Predicted Pollution")
plt.title("Comparison graph for 12th Feb 2014")
plt.legend()
plt.show()

In [None]:
df_model3 = create_comparison_df(valid=valid_3, forecast=forecast_multi_3)
df_model3.Pollution.mean(), df_model3.Predicted_Pollution.mean()

### Conclusion
- Prophet works well for this dataset because we have a lot of historical data and seasonal variations.
- Both the models perform equally well, in fact, the model with lesser features perform slightly better when considered average pollution per day.