# Air Pollution Forecasting using LSTM

## What is LSTM?
- Long Short Term Memory Network is an advanced RNN, a sequential network, that allows information to persist. It is capable of handling the vanishing gradient problem faced by RNN.
- Unlike the Feedforward network, the LSTM has the feedback connections.
- Therefore, it not only can process single data points, but also the sequence of data at a time.
- They have internal mechanisms called gates that can regulate the flow of information.
- Or in other words, the prediction of nth sample in a sequence of test samples can be influenced by an input that was given many time steps before.
- Four key components of LSTM includes-
    - Forget gate: Decides what is relevant to keep from prior steps
    - Input gate: Decides what information is relevant to add from the current step
    - Output gate: Determines what the next hidden state should be
    - Cell State: Transport highway that transfers relative information all the way down the sequence chain
- Gates have sigmoid function that squashes the values between 0 and 1, that is helpful to update or forget data because any number getting multiplied by 0 is 0, causing values to disappears or be "forgotten". Any number multiplied by 1 is the same value therefore that value stay’s the same or is "kept"


In [None]:
from IPython.display import Image
Image(url="Images/LSTM.png", width=500, height=100)

## Why LSTM?
- Improved method for back propagating the error.
- Maintains information in memory for long periods of time.
- Has the capability to learn more parameters/Features.
- Can be used in complex domains like Machine Translation, Time Series Forecasting, Speech Recognition
- Provides greater accuracy for demand forecasters which results in better decision making for the business.

## Steps to perform for building an LSTM model:
- 1. Importing the Required Libraries
- 2. Basic Summary Statistics
- 3. Modelling
- 4. Prediction using the trained model
- 5. Evaluation

## Prerequisites:
- The data is obtained from the given **[link](https://www.kaggle.com/datasets/rupakroy/lstm-datasets-multivariate-univariate/code)**
- The CSV obtained from the preprocessing and EDA analysis has to kept under the data folder.

### 1. Importing the Libraries

In [None]:
# Libraries for reading the data and preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Deep Learning Libraries
import tensorflow as tf
import warnings

from sklearn.preprocessing import MinMaxScaler
warnings.filterwarnings("ignore")

In [None]:
#Installing Jupyter Black for PEP8 standards
import jupyter_black
jupyter_black.load()

In [None]:
# Reading the Air quality dataset recieved from EDA analysis
air_quality_data = pd.read_csv("data/air_quality_data.csv")
air_quality_data

### 2. Basic Summary Statistics

In [None]:
air_quality_data.shape

In [None]:
# Converting the date column into datetime type
air_quality_data["date"] = pd.to_datetime(
    air_quality_data.date, infer_datetime_format="True"
)
air_quality_data.dtypes

In [None]:
# Setting the date as the index for the dataframe
air_quality_data.set_index("date", inplace=True)
air_quality_data.head()

### 3.Modelling

In [None]:
# for the evaluation
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
import tensorflow as tf

air_quality_data.columns

In [None]:
# Splitting the dataframe into training and Testing
split_date = pd.datetime(2014, 12, 31)
train = air_quality_data.loc[air_quality_data.index < split_date]
test = air_quality_data.loc[air_quality_data.index >= split_date]

In [None]:
# Using the MinMaxScaler for scaling the data of all columns
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_train = scaler.fit_transform(train)
scaled_test = scaler.transform(test)

In [None]:
# Splitting the training and testing data into features and classes
x_train = scaled_train[:, 1:]
y_train = scaled_train[:, 0]
x_test = scaled_test[:, 1:]
y_test = scaled_test[:, 0]

In [None]:
# Reshaping the data by considering the window size =1, i.e as one step ahead(Lag = 1)
x_train = x_train.reshape((x_train.shape[0], 1, x_train.shape[1]))
x_test = x_test.reshape((x_test.shape[0], 1, x_test.shape[1]))

In [None]:
# Building the Keras Model using LSTm and dropout layers
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras.utils.vis_utils import plot_model

deep_lstm_model = Sequential()
deep_lstm_model.add(
    LSTM(128, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))
)
# Adding a Droput layer to ensure the model does not overfit
deep_lstm_model.add(Dropout(0.2))
deep_lstm_model.add(
    LSTM(64, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))
)
deep_lstm_model.add(Dropout(0.2))
deep_lstm_model.add(
    LSTM(32, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))
)
deep_lstm_model.add(Dropout(0.2))
deep_lstm_model.add(Dense(x_train.shape[1]))
# Metrics for loss is MAE
deep_lstm_model.compile(optimizer="adam", loss="mae")
deep_lstm_model.summary()

In [None]:
from IPython.display import Image

Image(url="Images/LSTM_Model.png.jpg", width=200, height=100)

In [None]:
# Fitting the LSTM model for epochs = 50 and batch size=5
history = deep_lstm_model.fit(
    x_train,
    y_train,
    epochs=50,
    batch_size=5,
    validation_split=0.2,
    verbose=2,
    shuffle=False,
)

In [None]:
# Saving the LSTM model for Future predictions
deep_lstm_model.save("Models/Time_Series_Forecasting_LSTM_model.h5")

In [None]:
# Plotting the Training and Validation loss of the trained model.
plt.plot(history.history["loss"], label="Training loss")
plt.plot(history.history["val_loss"], label="Validation loss")
plt.legend();

### 4. Prediction using the Trained Model

In [None]:
# Predicting the PM2.5 concentration for the next 24 hours using the trained model
import keras
import keras.utils
from keras import utils as np_utils

# Loading the trained model for prediction
reconstructed_LSTM_Model = keras.models.load_model(
    "Models/Time_Series_Forecasting_LSTM_model.h5"
)
y_pred = reconstructed_LSTM_Model.predict(x_test)

In [None]:
# Reshaping the Testing dataset so as to prepare it for Inverse transformation
x_test = x_test.reshape((x_test.shape[0], 1 * 18))
y_pred1 = y_pred.reshape((y_pred.shape[0], 1 * 1))

In [None]:
from math import sqrt
from numpy import concatenate

inv_yhat = concatenate((y_pred1, x_test[:, -18:]), axis=1)
inv_yhat[0][0].reshape(1, -1)
# Inverse transforming using the scaler used for Training the model
inv_yhat_final = scaler.inverse_transform(inv_yhat)

In [None]:
list1_pred = []
for i in range(len(inv_yhat_final)):
    num = inv_yhat_final[i][0]
    list1_pred.append(float(round(num)))

In [None]:
test1_no_index = test.reset_index()
list1_actual = test1_no_index["pollution"]
list1_actual = list1_actual.tolist()

In [None]:
# Building a dataframe to depict the actual and the forecasted pollution for the next 24 hours
df_final = pd.DataFrame(
    {"Forecasted Pollution": list1_pred, "Actual Pollution": list1_actual}
)
df_final

In [None]:
# Building a graph to visualize the actual and the predicted pollution level for the next 24 hours
plt.figure(figsize=(10, 6))
plt.plot(
    df_final["Forecasted Pollution"],
    color="Darkblue",
    label="Predicted Pollution level",
)
plt.plot(df_final["Actual Pollution"], color="green", label="Actual Pollution level")
plt.title("Air Pollution Prediction (Multivariate)")
plt.xlabel("Hours")
plt.ylabel("Pollution level")
plt.legend()
plt.show()
plt.savefig("graph.png")

### 5. Evaluation

In [None]:
# Function to retrieve the Mean Squared Error, Root Mean Squared Error and the Mean Absolute Error
def diagnostics(y_pred, y_valid):
    mse = np.mean(np.square(y_pred - y_valid))
    print("The MSE is: ", mse)
    rmse = np.sqrt(mse)
    print("The RMSE is: ", rmse)
    mae = np.mean(np.abs(y_pred - y_valid))
    print("The MAE is: ", mae)

In [None]:
diagnostics(df_final["Forecasted Pollution"], df_final["Actual Pollution"])

In [None]:
# Printing the Overall actual mean of the pollution(PM 2.5 concentration) for the test data
test.pollution.mean()

In [None]:
# Printing the Overall forecasted mean of the pollution(PM 2.5 concentration) for the test data
df_final["Forecasted Pollution"].mean()

## Observations/Insights:
- The forecasted PM2.5 concentration(pollution levels) is almost closer to the actual values.
- When considered independently for hours, the predictions are much good.
- The built model has learnt the trend across the hours perfectly and can be used to predict for any future hours.
- The overall mean of the forecast and the actual are closer.
- Therefore, LSTM can be considered as a model for forecasting the time series data.