#DSCI 619: Deep Learning
## Project 2
Symphony Hopkins

## Introduction

We are a data scientist in ClimateChange company. We were given the Beijing Multi-Site Air Quality Datasets.

Each dataset contains the following features:
* No: row number
* date: date of the observation in the format of year-month-day
* hour: hour of data in this row
* PM10: PM10 concentration (ug/m^3)
* SO2: SO2 concentration (ug/m^3)
* NO2: NO2 concentration (ug/m^3)
* CO: CO concentration (ug/m^3)
* O3: O3 concentration (ug/m^3)
* TEMP: temperature (degree Celsius)
* PRES: pressure (hPa)
* DEWP: dew point temperature (degree Celsius)
* RAIN: precipitation (mm)
* wd: wind direction
* WSPM: wind speed (m/s)

Each dataset only contains one target:
* PM2.5: PM2.5 concentration (ug/m^3)


Our objective is to create an optimal deep learning model with specific hyperparameters that can help us forecast the PM2.5.

Data Source: [Beijing Multi-Site Air-Quality Data Data Set](https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data)



## Data Preparation

**1.Load the dataset, PRSA_Data.csv, into memory.**

First, we will import the necessary libraries to load the dataset.

In [1]:
#connecting to google drive
from google.colab import drive 
drive.mount('/content/gdrive')


Mounted at /content/gdrive


In [2]:
#importing libraries
import pandas as pd
import os
import glob

There are 12 sites in Beijing, each with their own dataset in the PRSA_Data folder. Because of this, we need to load each dataset and then combine them into a single dataframe.

In [None]:
# use glob to retrieve all 12 csv files in PRSA_Data Folder
path = os.getcwd()
csv_files = glob.glob(os.path.join('gdrive/My Drive/Colab Notebooks/Topic 2/PRSA_Data','*.csv'))
  
  
# reading the csv files into dataframes and concatenating them into a single dataframe
df = pd.concat([pd.read_csv(f) for f in csv_files ], ignore_index=True)

#displaying df
display(df.head())

#checking dataframe information
print(f'''
Dataframe Shape: {df.shape}
Number of Sites: {len(df['station'].unique())}
Site Names: {df['station'].unique()}''')

**2.Clean and check missing values for this dataset.**

Now that we have created a dataframe, let's clean and check it for missing values.

In [None]:
#checking for missing values
df.isnull().sum(axis = 0)

Our target, PM2.5, contains null values. Since we are creating a model to forecast PM2.5 concentration levels, we will not impute the missing values. Instead, we will delete them.

In [35]:
#deleting rows with missing PM2.5 values
df.dropna(axis=0, subset = ['PM2.5'], inplace = True)

Let's see if it worked.

In [None]:
#checking for missing values
df.isnull().sum(axis = 0)

Now, we will use Multiple Imputation by Chained Reaction (MICE) to deal with the missing values in the other columns. Before we do this, let's make all of the variables numerical.

In [None]:
#checking data types
df.dtypes

As we can see, *wd* and *station* are not numerical values, so we must convert them.

In [None]:
#converting categorical features to numerical values
cat_features = ['wd', 'station']
factors = pd.get_dummies(df[cat_features],drop_first=True)
factors.head()

Now, we will concatenate the newly converted numerical feautures and dummy variables to the dataframe, and drop the original variables.



In [None]:
#concatenating converted values to dataframe
df = df.drop(cat_features,axis=1)
df = pd.concat([df,factors],axis=1)
df.head()

Now let's impute the missing values.

In [40]:
#importing libraries
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [41]:
#initializing imputer
mice_imputer = IterativeImputer()

#imputing values
df.iloc[:, :] = mice_imputer.fit_transform(df)


Let's check the dataframe for missing values to see if the values were imputed.

In [None]:
#checking for missing values
df.isnull().sum(axis = 0)

**4.Split the data into 80% of training and 20% of the test dataset.**


Now, we will now split the data into training and test datasets with 80% going into the training dataset and 20% going into the test dataset. 



In [45]:
#assigning variables
X = df.drop('PM2.5',axis=1)
y = df['PM2.5']

#splitting data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 2021)

**5.Preprocess the data using the normalization method to convert all features into the range of [0,1]**


Next, we will normalize the data to convert all of the featues into the range of [0,1].

In [46]:
#importing library
from sklearn.preprocessing import MinMaxScaler

#creating a scaler so we can transform the data to fit within the range of [0,1]
scaler = MinMaxScaler()

#normalizing the data
X_train= scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Building the Neural Network

**6.Build a neuron network with two hidden layers of 20 and 10 neurons to forecast PM2.5 using all other features and TensorFlow. Does it overfit or underfit the data? Please justify your answer.**

After cleaning and preparing the data, we can finally build the neural network with the following layers:
* Input Layer: 50
* First Hidden Layer: 20
* Second Hidden Layer: 10
* Output Layer: 1


In [47]:
#importing libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

#creating empty model
model = keras.Sequential()

#input layer has 50 neurons
model.add(layers.Dense(50, activation='relu'))

#first hidden layer has 20 neurons
model.add(layers.Dense(20, activation='relu'))

#second hidden layer has 10 neurons
model.add(layers.Dense(10, activation='relu'))

#output layer has 1 neuron
model.add(layers.Dense(1, activation='relu'))

#configuring the model
model.compile(optimizer='adam',loss='mse')

Now that we have setup the model, we can fit the data. While we fit the data, we will keep the history of the training and validation loss to later determine if the model is under-fitting or over-fitting the data.

In [None]:
%%time
#fixing the seed 
tf.random.set_seed(1)

#fitting the model and saving the the history of the training and validation losses
history = model.fit(x=X_train,y=y_train,batch_size=64,epochs=100,
          validation_data=(X_test,y_test), verbose=0)

Next, let's look at the model's Mean Squared Error (MSE).

In [49]:
#importing library
from sklearn.metrics import mean_squared_error,mean_absolute_error

In [50]:
#predicting the target values for the test dataset
y_pred = model.predict(X_test)
print(f'Mean Squared Error: {mean_squared_error(y_test,y_pred)}')

Mean Squared Error: 327.6766379404551


Now, that we know the MSE, let's visualize the history of the training and validation losses by creating a lineplot.

In [51]:
#converting the history into a dataframe
trainhist = pd.DataFrame(history.history)
#adding epoch column to dataframe
trainhist['epoch'] = history.epoch

In [54]:
#importing library
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#plotting training loss vs epoch
sns.lineplot(x='epoch', y ='loss', data =trainhist)
#plotting validation loss vs epoch
sns.lineplot(x='epoch', y ='val_loss', data =trainhist)
#adding legends
plt.legend(labels=['train_loss', 'val_loss'])
plt.show()

We can tell there is some over-fitting within this model because the validation loss is slightly greater than the training loss as the number of epochs increases towards the end. Let's see if we can improve the model by tuning the hyperparameters.

## Hyperparameter Tuning

**7.Tune the model using the following hyperparameters using keras-tuner:**
* **First hidden layer with units between 20 and 50 with a step size of 5**
* **Second hidden layer with units between 5 and 10 with a step size of 1**
* **The dropout rate for both hidden layer is between 0.2 and 0.8 with a step size of 0.1**



We want to create an optimal model. In order to achieve this, we will tune our model using Keras Tuner to find the best hyperparameters. 

In [57]:
#installing keras tuner
!pip install -q -U keras-tuner

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.6/169.6 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [58]:
#importing library
import keras_tuner as kt

In [None]:
%%time
# we need to import Dropout layers from TF
from tensorflow.keras.layers import Dropout
# we may use the constraints for the weights for the optimization algorithms
from tensorflow.keras.constraints import max_norm


#building our model inside of a function 
#the various hyperparameters will be specificed with hp
def model_builder(hp):
  #creating empty model
  model = keras.Sequential()

  #like the model we first created, the input layer will contain 50 neurons
  model.add(layers.Dense(50, activation='relu'))

  #specifying the the minimum value, maximum value, and stepsize for the first hidden layer
  hp_units1 = hp.Int('units_1', min_value = 20, max_value = 50, step = 5)
  #adding first hidden layer
  model.add(layers.Dense(units = hp_units1, activation = 'relu'))
  #specifying the drop out rate for the first hidden layer
  hp_dropout = hp.Float('dropout_rate', min_value = 0.2, max_value = 0.8, step = 0.1)
  #adding dropout layer for the first hidden layer
  model.add(Dropout(rate = hp_dropout))

  #specifying the the minimum value, maximum value, and stepsize for the second hidden layer
  hp_units2 = hp.Int('units_2', min_value = 5, max_value = 10, step = 1)
  #adding second hidden layer
  model.add(layers.Dense(units = hp_units2, activation = 'relu'))
  #since the dropout rate for the second hidden layer is the same as the first hidden layer we will use the same variable
  #adding dropout layer for the second hidden layer
  model.add(Dropout(rate = hp_dropout))

  #like the model we first created, the output layer will contain 1 neuron
  model.add(layers.Dense(1))

  #configuring the model to specify the optimizer, loss function, and metrics
  model.compile(optimizer = keras.optimizers.Adam(), loss = 'mse', metrics = [tf.keras.metrics.MeanSquaredError()]) 
  
  #returning the model
  return model

Before we can tune the model, we need to instantiate the model first. For this case, we will use Hyperband to instantiate the model. We also want to save the search results, so we will create a new file to store in our directory.

In [66]:
#instantiating the model with Hyperband
tuner = kt.Hyperband(model_builder, 
                     objective = 'val_loss',
                     seed = 1,
                     max_epochs = 100, 
                     directory = 'gdrive/My Drive/Colab Notebooks/Topic 2', 
                     project_name = 'Project2_HopkinsSymphony_TuningRegression')

Because the tuner has many outputs, we will define a callback to clean the outputs at the end of every training step.

In [61]:
#importing library
import IPython
# defining a new parent class for the callback
class ClearTrainingOutput(tf.keras.callbacks.Callback):
  #clearing the output at the end of each training step
  def on_train_end(*args, **kwargs):
    IPython.display.clear_output(wait = True)

Next, we will perform a search on the defined hyperparameter space.

In [None]:
#performing search
tuner.search(X_train,
             y_train,
             epochs = 100,
             validation_data = (X_test,y_test),
             callbacks = [ClearTrainingOutput()])

Now, let's see what the optimal hyperparameters for the model are.

In [None]:
#searching for optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]
#printing outputs
print(f"""
Optimal Number of Neurons for First Hidden Layer: {best_hps.get('units_1')}. 
Optimal Number of Neurons for Second Hidden Layer: {best_hps.get('units_2')}. 
Optimal Dropout Rate for First and Second Hidden Layer: {best_hps.get('dropout_rate')}
""")

Since we know what the optimal hyperparametres are, we will retrain the model using them.

In [None]:
# building the model with the optimal hyperparameters
model = tuner.hypermodel.build(best_hps)
#fitting the data
history = model.fit(x=X_train,y=y_train,batch_size=64,epochs=100,
          validation_data=(X_test,y_test), verbose=0)

Next, let's graph the history of the model.

In [None]:
#converting the history into a dataframe
trainhist = pd.DataFrame(history.history)
#adding epoch column to dataframe
trainhist['epoch'] = history.epoch

#plotting training loss vs epoch
sns.lineplot(x='epoch', y ='loss', data =trainhist)
#plotting validation loss vs epoch
sns.lineplot(x='epoch', y ='val_loss', data =trainhist)
#adding legends
plt.legend(labels=['train_loss', 'val_loss'])

Let's evaluate the model by finding the MSE.

In [None]:
#predicting the target values for the test dataset
y_pred = model.predict(X_test)
print(f'Mean Squared Error: {mean_squared_error(y_test,y_pred)}')