We became familiar with several ML techniques for Numeric prediction and Classification.

In this activity we will use Artificial Neural Networks (ANN) for prediction and compare it with one of the techniques we used earlier for numeric predictions (linear regression).

# Numeric prediction using regression

We will use the 50_startups dataset you are familiar with from activity 1.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
#import dataset
import pandas as pd
df = pd.read_csv('50_Startups.csv')
#defining input and outcome variables
y = df[['Profit']]  #profit
#X = dataset.drop(labels=['Profit','State'], axis=1) #other variables
X = df.drop('Profit',axis=1) #other variables
#add the binary encoded State variables to our X variable
X=pd.concat([X,pd.get_dummies(X['State'])],axis=1)
# drop the State column, since we now have the binary encoded vars
X.drop(['State'],axis=1,inplace=True)
df.head(3)

In [None]:
y.head(2)

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)  #using same random_state value for replicability


## Training a linear regression model with the train/test split

In [None]:
#### Fitting Multiple Linear Regression to the Training set  ####
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train,Y_train)

### Predictive performance of regression model

In [None]:
# Predicting the Test set results
y_pred_lin = lin_reg.predict(X_test)

In [None]:
#model evaluation
from sklearn import metrics
import math
print("Mean squared error: %.2f" % (metrics.mean_squared_error(Y_test, y_pred_lin))) #MSE
print("Root Mean squared error: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred_lin))) #RMSE
print("Mean absolute error: %.2f" % metrics.mean_absolute_error(Y_test, y_pred_lin)) # mean absolute error

## Training the Regression model using k-fold CV

In [None]:
#training the regression model with k-fold Cross validation
from sklearn import model_selection
kfold = model_selection.KFold(n_splits=5) #our data has only 50 observations
#we will output RMSE and MAE
scoremetrics=('neg_root_mean_squared_error','neg_mean_absolute_error')
for score in scoremetrics:
  cv_results = model_selection.cross_val_score(lin_reg, X, y, cv=kfold, scoring=score)
  msg = "%s: %f (%f)" % (score, -cv_results.mean(), cv_results.std())
  print(msg)

In [None]:
#predict profit for a new startup (same dataset as in activity 2_numeric prediction)
X_new=pd.read_csv('50_Startups_newinput.csv')
print(X_new)
lin_reg.predict(X_new)

# Using a Neural Network to predict numeric outcome (e.g., profit)

Let's use an Artificial Neural Network (ANN) for the same problem

In [None]:
#install some more things we need
!pip install -q git+https://github.com/tensorflow/docs

In [None]:
#if you have not loaded these yet...
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
#%tensorflow_version 2.x
import pathlib
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

In [None]:
import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

## Setting up the NN architecture

In [None]:
# Just to make sure we have the right datasets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)
#Our predictor/input variables
X_train.keys()

We define the following architecture for our ANN in the next cell and define it as a function:
* 1 input layer with 6 nodes; equal to the number of predictor/input variables (take a look at X-train again)
* 1 (or 2) fully-connected (dense) hidden layer with 64 (or 32) nodes
we use the Rectifier Linear (ReLu) activation function for the nodes in the hiddern layer
* 1 output layer with 1 node; since we have 1 continuous outcome variable

Later on we will add another hidder layer

In [None]:
def build_model():
  model = keras.Sequential([ #keras.Sequential creates a stack of layers (like a placeholder for layers we want to add)
    layers.Dense(64, activation='relu', input_shape=[len(X_train.keys())]),
    #layers.Dense(64, activation='relu'), #uncomment this line to add a 2nd hidden layer later
    layers.Dense(1) # notice out output node does not have an activation function
  ])

  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.005)

  model.compile(loss='mse', #mean-square-error (MSE) is our loss function
                optimizer=optimizer,
                metrics=['mae', 'mse']) #keep both MAE and MSE
  return model


In [None]:
#NOTE:run this cell if you want to delete the model and clear tf session
#tf.keras.backend.clear_session()
#del model

In [None]:
#call the above function once to create our model (just initializing it)
model=build_model()

In [None]:
# show model architecture (see below for details)
model.summary()
#Note: in dense_X  the number X merely keeps track of used layers

For the above NN architecture (the one with one dense layer with 64 nodes), the number of parameters that can be trained are 448. How so?
we have 1 parameter for each weight and 1 bias parameter for each node (except for our input nodes)
*   6x64=384 (weights connecting each input node to each node in the hidden layer) +
*   64 (bias parameter for nodes in the hidden layer)
*   = 448 parameters
*   64x1 (weights connecting each node in the hidden layer to the output node) +
*   1 (bias parameter for output node) = 65 parameters


### Question 1
If we would use 32 nodes in the hidden layer (instead of 64), how many trainable parameters would the network have?

You don't have to, but can change the network structure to have 32 nodes in the above cells and check the model architecture. If you o, make sure to change it back to 64 before proceeding.

ans

In [None]:
#let's take another look at our X-train dataset
X_train.head(1)

In [None]:
#making sure we can feed our data to the NN
model.predict((X_train[:10]))

## Training the NN model on unscaled data

In [None]:
EPOCHS = 100
history = model.fit(
  X_train,Y_train,
  epochs=EPOCHS, validation_split = 0.1, verbose=0, #in each epoch, 10% of the data is used as the validation set after the NN trains on the other 90%
  callbacks=[tfdocs.modeling.EpochDots()])

In [None]:
#take a look at the loss values for the first and last 5 epochs
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.head(5), hist.tail(5)

The MAE values are high!

In [None]:
#using a plotter from tfdocs to visualize our NN's training
plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)

In [None]:
#plotting the error values per epoch; shows us how MAE goes down as network trains
plotter.plot({'Basic': history}, metric = "mae")
plt.ylim([0, 50000])
#change the upper limit to roughly the MAE of your first epoch (from above cell)
plt.ylabel('MAE [Profit]')

In [None]:
#similar plot for MSE
plotter.plot({'Basic': history}, metric = "mse")
plt.ylim([0, 2e9])
#change the upper limit to roughly the MSE of your first epoch (from above cell)
plt.ylabel('MSE [Profit^2]')

In [None]:
#another way to plot the Loss values per epoch (without using tensorflowDocs plotter)
#no smoothing
fig, axs = plt.subplots(ncols=2)
fig.set_size_inches(15,4.5, forward=True)
axs[0].plot(hist['epoch'], hist["mae"])
axs[0].plot(hist['epoch'], hist["val_mae"])
axs[0].legend(['Training loss','Test loss'])
axs[0].set_title('MAE [Profit]')
axs[0].grid()
axs[1].plot(hist['epoch'], hist["mse"])
axs[1].plot(hist['epoch'], hist["val_mse"])
axs[1].legend(['Training loss','Test loss'])
axs[1].set_title('MSE [Profit^2]')
axs[1].grid()
plt.plot

Notice the values in both plots are high, but converge after epoch 20

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

### NN model performance (trained on unscaled data)

In [None]:
# let's evalute the trained NN on out test data (X_test)
y_pred=model.predict(X_test)
from sklearn import metrics
import math
#print('Coefficients: \n', lin_reg.coef_) #regression coefficients
print("Root Mean squared error: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred))) #RMSE
print("Mean absolute error: %.2f" % metrics.mean_absolute_error(Y_test, y_pred))

### Question 2
Does the NN model have a better predictive performance than the regression model?
...

## Train the NN model on scaled data (this is the right way)
Now we will train the model on the scaled data

In [None]:
#NOTE:this cell deletes the model and clears tf session
tf.keras.backend.clear_session()
del model
#delete the model to not keep any of the learned weights

In [None]:
#build same model again and
model=build_model()
model.summary()

In [None]:
# Feature Scaling, we will use MinMax scaling
from sklearn.preprocessing import StandardScaler,MinMaxScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()

In [None]:
#training the NN on scaled data (NOTE: this is the right way!)
EPOCHS = 100
history = model.fit(
  sc_X.fit_transform(X_train),sc_Y.fit_transform(Y_train),
  epochs=EPOCHS, validation_split = 0.1, verbose=0, #in each epoch, 10% of the data is used as the validation set after the NN trains on the other 90%
  callbacks=[tfdocs.modeling.EpochDots()])

In [None]:
#take a look at the loss values for the first and last 5 epochs
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.head(5), hist.tail(5)

In [None]:
#plotting the error values per epoch; shows us how MAE goes down as network trains
plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)
plotter.plot({'Basic': history}, metric = "mae")
#plt.ylim([0, 0.3])
#change the upper limit to roughly the MAE of your first epoch (from above cell)
plt.ylabel('MAE [Profit]')

In [None]:
#similar plot for MSE
plotter.plot({'Basic': history}, metric = "mse")
plt.ylim([0, 0.15])
#change the upper limit to roughly the MSE of your first epoch (from above cell)
plt.ylabel('MSE [Profit^2]')

In [None]:
#another way to plot the Loss values per epoch (without using tensorflowDocs plotter)
#no smoothing
fig, axs = plt.subplots(ncols=2)
fig.set_size_inches(15,4.5, forward=True)
axs[0].plot(hist['epoch'], hist["mae"])
axs[0].plot(hist['epoch'], hist["val_mae"])
axs[0].legend(['Training loss','Test loss'])
axs[0].set_title('MAE [Profit]')
axs[0].grid()
axs[1].plot(hist['epoch'], hist["mse"])
axs[1].plot(hist['epoch'], hist["val_mse"])
axs[1].legend(['Training loss','Test loss'])
axs[1].set_title('MSE [Profit^2]')
axs[1].grid()
plt.plot

### Question 3
At around which epoch (approximately) does the model not improve anymore?

### NN model performance (trained on scaled data)

In [None]:
from sklearn import metrics
import math

Since we have trained the NN on scaled X_train, we need to scale the test data before feeding it to the trained NN model for predictions

Similarly, the resulting NN model predictions (*y_pred*) are scaled and we need to transform it back to its original scale (*sc_Y.inverse_transform(y_pred)* ) for a meaningful comparison with Y_test (profit values on the original scale in our test data)

In [None]:
# let's evaluate the trained NN model
# we need to scale the test data since we have trained the NN on scaled X
y_pred=model.predict(sc_X.transform(X_test))

print("RMSE: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, sc_Y.inverse_transform(y_pred)))) #RMSE
print("MAE: %.2f" % metrics.mean_absolute_error(Y_test, sc_Y.inverse_transform(y_pred))) #MAE

In [None]:
sc_Y.inverse_transform(y_pred)

Notice that the NN model's performance improves when we (correctly) train it on the scaled data.

In [None]:
# let's look at the metrics using the scaled outcome variable (Profit)
print("RMSE: %.2f" % math.sqrt(metrics.mean_squared_error(sc_Y.transform(Y_test), y_pred))) #RMSE
print("MAE: %.2f" % metrics.mean_absolute_error(sc_Y.transform(Y_test), y_pred))


Remember RMSE and MAE are in the same unit as the outcome variable. So when we evaluate the model on the scaled Profit (i.e., profit ranges from 0 to 1) the RMSE and MAE values are naturally smaller. But we can't really interpret them in terms of dollar values.

## Comparing performance against Regression model


In [None]:
print("RMSE NN model: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, sc_Y.inverse_transform(y_pred)))) #RMSE
print("MAE NN model: %.2f" % metrics.mean_absolute_error(Y_test, sc_Y.inverse_transform(y_pred)) +"\n") #MAE

print("RMSE linear regression: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred_lin))) #RMSE
print("MAE linear regression: %.2f" % metrics.mean_absolute_error(Y_test, y_pred_lin))

### Question 4
As you can see, in our example the NN model performs worse than the Regression model. Why?

This is because we have a very small dataset; 50 observations and we use 40 of them for training. NNs performs well on large data, here our NN is "learning" (adjusting all those parameters) based on a very limited number of observations.

# Using a Neural network like a regression?!

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

In [None]:
tf.keras.backend.clear_session()
del model

In [None]:
def build_model():
  model = keras.Sequential([ #keras.Sequential creates a stack of layers (like a placeholder for layers we want to add)
    layers.Dense(units=1,activation=None, input_shape=[len(X_train.keys())])
  ])

  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.005)

  model.compile(loss='mse', #mean-square-error (MSE) is our loss function
                optimizer=optimizer,
                metrics=['mae', 'mse']) #keep both MAE and MSE
  return model

In [None]:
model=build_model()
model.summary()

In [None]:
#training on unscaled data
EPOCHS = 100
history = model.fit(
  X_train,Y_train,
  epochs=EPOCHS, validation_split = 0.1, verbose=0, #in each epoch, 10% of the data is used as the validation set after the NN trains on the other 90%
  callbacks=[tfdocs.modeling.EpochDots()])

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

fig, axs = plt.subplots(ncols=2)
fig.set_size_inches(15,4.5, forward=True)
axs[0].plot(hist['epoch'], hist["mae"])
axs[0].plot(hist['epoch'], hist["val_mae"])
axs[0].legend(['Training loss','Test loss'])
axs[0].set_title('MAE [Profit]')
axs[0].grid()
axs[1].plot(hist['epoch'], hist["mse"])
axs[1].plot(hist['epoch'], hist["val_mse"])
axs[1].legend(['Training loss','Test loss'])
axs[1].set_title('MSE [Profit^2]')
axs[1].grid()
plt.plot

In [None]:
y_pred=model.predict(X_test)
print("Root Mean squared error: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, y_pred))) #RMSE
print("Mean absolute error: %.2f" % metrics.mean_absolute_error(Y_test, y_pred))

In [None]:
#training on scaled data
EPOCHS = 300
history = model.fit(
  sc_X.fit_transform(X_train),sc_Y.fit_transform(Y_train),
  epochs=EPOCHS, validation_split = 0.1, verbose=0, #in each epoch, 10% of the data is used as the validation set after the NN trains on the other 90%
  callbacks=[tfdocs.modeling.EpochDots()])

In [None]:
y_pred=model.predict(sc_X.transform(X_test))

print("RMSE: %.2f" % math.sqrt(metrics.mean_squared_error(Y_test, sc_Y.inverse_transform(y_pred)))) #RMSE
print("MAE: %.2f" % metrics.mean_absolute_error(Y_test, sc_Y.inverse_transform(y_pred))) #MAE