# Tuning Neural Networks with Normalization - Lab

## Introduction

For this lab on initialization and optimization, you'll build a neural network to perform a regression task.

It is worth noting that getting regression to work with neural networks can be difficult because the output is unbounded ($\hat y$ can technically range from $-\infty$ to $+\infty$, and the models are especially prone to exploding gradients. This issue makes a regression exercise the perfect learning case for tinkering with normalization and optimization strategies to ensure proper convergence!

## Objectives
You will be able to:
* Build a neural network using Keras
* Normalize your data to assist algorithm convergence
* Implement and observe the impact of various initialization techniques

In [1]:
import numpy as np
import pandas as pd
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras import initializers
from tensorflow.python.keras import layers
from tensorflow.python.keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from tensorflow.python.keras import optimizers
from sklearn.model_selection import train_test_split

## Loading the data

The data we'll be working with is data related to Facebook posts published during the year of 2014 on the Facebook page of a renowned cosmetics brand.  It includes 7 features known prior to post publication, and 12 features for evaluating the post impact. What we want to do is make a predictor for the number of "likes" for a post, taking into account the 7 features prior to posting.

First, let's import the data set, `dataset_Facebook.csv`, and delete any rows with missing data. Afterwards, briefly preview the data.

In [2]:
data = pd.read_csv("dataset_Facebook.csv", sep= ";", header=0)
data = data.dropna()
print(np.shape(data))
data.head()

(495, 19)


Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,58,1572.0,147.0,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,19,325.0,49.0,393


## Defining the Problem

Define X and Y and perform a train-validation-test split.

X will be:
* Page total likes
* Post Month
* Post Weekday
* Post Hour
* Paid
along with dummy variables for:
* Type
* Category

Y will be the `like` column.

In [3]:
X0 = data['Page total likes']
X1 = data["Type"]
X2 = data["Category"]
X3 = data["Post Month"]
X4 = data["Post Weekday"]
X5 = data["Post Hour"]
X6 = data["Paid"]

dummy_X1 = pd.get_dummies(X1, drop_first=True)
dummy_X2 = pd.get_dummies(X2, drop_first=True)

X = pd.concat([X0, dummy_X1, dummy_X2, X3, X4, X5, X6], axis=1)
Y = data["like"]

data_clean = pd.concat([X, Y], axis=1)
np.random.seed(123)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=123)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=123)


## Building a Baseline Model

Next, build a naive baseline model to compare performance against is a helpful reference point. From there, you can then observe the impact of various tunning procedures which will iteratively improve your model.

In [4]:
#Simply run this code block, later you'll modify this model to tune the performance
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)



### Evaluating the Baseline

Evaluate the baseline model for the training and validation sets.

In [5]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train - Y_train)**2)
MSE_val = np.mean((pred_val - Y_val)**2)

print("MSE_train:", MSE_train)
print("MSE_val:", MSE_val)

MSE_train: nan
MSE_val: nan


In [6]:
hist.history['loss'][:10]

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

> Notice this extremely problematic behavior: all the values for training and validation loss are "nan". This indicates that the algorithm did not converge. The first solution to this is to normalize the input. From there, if convergence is not achieved, normalizing the output may also be required.

## Normalize the Input Data

Normalize the input features by subtracting each feature mean and dividing by the standard deviation in order to transform each into a standard normal distribution. Then recreate the train-validate-test sets with the transformed input data.

In [9]:
## standardize/categorize
X0= (X0-np.mean(X0))/(np.std(X0))
X3= (X3-np.mean(X3))/(np.std(X3))
X4= (X4-np.mean(X4))/(np.std(X4))
X5= (X5-np.mean(X5))/(np.std(X5))
X6= (X6-np.mean(X6))/(np.std(X6))

X = pd.concat([X0, dummy_X1, dummy_X2, X3, X4, X5, X6], axis=1)

data_clean = pd.concat([X, Y], axis=1)
np.random.seed(123)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=123)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=123)

## Refit the Model and Reevaluate

Great! Now refit the model and once again assess it's performance on the training and validation sets.

In [10]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation='linear'))

model.compile(optimizer='sgd', loss='mse', metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                epochs=100, validation_date=(X_val, Y_val), verbose=0)

TypeError: Unrecognized keyword arguments: {'validation_date': (     Page total likes  Photo  Status  Video  2  3  Post Month  Post Weekday  \
273          0.200818      1       0      0  0  1   -0.008568     -0.065724   
35           0.971230      1       0      0  1  0    1.506154     -1.544516   
413         -1.180129      1       0      0  0  0   -1.220346      0.427207   
438         -1.543065      0       0      0  1  0   -1.220346      0.920137   
200          0.595754      1       0      0  1  0    0.294376     -0.065724   
496         -2.582451      1       0      0  1  0   -1.826235      0.427207   
436         -1.386338      1       0      0  0  0   -1.220346      0.920137   
285          0.183335      0       1      0  1  0   -0.311513      0.920137   
266          0.300154      1       0      0  0  0   -0.008568      0.920137   
185          0.723137      1       0      0  0  0    0.597321     -1.544516   
70           0.909330      1       0      0  0  0    1.203210      0.427207   
65           0.927369      1       0      0  0  0    1.203210     -1.544516   
458         -1.881228      1       0      0  0  1   -1.523291     -1.051585   
349         -0.334165      1       0      0  0  1   -0.614457      0.427207   
42           0.937747      1       0      0  0  0    1.506154      0.427207   
144          0.793191      1       0      0  0  0    0.900265     -1.051585   
26           0.944234      0       1      0  1  0    1.506154      0.427207   
263          0.300154      1       0      0  1  0   -0.008568      1.413068   
216          0.542564      1       0      0  1  0    0.294376     -0.558655   
108          0.837855      1       0      0  0  1    0.900265      1.413068   
206          0.557700      1       0      0  0  0    0.294376     -1.544516   
391         -0.834182      1       0      0  1  0   -0.917402     -1.544516   
452         -1.841568      1       0      0  0  0   -1.523291      0.427207   
2            1.004960      1       0      0  0  1    1.506154     -0.558655   
329         -0.192944      1       0      0  0  1   -0.614457     -0.065724   
126          0.816666      1       0      0  0  1    0.900265      1.413068   
489         -2.297724      1       0      0  0  1   -1.826235      1.413068   
461         -1.901058      1       0      0  0  0   -1.523291     -1.544516   
441         -1.543065      1       0      0  0  0   -1.220346      0.427207   
175          0.742658      0       1      0  1  0    0.597321      0.920137   
..                ...    ...     ...    ... .. ..         ...           ...   
127          0.816666      1       0      0  0  0    0.900265      0.920137   
328         -0.192944      1       0      0  0  1   -0.614457     -0.065724   
122          0.816666      1       0      0  0  0    0.900265      1.413068   
145          0.793191      1       0      0  0  0    0.900265     -1.544516   
248          0.397020      1       0      0  0  1   -0.008568     -1.544516   
60           0.927369      1       0      0  0  0    1.203210     -0.558655   
68           0.927369      1       0      0  0  0    1.203210      0.920137   
77           0.865098      1       0      0  0  0    1.203210     -1.544516   
50           0.937747      1       0      0  1  0    1.203210     -1.544516   
465         -1.927498      1       0      0  0  1   -1.523291      0.920137   
8            1.004960      0       1      0  1  0    1.506154      1.413068   
476         -2.240272      1       0      0  0  0   -1.826235      0.920137   
320         -0.007800      1       0      0  0  0   -0.311513     -0.558655   
470         -1.960549      0       0      0  0  0   -1.523291     -0.558655   
57           0.936265      1       0      0  0  0    1.203210      0.427207   
45           0.937747      0       0      0  0  0    1.506154     -0.065724   
372         -0.626737      0       0      0  0  0   -0.917402      0.427207   
368         -0.482180      1       0      0  0  0   -0.917402     -1.544516   
121          0.816666      0       1      0  1  0    0.900265      1.413068   
53           0.936265      1       0      0  0  0    1.203210      1.413068   
247          0.397020      1       0      0  1  0   -0.008568     -1.544516   
406         -0.943093      1       0      0  0  1   -0.917402      1.413068   
393         -0.834182      1       0      0  0  1   -0.917402      1.413068   
87           0.865098      1       0      0  0  1    1.203210     -0.558655   
379         -0.713718      1       0      0  0  1   -0.917402     -1.544516   
315          0.109142      1       0      0  0  1   -0.311513      0.920137   
138          0.793191      0       1      0  1  0    0.900265      0.427207   
296          0.150656      1       0      0  0  1   -0.311513     -1.544516   
495         -2.352457      1       0      0  0  1   -1.826235      1.413068   
288          0.183335      1       0      0  0  0   -0.311513     -0.065724   

     Post Hour     Paid  
273   0.948620 -0.62486  
35    0.720342 -0.62486  
413   1.405175  1.60036  
438  -0.421045 -0.62486  
200  -1.105878  1.60036  
496   0.035510 -0.62486  
436   1.176897  1.60036  
285  -0.877600 -0.62486  
266  -1.105878 -0.62486  
185   0.492065 -0.62486  
70    0.492065  1.60036  
65   -1.105878 -0.62486  
458   1.176897 -0.62486  
349   1.176897 -0.62486  
42    0.492065 -0.62486  
144  -1.105878  1.60036  
26    0.720342 -0.62486  
263   0.720342  1.60036  
216  -0.877600 -0.62486  
108   0.263787 -0.62486  
206  -1.105878 -0.62486  
391   1.176897  1.60036  
452  -0.421045 -0.62486  
2    -1.105878 -0.62486  
329  -0.877600  1.60036  
126  -1.105878 -0.62486  
489  -1.334155 -0.62486  
461  -0.877600 -0.62486  
441  -0.877600  1.60036  
175   0.492065  1.60036  
..         ...      ...  
127   1.176897 -0.62486  
328   0.948620 -0.62486  
122   0.035510 -0.62486  
145   0.492065  1.60036  
248  -0.421045 -0.62486  
60    0.720342  1.60036  
68    0.720342 -0.62486  
77    0.492065 -0.62486  
50    0.263787 -0.62486  
465  -1.105878 -0.62486  
8    -1.105878 -0.62486  
476   0.492065  1.60036  
320   0.720342 -0.62486  
470   1.176897 -0.62486  
57   -1.105878 -0.62486  
45   -1.105878  1.60036  
372  -0.192768  1.60036  
368  -1.334155 -0.62486  
121   0.263787 -0.62486  
53   -1.105878 -0.62486  
247   0.948620 -0.62486  
406  -1.105878 -0.62486  
393   1.405175 -0.62486  
87    0.492065 -0.62486  
379   1.405175  1.60036  
315  -0.877600 -0.62486  
138  -1.105878 -0.62486  
296   1.405175 -0.62486  
495  -1.334155 -0.62486  
288   0.948620 -0.62486  

[89 rows x 10 columns], 273     143.0
35      172.0
413      63.0
438      32.0
200     139.0
496      53.0
436     127.0
285      72.0
266      75.0
185      62.0
70      146.0
65       77.0
458     128.0
349    1639.0
42       26.0
144      41.0
26      412.0
263      66.0
216      17.0
108     125.0
206     859.0
391     766.0
452      49.0
2        66.0
329     139.0
126     198.0
489      74.0
461      79.0
441       0.0
175     165.0
        ...  
127       9.0
328     617.0
122       1.0
145      76.0
248      98.0
60      101.0
68       53.0
77       85.0
50       48.0
465     128.0
8       161.0
476     579.0
320      36.0
470     114.0
57       40.0
45       57.0
372      38.0
368      41.0
121     186.0
53       64.0
247      96.0
406      97.0
393      57.0
87      270.0
379    1998.0
315     186.0
138     129.0
296      93.0
495      53.0
288     469.0
Name: like, Length: 89, dtype: float64)}

In [11]:
hist.history['loss'][:10]

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

> Note that you still haven't achieved convergence! From here, it's time to normalize the output data.

## Normalizing the output

Normalize Y as you did X by subtracting the mean and dividing by the standard deviation. Then, resplit the data into training and validation sets as we demonstrated above, and retrain a new model using your normalized X and Y data.

In [12]:
Y = (data["like"]-np.mean(data["like"])) / (np.std(data["like"]))

In [13]:
data_clean = pd.concat([X, Y], axis=1)
np.random.seed(123)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=123)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=123)

In [14]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation='linear'))

model.compile(optimizer='sgd', loss='mse', metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                epochs=100, validation_data= (X_val, Y_val), verbose=0)



Again, reevaluate the updated model.

In [15]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train - Y_train)**2)
MSE_val = np.mean((pred_val - Y_val)**2)

print("MSE_train: ", MSE_train)
print("MSE_val: ", MSE_val)

MSE_train:  1.0400182702899643
MSE_val:  0.950539861498632


In [16]:
hist.history['loss'][:10]

[1.3167034337359869,
 1.2431428392281694,
 1.2041585827141665,
 1.1760507788550987,
 1.1595090379875697,
 1.1470280905787864,
 1.1392238903581426,
 1.1302399009131314,
 1.1256035938021842,
 1.1207207330826963]

Great! Now that you have a converged model, you can also experiment with alternative optimizers and initialization strategies to see if you can find a better global minimum. (After all, the current models may have converged to a local minimum.)

## Using Weight Initializers

Below, take a look at the code provided to see how to modify the neural network to use alternative initialization and optimization strategies. At the end, you'll then be asked to select the model which you believe is the strongest.

##  He Initialization

In [17]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, kernel_initializer= "he_normal",
                activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val),verbose=0)



In [18]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)



In [19]:
print(MSE_train)
print(MSE_val)

1.0209460681003533
0.9727789032911952


## Lecun Initialization

In [20]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, 
                kernel_initializer= "lecun_normal", activation='tanh'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)



In [21]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)



In [22]:
print(MSE_train)
print(MSE_val)

1.045334180121194
0.9371290789467398


Not much of a difference, but a useful note to consider when tuning your network. Next, let's investigate the impact of various optimization algorithms.

## RMSprop

In [23]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "rmsprop" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)



In [24]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)



In [25]:
print(MSE_train)
print(MSE_val)

1.039178066931461
0.9706028250495943


## Adam

In [26]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "Adam" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)



In [27]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)



In [28]:
print(MSE_train)
print(MSE_val)

1.0484436819046854
0.9750867774333121


## Learning Rate Decay with Momentum


In [29]:
np.random.seed(123)
sgd = optimizers.SGD(lr=0.03, decay=0.0001, momentum=0.9)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= sgd ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

ValueError: Tried to convert 'y' to a tensor and failed. Error: None values not supported.

In [30]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [31]:
print(MSE_train)
print(MSE_val)

1.9611747118763914
1.5528834438316168


## Selecting a Final Model

Now, select the model with the best performance based on the training and validation sets. Evaluate this top model using the test set!

In [32]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, kernel_initializer= "he_normal",
                activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val),verbose=0)

pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

pred_test = model.predict(X_test).reshape(-1)
MSE_test = np.mean((pred_test-Y_test)**2)

print(MSE_train)
print(MSE_val)
print(MSE_test)

1.013561232658164
0.9433780760138251
0.18028308734515974


## Summary  

In this lab, you worked to ensure your model converged properly. Additionally, you also investigated the impact of varying initialization and optimization routines.