# Tuning Neural Networks with Normalization - Lab

## Introduction

For this lab on initialization and optimization, you'll build a neural network to perform a regression task.

It is worth noting that getting regression to work with neural networks can be difficult because the output is unbounded ($\hat y$ can technically range from $-\infty$ to $+\infty$, and the models are especially prone to exploding gradients. This issue makes a regression exercise the perfect learning case for tinkering with normalization and optimization strategies to ensure proper convergence!

## Objectives
You will be able to:
* Build a neural network using Keras
* Normalize your data to assist algorithm convergence
* Implement and observe the impact of various initialization techniques

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras import initializers
from keras import layers
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from keras import optimizers
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


## Loading the data

The data we'll be working with is data related to Facebook posts published during the year of 2014 on the Facebook page of a renowned cosmetics brand.  It includes 7 features known prior to post publication, and 12 features for evaluating the post impact. What we want to do is make a predictor for the number of "likes" for a post, taking into account the 7 features prior to posting.

First, let's import the data set, `dataset_Facebook.csv`, and delete any rows with missing data. Afterwards, briefly preview the data.

In [2]:
#Your code here; load the dataset and drop rows with missing values. Then preview the data.
df = pd.read_csv('./dataset_Facebook.csv', sep=';') #, header=0)
df.tail()

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
495,85093,Photo,3,1,7,2,0.0,4684,7536,733,708,985,4750,2876,392,5,53.0,26.0,84
496,81370,Photo,2,1,5,8,0.0,3480,6229,537,508,687,3961,2104,301,0,53.0,22.0,75
497,81370,Photo,1,1,5,2,0.0,3778,7216,625,572,795,4742,2388,363,4,93.0,18.0,115
498,81370,Photo,3,1,4,11,0.0,4156,7564,626,574,832,4534,2452,370,7,91.0,38.0,136
499,81370,Photo,2,1,4,4,,4188,7292,564,524,743,3861,2200,316,0,91.0,28.0,119


In [3]:
df.isnull().any().any()

True

In [4]:
df.isnull().sum()

Page total likes                                                       0
Type                                                                   0
Category                                                               0
Post Month                                                             0
Post Weekday                                                           0
Post Hour                                                              0
Paid                                                                   1
Lifetime Post Total Reach                                              0
Lifetime Post Total Impressions                                        0
Lifetime Engaged Users                                                 0
Lifetime Post Consumers                                                0
Lifetime Post Consumptions                                             0
Lifetime Post Impressions by people who have liked your Page           0
Lifetime Post reach by people who like your Page   

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
Page total likes                                                       500 non-null int64
Type                                                                   500 non-null object
Category                                                               500 non-null int64
Post Month                                                             500 non-null int64
Post Weekday                                                           500 non-null int64
Post Hour                                                              500 non-null int64
Paid                                                                   499 non-null float64
Lifetime Post Total Reach                                              500 non-null int64
Lifetime Post Total Impressions                                        500 non-null int64
Lifetime Engaged Users                                                 500 non-nul

In [6]:
df.shape

(500, 19)

In [7]:
df = df.dropna()
df.shape

(495, 19)

## Defining the Problem

Define X and Y and perform a train-validation-test split.

X will be:
* Page total likes
* Post Month
* Post Weekday
* Post Hour
* Paid
along with dummy variables for:
* Type
* Category

Y will be the `like` column.

In [8]:
for idx, col in enumerate(df.columns):
    print(idx,col)

0 Page total likes
1 Type
2 Category
3 Post Month
4 Post Weekday
5 Post Hour
6 Paid
7 Lifetime Post Total Reach
8 Lifetime Post Total Impressions
9 Lifetime Engaged Users
10 Lifetime Post Consumers
11 Lifetime Post Consumptions
12 Lifetime Post Impressions by people who have liked your Page
13 Lifetime Post reach by people who like your Page
14 Lifetime People who have liked your Page and engaged with your post
15 comment
16 like
17 share
18 Total Interactions


In [9]:
#Your code here; define the problem.
X = df.iloc[:,[0, 1, 2, 3, 4, 5, 6]]
X[:10]

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid
0,139441,Photo,2,12,4,3,0.0
1,139441,Status,2,12,3,10,0.0
2,139441,Photo,3,12,3,3,0.0
3,139441,Photo,2,12,2,10,1.0
4,139441,Photo,2,12,2,3,0.0
5,139441,Status,2,12,1,9,0.0
6,139441,Photo,3,12,1,3,1.0
7,139441,Photo,3,12,7,9,1.0
8,139441,Status,2,12,7,3,0.0
9,139441,Photo,3,12,6,10,0.0


In [10]:
df['Type'].unique()

array(['Photo', 'Status', 'Link', 'Video'], dtype=object)

In [11]:
np.unique(df['Category'])

array([1, 2, 3], dtype=int64)

In [12]:
y = df.iloc[:, 17]  # Regression Problem - Number/Count of likes ....
y[:10]

0     17.0
1     29.0
2     14.0
3    147.0
4     49.0
5     33.0
6     27.0
7     14.0
8     31.0
9     26.0
Name: share, dtype: float64

In [13]:
dummy_X1= pd.get_dummies(X['Type'], drop_first=True)
dummy_X2= pd.get_dummies(X['Category'], drop_first=True)

X = pd.concat([X, dummy_X1, dummy_X2], axis=1)
X[:10]


Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Photo,Status,Video,2,3
0,139441,Photo,2,12,4,3,0.0,1,0,0,1,0
1,139441,Status,2,12,3,10,0.0,0,1,0,1,0
2,139441,Photo,3,12,3,3,0.0,1,0,0,0,1
3,139441,Photo,2,12,2,10,1.0,1,0,0,1,0
4,139441,Photo,2,12,2,3,0.0,1,0,0,1,0
5,139441,Status,2,12,1,9,0.0,0,1,0,1,0
6,139441,Photo,3,12,1,3,1.0,1,0,0,0,1
7,139441,Photo,3,12,7,9,1.0,1,0,0,0,1
8,139441,Status,2,12,7,3,0.0,0,1,0,1,0
9,139441,Photo,3,12,6,10,0.0,1,0,0,0,1


In [14]:
X = X.drop(['Type','Category'], axis=1)
X[:10]

Unnamed: 0,Page total likes,Post Month,Post Weekday,Post Hour,Paid,Photo,Status,Video,2,3
0,139441,12,4,3,0.0,1,0,0,1,0
1,139441,12,3,10,0.0,0,1,0,1,0
2,139441,12,3,3,0.0,1,0,0,0,1
3,139441,12,2,10,1.0,1,0,0,1,0
4,139441,12,2,3,0.0,1,0,0,1,0
5,139441,12,1,9,0.0,0,1,0,1,0
6,139441,12,1,3,1.0,1,0,0,0,1
7,139441,12,7,9,1.0,1,0,0,0,1
8,139441,12,7,3,0.0,0,1,0,1,0
9,139441,12,6,10,0.0,1,0,0,0,1


In [15]:
X.shape

(495, 10)

In [16]:
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)  
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=123) 

## Building a Baseline Model

Next, build a naive baseline model to compare performance against is a helpful reference point. From there, you can then observe the impact of various tunning procedures which will iteratively improve your model.

In [17]:
#Simply run this code block, later you'll modify this model to tune the performance
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, y_val), verbose=0)

### Evaluating the Baseline

Evaluate the baseline model for the training and validation sets.

In [18]:
#Your code here; evaluate the model with MSE
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)  

MSE_train = np.mean((pred_train-y_train)**2)
MSE_val = np.mean((pred_val-y_val)**2)

print("MSE_train:", MSE_train)
print("MSE_val:", MSE_val)

MSE_train: nan
MSE_val: nan


In [19]:
#Your code here; inspect the loss function through the history object
hist.history['loss'][:10]

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

> Notice this extremely problematic behavior: all the values for training and validation loss are "nan". This indicates that the algorithm did not converge. The first solution to this is to normalize the input. From there, if convergence is not achieved, normalizing the output may also be required.

## Normalize the Input Data

Normalize the input features by subtracting each feature mean and dividing by the standard deviation in order to transform each into a standard normal distribution. Then recreate the train-validate-test sets with the transformed input data.

In [20]:
## standardize/categorize

In [21]:
X.columns

Index(['Page total likes',       'Post Month',     'Post Weekday',
              'Post Hour',             'Paid',            'Photo',
                 'Status',            'Video',                  2,
                        3],
      dtype='object')

In [22]:
# X0 = X["Page total likes"]
# #X1 = data["Type"]
# #X2 = data["Category"]
# X1 = X["Post Month"]
# X2 = X["Post Weekday"]
# X3 = X["Post Hour"]
# X4 = X["Paid"]

In [23]:
## standardize/categorize
# X0_norm= (X0-np.mean(X0))/(np.std(X0))

# X1_norm = (X1-np.mean(X1))/(np.std(X1))
# X2_norm = (X2-np.mean(X2))/(np.std(X2))
# X3_norm = (X3-np.mean(X3))/(np.std(X3))
# X4_norm = (X4-np.mean(X4))/(np.std(X4))

# X = pd.concat([X0_norm, X1_norm, X2_norm, X3_norm, X4_norm, dummy_X1, dummy_X2], axis=1)
# X[:10]

In [24]:
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_scaled = sc.fit_transform(X)
# X_scaled[:10]

In [25]:
# np.random.seed(123)
# X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.1, random_state=123)  
# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=123) 

### SOLUTION FILE_

In [26]:
X0 = df["Page total likes"]
X1 = df["Type"]
X2 = df["Category"]
X3 = df["Post Month"]
X4 = df["Post Weekday"]
X5 = df["Post Hour"]
X6 = df["Paid"]

## Even for a baseline model some preprocessing may be required (all inputs must be numerical features)
dummy_X1= pd.get_dummies(X1, drop_first=True)
dummy_X2= pd.get_dummies(X2, drop_first=True)

X = pd.concat([X0, dummy_X1, dummy_X2, X3, X4, X5, X6], axis=1)

Y = df["like"]


data_clean = pd.concat([X, Y], axis=1)
np.random.seed(123)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=123)  
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=123)  

np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)

#Evaluate the baseline model for the training and validation sets.

pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)  

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

print("MSE_train:", MSE_train)
print("MSE_val:", MSE_val)
# MSE_train: nan
# MSE_val: nan
hist.history['loss'][:10]

# Normalize the input features by subtracting each feature mean and dividing by the standard deviation in order to transform each into a standard normal distribution. Then recreate the train-validate-test sets with the transformed input data.

## standardize/categorize
X0= (X0-np.mean(X0))/(np.std(X0))

X3= (X3-np.mean(X3))/(np.std(X3))
X4= (X4-np.mean(X4))/(np.std(X4))
X5= (X5-np.mean(X5))/(np.std(X5))
X6= (X6-np.mean(X6))/(np.std(X6))

X = pd.concat([X0, dummy_X1, dummy_X2, X3, X4, X5, X6], axis=1)
display(X[:10])

#Code provided; defining training and validation sets
data_clean = pd.concat([X, Y], axis=1)
np.random.seed(123)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=123)  
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=123)  

np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)


display(hist.history['loss'][:10])


MSE_train: nan
MSE_val: nan


Unnamed: 0,Page total likes,Photo,Status,Video,2,3,Post Month,Post Weekday,Post Hour,Paid
0,1.00496,1,0,0,1,0,1.506154,-0.065724,-1.105878,-0.62486
1,1.00496,0,1,0,1,0,1.506154,-0.558655,0.492065,-0.62486
2,1.00496,1,0,0,0,1,1.506154,-0.558655,-1.105878,-0.62486
3,1.00496,1,0,0,1,0,1.506154,-1.051585,0.492065,1.60036
4,1.00496,1,0,0,1,0,1.506154,-1.051585,-1.105878,-0.62486
5,1.00496,0,1,0,1,0,1.506154,-1.544516,0.263787,-0.62486
6,1.00496,1,0,0,0,1,1.506154,-1.544516,-1.105878,1.60036
7,1.00496,1,0,0,0,1,1.506154,1.413068,0.263787,1.60036
8,1.00496,0,1,0,1,0,1.506154,1.413068,-1.105878,-0.62486
9,1.00496,1,0,0,0,1,1.506154,0.920137,0.492065,-0.62486


[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

## Refit the Model and Reevaluate

Great! Now refit the model and once again assess it's performance on the training and validation sets.

In [27]:
# Your code here; refit a model as shown above
# np.random.seed(123)
# model = Sequential()
# model.add(layers.Dense(8, input_dim=10, activation='relu'))
# model.add(layers.Dense(1, activation = 'linear'))

# model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
# hist = model.fit(X_train, y_train, batch_size=32, 
#                  epochs=100, validation_data = (X_val, y_val), verbose=0)


In [28]:
# Rexamine the loss function
# hist.history['loss'][:10]

> Note that you still haven't achieved convergence! From here, it's time to normalize the output data.

## Normalizing the output

Normalize Y as you did X by subtracting the mean and dividing by the standard deviation. Then, resplit the data into training and validation sets as we demonstrated above, and retrain a new model using your normalized X and Y data.

In [29]:
#Your code here: redefine Y after normalizing the data.
Y = (df["like"]-np.mean(df["like"]))/(np.std(df["like"]))

In [30]:
#Your code here; create training and validation sets as before. Use random seed 123.
data_clean = pd.concat([X, Y], axis=1)
np.random.seed(123)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=123)  
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=123)  
# train, validation = train_test_split(data_clean, test_size=0.2)

In [31]:
#Your code here; rebuild a simple model using a relu layer followed by a linear layer. (See our code snippet above!)
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

Again, reevaluate the updated model.

In [32]:
#Your code here; MSE
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)  

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

print("MSE_train:", MSE_train)
print("MSE_val:", MSE_val)

MSE_train: 1.0424935946307536
MSE_val: 0.9346535852929211


In [33]:
#Your code here; loss function
hist.history['loss'][:10]

[1.254382147547904,
 1.2010118355242054,
 1.175830115093274,
 1.1630225714003102,
 1.1524048089311365,
 1.1436721401268177,
 1.1365511114342828,
 1.1307969508546123,
 1.1269671114977826,
 1.1240440822216902]

Great! Now that you have a converged model, you can also experiment with alternative optimizers and initialization strategies to see if you can find a better global minimum. (After all, the current models may have converged to a local minimum.)

## Using Weight Initializers

Below, take a look at the code provided to see how to modify the neural network to use alternative initialization and optimization strategies. At the end, you'll then be asked to select the model which you believe is the strongest.

##  He Initialization

> It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(2 / fan_in) where fan_in is the number of input units in the weight tensor.

In [34]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, kernel_initializer= "he_normal",
                activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val),verbose=0)

In [35]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [36]:
print(MSE_train)
print(MSE_val)

1.0447634995798012
0.9795106781012153


## Lecun Initialization

> It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(1 / fan_in) where fan_in is the number of input units in the weight tensor.

In [37]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, 
                kernel_initializer= "lecun_normal", activation='tanh'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)

In [38]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [39]:
print(MSE_train)
print(MSE_val)

1.0029716093424133
1.005522075030141


Not much of a difference, but a useful note to consider when tuning your network. Next, let's investigate the impact of various optimization algorithms.

## RMSprop

In [40]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "rmsprop" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

In [41]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [42]:
print(MSE_train)
print(MSE_val)

1.0255967123225014
0.9312651802299948


## Adam

In [43]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "Adam" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

In [44]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [45]:
print(MSE_train)
print(MSE_val)

1.0306934074828158
0.9324454899423249


## Learning Rate Decay with Momentum


In [46]:
np.random.seed(123)
sgd = optimizers.SGD(lr=0.03, decay=0.0001, momentum=0.9)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= sgd ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

In [47]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [48]:
print(MSE_train)
print(MSE_val)

0.9139455623316945
1.067607749494609


## Selecting a Final Model

Now, select the model with the best performance based on the training and validation sets. Evaluate this top model using the test set!

In [49]:
#Your code here
# kernel_initializer= "he_normal", activation='relu'
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, kernel_initializer= "he_normal",
                activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val),verbose=0)

pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

pred_test = model.predict(X_test).reshape(-1)
MSE_test = np.mean((pred_test-Y_test)**2)

print(MSE_train)
print(MSE_val)
print(MSE_test)

1.0448634390884701
0.979294761322587
0.19403534833244024


In [50]:
#Your code here
# kernel_initializer= "lecun_normal", activation='tanh'
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=10, 
                kernel_initializer= "lecun_normal", activation='tanh'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)

pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

pred_test = model.predict(X_test).reshape(-1)
MSE_test = np.mean((pred_test-Y_test)**2)

print(MSE_train)
print(MSE_val)
print(MSE_test)

1.0029712065504928
1.005524916550631
0.1834271810010904


## Summary  

In this lab, you worked to ensure your model converged properly. Additionally, you also investigated the impact of varying initialization and optimization routines.