# TensorFlow - Unit 07 - Regression

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Fit a deep learning neural network for Regression task
* Save and load tensorflow models (.h5 extension)



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 08 - Regression

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png">
 We will follow the typical process for supervised learning which we are familiar with, but now with a few tweaks:

* Split the dataset into train, validation and test set
* Create a pipeline to handle data cleaning, feature engineering and feature scaling (as we covered, it is highly recommended the data be scaled, so we wrap up in one pipeline)
* Create the neural network
* Fit the pipeline to the train set and transformations to the other sets
* Fit the model to the train and validation set
* Evaluate the model
* Predict on new data

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load and split the data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's first load the data. We are using the Boston dataset from sklearn.
* It has house price records and house characteristics, like the average number of rooms per dwelling and the per capita crime rate in Boston. The target variable is the house price.

from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data,columns=data.feature_names)
df['price'] = pd.Series(data.target)

print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.
amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As part of our workflow, we split the data, but now we will split it into train, validation, test sets. 
* First, we split into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['price'],axis=1),
                                    df['price'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Then, from the train set, we split a validation set. We set the validation set as 20% of the train set
* Have a look at the print statement, which shows the amount of data we have in each set (train, validation and test)

X_train, X_val,y_train, y_val = train_test_split(
                                    X_train,
                                    y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline for data processing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We first create a pipeline for preprocessing the data. In theory, it could handle the processes of data cleaning, feature engineering and feature scaling
* In this case, it's only features scaling.
* We could have also added a step for removing correlated features, but let's keep it simple.

from sklearn.pipeline import Pipeline
### Feature Scaling
from sklearn.preprocessing import StandardScaler

def pipeline_pre_processing():
  pipeline_base = Pipeline([
      
      ( "feat_scaling",StandardScaler() )

    ])

  return pipeline_base


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we fit the pipeline to the train set and transformations to the validation and test set
* So the pipeline can learn the transformations (in this case it is only feature scaling) from the train set, and apply the transformation to the other sets. 

pipeline = pipeline_pre_processing()
X_train = pipeline.fit_transform(X_train)
X_val= pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create Deep Learning Network

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We will create a tensorflow model.
* We create a function that creates a sequential model, compiles the model and returns the mode. The function requires the number of features the data has to be used as the number of neurons from the first layer
* Let's define the network architecture (a deep learning neural network since it has 2 or more hidden layers - jargon alert! )
  * We noted the data has 13 features. First, we will create a simple network just for a learning experience. 
  * The network is built using Dense layers - fully connected layers
  * The input layer has the same number of neurons as the number of columns from the data. The activation function is relu. We parse the input_shape in a tuple, the first value is the number of columns from the data, and you don't need to parse the second since the data is uni-dimensional (an image wouldn't be unidimensional, for example)
  * We are using 2 hidden layers, the first with 8 neurons and the next with 4 neurons. Both will use relu as an activation function.
  * After each hidden layer, we have a dropout layer with a 25% rate; so we can reduce the chance of overfitting
  * The output layer should contain only 1 neuron since the ML task is Regression. 
  * We compile the model with mse as loss/cost function and optimizer as adam

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

def create_tf_model(n_features):

  model = Sequential()
  model.add(Dense(units=n_features, activation='relu', input_shape=(n_features,)))

  model.add(Dense(units=8,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=4,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=1))
  model.compile(loss='mse', optimizer='adam')
  
  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's visualize the network structure
* Note the number of parameters the network has. That looks to be a reasonable amount compared to the number of rows the train set has
* A non-reasonable amount would be like 100 thousand parameters for a dataset with 1k. Or maybe your dataset is so tiny and complex, and you need more parameters, but the rule of thumb suggests starting easy and adding more complexity if the performance is not good.

model = create_tf_model(n_features=X_train.shape[1])
model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We can use `plot_model()` also from Keras.utils for a more graphical approach
* Note the input and output shape each layer has. That is how your data is "travelling" from the input to the prediction.

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As we mentioned in a previous notebook, early stopping stops training when a monitored metric has stopped improving; this is useful to avoid overfitting the model to the data. The documentation function is [here](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping)
* We will monitor the validation loss 
  * Remember that now we parse the train and validation data. After a given epoch finish, the network calculates the error. The training process stops if the validation error doesn't improve for a given set of consecutive epochs. 
  * We set patience as 15, which is the number of epochs with no improvement; after that, training will be stopped. Although there is no fixed rule to set patience, if you feel that your model was learning still, then you stopped, you may increase the value and train again.
  * We set the mode to min. According to TensorFlow documentation, in min mode, training will stop when the quantity monitored has stopped decreasing.

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We finally will fit the model
* We create the model object and use .fit(), as usual
  * We parse the Train set
  * The epoch is set to 100. In theory, you may set a high value since we will add an early stop, which stops the training process when there is no training improvement. 
  * We parse the validation data in a tuple.
  * Verbose is set to 1 to see in which epochs we are and the training and validation loss.
  * Finally, we parse our callback as the early_stop object we created earlier. We parse in a list since you may parse more than 1 type of callback. In this course, we will cover only early stopping

* For each epoch, note the training and validation loss. Are they increasing? Decreasing? Still?
  * Ideally, it should decrease as long as the epoch increases, showing a practical sign the network is learning


model = create_tf_model(n_features=X_train.shape[1])

model.fit(x=X_train, 
          y=y_train, 
          epochs=100,
          validation_data=(X_val, y_val),
          verbose=1,
          callbacks=[early_stop]
          )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Model evaluation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Now we will evaluate the model performance by analyzing the train and validation losses that happened during the training process. 
* In deep learning we use the model history to assess if the model learned, using the train and validation sets. We also evaluate separately how the model generalize on unseen data (on the test set)
* The model training history information is stored in a `.history.history` attribute. Note it shows a loss (training loss) and val_loss (validation_loss)

losses = pd.DataFrame(model.history.history)
losses

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are plotting each loss in a line plot, where the y-axis has the loss value, the x-axis is the epoch number and the lines are colored by train or validation
* We use `.plot(style='.-')` for this task
* Note the loss plots for training and validation data follow a similar path and are close to each other. It looks like the network learned the patterns.

losses = pd.DataFrame(model.history.history)

sns.set_style("whitegrid")
losses[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we will evaluate the model performance on the test set, using `.evaluate()` and parsing the test set. Note the value is not much different from the losses in the train and validation set.
* Note the model learned the relationship between the features and the target, considering all features. Conventional ML often use a feature selection step to remove features that wouldn't contribute to the model learning, thus increasing the chance of overfitting the model.
* But in Deep Learning, the neural network handles this topic by itself. The connections that are related to features with less importance are weaker after the training process; therefore, it doesn't harm the overall performance.

model.evaluate(X_test,y_test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When evaluating a deep learning model you typically cover the loss plot and evaluate the test set, however, **you can do if you want as an additional step** a similar evaluation we did in conventional ML.
* In regression, you would analyze the performance metrics and actual x predictions plot, using the custom function we have seen over the course.
* One difference is that we readapted the function also to evaluate the validation set, but that is a minor change in the code; the overall logic is the same

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
import numpy as np

def regression_performance(X_train, y_train,
                           X_val, y_val,
                           X_test, y_test,pipeline):

  print("Model Evaluation \n")
  print("* Train Set")
  regression_evaluation(X_train,y_train,pipeline)
  print("* Validation Set")
  regression_evaluation(X_val, y_val,pipeline)
  print("* Test Set")
  regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  """
  # Gets features and target (either from train or test set) and pipeline
  - it predicts using the pipeline and the features
  - calculates performance metrics comparing the prediction to the target
  """
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train,
                                X_val, y_val,
                                X_test, y_test,
                                pipeline, alpha_scatter=0.5):

  pred_train = pipeline.predict(X_train).reshape(-1) 
  # we reshape the prediction arrays to be in the format (n_rows,), so we can plot it after
  pred_val = pipeline.predict(X_val).reshape(-1)
  pred_test = pipeline.predict(X_test).reshape(-1)

  fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_val , y=pred_val, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_val , y=y_val, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Validation Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[2])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[2])
  axes[2].set_xlabel("Actual")
  axes[2].set_ylabel("Predictions")
  axes[2].set_title("Test Set")

  plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's parse the values as usual.
* Note here we don't parse a pipeline, we are the TensorFlow model
* Note the predictions tend to follow the red diagonal line. However, it seems the test set metrics are quite different from the train/validation set. You could add more complexity to the model, or increase the number of epochs etc until you reach a metric that can satisfy you. For learning purposes, we will be happy with this performance. 
* In general, we would expect the performance to be better in the train set, then validation, then test set. However, there may be cases where this doesn't happen.

regression_performance(X_train, y_train,X_val, y_val, X_test, y_test,model)
regression_evaluation_plots(X_train, y_train, X_val, y_val,X_test, y_test, 
                            model, alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Prediction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take a sample from the test set and use it as if it was live data. We will consider 2 houses (not only 1)

live_data = X_test[:2,:]
live_data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `.predict()` and parse the data. In this case, we are predicting the price of 2 houses
* Since the X_test data is scaled and is an array, it is difficult to make sense of the content, but we are assuming here the data passed through the pre_processing pipeline already.



model.predict(live_data)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Save and Load the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In case you want to save your model, you may use `.save() `and parse the directory and the model's name. The extension is `.h5`
* Remember in this notebook we used a pipeline to pre-process the data, so in a project using tabular data you would be interested to save this pipeline also as a pkl file (similarly to what we saw in the scikit-learn lesson)

model.save('my_model.h5')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You can load the model using `load_model()` from the Keras module
* Let's load the model as model_2

from tensorflow.keras.models import load_model
model_2 = load_model('my_model.h5')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> NOTE: the history information on a loaded model is lost when you save and load afterwards. The recommendation is to fit the model, generate the training history plots and save them immediately after

model_2.history.history

---