# TensorFlow - Unit 08 - Binary Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Fit a deep learning neural network for Binary Classification task



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 08 - Binary Classification

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We will follow the process used for supervised learning which we are familiar with, but now with a few tweaks:

* Split the dataset into train, validation and test set
* Create a pipeline to handle data cleaning, feature engineering and feature scaling
* Create the neural network
* Fit the pipeline to the train set and transformations to the other sets
* Fit the model to the train and validation set
* Evaluate the model
* Prediction

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load and split the data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's first load the data. We are using the breast cancer dataset from sklearn.
* It shows records for a breast mass sample and a diagnosis informing whether it is malignant or benign cancer. The target variable is the diagnosis, where 1 is malignant, and 0 is benign.

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df = pd.DataFrame(data.data,columns=data.feature_names)
df['diagnosis'] = pd.Series(data.target)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As part of our workflow, we split the data, but now we will split it into train, validation, test sets. 
* First, we split into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['diagnosis'],axis=1),
                                    df['diagnosis'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

X_train

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Then, from the train set, we split a validation set. We set the validation set as 20% of the train set
* Have a look at the print statement, which shows the amount of data we have in each set (train, validation and test)

X_train, X_val,y_train, y_val = train_test_split(
                                    X_train,
                                    y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)

X_train

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline for data processing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We first create a pipeline for preprocessing the data. 
* In this case, it is only feature scaling.
* We could have also added a step for removing correlated features, but let's keep it simple.

from sklearn.pipeline import Pipeline
### Feat Scaling
from sklearn.preprocessing import StandardScaler

# in this case, we don't need data cleaning or feat eng
def pipeline_pre_processing():
  pipeline_base = Pipeline([
      
      ( "feat_scaling",StandardScaler() )

    ])

  return pipeline_base


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Next, we fit the pipeline to the train set and transformations to the validation and test set
* So the pipeline can learn the transformations (in this case it is only feature scaling) from the train set, and apply the transformation to the other sets. 
* Let's visualize the first rows from the scaled data. Note it is a 2D NumPy array

pipeline = pipeline_pre_processing()
X_train = pipeline.fit_transform(X_train)
X_val= pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

X_train[:2,]

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create Deep Learning Network

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We will create a tensorflow model
* We create a function that creates a sequential model, compiles the model and returns the model. The function requires the number of features the data has to be used as the number of neurons for the first layer
* Let's define the network architecture
  * We noted the data has 30 features. We will create a simple network just for a learning experience. 
  * The network is built using Dense layers - fully connected layers
  * The input layer has the same amount of neurons as the number of columns from the data. The activation function is relu. Finally, we parse the input_shape using a tuple.
  * We are using 3 hidden layers, the first with 20 neurons, the next with 10 neurons and the last with 6 neurons. Both will use relu as an activation function.
  * After the input layer and each hidden layer, we have a dropout layer with a 25% rate to reduce the chance of overfitting. In the previous notebook, we didn't add a dropout layer to the input layer. In this notebook, we are adding one to demonstrate it is possible. We covered the dropout layer in a previous notebook in case you want to refresh the concept.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The output layer should reflect binary classification.
  * You may recall there are 2 ways to define an output layer for binary classification:
    * Either with 1 neuron with sigmoid as activation function 
    * Or 2 neurons with softmax as activation function
  * We will code both, so you can choose which one you would like to use
* We compile the model depending on the output layer choice
  * If it is 1 neuron with sigmoid as activation function: optimizer='adam', loss='binary_crossentropy'
  * If it is 2 neurons with softmax as activation function: optimizer='adam', loss='categorical_crossentropy'



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  In classification tasks, we can use an additional metric when compiling: 'accuracy'. We will still monitor the loss (like we did in Regression), but now we can monitor the accuracy while training. 
* Note: in regression, we can add this argument since accuracy doesn't suit the context of regression

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Below we find the model where the output layer has sigmoid as an activation function

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

def create_tf_model_sigmoid(n_features):

  model = Sequential()
  model.add(Dense(units=n_features,activation='relu', input_shape=(n_features,)))
  model.add(Dropout(0.25))

  model.add(Dense(units=20,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=10,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=5,activation='relu'))
  model.add(Dropout(0.25))

  # note we use 1 neuron and sigmoid
  model.add(Dense(units=1,activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  
  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Below we find the model where the output layer has softmax as an activation function

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> In this exercise, we will move on with the model that has the activation function as sigmoid in the output layer, since in the next unit notebook, we will handle a network that has softmax as an activation function in the output layer 
* Even if you try to fit the model with softmax as an activation function in the output layer, it will not work since it needs an additional step. We need to one-hot-encode the target variable. We will do that in the next unit notebook, which covers multi-class classification.


def create_tf_model_softmax(n_features):

  model = Sequential()
  model.add(Dense(units=n_features,activation='relu', input_shape=(n_features,)))
  model.add(Dropout(0.25))

  model.add(Dense(units=20,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=10,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=5,activation='relu'))
  model.add(Dropout(0.25))

  # note we use 2 neurons and softmax
  model.add(Dense(2, activation='softmax'))
  model.compile(optimizer='adam', loss='categorical_crossentropy')
  
  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's visualize the network structure
* Note the amount of parameters the network has, let's frist use `create_tf_model_sigmoid()`

model = create_tf_model_sigmoid(n_features=X_train.shape[1])
model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once again, we can use `plot_model()` also from Keras.utils for a more graphical approach

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's check the difference between the structure of `create_tf_model_sigmoid()` and `create_tf_model_softmax()`
* Below we plotted the model for `create_tf_model_softmax()` as you may expect, we defined the difference to be in the output layer

model = create_tf_model_softmax(n_features=X_train.shape[1])
plot_model(model, show_shapes=True)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Early stopping allows you to stop training when a monitored metric has stopped improving; this is useful to avoid overfitting the model to the data.
* We will monitor the validation accuracy now 
  * We set patience as 10, which is the number of epochs with no improvement, after which the training will be stopped. Although there is no fixed rule to set patience, if you feel that your model was still learning when you stopped, you may increase the patience value and train again.
  * We set the mode to max since now we want the model to stop training when the accuracy didn't improve its performance and improve means increase.
  * When you are monitoring loss, the expectation is a decrease in loss over the training process. Therefore, in this case, you are looking for a minimum mode value; this is unlike accuracy as you expect an increase over the training time and thus monitor a max. 

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_accuracy', mode='max', verbose=1, patience=10)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We finally will fit the model
* We create the model object and use .fit(), as usual
  * We parse the Train set
  * Epochs are set to 75. In theory, you may set a high value since we will add an early stop, which stops the training process when there is no training improvement. 
  * We parse the validation data in a tuple.
  * Verbose is set to 1 so we can see in which epochs we are and the training and validation loss.
  * Finally, we parse our callback as the early_stop object we created earlier.

* For each epoch, note the training and validation loss; are they increasing? Decreasing? Static?
  * Ideally, it should decrease as long as the epoch increases, showing a practical sign the network is learning

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> In this exercise, we will move on with the model that has the activation function as sigmoid in the output layer, since in the next unit notebook, we will handle a network that has softmax as an activation function in the output layer 
* Even if you try to fit the model with softmax as an activation function in the output layer, it will not work since it needs an additional step. We need to one-hot-encode the target variable. We will do that in the next unit notebook, which covers multi-class classification.


model = create_tf_model_sigmoid(n_features=X_train.shape[1])
model.fit(x=X_train, 
          y=y_train, 
          epochs=75,
          validation_data=(X_val, y_val),
          verbose=1,
          callbacks=[early_stop]
          )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Model evaluation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Now we will evaluate the model performance by analyzing the train and validation losses and accuracy that happened during the training process. 
* In deep learning we use the model history to assess if the model learned, using the train and validation sets. We also evaluate separately how the model generalize on unseen data (on the test set)
* The model training history information is stored in a `.history.history` attribute from the model. 
* **Note it shows loss and accuracy for train and validation**

history = pd.DataFrame(model.history.history)
history.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are plotting each loss and accuracy in a line plot, where the y-axis has the loss/accuracy value, the x-axis is the epoch number and the lines are colored by train or validation
* We use `.plot(style='.-')` for this task
  * Note the loss plot for training and validation data follow a similar path and are close to each other. It looks the network learned the patterns.
  * Note in the accuracy plot that both train and validation accuracies keep increasing; When the performance "saturates" for validation, the training stops, as we set in the early stopping object.

sns.set_style("whitegrid")
history[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.show()

print("\n")
history[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we will evaluate the model performance on the test set, using `.evaluate()` and parsing the test set. Note the value is not much different from the losses and accuracy in the train and validation set.
* Note the loss is low and accuracy is high. It looks like the model learned the relationship between the features and the target, considering all features.

model.evaluate(X_test,y_test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When evaluating a deep learning model, you typically cover the loss plot and evaluate the test set; however, **if you want, you can do as an additional step** a similar evaluation we did in conventional ML.
* In classification, you would analyze the confusion matrix and classification report, using the custom function we have seen over the course.
* One difference is that we readapted the function to evaluate also the validation set, but that is a minor change in the code; the overall logic is the same

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> The adapted custom function below will work for the model made with a sigmoid as an activation function.
* there is a difference in the prediction format between the sigmoid output layer and the softmax output layer, the first is a probabilistic output (between 0 and 1). In the next unit, we will cover this evaluation for a model with softmax
* In case your model was trained with a softmax activation function, the code below may not work as expected

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  prediction = pipeline.predict(X).reshape(-1)
  prediction = np.where(prediction<0.5,0,1) 
  # the prediction using sigmoid as acitvation function, is a probability number, between 0 and 1
  # we convert it to 0 or 1, if it lower than 0.5, predicted class is 0, otherwise is 1
  # you could change the threshold if you want.

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")



def clf_performance(X_train,y_train,X_test,y_test,X_val, y_val,pipeline,label_map):

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Validation Set #### \n")
  confusion_matrix_and_report(X_val,y_val,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's parse the values as usual.
* Note the model is capable of separating the classes, including in the test set

clf_performance(X_train, y_train,
                X_test,y_test,
                X_val, y_val,
                model,
                label_map= ['malignant', 'benign']
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Prediction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take a sample from the test set and use it as if it was live data. We will consider 1 sample

index = 1
live_data = X_test[index-1:index,]
live_data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `.predict()` and parse the data. Note the result is not a direct 0 or 1, but instead a probabilistic result, between 0 and 1

  prediction_proba = model.predict(live_data)
  prediction_proba

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You must decide a threshold when stating if the given probabilistic result is a 0 or 1. In this case, we set the threshold as 0.5
* We converted using a NumPy function `np.where()`, where you make a condition (prediction_proba < 0.5), if that is true, it converts to 0; otherwise, it is 1.

prediction_class = np.where(prediction_proba<0.5,0,1) 
prediction_class

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's plot the probabilistic result, so you can check the predictions in a more visual fashion
* Read the pseudo-code
* At the end you are getting prediction_proba, to define the associate probability for the 2 classes: 0 and 1. Then you plot it in a bar plot using Plotly 

# define how you map the classes and the meaning of each
# where the dict key is the class number
target_map = {0:'Benign', 1:'Malignant'}

# create an empty dataframe, that will show the probability per class
# we set that the probabilities will be 0, but we will update soon
prob_per_class= pd.DataFrame(
        data=[0,0],
        index=target_map.keys(),
        columns=['Probability']
    )


# the summed predictions probabilities from both classes sum 1
# for a binary classification case we can say that
#    === if prediction_proba is, say, 0.01. that means the predicted class is 0
#    so we can say the prediction probability from class 1 is 0.01 and for class 0 is 0.99
#    ===  if prediction_proba is, say, 0.99. that means the predicted class is 1
#    so we can say the prediction probability from class 1 is 0.99 and for class 0 is 0.01
prob_per_class.iloc[1,0] = int(prediction_proba[0])
prob_per_class.iloc[0,0] = 1 - int(prediction_proba[0])


# we round the values to 3 decimal points, for better visualization
prob_per_class = prob_per_class.round(3)

# we add a column to prob_per_class that shows the meaning of each class
# in this case, malignant or benign
prob_per_class['Result'] = target_map.values() 

# take a look at the data we generated
prob_per_class


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use a bar plot, where the x-axis shows the Result and the y-axis the associated probability for a given Result.
* I encourage you to go to the first cell of the Prediction section and change the index variable so that you would take a sample. Then run all cells to predict until the plot from the cell below
* You may change the index to another positive integer

import plotly.express as px
fig = px.bar(
        prob_per_class,
        x = 'Result',
        y = 'Probability',
        range_y=[0,1],
        width=400, height=400,template='seaborn')
fig.show()

---