# TensorFlow - Unit 01 - Introduction

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **The TensorFlow Lesson is made of 10 units.**
* By the end of this lesson, you should be able to:
  * Understand and apply the concepts of Neural Networks in TensorFlow, such as layers (Dense, Convolution, Dropout), activation function, loss function, optimizer, backpropagation and the aspects considered when fitting a TensorFlow model 
  * Use TensorFlow models for ML tasks, such as Regression and Classification, considering tabular and image datasets



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand which ML tasks we will explore in this course using TensorFlow
* Understand the basic terminologies used in Neural Networks and TensorFlow



---

TensorFlow is a Python library, released in late 2015, for fast numerical computing and is heavily used for deep learning.


 <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study TensorFlow?**
  * Because you can build and train deep learning models relatively easily using TensorFlow's high-level interface. 
  * Also, it has strong community support with extensive documentation and many project use cases.
  * In addition, TensorFlow models solve ML problems that conventional ML can’t solve, especially when it comes to the complexity of the data and big data, like the algorithms we saw from Scikit-learn. There are applications in computer vision, natural language processing, or speech recognition where deep learning will be the effective solution to learning the complex patterns in your dataset




## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, i.e., play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your comments** in the cells. It can help you to consolidate your learning. 


* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For TensorFlow, the link is [here](https://www.tensorflow.org/)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Unit 01 - Introduction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We have a series of theory lessons to gradually introduce Neural Networks Models and TensorFlow.
* That is a practical approach to onboard you to the topics, so when we reach a unit where we conduct an end-to-end workflow, you will be familiar with the terminologies and concepts.



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> In addition, **don't worry if you don't understand everything at first; it is expected to take some time to absorb the use cases**. 
* We will also not dive deep into the mathematical concepts and explanations, so we can focus more on usability and how it can serve our workflow process to fit a model that will produce reliable predictions.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  ML Tasks

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In this lesson, we will explore business cases that involve the following ML tasks
* Regression
* Classification (Binary and Multi-class)


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
We will use:
* Structured and **tabular datasets** from ML libraries like Seaborn, Plotly, Sckit-learn and Yellow-brick
* **Images datasets** from TensorFlow
* Note: you will learn how to fetch a real image dataset, from Kaggle API, in Walkthrough Project 01

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Neural Networks

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Neural Network is an approach to model biological neuron systems mathematically. This approach is applied to solve tasks that many other types of algorithms can't.
* A classical artificial neural network is made of **layers**
  * An input layer, one or more hidden layers and an output layer. When your network has two or more hidden layers, we call it a deep neural network or **deep learning**.
* Each layer has a set of **nodes** (or **neurons**, equivalent to the human brain).
  * Each neuron is connected to every other neuron in the next layer. This is known as a fully connected layer.
  * We will see later this type of Layer is called a `Dense` layer in TensorFlow
* The network architecture will depend on the dataset and the ML task you are dealing with; we will cover this in more detail in the upcoming units.
  
  <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png">  We also understand that each neuron is associated with a value of weight (w) and bias (b), **which are parameters that will learn the relationship between the inputs and output**. In essence, these are **floating numbers**, either positive or negative, or even zero.
* In the above network, there is a unique weight value for each connection. There is also a single bias for each neuron. Reflect for a moment on how many parameters a network has!!
  * The input values, weights and bias (floating numbers) are combined and parsed to an activation function that performs a mathematical operation and outputs a number. We will study this more that later on.
* Once the training is finished, the weights and biases from all neurons should have values that together are capable of learning the patterns from the data and should be able to generalize on unseen data.





<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> In the end, this network has:
  * A set of **layers**, where each layer has a set of **neurons**, where each neuron has a set of inputs where we apply a weight, add a bias term and pass them through a function that generates a single output. 
  * This output is fed into the next neuron's layer unless this neuron belongs to the output layer.

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> TensorFlow

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png">  TensorFlow is a Python library, released in late 2015, for fast numerical computing and is heavily used for deep learning.
* It is one of the most popular Deep Learning frameworks.

* We use Python to provide a high-level programming front end that interacts with TensorFlow, but the actual math operations are computed in C++. 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Due to its effectiveness and syntax simplicity, another neural network library, known as Keras, was adopted as the interface for TensorFlow from version 2.0. 
* That is why we will notice in the upcoming notebooks many imports made with Keras notation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Training TensorFlow models will likely require extensive processing capacity. The models can be trained on either conventional computation processing units (or CPUs) or in higher-performance graphics processing units (or GPUs). 
* Just an example, depending on the training set and model complexity, training TensorFlow models using CPUs may take up to 40hr, whereas training with GPU may take 3 or 4hr.



# TensorFlow - Unit 02 - Sequential model and Layers

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn what a Sequential model is and which layers we will learn and use over 
the course (Dense, Dropout, Convolution, Pooling, and Flatten)



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Unit 02 - Sequential model and Layers

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> Tensorflow uses the `Sequential` model function to model these Neural Networks **using a different set of layers that are arranged**, creating a network.  
* As we saw in the lesson video, there are multiple layer examples. In the course, we will cover **Dense, Convolution, Pooling, Flatten, and Dropout layers**

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Sequential Model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> We import the Sequential model from the TensorFlow API. 
* It **arranges a sequence of layers in your neural network**. The documentation is [here](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential).
* We will use and code many Sequential models over this TensorFlow lesson, and in the first Walkthrough Project, for now, we are just explaining the layers and doing simple imports.
* Typically you will create an object and instantiate the Sequential. The object is typically named model. Once you instantiate, you can add layers to the model by parsing the layer and its configuration on `.add()`

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.models import Sequential
model = Sequential()
# you would add layers like:   model.add()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Layers

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">The idea is to stack and arrange layers to create a neural network. 
* The layer selection depends on the ML task and business problem you are trying to solve. We will investigate that in upcoming notebooks.
*  You can find [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers) a list of layers that TensorFlow manages. We will use a few of them over the course.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> To import a given layer, simply type `from tensorflow.keras.layers import  ...` and choose your layer. 




---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Dense Layer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A Dense layer is a **fully-connected neural network layer**. The documentation is found [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)
* You will parse the number of neurons (units) that layer has and an activation function. We will learn more about activation functions in the next unit. 
* If this layer is the input shape, you may parse at `input_shape` the number of features your data has. The convention format is a tuple, so for example, if your data is tabular and has eight features, you would parse to `input_shape` as (8,)
  * In the example below, we added two Dense Layers. We are assuming the data has eight features. The first is the input layer, which has 'relu' as an activation function (don't worry, we will cover more on that soon),  and input_shape as (8,) and units is 8 (the input_shape and units from the first layer should match)
  * The next layer has 20 neurons and  'relu' as an activation function. You don't need to set input_shape since once you add it to the Sequential model, the second layer will take the output shape from the previous layer as their input_shape

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Note this arrangement of 2 layers will not produce a working neural network, since we defined the input layer and one hidden layer. 
* We would still need to evaluate more hidden layers and define the output layer, but for the moment, we just want to show how to add Dense layers to your model. But don't worry, we will do that soon.

from tensorflow.keras.layers import Dense
model = Sequential()
model.add(Dense(units=8, activation='relu', input_shape=(8,)))
model.add(Dense(units=20, activation='relu'))

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Dropout Layer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  A dropout layer is a regularization layer and is used to reduce the chance of overfitting the neural network. The documentation is found [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout).
* Regularization is jargon which means reducing complexity. In this case, we reduce network complexity by switching off certain random neurons in your network layers during a given training iteration. 
  * You can use an analogy where the neurons that "survived" (that were not switched off) in that iteration will have to step in and handle the learning from the "switched" neurons.
  * Although, in practical terms, this technique is effective for reducing overfitting, the reasons for being so effective are not yet well elaborated on a theoretical level.

* You will add a Dropout layer, for example, after a Dense layer, meaning that the given Dense layer will use the dropout value from that given dropout layer.
* You will parse the percentage of neurons that will be switched off; it can be between 0 and 1. Typically, use a small dropout value of 0.2 to 0.5. If you notice it still overfits, you may increase the values




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Let's use the previous network and add Dropout to it. You may add Dropout to the input and hidden layers.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Again, this arrangement will not produce a working neural network, since we only defined the input layer and one hidden layer. 
* We would still need to evaluate more hidden layers and define the output layer, but for the moment, we just want to show how to add Dense layers to your model. But don't worry, we will do that soon.

from tensorflow.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(units=8, activation='relu', input_shape=(8,)))
model.add(Dropout(0.25))  # 25% of the previous layer's neurons will be randomly switched off in every training iteration

model.add(Dense(units=20, activation='relu'))
model.add(Dropout(0.3)) # 30% of the previous layer's neurons will be randomly switched off in every training iteration

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Convolution Layer, Pooling Layer and Flatten Layer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  In this course, we will handle datasets that are images. A convolutional neural network is a type of neural network which is commonly used for image processing and computer vision.
* You may create a Neural Network with Dense Layers only, or use Conventional ML when dealing with image datasets. However, experience shows that a combination of Convolution Neural Networks (CNN), Pooling and Flatten tend to have superior performance since it better recognizes the patterns in the image by extracting the features from the image.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
"> Let's cover what each layer does in a broad sense and how to arrange them to create a CNN. Then we will explain in detail the mechanism of each layer
  * We are not focusing on the mathematical concepts of convolutions, but instead, we want to explore the core ideas of how it works.

---

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> Assume our dataset is a collection of images with two classes of different images: say, dogs (class 0) and cats (class 1). In the end, a convolutional network (CNN or ConvNet) should be able to handle a binary classification task

* To achieve that we could arrange the layers in this format:
  * 1 - **A set of pairs of Convolutional  + Pooling layers**. 
    * Often pairs of Convolutional  + Pooling are fed into another pair of Convolutional  + Pooling. 
    * This allows the networks to discover patterns within patterns, usually with more complexity for later convolutional layers.
  * 2- **A Flatten layer** 
  * 3 - A **neural network using fully connected Dense Layers**.
    *  It is a conventional feed-forward neural network, where the output layer, in this case, is a single neuron, which will tell us if the image belongs to class 0 or class 1.

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> What does each layer do, and why is it important?
* **Convolution layers** are used to select the dominant pixel value from the non-dominant pixels in images using filters. It allows you to find patterns (or features) in the image
* **Pooling layers** reduce the image size by extracting only the dominant pixels (or features) from the image. The combination of these two layers removes the nonessential part of the image and reduces complexity
* The **Flatten layer** is used to flatten the matrix into a vector, which means a single list of all values. That is fed into a dense layer.
* A **fully connected neural network** is used for learning non-linear combinations of the high-level features as represented by the output of the convolutional layers and for running predictions.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We are not focusing on the mathematical explanation for the subsequent, but rather a practical demonstration of how they work.

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> **Context**: A challenge in computer vision is that an image input can be huge. Imagine an image with a size of 50x50 with three colours (50x50x3). That represents 7500 features as input to your neural network. 
  * Now imagine bigger image sizes. You may end up with a network that needs a lot of processing and memory requirements for training.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> At the end of the day, you don't need all pixels from the image; you would only need to be able to recognise the patterns (or the features) and eventually, you can learn from an image that is a fraction of the original size. 
  * That is when the Convolution and Pooling layers become interesting since they can capture the relevant image information through filters while having fewer parameters if you consider only a network with Dense layers.

---

##### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Convolution Layer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> **Convolution is about applying filters to get another image.** Simple as that.
* We will use `Conv2D()` in the course to represent the layer; the documentation is found [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D).
* Ultimately, we create another image when we apply a filter to the input image. The convolution layer will not have a single filter; it will have a set of filters that will capture distinct features (or patterns), like edges, vertical lines, horizontal lines etc.
* By filtering, you keep the information (aka pixels) that matter and drop the rest that is not useful.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We understand that an image can be represented as a NumPy array, so let's imagine our image is 6 width x 5 height  in greyscale.
* We are not worried about the colours of this fictitious image; let's just focus on the pixel representations of it (or the NumPy array representation of it). We just added 1 and 0 as the image values to facilitate the learning experience.
* We are imagining here a grayscale image, that has one colour only. That will simplify the learning process. However, this concept is extended to coloured images, where the difference is that the filters happen to all channels; when using grayscale, there is only one channel

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A **filter** is a matrix of values. It is also known as a kernel. Its values will define the type of filter we want. 
* The example below shows a 3x3 filter used for extracting vertical lines from an image
* The filter is  used for detecting an image feature: like horizontal/vertical lines, blurring, detecting edges etc. You can check additional types of filters [here](https://en.wikipedia.org/wiki/Kernel_(image_processing))

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  You will "convolve" the image and the filter. Or in simpler terms, **scan the filter over the image**.
* Let's first visualise the filter scanning the image. It is like if you have to match the grid from the filter over the image grid.


* Each value from the filter grid will multiply the respective value from the image grid covered by the filter. 
* The product's results are summed and this value is a pixel in the new convoluted image

* When you visualise the  filter scanning all images, you will notice the convoluted image will not be the same size as the original image. In this case, considering the filter size, the convoluted image will have 2 pixels less in both directions. 
  * The convoluted image is then 4 x 3 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Wait, you may be asking yourself: **will that really work?** And the answer is: yes, and a lot.
* Let's take a real picture of a flower, it is **580 width x 555 height**.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are applying two filters, one that captures vertical lines and the other that captures horizontal lines. 
* We are not showing the code since the focus here is on the concept
* Note the convoluted images show an image's features, in this case, a vertical and horizontal line that constitute the patterns of the image.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Next, we apply two edge filters, each capturing the edges (or the features) from the image in a different fashion.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Do I need to state which filters to use in my Convolution layer?
* No. You specify the number of filters you want. 
* The filters are initialized with small random values. During the training process, the filters are updated in a manner that minimises the loss. During this process, the filters will learn to detect certain features, like edges or lines. 


---

##### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pooling

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Like a Convolutional Layer, a Pooling layer can reduce the spatial size of the image.
* That allows you to reduce the processing effort while maintaining the core features of the image
* It is based on the assumption that when the input is changed by a small amount, the overall outcome does not change. 
* When pooling you will set a kernel size (a matrix size) that will scan the image. 
* The image values that are covered by the Kernel are analysed. Then a certain value is calculated, and this new value is the new pixel from the pooled image
  * For example, you can do Max Pooling, Min Pooling or Average Pooling. 
  * Max Pooling returns the maximum value from the portion of the image covered by the kernel. The experience shows this method works better than the others.
  * Min Pooling returns the minimum value from the portion of the image covered by the kernel.
  * Average Pooling returns the average of all the values from the portion of the image covered by the Kernel.

* We are using the `MaxPool()` layer in the course for pooling. The documentation is found [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D).

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Let's show an example. Imagine if your image is 4 x 4


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  You set your kernel to be 2x2
* The kernel takes the unique grid groups; for example, the upper left group is 45, 37, 5 and 2, and in this case, it extracts the max value: 45
* The extracted value is a new pixel in the pooled image
* Note we reduced the image by half! Previously it was 4x4, whereas now it is 2x2
  * In addition, if your kernel were 3x3, your pooled image would be a third of the original size. Typically, we tend to use 2x2



  <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's apply a 2x2 pool size to the flower picture's convoluted edge filter.
* Note the image on the right is the pooled image and has the same features as the convoluted image (on the left).
* The pooled image is half of the convoluted and has the same feature information. 
  * In fact, the details of the edges have been increased after the pooling.

---

##### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Flatten

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The flatten layer is much simpler: it is used to flatten the image matrix into a vector, which means a single list of all values. The `Flatten()` layer documentation is found [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten)
* Once the data passes through the Flatten layer, it is fed into a neural network with a Fully-Connected layer; This is an effective way to keep learning the patterns from the features represented by convolutions and pooling layers.

---

##### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> How do we arrange these layers then?

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will study in an upcoming notebook how to arrange all the layers together and study the inputs and outputs of each layer

---

# TensorFlow - Unit 03 - Activation Function, Loss function, and Optimiser

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn the theory and how to apply the concepts of Activation Function, Loss Function and Optimiser; and understand why they are important in Neural Networks



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Unit 03 - Activation Function, Loss function, and Optimiser

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Activation Function

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When designing your neural network, the `activation function` is used to set boundaries to the output values from the neuron.

* Remember, a neuron process its inputs to generate an output. This processing is made by an activation function
* In the plot below, the inputs (X1 and X2), weights (w1 and w2), bias and output (y) are floating numbers. 
  * Therefore the activation function (which is a mathematical function) uses numbers as inputs **`( X1*w1 + X2*w2 + b )`** to generate a numerical output. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When designing your network, you should consider that different activation functions may be used in different layers (input, hidden, output) of your network. We will cover use cases here in the notebook
* The hidden layers typically tend to use the same activation function
* The activation function for the output layer depends on the ML task

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Step function


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The simplest activation function is a step function. 
* If the input is greater than zero, then neuron output is 1, otherwise is 0
* A reinforcement considering the previous neuron plot, the input would be **`( X1*w1 + X2*w2 + b )`**. 
* If the result of this calculation is larger than zero, the neuron output is 1. This output would be fed to the next layer (in case it is not the output layer)


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A step function can potentially be useful for the output layer in a binary classification task where your target variable is 0 or 1. 
* At the same time, note this activation function is "harsh" since either the neuron output is 0 or 1, the effect of small changes to the input is typically not reflected in the output.
  * In practical terms, we tend not to use this activation function. Since it is an intuitive function, it helps us only to understand better how the activation function interacts with the neuron (inputs - process - output)

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Sigmoid function


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Another activation function is sigmoid. 
* It follows the logistic function, the same used for the Logistic Regression algorithm. 
* Note the min and max values are still 0 and 1 (like in the step function), but now the flow is more dynamic since slight variations in the input lead to an effect in the output. Note that the output value doesn't vary from a specific range in the input (either positive or negative); it saturates either at 0 or 1.
* **This is commonly used as an activation function for the output layer in a binary classification** since it represents well the idea of predicting two classes - 0 or 1. It will output a value between 0 and 1, indicating the probability of having that class assigned to it.



---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  ReLu function


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The Rectified Linear Unit, or (ReLU), is another activation function commonly used in the industry.
* If the input is negative, then the output is 0; otherwise, the output is equal to the input.
* It is simple and effective to implement. **It is commonly used for input and hidden layers.**

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Softmax function

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Finally, there is another common activation function called Softmax.

* This function outputs a set of values that sum to 1.0 that can be mapped to  probabilities of multiple classes.

* Imagine if your target variable has three classes, like a cat, dog and parrot. You need to organize the neurons to map to an output layer; you want an output layer where its neuron are independent of each other
  * Softmax will calculate the probabilities for each class, and the sum of all individual classes' probabilities will equal one. Thus, the chosen class is the one that has the highest probability.


* **Softmax is useful for an output layer in multiclass or binary classification.**

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Other Activation Functions

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> There are other activation functions you can consider in your project, like  `tanh` (Hyperbolic Tangent).
* You may access the Tensorflow [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/activations) to check it out. In addition, you may go to this link to learn more about a wide set of [activation functions](https://en.wikipedia.org/wiki/Activation_function).

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Loss Function and Optimiser

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We now understand that neural networks take in inputs, multiply them by weights and add biases. Then it passes through an activation function which at the end of all the layers leads to some output - your prediction.

* Once the network makes a prediction, how can we evaluate its performance? Is it good or not? 
* And after that, how can we optimise the network weights? Keep in mind these evaluations and optimisation processes happen during the model training.
  * These processes are done with an optimizer and loss function, which are mathematical functions that mimic human behaviour.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In essence, the human behaviour of **learning from mistakes** inspired the creation of optimizer and loss function mechanisms in neural networks. 
 
* As we try to reduce our mistakes and learn from our experiences, scientists have used this principle and mathematically created this algorithm to reduce errors by using optimizers and loss functions. 
* This process takes several iterations to make the model learn from its experience.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When we create a neural network, you will **compile your model after you set your last layer**.
* Compile the model simply means defining the optimizer and loss function for your network, and its values depend on the ML task, for example, Regression, Binary or Multiclass Classification) 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **We will not get deep into the mathematical explanation of how optimizers algorithms and loss functions work**. 
* It involves linear algebra concepts, and we would need more time to cover that. We will focus on explaining why these functions are important to your model and how to use them.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 Here you can find the TensorFlow documentation with a series of functions for the [loss function](https://www.tensorflow.org/api_docs/python/tf/keras/losses) and [optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers). We will cover the most common.

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Regression

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Assume you created a neural network model using a `Sequential()` model; therefore, you can stack layers.
* The last layer, or the output layer, should be designed to reflect what you want to predict.
* In case your ML task is Regression, your last layer will have one neuron since, in Regression you are predicting a number, and a single neuron can output a continuous range of numbers related to your numerical target variable




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once you set your output layer, you will compile the model and set the optimizer and the loss function.
  * Potential loss functions to consider: 
    * `mse`: The mean squared error loss is the standard loss used for regression tasks. As we studied previously, it is calculated as the average of the squared differences between the predicted and actual values
    * `mean_absolute_error`: The mean absolute error loss can be considered when the distribution of the target variable is mostly Gaussian; however, may have outliers. You may consider checking other loss functions in TensorFlow documentation
  * Potential optimizers are:
      * `SGD` (Stochastic Gradient Descent): It is used to find the values of parameters of a separate function that minimizes the loss function.
      * `Adam` (Adaptive Moment Estimation): this algorithm replaces the previous option to update network weights iteratively based on the training data. Based on the experiences, Adam proved to be effective and became very popular since it achieves good results in fast training times, compared to other options.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We will cover in the upcoming lessons how to programmatically define a neural network and fit it afterwards,
* For the moment, let's instantiate a model using `Sequential()`. Next, as we mentioned earlier, the idea is to add layers to the model
* For the next couple of examples, we will instantiate the model, define the output layer, and compile the model just to understand how to programme the loss function and optimizer.
* If you run the cell below, the code will not break. However, the model we create here is not functional yet. We are just showing how to compile the model for a Regression task.

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# instantiate an empty sequential model
model = Sequential()

########### input layer and hidden layers
# you would add here a set of layers (Dense, Convolution, Dropout etc)
###########

########### output layer
model.add(Dense(1))  # here you added the output layer to your network, it is a dense layer with 1 neuron
###########

# compile the model: that means, set your optimizer and loss function
model.compile(optimizer='adam', loss='mse')


---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Binary Classification

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Our output layer should reflect a prediction of 2 classes.
* In binary classification, there are two ways to define the output layer: using **sigmoid or softmax** as the activation function. The experiences show sigmoid is more effective; however, you could use both.
* In this case, you can set 
  * **One neuron with sigmoid as the activation function** where the prediction is a probability between 0 and 1. You will define a threshold (default would be 0.5); if the probability is lower than the threshold, the prediction is class 0; otherwise, the prediction is class 1
  * Or **two neurons with softmax as activation function**, where the prediction is a probability for each class and these two probabilities sum to 100%




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once you set your output layer, you will compile the model and set the optimizer and the loss function.
  * Potential loss functions are: 
    * `binary_crossentropy`: A cross-entropy loss is the standard loss function for binary classification problems. The idea is to use it when the target values are 0 and 1. A cross-entropy loss increases as the predicted probability diverges from the actual class. It calculates a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1.
    * `hinge`: The Hinge loss is an alternative to the previous loss function. The idea is to use it when the target value is -1 and 1.


  * Potential optimizers are:
      * `SGD` (Stochastic Gradient Descent): It is used to find the values of parameters of a separate function that minimizes the loss function.
      * `Adam` (Adaptive Moment Estimation): this algorithm replaces the previous option to update network weights iteratively based on the training data. Based on the experiences, Adam proved to be effective and became very popular since it achieves good results in fast training times, compared to other options.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 Let's see how you could set the last layer and compile the model for a binary classification task

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# instantiate an empty sequential model
model = Sequential() 

########### input layer and hidden layers
# you would add here a set of layers (Dense, Convolution, Dropout etc)
###########

########### output layer and compile the model
# either 1 neuron with sigmoid as activation function where the prediction is a probability
# between 0 and 1. You will define a threshold (default would be 0.5), if the probability is
# lower than threshold, the prediction is class 0, otherwise the prediction is class 1
model.add(Dense(units=1, activation='sigmoid')) # here you added the output layer to your network
model.compile(optimizer='adam', loss='binary_crossentropy')

# Or 2 neurons with softmax as activation function, where the prediction is a probability for each class, 
# and these 2 probabilities sum 100%
model.add(Dense(units=2, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy')
###########



---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Multiclass Classification

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Our output layer should reflect a prediction for a multiclass classification task, which means our target variable has three or more classes.
* In this case, you can set the output layer with **X neurons (where X is the number of classes your target variable has) with softmax as the activation function**, where the prediction is a probability for each class, and they will sum to 100%

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once you set your output layer, you will compile the model and set the optimizer and the loss function.

 * Potential loss functions are: 
    * `categorical_crossentropy`: A cross-entropy loss is the standard loss function to use for multiclass classification problems. The idea is to use when the target values are integers starting from 0 until n-1, where n is the number of classes in your target variable. In addition, the target variable should be one hot encoded for using this loss function. We will cover it in a future notebook.
    * `sparse_categorical_crossentropy`: when your target variable has too many classes (like hundreds), and the previous loss function is not showing promising results, you may use  sparse_categorical_crossentropy, where you don't need to do one hot encoded to your target before training.

  * Potential optimizers are:
      * `SGD` (Stochastic Gradient Descent): It is used to find the values of parameters of a separate function that minimizes the loss function.
      * `Adam` (Adaptive Moment Estimation): this algorithm replaces the previous option to update network weights iteratively based on the training data. Based on the experiences, Adam proved to be effective and became very popular since it achieves good results in fast training times, compared to other options.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 Let's see how you could set the last layer and compile the model for a binary classification task

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# instantiate an empty sequential model
model = Sequential() 

########### input layer and hidden layers
# you would add here a set of layers (Dense, Convolution, Dropout etc)
###########

########### output layer
n_classes = 3 # in this case your target variable has 3 classes
model.add(Dense(n_classes, activation='softmax'))
########### output layer

# compile the model: that means set your optimizer and loss function
model.compile(optimizer='adam', loss='categorical_crossentropy')

---




# TensorFlow - Unit 04 - Backpropagation

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand Backpropagation and why it is important when studying neural networks.



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Backpropagation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We understand that a network is arranged in layers where each layer is made up of a set of neurons.
* Each neuron has a set of weights and a bias and receives a set of inputs processed by an activation function, and its output is fed forward to the next layer until we reach the end of the network. The "final output" is the model prediction.
* At the same time, a loss function and an optimiser evaluate the prediction performance and update the network weights to values that will reduce the loss function (or better map the relationship between your features and target).
  * But how is that possible to update the weights?
  * Note in the simple network below the number of connections, where each connection is a weight. How do I know which "connection" to update and in which direction and intensity?
  * Imagine now if your dataset has ten features and your target variable has four classes. You design an input layer with four hidden layers, eight neurons, and an output layer. It can go quickly to hundreds of thousands of weights.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%208-%20Challenge.png"> **Backpropagation is responsible for updating the network's weights**
* Just a reinforcement: the real "learning" happens when these weights and values enable the network to map the features and the target correctly.




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> There is extensive mathematical content (calculus and linear algebra) in backpropagation which we will not cover in this course. Instead, we will cover the central idea and why it is important when studying neural networks.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In essence, **backpropagation takes a neural network's output error and propagates this error backwards through the network, determining which paths influence the output most**. 

* It identifies which paths are more influential in the final output and increases or decreases the "connections" (weights) to reach the desired prediction
* The main idea here is that we can go back through the network and adjust our weights to minimise the output of the error on the last output layer.




---

# TensorFlow - Unit 05 - Fitting a TensorFlow Model

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Study considerations for your datasets (Train, Validation, Test Sets) when training TensorFlow models
* Understand how to model your network architecture
* Make sense of the data flow inside your deep neural network
* Understand the training hyperparameters



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 05 - Fitting a TensorFlow Model

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Considerations when fitting your model

When we train a model in TensorFlow, we should consider the following aspects
* Datasets: Train, Validation, Test Sets
* Model Architecture
* Data flow in a network
* Training Hyperparameters
* Evaluate in the Loss Plot if the model learned normally, overfitted or underfitted

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Train, Validation and Test Sets

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In conventional ML, we split the data into a train and test set. However, when using GridSearchCV for hyperparameter optimization, this function automatically subsetted a share of the train set used as the validation set.
* Here, you shall have three distinct sets before the training process. You will parse the train and validation sets. Their usage is the same as in conventional ML, one is used to train the algorithm, and the other to validate performance while the model is being trained. The test set is used after to evaluate the ability to generalise on unseen data
* In image datasets, you may have a situation where you may have a folder called Train and when training, you split part of it as a validation set. However, for simplification, in this course, either for tabular data and image data, we will have three distinct DataFrames (if the data is tabular) or three distinct folders (if the data is an image).

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Model Architecture

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Unfortunately, you can't know in advance the optimal model architecture (number of layers, number of neurons per layer, best optimizer or loss function, dropout level etc.) that fits your data.
* It will be a trial-and-error approach. 
* However, there are some best practices you can consider  

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> The guidelines we will study below are applied to the following context
* In this course, we consider layers that will help us solve ML tasks (Regression and Classification) on tabular and image datasets. We present layers, such as the fully connected layer - `Dense()` - and convolution layer - `Conv2D()` - that help us to solve these tasks

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> A good practice is starting with a simple model first and then improving. A baseline architecture will help to tell if the model is learning properly, underfitting or overfitting.
  * If the model is underfitting, you may add more network complexity, like adding layers, more neurons to the existing layer, 
  * If the model is overfitting, you may reduce the network complexity using Dropout and EarlyStopping.

* You may also try different activation functions in your network. A "first choice" option is typically ReLU, but you may consider another for your particular problem.  That includes the optimizer and loss function when compiling the model. Popular choices for optimizers are Adam (typically the first choice), SGD, and RMSProp.
  
<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> You may notice it will be more of a trial-and-error approach to find the best hyperparameter combination for your data.

---

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  To define a baseline model architecture for **tabular data**, you may consider 
* The input layer would have the number of nodes as the number of features.
* The output layer will depend on the ML task. We covered in Unit 03 - Activation Function, Loss function, and Optimizer
* For the hidden layers, it is more like a set of refined guesses in terms of the quantity of layers and quantity of neurons per layer
  * Quantity of layer: you may start with two hidden layers. Then evaluate your dataset complexity (you probably made an EDA already and know how complex your data is up to this point, like if it shows significant correlations between the features and the target etc.). If the data is complex, it may be worth adding a few more layers so in theory, the network would learn the patterns. 
  * Neurons per layer: You may consider an "expansive-shrink" approach or "shrink" approach
    * "expansive-shrink": the number of neurons increases then decrease, for example, the input layer with 10 neurons, the hidden layer of 20 neurons, the hidden layer of 8, then the hidden layer of 2.
    * "shrink": input layer with 10 neurons, the hidden layer of 5, the hidden layer of 3, then the hidden layer of 2
    * The potential number of neurons you consider may follow a geometric progression of 2, 4, 8, 16, 32. Again, it is just a refined guess, and you will refine that as you check the model performance.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  To define a baseline model architecture for an **image dataset**, you may consider you will typically use in conjunction Conv2D (select the relevant features - like shape, shadow, colour etc. - from the image) and MaxPool2D (used for feature summarization while downsampling - reduces the image size by extracting only the dominant pixels within the pool size)
* Therefore, you may decide on how many pairs of these layers to use and reasonable values for filters and kernel size for Conv2D and pool size for MaxPool2D
* You may consider values for filters in Conv2D like multiples of two, like 16, 32, 64 etc.
* Kernel sizes are typically 2x2, 3x3 or 4x4
* You may start your CNN with 1 or 2 pairs of Conv2D/MaxPool2D
* When arranging multiple pairs, you may increase the number of filters as long as you add pairs like 16, 32, 64.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">  Once again: these are just rough ideas and references on how to start approaching your design of the network.

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Data flow in a network

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once you have defined your model architecture, **you can understand how your data flows across the network layers** 
* Let's check a simple hypothetical neural network where the ML task is Regression; we have four features. 
* You also want two hidden layers so that you can get a deep learning network (jargon alert!). 
* You also decided that the first hidden layer has five neurons and the second has three neurons. 
* You will use `relu` as an activation function, the loss function is mse, and the optimizer is adam.
* The network schema is represented below

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Steps.png
"> The code to translate that network is pretty simple. 
* You instantiate a sequential model (typically, you call the object model) and then with the attribute `.add()`, you add layers to your network.
* We will add `Dense()` layers, parsing the neurons (units) and activation functions where needed. 
* The last action is to compile the model. In practical terms, that means defining the loss function and optimizer. The loss function depends on the ML task; in this case, we set `mse` (mean squared error) for the Regression task. The optimizer we will use in the course is `adam`.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Note we added an argument called `name` so that you can name your layer. 
* It is not mandatory to name/label your layer; it is up to you to do so or not in the course or over your career. We are doing this now just to explain the network summary and network plot.

model = Sequential() # instantiate a sequential model
model.add(Dense(units=4, activation='relu', input_shape=(4,), name='InputLayer'))  # input layer

model.add(Dense(units=5,activation='relu', name='1stHidden'))  # 1st hidden layer
model.add(Dense(units=3,activation='relu', name='2ndHidden'))  # 2nd hidden layer

model.add(Dense(units=1, name='OutputLayer')) # output layer
model.compile(loss='mse', optimizer='adam') # inform loss function and optimizer

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You can check the model summary with `.summary()`
* Note that the network has a total of 67 parameters. These parameters will be learned (calculated) over the training process.
* For a given `Dense()` layer, we calculate the number of parameters using the simple formula
  * **(Current_Layer_Neurons x Previous_Layer_Neurons) + (Current_Layer_Neurons)**

* Let's apply this to our network layers
  * Input layer: ( 4 x 4 ) + 4  = 20    *(in this case, the previous layer neurons are the input shape, which is 4)*
  * 1st hidden layer: (5 x 4) + 5 = 25
  * 2nd hidden layer: (3 x 5) + 3 = 18
  * Output layer: (1 x 3) + 1 = 4

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Note also the layer output shape - it is 1. And makes sense to our ML task objective, which is Regression (predict a continuous number)

model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Alternatively, you can use a `plot_model()` function from `tensorflow.keras.utils` module to display the model data flow more visually.
* Note again the input and output shape on each layer. That is how the data flows across your network.

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Warning, depending on how many layers your model has, the previous summary and graph may look long and odd. <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png">

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> If you have a CNN (Convolutional Neural Network), you will arrange Conv2D() with parameters to learn. 
* Let's assume a CNN where the input is 32x32x3 (imagine an RGB image 32x32); with two convolutions (the first convolution has 16 filters and kernel 4x4; the second convolution has 32 filters and kernel 4x4).
* After the convolutions, we have a MaxPooling (2x2)
* Then we Flatten after the second convolution and then chain a Dense layer (128 neurons). The output layer is sigmoid with one neuron (indicating it is a binary classification). 
* The model is compiled with `binary_crossentropy` as loss and `adam` as optimizer.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Flatten, Dropout

model = Sequential()
#### 1st convolution and max pooling
model.add(Conv2D(filters=16, kernel_size=(4,4),input_shape=(32,32,3), activation='relu',))
model.add(MaxPool2D(pool_size=(2, 2)))
####

#### 2nd convolution and max pooling
model.add(Conv2D(filters=32, kernel_size=(4,4), activation='relu',))
model.add(MaxPool2D(pool_size=(2, 2)))
####

model.add(Flatten())
model.add(Dense(128, activation='relu'))

### output layer: it is a binary classification model 
# why? activation function is sigmoid and loss is binary_crossentropy
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We can calculate the number of parameters in the convolution by **multiplying the width and height of the filter of the current layer and the last number of filters** (if it is the first convolution layer, you get the third dimension from the input shape). **We can add this product with 1 and multiply it with the number of filters**, so we can get the number of parameters. 

* Let's calculate the convolutions
  * First convolution: (4 x 4 x 3 + 1 ) x 16 = 528
  * Second convolution: (4 x 4 x 16 + 1) x 32 = 8224

*  Max pooling has no learnable parameters since all it does is calculate a specific number

* Flatten converts the data to a 1D array for the next layer; therefore it doesn't have a learning parameter


* Let's calculate for the Dense layers (like we learned in the previous section)
  * (128 x 800) + 128 = 102528
  * (1 x 128) + 1 = 129


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In total, we have more than 111 thousand parameters that the network will learn. That is much more than the previous network. Reflect now on the amount of data (in this case, images showing their features and patterns) you need to provide to such a network so it can learn the relationships. 
* That is why you need a relevant amount of data when predicting images using CNNs.

model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Alternatively, you can use a `plot_model()` function from `tensorflow.keras.utils` module to display the model data flow more visually.
* Note the image shape again after each layer. **This is the data flow from your image (in this case, 32x32x3) to a prediction (0 or 1)!**

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Training Hyperparameters

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> When training a model in TensorFlow, it is advisable to scale your data (if your data is tabular or not) since experience shows that the model learns the relationships faster and better when the features are on a similar scale and typically in a range of -1 to +1. As a result, you should consider scaling your data before fitting a TensorFlow model.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> Let's start first with a definition of a sample which is a piece of data, like a row in a tabular data or an image 

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> There are a few hyperparameters we may parse when training neural networks.
* **Batch size**: a set of samples that you feed to the model when training.
  * The samples in a batch are processed independently, in parallel. If training, a batch results in only one update to the model. A batch generally approximates the distribution of the input data better than a single input. The larger the batch, the better the approximation; however, it is also true that the batch will take longer to process and will still result in only one update.
  * The batch size influences the training time. There is no rule of thumb as to which batch size would work best. You may try a few options and check performance. However, you should not use large batch sizes; it tends to overfit the model. Common batch sizes are 32, 64, 128, 256, 512, 1024, 2048. The default is 32.


* **Epoch**: It is an iteration over the entire data. 
  * In the training phase, the model will "see" (or iterate over) the data a certain number of times. This number of times is known as an epoch. The bigger the epoch is, the longer the training process. You may try an arbitrary number (the default is 1) and check the performance to decide if you need more.  We will also see a strategy to limit the epochs number.
  * This strategy uses a callback when training the model,  which is executed at a specific moment when fitting a TensorFlow model, typically at the end of every epoch.
  * You will then use a function called early stopping. According to TensorFlow [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping), early stopping allows stopping training when a monitored metric has stopped improving. This is useful to avoid overfitting the model to the data.


* **Steps per Epoch**: According to TensorFlow [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential), it is the total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch.
  * Therefore, you can define how many batches of samples to use in one epoch; this is useful when you perform data augmentation when training, like loading data (i.e., image) on the fly to the memory to fit your model. We will cover that in the Walkthrough Project 01


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Has the model learned properly?

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Result.png
">
 When fitting an ML model, you will be interested in analyzing the performance of the model over a validation set of data that the ML model has not seen at the time of training. 
* If we get the desired generalized performance,  we take these models further for the deployment; otherwise, we go for the optimization process.


<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> You will probably remember from other modules that a model can underfit, overfit, or learn the patterns in the data correctly.
* When the model **underfits**, it means the performance on the train and test set is low; this is because the model didn't learn the patterns correctly.
* When the model **overfits**, it means the performance between the train set and another data (test set or actual data) will be different. That means the model can't generalize on unseen data
* When the model learns the patterns, the performance on the train and test sets are satisfactory and similar. That means the model can generalize on unseen data.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  In TensorFlow, you can assess if the model learned right after it finished training. 
* The model will train over a given amount of epochs, and the expectation is that the error will reduce over the epochs. You can access a **learning curve that shows the loss achieved by the model over the training and validation data per epoch**. 
* In other words, the plot shows how the model performed when training and validating the data.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> There are three possibilities that you can identify in the learning curve: 
  * 1- the model learned normally,
  * 2 - the model overfitted or 
  * 3 - the model underfitted.


---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Normal Learning or Good fit

 <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> If the model learns normally, the loss plots for training and validation data follow a similar path and are close to each other, like in the plot below. 
* Note the model trained for 20 epochs, and the loss gradually decreased similarly  in the train (blue line) and validation data (orange line)


---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Overfit

 <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> If the model were overfitting, the learning curve would show the training loss decreasing and the validation loss shooting up, and it would not progress with the training loss. 
As a result, we would see a gap between the training and validation lines. 
* This is a standard behaviour when training deep learning models, and we can reduce it by tuning our model hyperparameters.


---

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Underfit

 <img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Tips.png
"> On the other hand, an underfit model could be identified from the learning curve of the training loss. 
* It would either show a flat line or noisy values of relatively high loss, indicating that the model could not learn the training dataset at all. In addition, you will notice the lines haven't followed the same trajectory. In this example, the model trained for 100 epochs, and the loss decreased differently.


# TensorFlow - Unit 06 - Hyperparameter Optimization

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Understand the considerations when optimizing neural network's hyperparameters



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Unit 06 - Hyperparameter Optimization

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In the Scikit Learn lesson, we covered in detail the approach and techniques for Hyperparameter Optimization using Conventional ML by fitting multiple algorithms with multiple hyperparameters arrangements. 
* That allowed us to find the most suitable algorithm and its hyperparameter configuration to meet the performance criteria defined in our business problem.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> What should I consider when optimizing my neural network hyperparameters?
* As you have seen, neural networks are constructed using a set of parameters and hyperparameters. Therefore, you will need to set them when designing your network.
* The truth is that you don't know which arrangement will be the best for your dataset, considering the technical performance you want to meet and the time it takes to train your model. 
* This process of finding the best hyperparameter will typically involve a lot of trial and error. We presented in the previous unit a few guidelines and rules of thumb when designing a neural network that can also guide your criteria when optimizing your neural network's hyperparameters. 



---

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
"> Examples of hyperparameters we covered in this lesson:
* Loss function, optimizer, metrics, activation function, model architecture (number of hidden layers and the respective amount of neurons in each layer), dropout rate, CNN filter etc.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> We are not covering Hyperparameter Optimization in TensorFlow in this course due to its scope. However, we want to give you a brief overview of the topic to encourage you to explore it yourself over your data practitioner career. 
* We suggest this [documentation](https://www.tensorflow.org/tutorials/keras/keras_tuner) that has an introduction for hyperparameters optimization in TensorFlow. We encourage you to read it when you find some free time.

---

# TensorFlow - Unit 07 - Regression

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Fit a deep learning neural network for Regression task
* Save and load tensorflow models (.h5 extension)



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 08 - Regression

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png">
 We will follow the typical process for supervised learning, which we are familiar with, but now with a few tweaks:

* Split the dataset into train, validation and test set
* Create a pipeline to handle data cleaning, feature engineering and feature scaling (as we covered, it is highly recommended the data be scaled, so we wrap up in one pipeline)
* Create the neural network
* Fit the pipeline to the train set and transformations to the other sets
* Fit the model to the train and validation set
* Evaluate the model
* Predict on new data

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load and split the data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's first load the data. We are using the Boston dataset from sklearn.
* It has house price records and characteristics, like the average number of rooms per dwelling and the per capita crime rate in Boston. The target variable is the house price.

from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data,columns=data.feature_names)
df['price'] = pd.Series(data.target)

print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.
amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As part of our workflow, we split the data, but now we will split it into the train, validation, and test sets. 
* First, we split into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['price'],axis=1),
                                    df['price'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Then, from the train set, we split a validation set. We set the validation set as 20% of the train set
* Have a look at the print statement, which shows the amount of data we have in each set (train, validation and test)

X_train, X_val,y_train, y_val = train_test_split(
                                    X_train,
                                    y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline for data processing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We first create a pipeline for preprocessing the data. In theory, it could handle the processes of data cleaning, feature engineering and feature scaling
* In this case, it's only feature scaling.
* We could have also added a step for removing correlated features, but let's keep it simple.

from sklearn.pipeline import Pipeline
### Feature Scaling
from sklearn.preprocessing import StandardScaler

def pipeline_pre_processing():
  pipeline_base = Pipeline([
      
      ( "feat_scaling",StandardScaler() )

    ])

  return pipeline_base


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we fit the pipeline to the train set and transformations to the validation and test set
* So the pipeline can learn the transformations (in this case, it is only feature scaling) from the train set and apply the transformation to the other sets. 

pipeline = pipeline_pre_processing()
X_train = pipeline.fit_transform(X_train)
X_val= pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create Deep Learning Network

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We will create a tensorflow model.
* We create a function that creates a sequential model, compiles the model and returns the mode. The function requires the number of features the data has to be used as the number of neurons from the first layer
* Let's define the network architecture (a deep learning neural network since it has two or more hidden layers - jargon alert! )
  * We noted the data has 13 features. First, we will create a simple network just for a learning experience. 
  * The network is built using Dense layers - fully connected layers
  * The input layer has the same number of neurons as the number of columns from the data. The activation function is `relu`. We parse the input_shape in a tuple, the first value is the number of columns from the data, and you don't need to parse the second since the data is uni-dimensional (an image wouldn't be unidimensional, for example)
  * We are using two hidden layers, the first with 8 neurons and the next with 4 neurons. Both will use `relu` as an activation function.
  * After each hidden layer, we have a dropout layer with a 25% rate; so we can reduce the chance of overfitting
  * The output layer should contain only 1 neuron since the ML task is Regression. 
  * We compile the model with mse as a loss/cost function and optimizer as adam

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

def create_tf_model(n_features):

  model = Sequential()
  model.add(Dense(units=n_features, activation='relu', input_shape=(n_features,)))

  model.add(Dense(units=8,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=4,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=1))
  model.compile(loss='mse', optimizer='adam')
  
  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's visualize the network structure
* Note the number of parameters the network has. That looks to be a reasonable amount compared to the number of rows the train set has
* A non-reasonable amount would be like 100 thousand parameters for a dataset with 1k. Or maybe your dataset is so tiny and complex, and you need more parameters, but the rule of thumb suggests starting easy and adding more complexity if the performance is not good.

model = create_tf_model(n_features=X_train.shape[1])
model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We can use `plot_model()` also from Keras.utils for a more graphical approach
* Note the input and output shape each layer has. That is how your data is "travelling" from the input to the prediction.

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As we mentioned in a previous notebook, early stopping stops training when a monitored metric has stopped improving; this is useful to avoid overfitting the model to the data. The documentation function is [here](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping)
* We will monitor the validation loss 
  * Remember that now we parse the train and validation data. After a given epoch finishes, the network calculates the error. The training process stops if the validation error doesn't improve for a given set of consecutive epochs. 
  * We set patience as 15, which is the number of epochs with no improvement; after that, training will be stopped. Although there is no fixed rule to set patience, if you feel that your model is still learning, then you stop; you may increase the value and train again.
  * We set the mode to min. According to TensorFlow documentation, in min mode, training will stop when the quantity monitored has stopped decreasing.

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will finally fit the model
* We create the model object and use .fit(), as usual
  * We parse the Train set
  * The epoch is set to 100. In theory, you may set a high value since we will add an early stop, which stops the training process when there is no training improvement. 
  * We parse the validation data in a tuple.
  * Verbose is set to 1 to see in which epochs we are and the training and validation loss.
  * Finally, we parse our callback as the early_stop object we created earlier. We parse in a list since you may parse more than 1 type of callback. In this course, we will cover only early stopping

* For each epoch, note the training and validation loss. Are they increasing? Decreasing? Still?
  * Ideally, it should decrease as long as the epoch increases, showing a practical sign the network is learning


model = create_tf_model(n_features=X_train.shape[1])

model.fit(x=X_train, 
          y=y_train, 
          epochs=100,
          validation_data=(X_val, y_val),
          verbose=1,
          callbacks=[early_stop]
          )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Model evaluation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Now we will evaluate the model performance by analyzing the training and validation losses that happened during the training process. 
* In deep learning, we use the model history to assess if the model learned, using the train and validation sets. We also evaluate separately how the model generalises on unseen data (on the test set)
* The model training history information is stored in a `.history.history` attribute. Note it shows a loss (training loss) and val_loss (validation_loss)

losses = pd.DataFrame(model.history.history)
losses

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are plotting each loss in a line plot, where the y-axis has the loss value, the x-axis is the epoch number, and the lines are coloured by train or validation
* We use `.plot(style='.-')` for this task
* Note the loss plots for training, and validation data follow a similar path and are close to each other. It looks like the network learned the patterns.

losses = pd.DataFrame(model.history.history)

sns.set_style("whitegrid")
losses[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we will evaluate the model performance on the test set, using `.evaluate()` and parsing the test set. Note the value is not much different from the losses in the train and validation set.
* Note the model learned the relationship between the features and the target, considering all features. Conventional ML often use a feature selection step to remove features that wouldn't contribute to the model learning, thus increasing the chance of overfitting the model.
* But in Deep Learning, the neural network handles this topic by itself. The connections related to less important features are weaker after the training process; therefore, it doesn't harm the overall performance.

model.evaluate(X_test,y_test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When evaluating a deep learning model, you typically cover the loss plot and evaluate the test set; however, **you can do, if you want, as an additional step**, a similar evaluation we did in conventional ML.
* In regression, you would analyse the performance metrics and actual x predictions plot using the custom function we have seen over the course.
* One difference is that we readapted the function also to evaluate the validation set, but that is a minor change in the code; the overall logic is the same

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
import numpy as np

def regression_performance(X_train, y_train,
                           X_val, y_val,
                           X_test, y_test,pipeline):

  print("Model Evaluation \n")
  print("* Train Set")
  regression_evaluation(X_train,y_train,pipeline)
  print("* Validation Set")
  regression_evaluation(X_val, y_val,pipeline)
  print("* Test Set")
  regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X,y,pipeline):
  """
  # Gets features and target (either from train or test set) and pipeline
  - it predicts using the pipeline and the features
  - calculates performance metrics comparing the prediction to the target
  """
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train,
                                X_val, y_val,
                                X_test, y_test,
                                pipeline, alpha_scatter=0.5):

  pred_train = pipeline.predict(X_train).reshape(-1) 
  # we reshape the prediction arrays to be in the format (n_rows,), so we can plot it after
  pred_val = pipeline.predict(X_val).reshape(-1)
  pred_test = pipeline.predict(X_test).reshape(-1)

  fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_val , y=pred_val, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_val , y=y_val, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Validation Set")

  sns.scatterplot(x=y_test , y=pred_test, alpha=alpha_scatter, ax=axes[2])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[2])
  axes[2].set_xlabel("Actual")
  axes[2].set_ylabel("Predictions")
  axes[2].set_title("Test Set")

  plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's parse the values as usual.
* Note here we don't parse a pipeline; we use the TensorFlow model
* Note the predictions tend to follow the red diagonal line. However, it seems the test set metrics are quite different from the train/validation set. You could add more complexity to the model or increase the number of epochs etc until you reach a metric that can satisfy you. For learning purposes, we will be happy with this performance. 
* In general, we would expect the performance to be better in the train set, validation, and test set. However, there may be cases where this doesn't happen.

regression_performance(X_train, y_train,X_val, y_val, X_test, y_test,model)
regression_evaluation_plots(X_train, y_train, X_val, y_val,X_test, y_test, 
                            model, alpha_scatter=0.5)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Prediction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take a sample from the test set and use it as if it was live data. We will consider two houses (not only 1)

live_data = X_test[:2,:]
live_data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `.predict()` and parse the data. In this case, we are predicting the price of 2 houses
* Since the X_test data is scaled and is an array, it is difficult to make sense of the content, but we are assuming here the data has passed through the pre_processing pipeline already.



model.predict(live_data)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Save and Load the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> In case you want to save your model, you may use `.save() `and parse the directory and the model's name. The extension is `.h5`
* Remember, in this notebook, we used a pipeline to pre-process the data, so in a project using tabular data, you would be interested in saving this pipeline also as a pkl file (similarly to what we saw in the scikit-learn lesson)

model.save('my_model.h5')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You can load the model using `load_model()` from the Keras module
* Let's load the model as model_2

from tensorflow.keras.models import load_model
model_2 = load_model('my_model.h5')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> NOTE: the history information on a loaded model is lost when you save and load afterwards. The recommendation is to fit the model, generate the training history plots and save them immediately after

model_2.history.history

---




# TensorFlow - Unit 08 - Binary Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Fit a deep learning neural network for the Binary Classification task



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 08 - Binary Classification

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We will follow the process used for supervised learning, which we are familiar with, but now with a few tweaks:

* Split the dataset into train, validation and test set
* Create a pipeline to handle data cleaning, feature engineering and feature scaling
* Create the neural network
* Fit the pipeline to the train set and transformations to the other sets
* Fit the model to the train and validation set
* Evaluate the model
* Prediction

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load and split the data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's first load the data. We are using the breast cancer dataset from sklearn.
* It shows breast mass sample records and a diagnosis informing whether it is malignant or benign cancer. The target variable is the diagnosis, where 1 is malignant, and 0 is benign.

from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df = pd.DataFrame(data.data,columns=data.feature_names)
df['diagnosis'] = pd.Series(data.target)
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As part of our workflow, we split the data, but now we will split it into a train, validation, and test sets. 
* First, we split into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['diagnosis'],axis=1),
                                    df['diagnosis'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Then, from the train set, we split a validation set. We set the validation set as 20% of the train set
* Have a look at the print statement, which shows the amount of data we have in each set (train, validation and test)

X_train, X_val,y_train, y_val = train_test_split(
                                    X_train,
                                    y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline for data processing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We first create a pipeline for preprocessing the data. 
* In this case, it is only feature scaling.
* We could have also added a step for removing correlated features, but let's keep it simple.

from sklearn.pipeline import Pipeline
### Feat Scaling
from sklearn.preprocessing import StandardScaler

# in this case, we don't need data cleaning or feat eng
def pipeline_pre_processing():
  pipeline_base = Pipeline([
      
      ( "feat_scaling",StandardScaler() )

    ])

  return pipeline_base


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Next, we fit the pipeline to the train set and transformations to the validation and test set
* So the pipeline can learn the transformations (in this case, it is only feature scaling) from the train set and apply the transformation to the other sets. 
* Let's visualise the first rows from the scaled data. Note it is a 2D NumPy array

pipeline = pipeline_pre_processing()
X_train = pipeline.fit_transform(X_train)
X_val= pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

X_train[:2,]

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create Deep Learning Network

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We will create a tensorflow model
* We create a function that creates a sequential model, compiles the model and returns the model. The function requires the number of features the data has to be used as the number of neurons for the first layer
* Let's define the network architecture
  * We noted the data has 30 features. We will create a simple network just for a learning experience. 
  * The network is built using Dense layers - fully connected layers
  * The input layer has the same amount of neurons as the number of columns from the data. The activation function is `relu`. Finally, we parse the input_shape using a tuple.
  * We are using three hidden layers, the first with 20 neurons, the next with 10 neurons and the last with 6 neurons. All three will use `relu` as an activation function.
  * After the input layer and each hidden layer, we have a dropout layer with a 25% rate to reduce the chance of overfitting. In the previous notebook, we didn't add a dropout layer to the input layer. In this notebook, we are adding one to demonstrate it is possible. We covered the dropout layer in a previous notebook in case you want to refresh the concept.


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The output layer should reflect binary classification.
  * You may recall there are two ways to define an output layer for binary classification:
    * Either with one neuron with sigmoid as activation function 
    * Or two neurons with softmax as an activation function
  * We will code both, so you can choose which one you would like to use
* We compile the model depending on the output layer choice
  * If it is 1 neuron with sigmoid as activation function: optimizer='adam', loss='binary_crossentropy'
  * If it is 2 neurons with softmax as activation function: optimizer='adam', loss='categorical_crossentropy'



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  In classification tasks, we can use an additional metric when compiling: 'accuracy'. We will still monitor the loss (like we did in Regression), but now we can monitor the accuracy while training. 
* Note: in regression, we can add this argument since accuracy doesn't suit the context of regression

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Below we find the model where the output layer uses sigmoid as an activation function

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

def create_tf_model_sigmoid(n_features):

  model = Sequential()
  model.add(Dense(units=n_features,activation='relu', input_shape=(n_features,)))
  model.add(Dropout(0.25))

  model.add(Dense(units=20,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=10,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=5,activation='relu'))
  model.add(Dropout(0.25))

  # note we use 1 neuron and sigmoid
  model.add(Dense(units=1,activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  
  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Below we find the model where the output layer has softmax as an activation function

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> In this exercise, we will move on with the model that has the activation function as sigmoid in the output layer since, in the next unit notebook, we will handle a network that has softmax as an activation function in the output layer 
* Even if you try to fit the model with softmax as an activation function in the output layer, it will not work since it needs an additional step. We need to one-hot-encode the target variable. We will do that in the next unit notebook, which covers multi-class classification.


def create_tf_model_softmax(n_features):

  model = Sequential()
  model.add(Dense(units=n_features,activation='relu', input_shape=(n_features,)))
  model.add(Dropout(0.25))

  model.add(Dense(units=20,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=10,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=5,activation='relu'))
  model.add(Dropout(0.25))

  # note we use 2 neurons and softmax
  model.add(Dense(2, activation='softmax'))
  model.compile(optimizer='adam', loss='categorical_crossentropy')
  
  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's visualize the network structure
* Note the number of parameters the network has; let's first use `create_tf_model_sigmoid()`

model = create_tf_model_sigmoid(n_features=X_train.shape[1])
model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once again, we can use `plot_model()` also from Keras.utils for a more graphical approach

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's check the difference between the structure of `create_tf_model_sigmoid()` and `create_tf_model_softmax()`
* Below we plotted the model for `create_tf_model_softmax()`. As you may expect, we defined the difference to be in the output layer

model = create_tf_model_softmax(n_features=X_train.shape[1])
plot_model(model, show_shapes=True)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Early stopping allows you to stop training when a monitored metric has stopped improving; this is useful to avoid overfitting the model to the data.
* We will monitor the validation accuracy now 
  * We set patience as 10, which is the number of epochs with no improvement, after which the training will be stopped. Although there is no fixed rule to set patience, if you feel that your model was still learning when you stopped, you may increase the patience value and train again.
  * We set the mode to max since now we want the model to stop training when the accuracy didn't improve its performance, and improvement means increase.
  * When you are monitoring loss, the expectation is a decrease in loss over the training process. Therefore, in this case, you are looking for a minimum mode value, unlike accuracy, as you expect an increase over the training time and thus monitor a max.

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_accuracy', mode='max', verbose=1, patience=10)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will finally fit the model
* We create the model object and use .fit(), as usual
  * We parse the Train set
  * Epochs are set to 75. In theory, you may set a high value since we will add an early stop, which stops the training process when there is no training improvement. 
  * We parse the validation data in a tuple.
  * Verbose is set to 1, so we can see in which epochs we are and the training and validation loss.
  * Finally, we parse our callback as the early_stop object we created earlier.

* For each epoch, note the training and validation loss; are they increasing? Decreasing? Static?
  * Ideally, it should decrease as long as the epoch increases, showing a practical sign the network is learning

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> In this exercise, we will move on with the model that has the activation function as sigmoid in the output layer since, in the next unit notebook, we will handle a network that has softmax as an activation function in the output layer 
* Even if you try to fit the model with softmax as an activation function in the output layer, it will not work since it needs an additional step. We need to one-hot-encode the target variable. We will do that in the next unit notebook, which covers multi-class classification.


model = create_tf_model_sigmoid(n_features=X_train.shape[1])
model.fit(x=X_train, 
          y=y_train, 
          epochs=75,
          validation_data=(X_val, y_val),
          verbose=1,
          callbacks=[early_stop]
          )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Model evaluation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Now we will evaluate the model performance by analysing the training and validation losses and accuracy that happened during the training process. 
* In deep learning, we use the model history to assess if the model learned, using the train and validation sets. We also evaluate separately how the model generalises on unseen data (on the test set)
* The model training history information is stored in a `.history.history` attribute from the model. 
* **Note it shows loss and accuracy for train and validation**

history = pd.DataFrame(model.history.history)
history.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are plotting each loss and accuracy in a line plot, where the y-axis has the loss/accuracy value, the x-axis is the epoch number, and the lines are coloured by train or validation
* We use `.plot(style='.-')` for this task
  * Note the loss plot for training and validation data follow a similar path and are close to each other. It looks like the network learned the patterns.
  * Note in the accuracy plot that both train and validation accuracies keep increasing; When the performance "saturates" for validation, the training stops, as we set in the early stopping object.

sns.set_style("whitegrid")
history[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.show()

print("\n")
history[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we will evaluate the model performance on the test set, using `.evaluate()` and parsing the test set. Note the value is not much different from the losses and accuracy in the train and validation set.
* Note the loss is low and accuracy is high. It looks like the model learned the relationship between the features and the target, considering all features.

model.evaluate(X_test,y_test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When evaluating a deep learning model, you typically cover the loss plot and evaluate the test set; however, **if you want, you can do as an additional step** a similar evaluation we did in conventional ML.
* In classification, you would analyze the confusion matrix and classification report using the custom function we have seen over the course.
* One difference is that we readapted the function to evaluate also the validation set, but that is a minor change in the code; the overall logic is the same

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> The adapted custom function below will work for the model made with a sigmoid as an activation function.
* There is a difference in the prediction format between the sigmoid output layer and the softmax output layer; the first is a probabilistic output (between 0 and 1). In the next unit, we will cover this evaluation for a model with softmax
* In case your model was trained with a softmax activation function, the code below may not work as expected

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  prediction = pipeline.predict(X).reshape(-1)
  prediction = np.where(prediction<0.5,0,1) 
  # the prediction using sigmoid as an activation function is a probability number, between 0 and 1
  # we convert it to 0 or 1, if it is lower than 0.5, the predicted class is 0, otherwise it is 1
  # you could change the threshold if you want.

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")



def clf_performance(X_train,y_train,X_test,y_test,X_val, y_val,pipeline,label_map):

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Validation Set #### \n")
  confusion_matrix_and_report(X_val,y_val,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's parse the values as usual.
* Note the model is capable of separating the classes, including in the test set

clf_performance(X_train, y_train,
                X_test,y_test,
                X_val, y_val,
                model,
                label_map= ['malignant', 'benign']
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Prediction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take a sample from the test set and use it as if it was live data. We will consider 1 sample

index = 1
live_data = X_test[index-1:index,]
live_data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `.predict()` and parse the data. Note the result is not a direct 0 or 1 but instead a probabilistic result between 0 and 1

  prediction_proba = model.predict(live_data)
  prediction_proba

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You must decide a threshold when stating if the given probabilistic result is a 0 or 1. In this case, we set the threshold as 0.5
* We converted using a NumPy function `np.where()`, where you make a condition (prediction_proba < 0.5), if that is true, it converts to 0; otherwise, it is 1.

prediction_class = np.where(prediction_proba<0.5,0,1) 
prediction_class

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's plot the probabilistic result so that you can check the predictions in a more visual fashion
* Read the pseudo-code
* In the end you are getting prediction_proba to define the associate probability for the two classes: 0 and 1. Then you plot it in a bar plot using Plotly 

# define how you map the classes and the meaning of each
# where the dict key is the class number
target_map = {0:'Benign', 1:'Malignant'}

# create an empty dataframe that will show the probability per class
# we set that the probabilities will be 0, but we will update soon
prob_per_class= pd.DataFrame(
        data=[0,0],
        index=target_map.keys(),
        columns=['Probability']
    )


# the summed predictions probabilities from both classes sum 1
# for a binary classification case, we can say that
#    === if prediction_proba is, say, 0.01. that means the predicted class is 0
#    so we can say the prediction probability from class 1 is 0.01 and for class 0 is 0.99
#    ===  if prediction_proba is, say, 0.99. that means the predicted class is 1
#    so we can say the prediction probability from class 1 is 0.99 and for class 0 is 0.01
prob_per_class.iloc[1,0] = int(prediction_proba[0])
prob_per_class.iloc[0,0] = 1 - int(prediction_proba[0])


# we round the values to 3 decimal points, for better visualization
prob_per_class = prob_per_class.round(3)

# we add a column to prob_per_class that shows the meaning of each class
# in this case, malignant or benign
prob_per_class['Result'] = target_map.values() 

# take a look at the data we generated
prob_per_class


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use a bar plot, where the x-axis shows the Result and the y-axis the associated probability for a given Result.
* I encourage you to go to the first cell of the Prediction section and change the index variable so that you would take a sample. Then run all cells to predict until the plot from the cell below
* You may change the index to another positive integer

import plotly.express as px
fig = px.bar(
        prob_per_class,
        x = 'Result',
        y = 'Probability',
        range_y=[0,1],
        width=400, height=400,template='seaborn')
fig.show()

---
There were some weird issues with testing in the last one. Maybe try it on gitpod once you get to a place in the walkthrough.


# TensorFlow - Unit 09 - Multiclass Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Fit a deep learning neural network for Multiclass Classification task




---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Unit 09 - Multiclass Classification

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We will follow the typical process used for supervised learning that we are familiar with, but now with a few tweaks:

* Split the dataset into train, validation and test set
* Create a pipeline to handle data cleaning, feature engineering and feature scaling
* Create the neural network
* Fit the pipeline to the train set and transform the other sets
* Fit the model to the train and validation set
* Evaluate the model
* Prediction

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load and split the data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's first load the data. We are using the penguins' dataset from seaborn. It has records for three different species of penguins collected from three islands in the Palmer Archipelago, Antarctica
* Here, we are interested in predicting the penguin species based on a penguin's characteristic

df = sns.load_dataset('penguins')
print(df.shape)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  When you want to create a TensorFlow model for multiclass classification, your target variable needs to be encoded as numerical since TensorFlow handles numbers.
* As a result, we create a dictionary that maps the target classes to numbers and then replace them with the target variable.
* It is better to do that before splitting the data; otherwise, you would have to do that three times, one for each target set (y_train, y_val y_test)

target_map = {'Adelie':0, 'Chinstrap':1, 'Gentoo':2}
df['species'] = df['species'].replace(target_map)
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.
amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As part of our workflow, we split the data, but now we will split it into the train, validation, and test sets. 
* First, we split into the train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['species'],axis=1),
                                    df['species'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Then, from the train set, we split a validation set. We set the validation set as 20% of the train set
* Have a look at the print statement, showing the amount of data we have in each set (train, validation and test)

X_train, X_val,y_train, y_val = train_test_split(
                                    X_train,
                                    y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  When you want to create a TensorFlow model for multiclass classification, you will choose the loss function when compiling the model. The "first go to" option is `categorical_crossentropy`; we will use that over the course.
* The target variable should be one hot encoded when using this loss function.
* We are converting each categorical level into new binary columns and assigning a binary value of 1 or 0. Each binary column is a category level from the variable. The number of binary columns is the same as the number of classes from that target variable.
* The binary column is 1 when the original categorical variable represents the associated binary column. Let's see the example and learn from that. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> First, let's inspect the first five rows from y_train

y_train[:5,]

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We first get the unique values from the target variables; we will use them here and when creating the model

* We use `to_categorical()` function to one hot encode in the format we require. We parse the data to to_categorical() and assign the number of classes.
* Let's again inspect the first five items from y_train. Note we had three classes (0, 1 and 2)
* Three binary columns were created, each representing one of the possible classes (0, 1 or 2). The first row was 1, and when hot encoded, the second binary variable (that represents class 1) has the value 1, where the other binary variables are zero.

import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.utils import to_categorical
n_labels = y_train.nunique()

y_train = to_categorical(y=y_train, num_classes=n_labels)
y_val = to_categorical(y=y_val, num_classes=n_labels)
y_test = to_categorical(y=y_test, num_classes=n_labels)

y_train[:5,]

---


### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pipeline for data processing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We first create a pipeline for preprocessing the data. We will list here the steps, but in an actual project, you would have used your knowledge to explore the data and look for data cleaning and feature engineering steps. In this case, the steps are: 
* Impute missing data with the median for all variables (when you don't parse the `variables` argument, you define all numerical variables to be imputed. This trick saves you time)
* Impute the most frequent level in the categorical variables. We again didn't parse the `variables` argument, so it includes all categorical variables, so you didn't have to parse ['island', 'sex']
  * Note: you shouldn't do this for all datasets. We had studied the dataset before and concluded we could use this imputer for island and sex, which happen to be categorical variables.
* Encode all categorical variables (we have the same rationale from the previous bullet on not parsing the `variables` argument explicitly) 
* Feature scaling

from sklearn.pipeline import Pipeline
### Feature Engineering
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import CategoricalImputer
from feature_engine.encoding import OrdinalEncoder
### Feat Scaling
from sklearn.preprocessing import StandardScaler

def pipeline_pre_processing():
  pipeline_base = Pipeline([
                            
      ( 'median',  MeanMedianImputer(imputation_method='median') ),

      ( 'categorical_imputer', CategoricalImputer(imputation_method='frequent')),

      ( "ordinal",OrdinalEncoder(encoding_method='arbitrary' )),    
      
      ( "feat_scaling",StandardScaler() )

    ])

  return pipeline_base



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Next, we fit the pipeline to the train set and transformations to the validation and test set
* So the pipeline can learn the transformations (in this case, it is only feature scaling) from the train set and apply the transformation to the other sets
* Let's visualise the first rows from the scaled data. Note it is a 2D NumPy array

pipeline = pipeline_pre_processing()
X_train = pipeline.fit_transform(X_train)
X_val= pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

X_train[:2,]

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create Deep Learning Network

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We will create a tensorflow model
* We create a function that creates a sequential model, compiles the model and returns the model. The function needs the number of features the data has and  the number of neurons in the last layer
* Let's define the network architecture
  * We noted the data has six features. First, we will create a simple network just for a learning experience. 
  * The network is built using Dense layers - fully connected layers
  * The input layer has the same number of neurons as the number of columns from the data. The activation function is `relu`. We parse the input_shape using a tuple.
  * We are using three hidden layers, the first with 20 neurons, then 10 neurons, and the last with 5 neurons. All three will use `relu` as an activation function. This approach is the "expansive-shrink" option we mentioned in a previous notebook related to model architecture.
  * After the input layer and each hidden layer, we have a dropout layer with a rate of 25% to reduce the chance of overfitting. 
* The output layer should reflect a multiclass classification.
  * We set a dense layer where the number of neurons used is the same as the number of classes in the target variable. This information is stored in a previously created variable - `n_labels`. 
  * For multiclass classification, we set the activation function as softmax, and we compile the model with adam as the optimizer and the loss function as categorical_crossentropy. We also set to monitor the metric accuracy.




from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

def create_tf_model(n_features, n_labels):

  model = Sequential()

  model = Sequential()
  model.add(Dense(units=n_features,activation='relu', input_shape=(n_features,)))
  model.add(Dropout(0.25))

  model.add(Dense(units=20,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=10,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(units=5,activation='relu'))
  model.add(Dropout(0.25))

  model.add(Dense(n_labels, activation='softmax'))
  model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])

  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's visualise the network structure

model = create_tf_model(n_features=X_train.shape[1], n_labels=n_labels )
model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once again, we can use `plot_model()` also from Keras.utils for a more graphical approach

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Early stopping allows us to stop training when a monitored metric has stopped improving; this is useful to avoid overfitting the model to the data.
* We will monitor the validation accuracy now 
  * We set patience as 10, which is the number of epochs with no improvement, after which training will be stopped. Although there is no fixed rule to set patience, if you feel that your model is still learning, then you stop and you may increase the value and train again.
  * We set the mode to min since now we want the model to stop training when the loss didn't improve its performance and improve means decrease

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will finally fit the model
* We create the model object and use `.fit()`, as usual
  * We parse the Train set
  * The epochs are set to 100. In theory, you may set a high value since we will add an early stop, which stops the training process when there is no training improvement. 
  * We parse the validation data using a tuple.
  * Verbose is set to 1, so we can see in which epochs we are and the training and validation loss.
  * Finally, we parse our callback as the early_stop object we created earlier.

* For each epoch, note the training and validation loss and accuracy. Are they increasing? Decreasing? Static?
  * Ideally, the loss should decrease as long as the epoch increases, showing a practical sign the network is learning. The accuracy should increase over the epochs.

model = create_tf_model(n_features=X_train.shape[1],  n_labels=n_labels)

model.fit(x=X_train, 
          y=y_train, 
          epochs=100,
          validation_data=(X_val, y_val),
          verbose=1,
          callbacks=[early_stop]
          )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Model evaluation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Now we will evaluate the model performance by analysing the training and validation losses and accuracy that happened during the training process. 
* In deep learning, we use the model history to assess if the model learned, using the train and validation sets. We also evaluate separately how the model generalises on unseen data (on the test set)
* The model training history information is stored in a `.history.history` attribute from the model. 
* **Note it shows loss and accuracy for train and validation**

history = pd.DataFrame(model.history.history)
history.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are plotting each loss and accuracy in a line plot, where the y-axis has the loss/accuracy value, the x-axis is the epoch number, and the lines are coloured by train or validation
* We use `.plot(style='.-')` for this task
  * Note the loss plot for training and validation data follow a similar path and are close to each other. It looks like the network learned the patterns.
  * Note in the accuracy plot that both train and validation accuracies keep increasing. When the performance "saturates" for validation, the training stops, as we set the early stopping object.

sns.set_style("whitegrid")
history[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.show()

print("\n")
history[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we will evaluate the model performance on the test set, using `.evaluate()` and parsing the test set. Note the value is not much different from the losses and accuracy in the train and validation set.
* Note the loss is low and accuracy is high. It looks like the model learned the relationship between the features and the target, considering all features.

model.evaluate(X_test,y_test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When evaluating a deep learning model, you typically cover the loss plot and evaluate the test set; however, **you can do this as an additional step** similar to the evaluation we did in conventional ML.
* In classification, you would analyse the confusion matrix and classification report using the custom function we have seen over the course.
* One difference is that we readapted the function to evaluate also the validation set, but that is a minor change in the code; the overall logic is the same

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  # the prediction comes in a one hot encoded format
  prediction = pipeline.predict(X)
  # so we take the index from the highest probability, which is the "winner" or predicted class
  prediction = np.argmax(prediction, axis=1)
  
  # we also take the index from the highest probability from the actual values
  y = np.argmax(y, axis=1)
  
  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")

  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,X_val, y_val,pipeline,label_map):

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Validation Set #### \n")
  confusion_matrix_and_report(X_val,y_val,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's parse the values as usual.
* Note the model is capable of separating the classes, including in the test set

clf_performance(X_train, y_train,
                X_test,y_test,
                X_val, y_val,
                model,
                label_map= target_map.keys()
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Prediction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take a sample from the test set and use it as if it was live data. We will consider one sample

index = 1
live_data = X_test[index-1:index,]
live_data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `.predict()` and parse the data. Note the result is not a direct 0, 1 or 2 but instead a probabilistic result for each class. 

  prediction_proba = model.predict(live_data)
  prediction_proba

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> So we take the index from the highest probability, which is the "winner" or predicted class

prediction_class = np.argmax(prediction_proba, axis=1) 
prediction_class

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's plot the probabilistic result, so you can check the predictions in a more visual fashion
* Read the pseudo-code
* In the end you are getting `prediction_proba` to define the associate probability for each class. Then you plot it in a bar plot using Plotly 

# create an empty dataframe that will show the probability per class
# we set the probabilities as the prediction_proba
prob_per_class= pd.DataFrame(data=prediction_proba[0],
                             columns=['Probability']
                             )

# we round the values to 3 decimal points, for better visualization
prob_per_class = prob_per_class.round(3)

# we add a column to prob_per_class that shows the meaning of each class
# in this case, the species name are mapped in the target_map dict keys
prob_per_class['Results'] = target_map.keys()

prob_per_class

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use a bar plot, where the x-axis shows the Result and the y-axis the associated probability for a given Result
* I encourage you to go to the first cell of the Prediction section and change the index variable so that you would take a sample. Then you run all cells to predict until the plot from the cell below
* You may change the index to another positive integer

import plotly.express as px
fig = px.bar(
        prob_per_class,
        x = 'Results',
        y = 'Probability',
        range_y=[0,1],
        width=400, height=400,template='seaborn')
fig.show()

---

# TensorFlow - Unit 10 - Image Classification

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Fit a convolutional neural network for Classification task using an image dataset



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('white')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 10 - Image Classification: Toy Datasets

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Workflow

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Challenge%20test.png
">
 We will follow the typical process used for supervised learning, which we are familiar with, but now with a few tweaks:

* Split the dataset into train, validation and test set
* Preprocess the image data
* Create the neural network
* Fit the model to the train and validation set
* Evaluate the model
* Prediction

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Load and split the data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's first load the data. We are using the mnist dataset from tensorflow. It has handwritten digits from 0 to 9.
* Here, we are interested in predicting numbers based on the handwritten digit image
* This is a toy dataset, where all images are provided in a single and standardized format and arranged in a NumPy array.
  * This is useful for learning purposes. However, actual image datasets rarely have the characteristic of having all images of the same size. In the first walkthrough project, we will handle a dataset where its images have different sizes. For now, we are focused on the workflow for managing the image dataset


import os;
import tensorflow as tf;
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2';
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print(X_train.shape, X_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We have already explored this dataset in a previous unit notebook; however, you may remember you choose an index to reveal a number
* We are using `plt.imshow()` to display the NumPy array as an image

pointer = 88

print(f"array pointer = {pointer}")
print(f"x_train[{pointer}] shape: {X_train[pointer].shape}")
print(f"label: {y_train[pointer]}")

plt.imshow(X_train[pointer],cmap='gray')
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We already loaded the data in train and test sets; This happens since TensorFlow provides the data in this format.
* From the train set, we split a validation set. We set the validation set as 20% of the train set
* Have a look at the print statement, showing the amount of data we have in each set (train, validation and test)

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
                                    X_train,
                                    y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We first get the unique values from the target variables, and we will use them when evaluating the model performance

target_classes= np.unique(y_train)
target_classes

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we get the number of unique values from the target variables; we will use them here and when creating the model

n_labels = len(np.unique(y_train))
n_labels

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, let's inspect the first five rows from y_train

y_train[:5,]

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Similarly to the previous notebook, we use the `to_categorical()` function to one hot encode in the required format. We parse the data to `to_categorical()` and assign the number of classes.
* Let's inspect the first five items again from `y_train` after transformation.

from tensorflow.keras.utils import to_categorical
y_train = to_categorical(y=y_train, num_classes=n_labels)
y_val = to_categorical(y=y_val, num_classes=n_labels)
y_test = to_categorical(y=y_test, num_classes=n_labels)

y_train[:5,]

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Data processing

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We first need to preprocess the data. 
* This exercise will check if scaling the data and reshaping the array size is required.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Let's evaluate if we need to scale the data.
* Scale is important since the algorithm learns best when the data is in a shorter range; in this case, it has a range of 0 to 1
* Since X_train is in an array format, we can get the max() value; if it is greater than 1, it means we would need to scale.
  *  We note the max value is 255. The pixels values of 255 mean maximum light (or white), whereas 0 means min light (or black)

`X_train.max() = 255`

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> As a result, we will scale the data by dividing the NumPy arrays (X_train, X_val and X_test) by 255.

`X_train = X_train / 255`
`X_val = X_val / 255`
`X_test = X_test / 255`

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's again check the max value. The data is now in the proper format to feed the neural network

`X_train.max() = 1.0`

However, we don't need to do this in this dataset as it is already a balanced dataset

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we will look at the image shape
* We note it has 3 dimensions
  * The first is referred to the number of images; in this case, the X_train has 48k samples (or images)
  * The next two are the image size: 28x28
  * However, it is missing one last dimension, the channel. In this case, the image is grey; as a result, there is one channel. If the image were coloured, it would be three channels (RGB)

X_train.shape

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will simply reshape the array, where we essentially add a **1** to the last dimension. In that way, we add the channel dimension to the data
* We reshape all sets (X_train, X_val and X_test)
* We reshape with its current 1st, 2nd and 3rd dimension, and we force the last dimension to be 1

X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1)
X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], X_val.shape[2], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1)

X_train.shape

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Note: we are taking these steps for scaling the data and reshaping to include the channel dimension since the data was provided in such format and is in a NumPy array format
* When you get image datasets in a NumPy format, you will recheck these items, and if required, you will need to process them.
* However, when dealing with real images, the preprocessing tasks are done in another way, which we will cover in the walkthrough of project 1

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Create Deep Learning Network

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  We will create a tensorflow model
* We create a function that creates a sequential model, compiles the model and returns the model. The function needs the input shape (image size) as well as the number of neurons in the last layer 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The network has two pairs of Convolution + Pooling layers. We know in advance that for this dataset that one pair would be enough; however, we want to showcase multiple pairs of convolutions + pooling layers.
* Quick recap: convolution layers are used to select the dominant pixel value from within images using filters. Pooling layers reduce the image size by extracting only the dominant pixels
* The first pair has a convolution layer with 16 filters and a kernel size of 3x3. We parse the input shape as well as the `relu` as an activation function. The MaxPool has a pool size of 2x2
* The next pair has the same setup as the previous pair



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, there is a Flatten layer
* The Flatten layer is used to flatten the matrix into a vector, which means a single list of all values. Then that is fed into a dense layer.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, there is a Dense layer with 128 neurons.
* Typically, here, you arrange the dense layers in multiples of 2, and the number of layers depends on the data complexity after the Flatten layer.
  * We will check in the .summary() or .plot_model() that the data shape after the Flatten layer is 400, so it makes sense to reduce the number of neurons from this case, 400 to 128. So naturally, you will only know the output from the Flatten layer is 400 after creating a model and checking the summary/plot_model.
  * If the output from the Flatten layer were much higher, like 5k, you would consider two or more dense layers to reduce the number of connections progressively.
  * The value 128 is a good starting point. If you notice the CNN is not learning, you may add more dense layers and adjust the number of neurons in them
* After, we have a dropout layer with a rate of 25% to reduce the chance of overfitting. 


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The output layer should reflect a multiclass classification.
  * We set a dense layer where the number of neurons equals the number of classes in the target variable. This information is stored in a previously created variable - `n_labels`. 
  * For multiclass classification, we set the activation function as softmax, and we compile the model with adam as the optimizer and the loss function as categorical_crossentropy. We also arranged to monitor the metric accuracy.


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Flatten, Dropout

def create_tf_model(input_shape, n_labels):
  model = Sequential()
  
  model.add(Conv2D(filters=16, kernel_size=(3,3),input_shape=input_shape, activation='relu',))
  model.add(MaxPool2D(pool_size=(2, 2)))

  model.add(Conv2D(filters=16, kernel_size=(3,3), activation='relu',))
  model.add(MaxPool2D(pool_size=(2, 2)))

  model.add(Flatten())
  
  model.add(Dense(128, activation='relu'))
  model.add(Dropout(0.25))
  
  model.add(Dense(n_labels, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 

  return model


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's visualize the network structure
* The network has just over 55k parameters. We will study the layer's input/output in the next cell 

model = create_tf_model(input_shape=X_train.shape[1:], n_labels=n_labels )
model.summary()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Once again, we can use `plot_model()` also from Keras.utils for a more graphical approach
* Note in the first convolution, the input is 28x28, but then it reduces to 26x26 due to the convolution dynamic (where you lose 2 pixels at the edge of the image, in both directions). Then it goes to a pooling layer, and the image is halved (since the kernel is 2x2): 13 x 13
* In the second convolution, the same dynamic happens, 2 pixels in each direction are lost due to the convolution, and the pooling layer halves the image due to the kernel size
* The Flatten layer transforms the pooled image into a single vector by multiplying all dimensions from the pooled image
* Next, there is a dense layer of 128, and finally, an output layer with ten neurons (where each represents a number from 0 to 9)

from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True)

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Fit the model

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Early stopping allows us to stop the training when a monitored metric has stopped improving; This is useful to avoid overfitting the model to the data.
* We will monitor the validation accuracy now 
  * We set patience as 1, the number of epochs with no improvement, after which training will be stopped. There is no fixed rule to set patience; if you feel that your model is learning still and you stopped, you may increase the value and train again. However, we want the training process to be quick, so we also set patience to 1 since the idea here is to provide you with a "look and feel" learning experience.
  * We set the mode to min since now we want the model to stop training when the loss didn't improve its performance and improve means decrease

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=1)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will finally fit the model
* We create the model object and use `.fit()`, as usual
  * We parse the Train set
  * The epochs are set to 4. We know in advance that this amount is fine to learn the patterns considering the dataset and the network structure
  * We parse the validation data in a tuple.
  * Verbose is set to 1, so we can see in which epochs we are and the training and validation loss.
  * Finally, we parse our callback as the early_stop object we created earlier.

* For each epoch, note the training and validation loss and accuracy. Is it increasing? Decreasing? Static?
  * Ideally, the loss should decrease as long as the epoch increases, showing a practical sign the network is learning. The accuracy should increase over the epochs.
  * Note the model will take a bit longer now to train

model = create_tf_model(input_shape= X_train.shape[1:], n_labels=n_labels )

model.fit(x=X_train, 
          y=y_train, 
          epochs=4,
          validation_data=(X_val, y_val),
          verbose=1,
          callbacks=[early_stop]
          )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Model evaluation

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png">  Now we will evaluate the model performance by analysing the training and the validation losses and accuracy that happened during the training process. 
* In deep learning, we use the model history to assess if the model learned, using the train and validation sets. We also evaluate separately how the model generalises on unseen data (on the test set)
* The model training history information is stored in a `.history.history` attribute from the model. 
* **Note it shows loss and accuracy for train and validation**

history = pd.DataFrame(model.history.history)
history.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We are plotting each loss and accuracy in a line plot, where the y-axis has the loss/accuracy value, the x-axis is the epoch number, and the lines are coloured by train or validation
* We use `.plot(style='.-')` for this task
  * Note the loss plot for training and validation data follow a similar path and are close to each other. So it looks like the network learned the patterns.
  * Note in the accuracy plot that both train and validation accuracies keep increasing. The training stops when the performance "saturates" for validation, as we set in the early stopping object.

sns.set_style("whitegrid")
history[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.show()

print("\n")
history[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Next, we will evaluate the model performance on the test set, using `.evaluate()` and parsing the test set. Note the value is not much different from the losses and accuracy in the train and validation set.
* Note the loss is low and accuracy is high. It looks like the model has learned the relationships between the features and the target.

model.evaluate(X_test,y_test)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> When evaluating a deep learning model, you typically cover the loss plot and evaluate the test set; however, **as an additional step, you can do if you want** a similar evaluation we did in conventional ML.
* In classification, you would analyse the confusion matrix and classification report using the custom function we have seen over the course.
* One difference is that we readapted the function to evaluate also the validation set, but that is a minor change in the code; the overall logic is the same

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):
  # the prediction comes in a one hot encoded format
  prediction = pipeline.predict(X)
  # so we take the index from the highest probability, which is the "winner" or predicted class
  prediction = np.argmax(prediction, axis=1)
  
  # we also take the index from the highest probability from the actual values
  y = np.argmax(y, axis=1)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")

  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,X_val, y_val,pipeline,label_map):

  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Validation Set #### \n")
  confusion_matrix_and_report(X_val,y_val,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We need label_map to be a list with the classes meaning in a string format.
* We have target_classes, which is a list that represents the class meaning; however, it is an integer list
* We will convert the list of integers to a list of strings using a list comprehension.

[str(x) for x in target_classes]

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's parse the values as usual.
* Note the model is capable of separating the classes, including in the test set

clf_performance(X_train, y_train,
                X_test,y_test,
                X_val, y_val,
                model,
                label_map= [str(x) for x in target_classes]
                )

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Prediction

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's take a sample from the test set and use it as if it was live data. We will consider 1 sample

index = 102
my_number = X_test[index]
print(my_number.shape)
print(y_test[index])

sns.set_style('white')
plt.imshow(my_number.reshape(28,28), cmap='gray')
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We first investigate the shape of our live data. It has three dimensions, as we would expect from an image. In this case, it shows the image size (28 x 28) and the channel information (it is 1 since it is a grey colour image)

my_number.shape

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> However when interacting with the model, we need the data in 4 dimensions, where the first dimension is the number of images the data has, the next 2 are the image size and the last is the color channels
* In our case, we need to add the first dimension, and the value will be 1, so the final shape is (**1** ,28 ,28 ,1 )
* We use the command ` np.expand_dims()` for this task. The documentation link is [here](https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html).

live_data = np.expand_dims(my_number, axis=0)
print(live_data.shape)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `.predict()` and parse the data. Note the result is a probabilistic result for each class. 

prediction_proba = model.predict(live_data)
prediction_proba

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> So we take the index from the highest probability, which is the "winner" or predicted class

prediction_class = np.argmax(prediction_proba, axis=1) 
prediction_class

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's plot the probabilistic result so that you can check the predictions in a more visual fashion
* Read the pseudo-code
* In the end you are getting `prediction_proba` to define the associate probability for each class. Then you plot it in a bar plot using Plotly 

# create an empty dataframe, that will show the probability per class
# we set the probabilities as the prediction_proba
prob_per_class= pd.DataFrame(data=prediction_proba[0],
                             columns=['Probability']
                             )

# we round the values to 3 decimal points, for better visualization
prob_per_class = prob_per_class.round(3)

# we add a column to prob_per_class that shows the meaning of each class
# in this case, the species name that is mapped in the target_classes
prob_per_class['Results'] = target_classes

prob_per_class

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will use a bar plot, where the x-axis shows the result and the y-axis the associated probability for a given Result
* I encourage you to go to the first cell of the Prediction section and change the index variable so that you would take a sample. Then run all cells to predict until the plot from the cell below
* You may change the index to other positive integers

import plotly.express as px
fig = px.bar(
        prob_per_class,
        x = 'Results',
        y = 'Probability',
        range_y=[0,1],
        width=600, height=400,template='seaborn')
fig.update_xaxes(type='category')
fig.show()

---