### Advanced learning algorithms

What will be learned:
* Neural Networks 
* Inferencing (prediction)
* Training
* Practical advice for ML systems
* Decision trees

### Neural Networks

*Deeplearning* is usually the word used to refer to Neural Networks. 

Speech recognition is an example of DL. Computervision, NLP etc. are all also applications that use Neural Networks in Machine Learning. Although, it was originally inspired by how the brain works. Today, they do not really mimick the brain.

**Each neuron in a NN gets an input number, does some computation on the number, and then outputs the newly computed number.**

The ideas of NN has been around for many years, so why is it gaining this much traction now?  

The amount of data digitized in this age has exploded. With traditional algorithms like linear and logistic regression cannot take advantage of this. It 'platues' eventually.  

So they realized training NN takes better advantage of this data. **The size of the NN's performance is directly correlated to the amount of data that is used to train it.**

### How a NN acts and works

Lets use an example of selling t-shirts and you want to predict if the product will be a top seller or not:  
* You collected data of different t-shirts and which ones became to sellers
* input $x$ is the price of the t-shirt and you can apply logistic regression by fitting a *sigmoid* function to the data
* $f(x)$ or now called $a$ is used to denote the output. It is also called the *activation*
* Now you have a basic single neuron that takes input, does calculation (logistic regression) and outputs a value
* Building a NN requires a whole bunch of these *neurons*

A more complex example takes input features: price of t-shirt, shipping costs, amount of marketing of the t-shirt, and material quality:
* Few factors that influence if the t-shirt becomes a top-seller:
    * Affordability of the t-shirt
    * Degree of awareness of the t-shirt
    * Degree of quality
* How to build the NN:
    * Single neuron for each factor
    * Logistic regression neuron inputs the price of t-shirt and shipping costs and outputs *affordability*
    * Another NN to estimate awareness by inputting the *amount of marketing*
    * Final NN to estimate the degree of quality. This will be a function of the price of the t-shirt and the material quality
* Taking these 3 neurons (**the layer**) will be fed into a final neuron (**output layer**) that does logistic regression and then will output the probability of the t-shirt being a top seller
* We would also refer to the outputs of the first layer (affordability, awareness and quality) as *activation*
* Finally, the NN takes 4 values (**input layer**), calculates new values, and then finally uses these three to calculate the last number which is the probability.

In the example above we go manually through these neurons and decide which take which features. How it works in practice is that each neuron will have access to each input feature of the previous layer. The neuron will learn to ignore the irrelevant features.

Each layer inputs a vector of features and outputs a vector of features. The NN has input layer, hidden layer(s), and output layer.  

One way to think of a NN is it is just logistic regression. But in this version of logistic regression it can learn its own features that makes it easier to make accurate predictions. The NN can also 'engineer' its own features to make it easier for itself.

**NOTE**: <u>You do not need to explicitly decide which features the NN should compute, instead it figures out all by itself what are the features it wants to use in the hidden layers.<u>

When building a NN you need to make a few decisions:
* How many hidden layers will you use
* How many neurons will you use  

These are questions of the **architecture** of the NNs. Choosing the appropriate numbers for these will have a great impact on the performance of the NN.

### Example of NN: Recognizing images

In a face recognition example you want the NN to take the image of a person as input and output the identity. Say the picture is 1000px by 1000px. It will be represented in the computer as 1000rows by 1000 columns. Unrolling this into a single vector will be 1 million (1000x1000) values. Then the NN takes these values and tries to predict the identity.

Input $\vec{x}$ is fed into an input layer and then it extracts some features. The output of this first layer is fed to a second layer and so on and so on until the output layer predicts person $xyz$

**How it works**:  
1) For instance the first layers will be looking at short lines or edges, 
2) as you go further it will start to look at a combination of more such lines and edges as a corner of a nose or an eye for example. 
3) Then further along it will check larger, coarser face shapes and 
4) finally the output layer that tries to determine the identity.

**Again the NN figures out all on its own how to set itself up and figure out the identities by looking at these features in different ways.**

### Neural Network Layer

The first neuron has values $w_1, b_2, a_2$. After the first layer you are left with a vector of *activation* values, that can be denoted as $\vec{a}$. This is passed to the output layer to do the final computations.  

To denote the output of the first layer, it will be written as follow:  
$$\vec{a}^{[1]}$$

This shows the activation values of the first output layer. You can also write more specific the activation value of the first neuron in the first layer:
$$\vec{a}^{[1]}_1$$

To denote the $w$ and $b$ values will also be as follows:  
$$\vec{w}^{[1]}_1, \vec{b}^{[1]}_1$$

This shows the $w$ and $b$ value for the first neuron in the first layer. The **superscript** denotes the number of the layer "[1]" and the subscript shows the number of the neuron in the layer, counted from top to bottom.

### Neural Network Model (More complex Neural Networks)

The example has:
* Four layers in total
* One input layer (layer 0), **NOTE**: *By convention we do not count this layer when referring to the NN*
* Three hidden layers (layer 1, 2, 3)
* One output layer (layer 4)

Example of the the computations i of layer 3:  
$$a^{[3]}_1 = g(\vec{w}_1^{[3]}\cdot\vec{a}^{[2]}+b_1^{[3]})$$
$$a^{[3]}_2 = g(\vec{w}_2^{[3]}\cdot\vec{a}^{[2]}+b_2^{[3]})$$
$$a^{[3]}_3 = g(\vec{w}_3^{[3]}\cdot\vec{a}^{[2]}+b_3^{[3]})$$

Note that every Neuron takes in the whole $\vec{a}^{[2]}$ to do the calculations.

This can be written more general as:
$$a_j^{[l]} = g(\vec{w}_j^{[l]}\cdot\vec{a}^{[l-1]}+b_j^{[l]})$$

$g$ can be referred to in general as the *activation function*. Input vector is referred to as $\vec{x} = \vec{a}^{[0]}$.

### Inference: Making Predictions (Forward Propagation)

In the example we will:
* input an image and 
* try to distinguish between the digits 0 and 1. 
* The input is an 8x8 image.
* Three layer NN: Layer 1 = 25 units, Layer 2 = 15 units, and Layer 3 = 1 unit

Sequence of computations: $\vec{x}-->\vec{a}^{[1]}-->\vec{a}^{[2]}-->\vec{a}^{[3]}$. With $\vec{a}^{[3]}$ being a single scalar value that can be checked to be 1 if $\vec{a}^{[3]}>=0.5$

Because this algorithm goes from left to right, this is also called **Forward Propagation**. This is because you are *propagating* the activations of the neurons forward. The architecture here is fairly typical for NNs with the number of neurons decreasing from left to right.

### TensorFlow Implementation: Inference in Code

TensorFlow is one of the leading tools for building algorithms.  

Example of roasting coffee:
* *Two params* or features: Temp (celsius), and Duration (minutes)
* Dataset will have different temperatures and durations, and the *label* showing whether the coffee roasted is good tasting or not (y = 1 is good and y = 0 is bad)

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Sequential
from tensorflow.keras.losses import MeanSquaredError, BinaryCrossentropy
from tensorflow.keras.activations import sigmoid

x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation='sigmoid') #Dense is another name for the layers. 3 hidden units with act func sigmoid
a1 = layer_1(x) #Apply layer_1 to x
layer_2 = Dense(units=1, activation='sigmoid')
a2 = layer_2(a1)
yhat = 1 if a2 > 0.5 else 0
#Here we did not load w and b and so on


### Data in TensorFlow: How TensorFlow handles data

How data is represented in numpy? Why do you have the double square brackets in 'x =np.array([[200.0, 17.0]])'? 

To store a 2x3 matrix: x = np.array([[1, 2, 3],[4, 5, 6]]).  

**When we set x =np.array([[200.0, 17.0]]) we created a 1x2 matrix as follows: [200 17].**

When you want a 1x2 matrix it will look like this: x = np.array([[200],[17]]).  

**So why does it have double square brackets?** *It's because one of the dimensions is 2D matrix that we want, otherwise it would be x = np.array([200, 17]) and this would be a 1D "vector" or a linear array*  

We used this 1D vector to represent the features $x$ in course 1. But with Tensorflow there is a convention to use matrices to represent the data!

If you print a1, you will get:  
a1 = layer_1(x)  
**result**: tf.Tensor([[0.2 0.7 0.3]], shape(1, 3) dtype=float32)

a1.numpy()  
**result**: array([[0.2, 0.7, 0.3]], dtype=float32)

if we look at a2:
a2 = layer_2(a1)  
**result**: tf.Tensor([[0.8]], shape=(1, 1), dtype=float32)

a2.numpy()  
**result**: array([[0.8]], dtype=float32)

### Building a Neural network in Tensorflow

In [3]:
#First we create the first two layers
layer_1 = Dense(units=3, activation='sigmoid') #Dense is another name for the layers. 3 hidden units with act func sigmoid
layer_2 = Dense(units=1, activation='sigmoid')
#Then we can tell Tensorflow to take these layers and string them together in a NN
model = Sequential([layer_1, layer_2])

When coding in Tensorflow, we do not explicitly code the layers, it is done as follows:

In [None]:
model = Sequential([Dense(units=3, activation="sigmoid"),
                    Dense(units=1, activation="sigmoid")])

### Implementing NN efficiently (Vectorization)

Vectorized Implementation:

In [None]:
X = np.array([[200, 17]]) #2D-array/Matrix
W = np.array([[1, -3, 5],
              [-2, 4, -6]])#2D-array/Matrix
B = np.array([[-1, 1, 2]])#2D-array/Matrix

def dense(A_in, W, B):
    Z = np.matmul(A_in,W) + B #Matmul does the full matrix multiplication
    A_out = g(Z)
    return A_out

### Matrix Multiplication

Taking the dot product, you multiply the first elements with each other, then the second, the third and then add them up. e.g.  

$\begin{bmatrix} 1 \\ 3 \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 4 \end{bmatrix} = (1*3)+(2*4) = 3+8 = 11$  

Transpose:  Takes a column vector and transforms it to a row vector

$\vec{a} = \begin{bmatrix} 1\\2 \end{bmatrix} = \vec{a}^T = \begin{bmatrix} 1&&2 \end{bmatrix}$

e.g. multiplying $\vec{a}^T =\begin{bmatrix} 1&&2 \end{bmatrix}   \vec{w} = \begin{bmatrix} 2 \\ 4 \end{bmatrix}$  

***Is the same as taking the dot product.***

**Vector Matrix multiplication**:  

$\vec{a}^T = \begin{bmatrix} 1&&2 \end{bmatrix}$  

$W = \begin{bmatrix} 3&&5 \\ 4&&6 \end{bmatrix}$  

Then calculating: $Z = \vec{a}^T W$

$Z = \begin{bmatrix}\vec{a}^T \vec{w_1} && \vec{a}^T \vec{w_2}\end{bmatrix}$ which is $Z = [(1*3)+(2*4)\; (1*5)+(2*6)] = [11 \; 17]$

**Matrix Matrix Multiplication**:

$A = \begin{bmatrix}1&&-1 \\ 2&&-2\end{bmatrix}$  

$A^T = \begin{bmatrix}1&&2 \\ -1&&-2\end{bmatrix}$  

**How to make the transpose**? The first column, becomes the first row, and the second column becomes the second row.

We also have: $W = \begin{bmatrix} 3&&5 \\ 4&&6 \end{bmatrix}$  

$Z = A^TW = \begin{bmatrix} Row(a1)Col(w1)&&Row(a1)Col(w2) \\ Row(a2)Col(w1)&&Row(a2)Col(w2) \end{bmatrix}$  

$Z = \begin{bmatrix} 11&&17 \\ -11&&-17 \end{bmatrix}$

### Matrix Multiplication Rules

Think of the columns of each matrix as a vector.  

If you take the $Z = A^TW$.  

Think of it as each row of $A^T$ corresponds to the row of the result and each column of $W$ corresponds to each column of the result.  

**A requirement**: 3x2 matrix can only be multiplied with 2xN matrix. The result will be a 3xN matrix.  

**Why?** because you can only take dot products of vectors that are the same length! Therefore the columns of $A^T$ must be the same length as the rows of $W$.