# Deep Learning

<div>
<img src="files/deep_learning_brain.jpg" width="95%" source='https://www.futura-sciences.com/tech/definitions/intelligence-artificielle-deep-learning-17262/' align='center'/>
</div>

Nowadays, the deep learning is everywhere:

- Autonomous cars
- Unlocking phones
- Person tracking
- Detection of multiple sources in sounds
- Captioning (picture to text)
- Generating text (Large Language Model)
- Generate images
- Generate videos

### Basic architecture

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Input X = one single observation, 4 features (x1, x2, x3, x4)
# (e.g. eyes color, ears_lenghts, ...)
X = [1., -3.1, -7.2, 2.1]

# Target y (classification task 0/1, e.g. cat/dog)
y = 1

Imagine you have a **linear regression** with some weights :

In [None]:
def linreg_1(X):
    return -3 + 2.1*X[0] - 1.2*X[1] + 0.3*X[2] + 1.3*X[3]

out_1 = linreg_1(X)

And you transform its output with an **activation function**. So we're making a linear function non linear :

In [None]:
def activation(value):
    if value > 0:
        return value
    else:
        return 0

out_1 = activation(out_1)

## Neuron (perceptron)

### Weights

This **neuron**, also called **perceptron**, is made of the **sum of weighted features** combined with an **activation function**.

We took our 4 features ($x_1$, $x_2$, $x_3$, and $x_4$), we applied some weights, we summed everything and if the result is positive it returns the value, otherwise, it returns 0.

<div>
<img src="files/neuron_1.png" width="55%" source='https://www.researchgate.net/figure/12-Schema-de-structure-dun-seul-neurone-De-facon-simplifiee-le-reseau-de-neurones_fig3_324474500' align='center'/>
</div>

Given an input $X = (x_1, x_2, \ldots, x_n)$, a neuron is the concatenation of:

1. A **linear combination** of the input with the weights $w_k$ plus a bias $b$, that outputs $\sum_{k=1}^n w_k x_k + b$.

2. A **non-linear modification** $f$ of that sum.

Therefore, the output of a neuron is $\text{output} = f\left(\sum_{k=1}^n w_k x_k + b\right)$.


### Activation functions

There are many different activation functions.
<div>
<img src="files/activation_functions.png" width="55%" source='https://www.researchgate.net/figure/12-Schema-de-structure-dun-seul-neurone-De-facon-simplifiee-le-reseau-de-neurones_fig3_324474500' align='center'/>
</div>

Note : $tanh(x)$ is also called hyperbolic tangent.

❓ **Which of these activations did we just code?**

### Layers

Now imagine now that you produce another output by:

- Applying **another linear regression** to the same input X.
- Followed by the same activation function.

In [None]:
# Second neuron
def linreg_2(X):
    return -5 + (-0.1*X[0]) + 1.2*X[1] + 4.9*X[2] - 3.1*X[3] # Same X but different weights

out_2 = activation(linreg_2(X)) # Same activation function

In [None]:
# Third neuron
def linreg_3(X):
    return -8 + 0.4*X[0] + 2.6*X[1] + (-2.5*X[2]) + 3.8*X[3] # Same X but different weights

out_3 = activation(linreg_3(X))

In [None]:
# Fourth neuron
def linreg_4(X):
    return 10 + 0.8*X[0] + 1.7*X[1] + 3.5*X[2] + 1.67*X[3]

out_4 = activation(linreg_3(X))

In [None]:
# Fifth neuron
def linreg_5(X):
    return -7.3 + 0.42*X[0] + 3.89*X[1] + (-0.675*X[2]) + (-13.8*X[3])

out_5 = activation(linreg_3(X))

Now we have created 5 neurons : ```out_1```, ```out_2```, ```out_3```, ```out_4``` and ```out_5```. Each one takes the same inputs (X), but the linear regression functions have different coefficients. They also have the same activation function.

(**Note**: Each neuron can have a different activation function but in practice, each layer only uses one type.)

**We just wrote a layer of neurons !**

There are 5 neurons on this graph, but we just created 3 more.


<div>
<img src="files/layer.png" width="55%" source='https://www.researchgate.net/figure/12-Schema-de-structure-dun-seul-neurone-De-facon-simplifiee-le-reseau-de-neurones_fig3_324474500' align='center'/>
</div>

### What if we use the 3 outputs of this layer as input of another layer, again?

In [None]:
def linreg_second_layer(X):
    return 5.1 + 1.1*X[0] - 4.1*X[1] - 0.7*X[2] + 1.7*X[3] + (-8.91*X[4]) # Now we have 5 inputs!

def activation_second_layer(value):
    # sigmoid activation (for classification task for example)!
    return 1. / (1 + np.exp(-value))

def neural_net_predictor(X):
    
    # First layer
    out_1 = activation(linreg_1(X))
    out_2 = activation(linreg_2(X))
    out_3 = activation(linreg_3(X))
    out_4 = activation(linreg_4(X))
    out_5 = activation(linreg_5(X))
    
    outs = [out_1, out_2, out_3, out_4, out_5] # All outputs from layer 1
    
    # Second layer and prediction
    y_pred = activation_second_layer(linreg_second_layer(outs))
    
    return y_pred

In [None]:
neural_net_predictor(X)

## So, what is a Neural Network?

Nothing more than a fancy function $f_{\theta}$ that computes $\hat{y} = f_{\theta}(x)$, where $\theta$ are the weights of all the linear regressions that take place within the neurons.

Usually:

- $\theta$ means are all the weights, including $b$.
- $w$ is the weights without the $b$.
- $b$ are the intercepts, also called the biais. (And called $\beta_0$ in a regression).

<div>
<img src="files/neuralnet.png" width="55%" source='https://www.researchgate.net/figure/12-Schema-de-structure-dun-seul-neurone-De-facon-simplifiee-le-reseau-de-neurones_fig3_324474500' align='center'/>
</div>

In the above graph, how many weights are in the hidden layer and output layer?

- There are 5$x$, and 4 neurons. So 20 $w$. But you've got also an intercept ($b$) in each one of the neuron so 24. So $\theta$ is 24 for the first layer.
- In the output layer, there are 4 different results + the biais ($b$), so 5 weights.

## Deep learning and neural networks

"**Deep learning**" means we're using a neural networks that have many layers.

Of course the output can be **one value** or **several values**. For example you can output each pixel of a new image. Or if we have 10 classes, we can predict 10 probabilities (and the sum will be 1).

When all neurons are connected to each other we call it "dense".

<div>
<img src="files/neuralnet_2.png" width="55%" source='https://www.researchgate.net/figure/12-Schema-de-structure-dun-seul-neurone-De-facon-simplifiee-le-reseau-de-neurones_fig3_324474500' align='center'/>
</div>

## Why do we use activation functions?

To introduce non-linearities! Without them, our Neural Network would be a simple linear model.

$A(a_1x_1 + a_2x_2) + B(b_1x_1 + b_2x_2) = (Aa_1 + Bb_1)x_1 + (Aa_2 + Bb_2)x_2$

### Tensorflow Playground

[Tensor Flow Playground](https://playground.tensorflow.org/)

With the default parameters.

- The orange color means it's negative and the blue means it's positive.
- The second 4 features are created with the 2 initial features, with a Tanh activation and a biais.

If we try with a linear activation, the algorithm can't find a way to separate my data points.

## Keras

**Keras** started out as an independent library, but a few years ago got backed-up by Google. The library is included (as a separate & totally different package) in **TensorFlow**, which is also a deep learning library.


<img src="files/keras_and_tf.png" width="55%" source='https://www.researchgate.net/figure/12-Schema-de-structure-dun-seul-neurone-De-facon-simplifiee-le-reseau-de-neurones_fig3_324474500' align='center'/>
</div>

## Installation

To install keras use:
    
```pip install tensorflow```

And to import:
    
```from tensorflow.keras import *```

In [None]:
from tensorflow.keras import Sequential, layers

# Basically, it will look like a sequence of layers 
model = Sequential()

# First layer: 10 neurons and ReLU as the activation function
model.add(layers.Dense(10, activation='relu')) 

# The standard layers are called Fully Connected (Dense in Keras)

In [None]:
# You can go for two fully connected layers
model = Sequential()
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(20, activation='tanh'))

In [None]:
# You can also go for many, many, many more ...

model = Sequential()
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(20, activation='tanh'))
model.add(layers.Dense(10, activation='linear'))
model.add(layers.Dense(100, activation='sigmoid'))
model.add(layers.Dense(40, activation='softmax'))
model.add(layers.Dense(10, activation='tanh'))
model.add(layers.Dense(3, activation='relu'))
model.add(layers.Dense(9, activation='tanh'))
model.add(layers.Dense(8900, activation='relu'))
model.add(layers.Dense(1000, activation='tanh'))

## Decision rules

How should I know how many layers, with how many nodes, and what activation function should I use?

Well, it takes a lot of time and effort but you can start with these simple rules:

### Rule N°1 : The first layer

Your first layer should be the size of your input.

In [None]:
from tensorflow.keras import Sequential, layers, Input

# Imagine each observation has 4 features (x1, x2, x3, x4)
model = Sequential()
#model.add(layers.Dense(10, input_dim=4, activation='relu')) # Old syntax
model.add(Input(shape=(4,)))  # Define the input shape explicitly
model.add(layers.Dense(10, activation='relu'))  # Add a dense layer with 10 neurons

### Rule N°2 : The last layer

#### Regression

You need to create the right layer according to your task. If it's a regression problem. You need to add a linear activation function, so it can take any values.

In [None]:
# Only 1 output
model.add(layers.Dense(1, activation='linear'))

# 13 outputs, you can predict more than one value for one observation. Like a color for each pixel.
model.add(layers.Dense(13, activation='linear'))

#### Classification

In [None]:
# 2 classes (binary)
model.add(layers.Dense(1, activation='sigmoid'))

# 13 classes
model.add(layers.Dense(13, activation='softmax')) # The sum of each proba will be equal to one ex [0.78, 0.02, etc.]

### Softmax

The softmax activation function transforms the raw outputs of the neural network into a vector of probabilities, essentially a probability distribution over the input classes. It's a bit like the sigmoid but for more than two values.

<div>
<img src="files/softmax.jpg" width="45%" source='https://towardsdatascience.com/softmax-activation-function-explained-a7e1bc3ad60' align='center'/>
</div>



### Rule N°3 : Experiment

In practice, apart from the input size and the last layer, you have to choose:

- The number of neurons.
- The number of layers.
- The activation functions.

In [None]:
# Small exercice: how many parameters in this simple regression task: 

model = Sequential()
model.add(Input(shape=(4,)))
model.add(layers.Dense(10,activation='relu'))
model.add(layers.Dense(1, activation='linear'))

In [None]:
# Code here !


### Exercice:

How many parameters in this model?

In [None]:
model = Sequential()
model.add(Input(shape=(784,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, activation='tanh'))
model.add(layers.Dense(10, activation='softmax'))

In [None]:
# Code here!


## Training : Loss and Optimization

### Compiling


In [None]:
model.compile(loss='mse', optimizer='adam') # That's the solver, in DL the default is adam.

### Fitting

In [None]:
X = np.array([[1., -3.1, -7.2, 2.1]]) # Same X than before, only one sample
y = np.array([[1]])  # only one y

model = Sequential()
model.add(Input(shape=(4,)))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='mse', optimizer='adam') # That's the solver, in DL the default is adam.

model.fit(X, y, batch_size=32, epochs=2);

- ```batch_size``` is the size of the subset given to the neural network to update the parameters $\theta$ .Each time the model has reached the batch size, it computes new weights.
- One epoch, is one iteration, is when the model has processed all the data you have.


## Example : Face recognition



In [None]:
# Load data
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=200, resize=0.25)

# 766 images of 31 * 23 pixel black & white
print(faces.images.shape)

In [None]:
# 2 different target classes
np.unique(faces.target)

In [None]:
fig = plt.figure(figsize=(13,10))
for i in range(15):
    plt.subplot(5, 5, i + 1)
    plt.title(faces.target_names[faces.target[i]], size=12)
    plt.imshow(faces.images[i], cmap=plt.cm.gray)
    plt.xticks(()); plt.yticks(())

In [None]:
# Flatten our 766 images
X = faces.images.reshape(766, 31*23) # Each image is now an array of 713 values. And each value is between 0 and 256.
X.shape

In [None]:
y = faces.target
y.shape

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=3)

In [None]:
# Standardize
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers, Input

# Model definition
model = Sequential()
model.add(Input(shape=(713,)))
model.add(layers.Dense(20, activation='relu'))
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()

In [None]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy', # The loss used in logistic regression -> Log(Loss)
    metrics=['accuracy'])

model.fit(X_train, y_train, batch_size=16, epochs=8);

In [None]:
model.evaluate(scaler.transform(X_test), y_test)
# returns [loss, metrics]

In [None]:
pd.Series(y).value_counts()

In [None]:
# Baseline score (always predict the majority class)
530 / (530+236)

In [None]:
# Predicted probabilities
# model.predict(scaler.transform(X_test))

## Conclusion

Deep Learning is nothing more than :
    
- **Multiple linear regressions** stacked together.
- **Non-linear functions**: the activation functions.