# <span style="color:blue">Building an ML project</span>


<div style="text-align: right">Dr. Sishir Kalita</div>
<div style="text-align: right">Data Scientist</div>
<div style="text-align: right">Armsoftech.air</div>
<div style="text-align: right">Chennai</div>

### Outline

1. Key steps in a ML project

2. How to train a ML model?

3. Bias/variance problem

4. Regularizations and batch norm

5. Evaluation metrics

6. ML case study

In [9]:
import IPython
IPython.display.Audio("test.mp3")

### Different ML projects

1. Voice bot (ASR, TTS, ASV, Speech analytics)
2. Chat bot
3. Health diagnosis bot
4. Self driving car
5. AI-powered visual inspection platform for quality control

### Key steps in a ML project

1. Data collection
2. Data pre-processing
3. Model training / testing 
    * Iterate till you get the good results in the test set
    * Pump more labeled data
4. Model deployment

### What is data?

1. Structured data 

| House size (sq. feet) | PIN code | # bedrooms | Price (INR) |
| :-: | :-: | :-: | :-: |
| <font size="3">1320</font> | <font size="3">781001</font> | <font size="3">3</font> | <font size="3">4300</font> |
| <font size="3">1110</font> | <font size="3">781039</font> | <font size="3">3</font> | <font size="3">3500</font> |
| <font size="3">990</font> | <font size="3">881034</font> | <font size="3">2</font> | <font size="3">3910</font> |

2. Unstructured data data
    * Image, Speech, Text, Video
    
    
<p float="right">
<img src="index.png" width="700" align="center"/>
</p>

### Data is messy

**Data may have-**
1. Mislabeled examples
2. Missing values
3. Noisy (speech/images)
    
| House size (sq. feet) | PIN code | # bedrooms | Price (INR) |
| :-: | :-: | :-: | :-: |
| 1320 | 781001 | 3 | 4300 |
| NaN | 781039 | 3 | 3500 |
| 990 | 781034 | 2 | 0 |
    
**Data pre-processing**
1. Data pre-processing is a process of cleaning the collected data
2. Ignoring the missing values:
3. Filling the missing values:
4. Outliers detection: 
3. Data Scaling: Normalization

### Prepare your train, development (dev) / validation (val) and test (evaluation (eval)) sets
1. How to divide:
    * Before deep learning era: <span style="color:blue">80% (train: 8000), 10% (Dev: 1000), and 10% (Test: 1000)</span>  (**if you have 10,000 examples**)
    * Present days : <span style="color:blue">98% (train), 1% (Dev), and 1% (Test)</span> (**if you have 10,00,0000 examples***)
1. Should have a more diverse <span style="color:blue">train</span> set
2. Distribution of <span style="color:blue">dev</span> and  <span style="color:blue">test</span> sets should be same
3. Cleaning up mislabeled <span style="color:blue">dev</span> and  <span style="color:blue">test</span> sets examples

### Steps to train a model

1. Decide the model and define its structure
2. Normalizing the inputs
3. Initialize the parameters of the model
4. Learn the parameters for the model by minimizing the cost
    * Calculate current loss (forward propagation)
    * Calculate current gradient (backward propagation)
    * Update parameters (gradient descent)
5. Use the learned parameters to make predictions (on the test set)

**Let's consider a logistic regression model**
* Binary classification model
* Training set: $ \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... (x^{(m)}, y^{(m)})\} $

\begin{align}
\hat y = h_{\text{w}}(x) = \frac{1}{1 + e^{-\text{w}^Tx}}
\end{align}
\begin{align}
    {\text{w}}^Tx &= [\text{w}_{1},\text{w}_{2},\text{w}_{3}]
          \begin{bmatrix}
           x_1 \\
           x_2 \\
           x_{3}
         \end{bmatrix}
  \end{align}
  
<p float="right">
<img src="sigmoid.png" width="300" align="left"/>
<img src="sigmoid1.png" width="400" align="left"/> 
</p>



**Compute the loss function: ('m' is the number of training example)**
\begin{align}
J(\text{w}, b) = \frac{1}{m}\sum_{i=1}^{m}L(\hat y^{(i)}, y^{(i)})
= -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\hat y^{(i)}) + (1 - y^{(i)})log(1- \hat y^{(i)})]
\end{align}
If y = 1 (first part) and y = 0 (second part)

Now, we need to find the **w** and **b** which minimize the $J(\text{w}, b)$ [Requires one optimization algorithm]

<img src="costng.png" width="500" /> 

Image source: <a href="https://cs230.stanford.edu/files/C1M2.pdf" target="_blank">cs230.stanford.edu</a>

### Optimization algorithms (I hope it is already discussed)

**Gradient descent**
1. Compute the gradient w.r.t **w** and **b**
2. Update the **w** and **b**
\begin{align}
w_{j+1} = w_{j} - \alpha \frac{dJ(w,b)}{dw}\\
b_{j+1} = b_{j} - \alpha \frac{dJ(w,b)}{db}
\end{align}
3. Iterate till you reach local minima

<p float="right">
<img src="loss.png" width="300" align="left"/> 
<img src="sgd.gif" width="400" align="left"/> 
</p>

Image source: <a href="http://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/" target="_blank">rasbt.github.io</a>


**Stochastic gradient descent** vs **mini batch gradient descent** vs **Batch gradient descent**
1. Let's say, your train set size 'm'
2. Stochastic gradient descent: calculate error for each example and update the model for each example
3. Mini batch gradient descent: take a mini batch (M < m), compute the error, and update the model for each mini-batch
4. Batch gradient descent: calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated

**Other advanced optimization algorithms**
1. Gradient descent with momentum
2. Adam
3. RMSprop

### What is epoch?
1. One cycle through the entire training dataset is called a training epoch
2. Number of passes (1 pass : one forward pass + one backward pass in one batch)
3. Let m = 1000, and M = 10
4. For SGD: there will be 1000 iterations/epoch
5. For Mini batch gradient descent: there will be 1000/10 (100) iterations/epoch
6. For Batch gradient descent: there will be 1 iteration/epoch

```python
for epoch in range(n_epochs):
    for x_batch, y_batch in train_loader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # Sets model to TRAIN mode
        model.train()
        ##################
        # Forward pass
        # Makes predictions
        yhat = model(x_batch)
        # Computes loss
        loss = loss_fn(y_batch, yhat)
        
        ##################
        # Backward pass
        # Computes gradients
        loss.backward()
        
        ##################
        # Updates parameters
        optimizer.step()
        
        losses.append(loss)
        
    with torch.no_grad():
        for x_eval, y_eval in eval_loader:
            x_eval = x_eval.to(device)
            y_eval = y_eval.to(device)
            
            model.eval()

            yhat = model(x_eval)
            eval_loss = loss_fn(y_eval, yhat)
            eval_losses.append(eval_loss.item())
            
```

Quiz (1 Min)

1. $\alpha$ in the gradient descent algorithm is a learnable parameter.

Ans: **TRUE** OR **FALSE**

### Normalization of the inputs
<p float="right">
<img src="norminput1.png" width="400" align="left"/>
<img src="norminput2.png" width="250" align="left"/> 
</p>

Image source: <a href="deeplearning.ai" target="_blank">Andrew Ng</a>

For train set:
\begin{align}
X^{train}_{norm} = \frac{X^{train} - \mu^{train}}{\sigma^{train}}
\end{align}

For test set:
\begin{align}
X^{test}_{norm} = \frac{X^{test} - \mu^{train}}{\sigma^{train}}
\end{align}

### Why we should normalize the input

<p float="right">
<img src="norminput3.png" width="1000" align="left"/> 
</p>

Image source: <a href="deeplearning.ai" target="_blank">Andrew Ng</a>


**Neural net may suffer from the vanishing and exploding gradient problems**

### Weight initialization
1. Zero initialization
    * Every neuron in each layer will learn the same thing
2. Random initialization
    * np.random.randn($N_{L}$,$N_{L-1}$)
3. He initialization
    \begin{align}
    w_{init_{L}} = \text{np.random.randn}(N_{L},N_{L-1}) * \sqrt(\frac{2}{N_{L-1}}), \text{N: number of neurons}
    \end{align}

Demo: <a href="https://qmsvpvzwwppdeofafmdkgv.coursera-apps.org/notebooks/week5/Initialization/Initialization.ipynb" target="_blank">initialization</a>

### Bias/Variance problem

1. We can get some idea for improving the model by knowing the bias/variance problem

|  | Train set error (%) | Test set error (%) | Conclusion | 
| :-: | :-: | :-: | :-: | 
| High bias problem | 25 | 40 | Underfitting | 
| High variance problem | 1 | 12 | Overfitting | 
| Good model | 1 | 2| 

<img src="bias_var.png" width="800">
Image source: <a href="deeplearning.ai" target="_blank">deeplearning.ai</a>

### Addressing bias/variance problem
1. **Reducing the high variance**

    * Add more training data
    * Add regularization 
        * L2 regularization
        * Dropout
    * Add early stopping
    * Decrease the model size

2. **Reducing the high bias**

    * Increase the model size (# of neurons/layers)
    * Try to run it longer
    * Different (advanced) optimization algorithms
    * Reduce or eliminate regularization
    * Modify model architecture

**<span style="color:blue">Try until you get better results on both train and test sets</span>**

<img src="bias-variance-tradeoff.png" width="300">

Iamge source: <a href="https://dziganto.github.io/cross-validation/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-with-Validation-and-Cross-Validation/" target="_blank">dziganto.github.io</a>


#### L2 Regularization
\begin{align}
J(\text{w}) = \frac{1}{m}\sum_{i=1}^{m}L(\hat y^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\sum_{i=1}^{m}(|\text{w}^{(i)}|^2)
\end{align}
1. Here, $\lambda$ is the regularization parameter (hyperparameter)
2. In practice this penalizes large weights and effectively limits the freedom in your model
3. Causes the weight to decay in proportion to its size
3. If lambda is too large - a lot of w's will be close to zeros which will make the NN simpler (you can think of it as it would behave closer to logistic regression).

<img src="reg_strengths.jpeg" width="400">

Image source: <a href="https://cs231n.github.io/neural-networks-1/" target="_blank">cs231n.github.io</a>

#### Dropout

<img src="drop1.png" width="600">
Image source: <a href="https://cs231n.github.io/neural-networks-2/" target="_blank">cs231n.github.io</a>

1. The dropout regularization eliminates some neurons/weights on each iteration based on a probability
2. Can’t rely on any one feature, so have to spread out weights [Andrew Ng]

Demo: <a href="https://qmsvpvzwwppdeofafmdkgv.coursera-apps.org/notebooks/week5/Regularization/Regularization_v2a.ipynb" target="_blank">L2 & Dropout</a>


### Other regularization methods
1. Data augmentation
    * If data mismatch between train and test set
    * Add noise in the speech signal or perturb the speech signal
    * Use speech synthesis
    * Distorts the image (scaled, rotate)
    * Create image using graphics
2. Early stopping
    * Check the train and validation set errors

### Parameters and Hyperparameters
1. Weights (w) or bias (b) is a learnable parameter
2. Hyper parameters (parameters that control the algorithm)
    * Learning rate
    * Number of iteration
    * Number of hidden layers
    * Number of hidden units
    * Choice of activation functions
    * Mini-batch size

### Normalizing activations in a network
**Batch normalization**
1.  batch normalization allows each layer of a network to learn by itself a little bit more independently of other layers

<p float="right">
<img src="batch-normalization.jpg" width="400" align="left">
<img src="val.png" width="300" align="left">
</p>

Image source: <a href="https://www.learnopencv.com/batch-normalization-in-deep-networks/" target="_blank">earnopencv</a>


```python
model = Sequential
model.add(Dense(32))
model.add(BatchNormalization())
model.add(Activation('relu'))
```

### Some ML strategies -
1. Carrying out error analysis
    * It can give you insights into what to do next
    * Cleaning up incorrectly labeled data
2. Transfer learning
3. Multi-task learning
4. <a href="https://www.linkedin.com/posts/andrewyng_i-talk-about-aimanufacturing-in-todays-activity-6732424258905108480-UWga/" target="_blank">Improving human level performance </a>


### Transfer learning
1. You have a model A, which was trained on a very large data
2. Learn from task A and then transfer that to task B (task B has very less data)
3. Sequential process

<p float="right">
<img src="cnn.png" width="600" align="left">
<img src="cnn1.png" width="400" align="left">
</p>

Image source: <a href="https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a" target="_blank">towardsdatascience</a>


### Multi-task learning
1. One neural network learns several tasks at the same time
2. Each of these tasks helps all of the other tasks

<p float="right">
<img src="mul1.png" width="500" align="left">
<img src="mul.png" width="500" align="left">
</p>

Image source: <a href="https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2594.pdf" target="_blank">isca</a>

### Evaluation metric:

**Highly depends on your application and what you are trying to optimize for**

1. Confusion metric
2. Accuracy
3. Precision / Recall / F1 score
4. Word error rate in ASR
5. Bilingual Evaluation Understudy (BLEU) Score in machine translation
6. Equal error rate in speaker verifiction


**Basic metrics you should know**

| H/P | + class | - class | 
| :-:| :-: | :-: | 
| **+ class** | True positive | False negative | 
| **- class** | False positive | True negative | 

**[When FP and FN will be useful!]**
1. FN should be zero: cancer diagnosis
2. FP should be zero: speaker verification

Accuracy = $\frac{TP + TN}{TP + FP + FN + TN}$

Precision (proportion of positive identifications which was actually correct) = $\frac{TP}{TP + FP}$

Recall = (proportion of actual positives which was identified correctly) = $\frac{TP}{TP + FN}$

F1 score = HM(Precision, Recall)

Quiz (1 min)

Which model I should prefer?

| Model | Precision (%) | Recall (%) | F1-score | 
| :-:| :-: | :-: | :-: | 
| A | 90 | 86 | 87.9  | 
| B | 87 | 93 | 89.9 | 

Ans: **A** OR **B**

### Case study:

**Problem formulation**

Let's assume you are a Research Scientist in one of the AI labs in Assam. Forest department found that in several parts of Assam, there are incresing number of elephant-man conflicts. And they want to deploy the smart cameras in the some places, so that the camera sends a warning signal to the nearby villagers, whenever it detects elephant presence near to it. Now, Forest department wants you to excecute this project.

Forest department gives you a dataset of 1,00,000 images captured from the different parts of Assam using their security cameras.


The images were labelled as:
1. y = 0: There is no elephant in the image
2. y = 1: There is an elephant in the image

Your task is to build an ML model that should be able to classify new images taken by security cameras after deployment.

The requirments from the forest department:
1. Accuracy should be as low as possible
2. The latency should be: 10 msec (i.e. within 10 msec, the camera should classify the images)
3. Memory requirment of the model should be very low. Their requirment: 10 MB

How will you choose the evaluation metric?
How will you split the train/dev/test sets?

How will you split the train/dev/test sets?


| Situation | Train | Dev | Test | 
| :-:| :-: | :-: | :-: | 
| A | 90000 | 5000  | 5000  | 

Which model you have to pick?

| Model | Latency (msec) | Size (MB) | Accuracy (%) | 
| :-:| :-: | :-: | :-: | 
| A | 30 | 4  | 98  | 
| B | 8.5 | 13 | 97 | 
| C | 8.9 | 6.8 | 95 | 

Here, 
1. Accuracy is an optimizing metric; latency and memory size are the satisficing metrics.

Another example: Accuracy on dev/test

A: 94%
B: 96%
    
But for the deployed scenario, you got better resutls using the Model A. This maybe due to the dev/test set is the very good quality picture. But in the deployed case you got very low qulaity images. Then, you have to create one test set which have very low qulaity images, and optimize your model.

People of Assam showed much interest in this project to help reducing the man-elephant conflit. Therefore, they volunteraly captured the elephants' photos using their mobile phones and DSLR cameras, and gave to the FD. In this way, FD collected additional 1,00,000 images of elephants and non-elephant into their database.

Where should we include this data - train/dev/test ? OR all? And why?

We should only include in the train set. We should not include in the test set - since our deployment scenario will not be on the mobile phone, and dev and test set distribution will be different.

What is the problem of your model? (Human level performance: 99.5%)

| Model | Train set accuracy (%) | Dev set accuracy (%) |
| :-:| :-: |   :-: | 
| A | 95  |  94   | 

1. Try decreasing regularization, if you have applied
2. Train a bigger model to try to do better on the training set

What is the problem of your model? (Human level performance: 99.5%)

| Model | Train set accuracy (%) | Dev set accuracy (%) |  Test set accuracy (%) |
| :-:| :-: |   :-: |  :-: | 
| A | 99  |  99   | 96   | 

1. Overfitting in the dev set
2. Increase the dev set data

Though your algorithm is giving very good overall accuracy but, it possesses low false negative (FN) values. What should you do now? As poor FN will be dangerous to the people. Most of the time the camera is not detecting a true elephant image and hence not sending the warning signal.

1. You should redefine your metric
2. Work on to improve the FN

When you did the error anaysis on the images from security camera (deployed condition), you have found that your algorithm is poor for the images, when -
1. elephant covered by tree, and 
2. foggy weather

Most of the training data comprises of very clear elephant images. What will you do in this situation? 

1. Data augmentation
2. Insert such images into your dev and test sets

1. Wow, congratulations! Now, FD is very happy with your work. And they are thinking of giving you another project for estimating the density of Assam pride one-horn Rhino in the Kaziranga National Park. 
2. In that case, your first step is to detect the Rhino in an image. However, it was found that FD has only 10000 good quality images of Rhino. 
3. How will you proceed with this project? FD wants to see your first model within one month :(

Some useful links:
1. Visualization of ML techniques : <a href="https://gfycat.com/gifs/search/gradient+descent" target="_blank">egfycat</a>
2. CNN materials: <a href="http://cs231n.stanford.edu/" target="_blank">cs231n.stanford.edu</a> 
3. Andrew Ng DL notes: <a href="https://cs230.stanford.edu" target="_blank">cs230.stanford.edu</a>
4. MOOCS: Coursera & EDx 
5. Read ML articles in https://medium.com
6. <a href="https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/2018/09/Ng-MLY01-13.pdf" target="_blank">Machine Learning Yearning</a>

### You can reach me @

1. email: sisiitg@gmail.com
2. mobile no.: 9435379331 
3. [Linkedin](https://www.linkedin.com/in/sishir-kalita-3a006120/)


### THANK YOU!
### Stay safe