# <span style="color:blue">An overview on ML/DL model development</span>
<span style="color:blue">AICTE Training and Learning (ATAL) Academy Sponsored FDP on **DL for Audio and Speech Processing**
</span>

<div style="text-align: right">Dr. Sishir Kalita</div>
<div style="text-align: right">Data Scientist</div>
<div style="text-align: right">Armsoftech.air</div>
<div style="text-align: right">Chennai</div>

### Outline

2. Building a DL model: steps

3. Bias/variance problem

4. Regularizations and batch norm

5. ML Strategies

6. Evaluation metrics


### Different Speech Tech projects

1. Voice bot (Speech enhancement, ASR, TTS, Speech analytics, SLU)
2. Speaker recognition
3. Spoken Language Idenfication
4. Speech to speech translation

### Key steps in building an ML/DL model

1. Problem definition
2. Gathering data
3. Data pre-processing
4. Define the evaluation metric
5. Model training / testing 
    * Iterate till you get the good results in the test set
    * Pump more labeled data
6. Model deployment

<img src="idea.png" width="700" /> 

Image source: <a href="https://www.coursera.org/learn/machine-learning-projects" target="_blank">Coursera | Deep Learning Specialization</a>

### Where will I get the speech data?

* Your own customized data
* <a href="https://datasetsearch.research.google.com/" target="_blank">Google dataset search</a> 
* <a href="https://www.kaggle.com/datasets" target="_blank">Kaggle</a> 
* <a href="https://commonvoice.mozilla.org/en/datasets" target="_blank">Commonvoice Mozilla</a> 
* <a href="https://openslr.org/resources.php" target="_blank">Openslr</a> 
* <a href="https://nplt.in/demo/resources/speech-corpus" target="_blank">National Platform for Language Technology</a> 
* <a href="https://catalog.ldc.upenn.ed" target="_blank">Linguistic Data Consortium</a> 
    

### Prepare your train, development (dev) / validation (val) and test (evaluation (eval)) sets
1. How to divide:
    * Before deep learning era: <span style="color:blue">80% (train: 8000), 10% (Dev: 1000), and 10% (Test: 1000)</span>  (**if you have 10,000 examples**)
    * Present days : <span style="color:blue">98% (train), 1% (Dev), and 1% (Test)</span> (**if you have 10,00,0000 examples***)
1. Should have a more diverse <span style="color:blue">train</span> set
2. Distribution of <span style="color:blue">dev</span> and  <span style="color:blue">test</span> sets should be same
3. Cleaning up mislabeled <span style="color:blue">dev</span> and  <span style="color:blue">test</span> sets examples

### Steps to train a model

1. Decide the model and define its structure
2. Normalizing the inputs
3. Initialize the parameters of the model
4. Learn the parameters for the model by minimizing the cost
    * Calculate current loss (forward propagation)
    * Calculate current gradient (backward propagation)
    * Update parameters (gradient descent)
5. Use the learned parameters to make predictions (on the test set)

**Let's consider a logistic regression model**
* Binary classification model
* Training set: $ \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... (x^{(m)}, y^{(m)})\} $

\begin{align}
\hat y = h_{\text{w}}(x) = \frac{1}{1 + e^{-\text{w}^Tx}}
\end{align}
\begin{align}
    {\text{w}}^Tx &= [\text{w}_{1},\text{w}_{2},\text{w}_{3}]
          \begin{bmatrix}
           x_1 \\
           x_2 \\
           x_{3}
         \end{bmatrix}
  \end{align}
  
<p float="right">
<img src="sigmoid.png" width="300" align="left"/>
<img src="sigmoid1.png" width="400" align="left"/> 
</p>



**Compute the loss function: ('m' is the number of training example)**
- **Binary cross-entropy loss**
\begin{align}
J(\text{w}, b) = \frac{1}{m}\sum_{i=1}^{m}L(\hat y^{(i)}, y^{(i)})
= -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(\hat y^{(i)}) + (1 - y^{(i)})log(1- \hat y^{(i)})]
\end{align}
If y = 1 (first part) and y = 0 (second part)

Now, we need to find the **w** and **b** which minimize the $J(\text{w}, b)$ [Requires one optimization algorithm]

<img src="costng.png" width="500" /> 

Image source: <a href="https://cs230.stanford.edu/files/C1M2.pdf" target="_blank">cs230.stanford.edu</a>

### Optimization algorithms (I hope it is already discussed)

**Gradient descent**
1. Compute the gradient w.r.t **w** and **b**
2. Update the **w** and **b**
\begin{align}
w_{j+1} = w_{j} - \alpha \frac{dJ(w,b)}{dw}\\
b_{j+1} = b_{j} - \alpha \frac{dJ(w,b)}{db}
\end{align}
3. Iterate till you reach local minima

<p float="right">
<img src="loss.png" width="300" align="left"/> 
<img src="sgd.gif" width="400" align="left"/> 
</p>

Image source: <a href="http://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/" target="_blank">rasbt.github.io</a>


### Other loss functions
- **Focal loss:** It reshapes the cross entropy loss in such a way that it down weighs the loss assigned to well classified examples
- **Negative log likelihood loss (NLLL):** takes class weights as input
- **Constrastive loss**
- **Connectionist Temporal Classification Loss (CTC Loss):** where we need alignment between sequences


[Focal Loss](https://medium.com/adventures-with-deep-learning/focal-loss-demystified-c529277052de)
[CTC Loss](https://distill.pub/2017/ctc/)


**Stochastic gradient descent** vs **mini batch gradient descent** vs **Batch gradient descent**
1. Let's say, your train set size 'm'
2. Stochastic gradient descent: calculate error for each example and update the model for each example
3. Mini batch gradient descent: take a mini batch (M < m), compute the error, and update the model for each mini-batch
4. Batch gradient descent: calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated

**Other advanced optimization algorithms**
1. Gradient descent with momentum
2. Adam
3. RMSprop

### What is epoch?
1. One cycle through the entire training dataset is called a training epoch
2. Number of passes (1 pass : one forward pass + one backward pass in one batch)
3. Let m = 1000, and M = 10
4. For SGD: there will be 1000 iterations/epoch
5. For Mini batch gradient descent: there will be 1000/10 (100) iterations/epoch
6. For Batch gradient descent: there will be 1 iteration/epoch

```python
for epoch in range(n_epochs):
    for x_batch, y_batch in train_loader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # Sets model to TRAIN mode
        model.train()
        ##################
        # Forward pass
        # Makes predictions
        yhat = model(x_batch)
        # Computes loss
        loss = loss_fn(y_batch, yhat)
        
        ##################
        # Backward pass
        # Computes gradients
        loss.backward()
        
        ##################
        # Updates parameters
        optimizer.step()
                
    with torch.no_grad():
        for x_eval, y_eval in eval_loader:
            x_eval = x_eval.to(device)
            y_eval = y_eval.to(device)
            
            model.eval()

            yhat = model(x_eval)
            eval_loss = loss_fn(y_eval, yhat)
            eval_losses.append(eval_loss.item())
            
```

### Normalization of the inputs
<p float="right">
<img src="norminput1.png" width="400" align="left"/>
<img src="norminput2.png" width="250" align="left"/> 
</p>

Image source: <a href="deeplearning.ai" target="_blank">Andrew Ng</a>

For train set:
\begin{align}
X^{train}_{norm} = \frac{X^{train} - \mu^{train}}{\sigma^{train}}
\end{align}

For test set:
\begin{align}
X^{test}_{norm} = \frac{X^{test} - \mu^{train}}{\sigma^{train}}
\end{align}

### Why we should normalize the input

<p float="right">
<img src="norminput3.png" width="1000" align="left"/> 
</p>

Image source: <a href="deeplearning.ai" target="_blank">Andrew Ng</a>


### Bias/Variance problem

1. We can get some idea for improving the model by knowing the bias/variance problem

|  | Train set error (%) | Test set error (%) | Conclusion | 
| :-: | :-: | :-: | :-: | 
| High bias problem | 25 | 40 | Underfitting | 
| High variance problem | 1 | 12 | Overfitting | 
| Good model | 1 | 2| 

<img src="bias_var.png" width="800">
Image source: <a href="deeplearning.ai" target="_blank">deeplearning.ai</a>

### Addressing bias/variance problem
1. **Reducing the high variance**

    * Add more training data
    * Add regularization 
        * L2 regularization
        * Dropout
    * Add early stopping
    * Decrease the model size

2. **Reducing the high bias**

    * Increase the model size (# of neurons/layers)
    * Try to run it longer
    * Different (advanced) optimization algorithms
    * Reduce or eliminate regularization
    * Modify model architecture


**<span style="color:blue">Try until you get better results on both train and test sets</span>**

<img src="bias-variance-tradeoff.png" width="300">

Iamge source: <a href="https://dziganto.github.io/cross-validation/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-with-Validation-and-Cross-Validation/" target="_blank">dziganto.github.io</a>


#### L2 Regularization
\begin{align}
J(\text{w}) = \frac{1}{m}\sum_{i=1}^{m}L(\hat y^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\sum_{i=1}^{m}(|\text{w}^{(i)}|^2)
\end{align}
1. Here, $\lambda$ is the regularization parameter (hyperparameter)
2. Penalizes large weights and effectively limits the freedom the model
3. Causes the weight to decay in proportion to its size
3. If lambda is too large - a lot of **w**'s will be close to zeros which will make the NN simpler (you can think of it as it would behave closer to logistic regression).

<img src="reg_strengths.jpeg" width="400">

Image source: <a href="https://cs231n.github.io/neural-networks-1/" target="_blank">cs231n.github.io</a>

#### Dropout

<img src="drop1.png" width="600">
Image source: <a href="https://cs231n.github.io/neural-networks-2/" target="_blank">cs231n.github.io</a>

1. The dropout regularization eliminates some neurons/weights on each iteration based on a probability
2. Can’t rely on any one feature, so have to spread out weights [Andrew Ng]

Demo: <a href="https://qmsvpvzwwppdeofafmdkgv.coursera-apps.org/notebooks/week5/Regularization/Regularization_v2a.ipynb" target="_blank">L2 & Dropout</a>


### Other regularization methods
1. Data augmentation
    * If data mismatch between train and test set
    * Add noise in the speech signal or perturb the speech signal
    * Use speech synthesis
    * Distorts the image (scaled, rotate)
    * Create image using graphics
2. Early stopping
    * Check the train and validation set errors

### Parameters and Hyperparameters
1. Weights (w) or bias (b) is a learnable parameter
2. Hyper parameters (parameters that control the algorithm)
    * Learning rate
    * Number of iteration
    * Number of hidden layers
    * Number of hidden units
    * Choice of activation functions
    * Mini-batch size

### Normalizing activations in a network
**Batch normalization (BN)**
1. BN allows each layer of a network to learn by itself a little bit more independently of other layers
2. Reduces the problem of input values changing (shifting)

<p float="right">
<img src="batch-normalization.jpg" width="400" align="left">
<img src="val.png" width="300" align="left">
</p>

Image source: <a href="https://www.learnopencv.com/batch-normalization-in-deep-networks/" target="_blank">earnopencv</a>


```python
model = Sequential
model.add(Dense(32))
model.add(BatchNormalization())
model.add(Activation('relu'))
```

### Some strategies while developing an  ML/DL model-
1. Carrying out error analysis
2. Pretraining
    * Transfer learning
    * Self-supervised learning
3. Multi-task learning
4. End-to-end modeling
5. Domain adaption
6. Self-training

### Carrying out error analysis

  * Error analysis - process of manually examining mistakes that your algorithm is making.
  * It can give you insights into what to do next
  * Cleaning up incorrectly labeled data
    


### Pretraining - Transfer Learning (TL) and Self-supervised Learning (SSL)

- Pretraining has become a standard technique in CV, NLP
- Transfer learning (TL) uses labeled data to learn a good representation network - Supervised fashion
- Self-supervised learning does not require annotated labels

<img src="pretrain.png" width="600">

Image source: <a href="https://arxiv.org/pdf/2007.04234.pdf" target="_blank"> A Tale of Two Pretraining Paradigms</a>

### Transfer learning

- Let's consider we have on ASR model for Tamil.
- Can we use that model to train an ASR model for Telugu?

<p float="right">
<img src="cnn1.png" width="400" align="left">
</p>

Image source: <a href="https://arxiv.org/pdf/2007.04234.pdf" target="_blank">towardsdatascience</a>


### Self-supervised based pretraining model
- What is self-supervised learning?
  * Self-supervised learning obtains supervisory signals from the data itself 
  * Labels are naturally part of the input data
  * Learn general data representations from unlabeled examples
  * Fine tuning for your downstream tasks
  
<p float="right">
<img src="ssl.png" width="500" align="center">
</p>  
  
[Self-supervised-learning](https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/)

### Wav2vec2

* Wav2Vec2 learns powerful speech representations from large amount of unlabeled speech
* It learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network
* **Pretraining** and **Finetuning**
* [Facebook Pretrained model](https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md)
  
<p float="right">
<img src="wav2vec21.png" width="500" align="center">
</p>  
  
[wav2vec2](https://arxiv.org/abs/2006.11477)

### Wav2vec2 for ASR

* Wav2Vec2 is fine-tuned using CTC loss with transcribed data


<p float="right">
<img src="w2v_results.png" width="500" align="center">
</p> 

[wav2vec2](https://arxiv.org/abs/2006.11477)

### Wav2vec2 learned speech embeddings for other downstrem task

- **[Speaker verification and language identification](https://arxiv.org/pdf/2012.06185.pdf)**
- **[Emotion recognition](https://arxiv.org/abs/2104.03502)**
- **[ASR model development for low-resource language](https://arxiv.org/abs/2012.12121)**

### ASR Development uisng wav2vec2 for Indian language

1. Fine-tuned on three different databases provided by IITM
  - **Pretained model: [XSLR-53](https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md) | Trained on 56000 hours of speech data of 53 different languages**

| Language | Train (Hours)  | Eval (Hours)  | WER | LM (kenLM)
| :-: | :-: | :-: | :-: |  :-: |
| Indian English |  179.5 | 5.4 | 4.91 % | 6 Gram |
| Hindi | 178.4 |  4.9 | 4.55 %| 5 Gram |
| Tamil | 104.5  | 3.8 | 5.84 %| 4 Gram|

2. Kaldi based TDNN model

| Language | Train (Hours)  | Eval (Hours)  | WER  | LM (SRILM)
| :-: | :-: | :-: | :-: | :-: | 
|  Indian English |  179.5 | 5.4 | 4.97 % | RNNLM Rescore |
| Hindi | 178.4 |  4.9 | 3.73 %| 5 Gram |
| Tamil | 104.5  | 3.8 | 5.21 %| 5 Gram |


### Wav2vec2 pretrained model Indic languages

- [EkStep Models](https://github.com/Open-Speech-EkStep/vakyansh-models)
- [Paper](https://arxiv.org/pdf/2107.07402.pdf)

### Multi-task learning
1. One neural network learns several tasks at the same time
2. Each of these tasks helps all of the other tasks

<p float="right">
<img src="mul.png" width="500" align="left">
</p>

Image source: <a href="https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2594.pdf" target="_blank">isca</a>

### End-to-end modelling
* No feature engineering and no intermediate stages
* Need large amount of data


### Evaluation metric:

**Highly depends on your application and what you are trying to optimize for**

1. Confusion matrix
2. Accuracy
3. Precision / Recall / F1 score
4. Word error rate in ASR
5. Bilingual Evaluation Understudy (BLEU) Score in machine translation
6. Equal error rate in speaker verifiction


**Basic metrics you should know**

| H/P | + class | - class | 
| :-:| :-: | :-: | 
| **+ class** | True positive | False negative | 
| **- class** | False positive | True negative | 

**[When FP and FN will be useful!]**
1. FN should be zero: cancer diagnosis
2. FP should be zero: speaker verification

Accuracy = $\frac{TP + TN}{TP + FP + FN + TN}$

Precision (proportion of positive identifications which was actually correct) = $\frac{TP}{TP + FP}$

Recall = (proportion of actual positives which was identified correctly) = $\frac{TP}{TP + FN}$

F1 score = HM(Precision, Recall)

References:
1. Structuring Machine Learning Projects : <a href="https://www.coursera.org/learn/machine-learning-projects" target="_blank">Andrew Ng</a> 
2. Visualization of ML techniques : <a href="https://gfycat.com/gifs/search/gradient+descent" target="_blank">egfycat</a>
3. CNN materials: <a href="http://cs231n.stanford.edu/" target="_blank">cs231n.stanford.edu</a> 
4. Andrew Ng DL notes: <a href="https://cs230.stanford.edu" target="_blank">cs230.stanford.edu</a>
5. MOOCS: Coursera & EDx 
6. Read ML articles in https://medium.com
7. <a href="https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/2018/09/Ng-MLY01-13.pdf" target="_blank">Machine Learning Yearning</a>

### You can reach me @

1. email: sisiitg@gmail.com
2. mobile no.: 9435379331 
3. [Linkedin](https://www.linkedin.com/in/sishir-kalita-3a006120/)


### THANK YOU!
### Stay safe