# Modelling and Deployment using MLOps 

Now that we have audio input data & corresponding labels in an array format, it is easier to consume and apply Natural language processing techniques. We can convert audio files labels into integers using label Encoding or One Hot Vector Encoding for machines to learn. The labeled dataset will help us in the neural network model output layer for predicting results. These help in training & validation datasets into nD array.
At this stage, we apply other pre-processing techniques like dropping columns, normalization, etc. to conclude our final training data for building models. Moving to the next stage of splitting the dataset into train, test, and validation is what we have been doing for other models. 
We can leverage CNN, RNN, LSTM,CTC etc. deep neural algorithms to build and train the models for speech applications like speech recognition. The model trained with the standard size few seconds audio chunk transformed into an array of n dimensions with the respective labels will result in predicting output labels for test audio input. As output labels will vary beyond binary, we are talking about building a multi-class label classification method.


In [1]:
import pandas as pd
import numpy as np
import os,sys
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,StandardScaler
sys.path.append(os.path.abspath(os.path.join('../scripts')))
import tensorflow as tf
from clean import Clean
from utils import vocab
from deep_learner import DeepLearn
from modeling import Modeler
from evaluator import CallbackEval

The vocabulary is: ['', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] (size =27)
C:\Users\Sebli\Desktop\10x Files\Week 4\nlp_swahili_amharic\notebooks
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to expla

In [2]:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [3]:
AM_ALPHABET='ሀለሐመሠረሰቀበግዕዝተኀነአከወዐዘየደገጠጰጸፀፈፐቈኈጐኰፙፘፚauiāeəo'
EN_ALPHABET='abcdefghijklmnopqrstuvwxyz'

In [4]:
cleaner = Clean()
char_to_num,num_to_char=vocab(AM_ALPHABET)

2022-06-04 09:04:36,224:logger:Successfully initialized clean class


The vocabulary is: ['', 'ሀ', 'ለ', 'ሐ', 'መ', 'ሠ', 'ረ', 'ሰ', 'ቀ', 'በ', 'ግ', 'ዕ', 'ዝ', 'ተ', 'ኀ', 'ነ', 'አ', 'ከ', 'ወ', 'ዐ', 'ዘ', 'የ', 'ደ', 'ገ', 'ጠ', 'ጰ', 'ጸ', 'ፀ', 'ፈ', 'ፐ', 'ቈ', 'ኈ', 'ጐ', 'ኰ', 'ፙ', 'ፘ', 'ፚ', 'a', 'u', 'i', 'ā', 'e', 'ə', 'o'] (size =44)


# Deep Learning Model

**objective**: Build a Deep learning model that converts speech to text.

In [5]:
swahili_df = pd.read_csv('../data/swahili.csv')
lang = pd.read_csv("../data/swahili.csv")
lang['type']='swahili'
amharic_df = pd.read_csv("../data/amharic.csv")
amharic_df['type']='amharic'
language_df = lang.append(amharic_df, ignore_index=True)

In [6]:
pre_model = Modeler()

In [7]:
swahili_preprocessed = pre_model.preprocessing_learn(swahili_df,'key','file')

In [8]:
amharic_preprocessed = pre_model.preprocessing_learn(amharic_df,'key','file')

In [9]:
train_df,val_df,test_df = amharic_preprocessed

In [10]:
batch_size = 3
# Define the trainig dataset
train_dataset = tf.data.Dataset.from_tensor_slices(
    (list(train_df["file"]), list(train_df["text"]))
)
train_dataset = (
    train_dataset.map(cleaner.encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE)
    .padded_batch(batch_size)
    .prefetch(buffer_size=tf.data.AUTOTUNE)
)

# Define the validation dataset
validation_dataset = tf.data.Dataset.from_tensor_slices(
    (list(val_df["file"]), list(val_df["text"]))
)
validation_dataset = (
    validation_dataset.map(cleaner.encode_single_sample, num_parallel_calls=tf.data.AUTOTUNE)
    .padded_batch(batch_size)
    .prefetch(buffer_size=tf.data.AUTOTUNE)
)


## Deep Learnin Architecture - CNN - RNN - LSTM & CTC

In [11]:
learn = DeepLearn(input_width=1, label_width=1, shift=1,epochs=5,
                 train_df=train_df, val_df=val_df, test_df=test_df,
                 label_columns=['mfcc-0'])
fft_length = 2
model = learn.build_asr_model(
    input_dim=fft_length // 2 + 1,
    output_dim=char_to_num.vocabulary_size(),
    rnn_units=1,
)
model.summary(line_length=110)

Model: "DeepSpeech_2"
______________________________________________________________________________________________________________
 Layer (type)                                    Output Shape                                Param #          
 input (InputLayer)                              [(None, None, 2)]                           0                
                                                                                                              
 expand_dim (Reshape)                            (None, None, 2, 1)                          0                
                                                                                                              
 conv_1 (Conv2D)                                 (None, None, 1, 2)                          4                
                                                                                                              
 conv_1_bn (BatchNormalization)                  (None, None, 1, 2)                       

# Evaluation

**objective**: Evaluate your model. 

In [13]:
epochs = 1
# Callback function to check transcription on the val set.
validation_callback = CallbackEval(model,validation_dataset)
# Train the model
history = model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=epochs,
    callbacks=[validation_callback],
)

The vocabulary is: ['', 'ሀ', 'ለ', 'ሐ', 'መ', 'ሠ', 'ረ', 'ሰ', 'ቀ', 'በ', 'ግ', 'ዕ', 'ዝ', 'ተ', 'ኀ', 'ነ', 'አ', 'ከ', 'ወ', 'ዐ', 'ዘ', 'የ', 'ደ', 'ገ', 'ጠ', 'ጰ', 'ጸ', 'ፀ', 'ፈ', 'ፐ', 'ቈ', 'ኈ', 'ጐ', 'ኰ', 'ፙ', 'ፘ', 'ፚ', 'a', 'u', 'i', 'ā', 'e', 'ə', 'o'] (size =44)
The vocabulary is: ['', 'ሀ', 'ለ', 'ሐ', 'መ', 'ሠ', 'ረ', 'ሰ', 'ቀ', 'በ', 'ግ', 'ዕ', 'ዝ', 'ተ', 'ኀ', 'ነ', 'አ', 'ከ', 'ወ', 'ዐ', 'ዘ', 'የ', 'ደ', 'ገ', 'ጠ', 'ጰ', 'ጸ', 'ፀ', 'ፈ', 'ፐ', 'ቈ', 'ኈ', 'ጐ', 'ኰ', 'ፙ', 'ፘ', 'ፚ', 'a', 'u', 'i', 'ā', 'e', 'ə', 'o'] (size =44)
The vocabulary is: ['', 'ሀ', 'ለ', 'ሐ', 'መ', 'ሠ', 'ረ', 'ሰ', 'ቀ', 'በ', 'ግ', 'ዕ', 'ዝ', 'ተ', 'ኀ', 'ነ', 'አ', 'ከ', 'ወ', 'ዐ', 'ዘ', 'የ', 'ደ', 'ገ', 'ጠ', 'ጰ', 'ጸ', 'ፀ', 'ፈ', 'ፐ', 'ቈ', 'ኈ', 'ጐ', 'ኰ', 'ፙ', 'ፘ', 'ፚ', 'a', 'u', 'i', 'ā', 'e', 'ə', 'o'] (size =44)
The vocabulary is: ['', 'ሀ', 'ለ', 'ሐ', 'መ', 'ሠ', 'ረ', 'ሰ', 'ቀ', 'በ', 'ግ', 'ዕ', 'ዝ', 'ተ', 'ኀ', 'ነ', 'አ', 'ከ', 'ወ', 'ዐ', 'ዘ', 'የ', 'ደ', 'ገ', 'ጠ', 'ጰ', 'ጸ', 'ፀ', 'ፈ', 'ፐ', 'ቈ', 'ኈ', 'ጐ', 'ኰ', 'ፙ', 'ፘ', 'ፚ', 'a', 'u', 'i', 'ā', 'e', 'ə', 'o'] (size =44)


ValueError: one or more groundtruths are empty strings

## Model Space Exploration
Using hyperparameter optimization by slightly modifying the architecture e.g. increasing and decreasing the number of layers to find the best model. 

In [13]:
filters = [16, 32, 64]
kernels = [7, 5, 3] 
pool_sizes = [3, 3, 3]  
mx_stride = [1, 1, 2]
cnn_stride = 1 
input_dim = 128

CONV_RNN_Model = learn.build_model(input_dim, filters, kernels, pool_sizes, mx_stride, cnn_stride)

Model: "CONV_RNN"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 the_input (InputLayer)      [(None, None, 128)]       0         
                                                                 
 reshape_3 (Reshape)         (None, None, 128, 1)      0         
                                                                 
 cnn_0 (Conv2D)              (None, None, 122, 16)     800       
                                                                 
 leaky_re_lu_7 (LeakyReLU)   (None, None, 122, 16)     0         
                                                                 
 max_pooling2d_3 (MaxPooling  (None, None, 60, 16)     0         
 2D)                                                             
                                                                 
 bn_cnn_0 (BatchNormalizatio  (None, None, 60, 16)     64        
 n)                                                       