# BahuBhashi

## What is BahuBhashi?

Recently many MNC’s are facing an issue where a customer of a particular region calls in and the customer experiences long wait hours as all the representatives are busy. While their call centers in other regions are not experiencing busy wait hours. To mitigate this issue there could be a system designed which takes an audio input in one language and translates it into another language. This solution could be used where if the call centers in one region are facing high call volumes the calls could be directed to the regions facing lower call volumes and BahuBhashi could be used so that person on both ends can converse in their own language and find a solution to their issue. 

## Some examples of translation: English-To-German 

<img src="Results.JPG" alt="Drawing" style="width: 1000px;"> 

## What are the requirements 

•	Operating System: Windows 10/ macOS High Sierra

•	Dataset: European Parliament Proceedings Parallel Corpus (http://www.statmt.org/europarl/v7/de-en.tgz)

•	Python: v3.6.2

•	Tensorflow: v1.8

•	Natural Language ToolKit (Nltk)


## How does translation takes place?

A regular (non-recurrent) neural network is a generic machine learning algorithm that takes in a list of numbers and calculates a result (based on previous training). Neural networks can be used as a black box to solve lots of problems.

A recurrent neural network (or RNN for short) is a slightly tweaked version of a neural network where the previous state of the neural network is one of the inputs to the next calculation. This means that previous calculations change the results of future calculations!

This trick allows neural networks to learn patterns in a sequence of data. For example, you can use it to predict the next most likely word in a sentence based on the first few words

<img src="RNN1.gif" alt="Drawing" style="width: 1000px;"> 

Because the RNN has a “memory” of each word that passed through it, the final encoding that it calculates represents all the words in the sentence.

To generate this encoding, the sentence is fed into the RNN, one word at time. The final result after the last word is processed and will be the values that represent the entire sentence.

This is a way to represent an entire sentence as a set of unique numbers. It is not known what each number in the encoding means, but it doesn’t really matter. As long as each sentence is uniquely identified by it’s own set of numbers, there is no need to know exactly how those numbers were generated. Now it is known how to use an RNN to encode a sentence into a set of unique numbers. How does that help? What if two RNNs were taken and hooked end-to-end? The first RNN could generate the encoding that represents a sentence. Then the second RNN could take that encoding and just do the same logic in reverse to decode the original sentence again

<img src="RNN2.png" alt="Drawing" style="width: 1000px;"> 

Being able to encode and then decode the original sentence again isn’t very useful.The second RNN is trained to decode the sentence into German instead of English by using parallel corpora training data to train it

<img src="RNN3.png" alt="Drawing" style="width: 1000px;"> 

## Import necessary libraries and dependencies:

In [2]:
!pip install nltk
from collections import Counter
import tensorflow as tf
import NMT_Model
import nmt_data_utils
import nmt_model_utils
nltk.download('punkt')



## Loading the datasets and displaying the datasets 

In [None]:
# load the english texts
with open('europarl-v7.de-en.en','r', encoding = 'utf-8') as f:
        en = f.readlines()

In [None]:
# load the german texts
with open('europarl-v7.de-en.de','r',encoding = 'utf-8') as f:
    de = f.readlines()

In [None]:
len(en), len(de)

In [None]:
# first 5 sentence pairs. 
for line in zip(en[:5], de[:5]):
    print(line, '\n')

## Cleaning the datasets

In [None]:
# remove unnecessary new lines. 
de = [line.strip() for line in de]
en = [line.strip() for line in en]

In [None]:
# we will only use sentences of similar lengths in order to make training easier. 
len_en = [len(sent) for sent in en if 20 < len(sent) < 50]
len_dist = Counter(len_en).most_common()
len_dist

In [None]:
# 158238 sentences that contain betwenn 20 and 50 words.
len(len_en)

In [None]:
_de = []
_en = []
for sent_de, sent_en in zip(de, en):
    if 20 < len(sent_en) < 50:
        _de.append(sent_de)
        _en.append(sent_en)

In [None]:
# but we will not use all 150 000 sentences, only 5000 for the beginning. 
en_preprocessed, en_most_common = nmt_data_utils.preprocess(_en[:5000])
de_preprocessed, de_most_common = nmt_data_utils.preprocess(_de[:5000], language = 'german')

In [None]:
len(en_preprocessed), len(de_preprocessed)

## Further cleaning of the data

In [None]:
# for some of the sentences there is not german or english counterpart, i.e. only an empy array []
# therefore we will remove those sentence pairs.
en_preprocessed_clean, de_preprocessed_clean = [], []

for sent_en, sent_de in zip(en_preprocessed, de_preprocessed):
    if sent_en != [] and sent_de != []:
        en_preprocessed_clean.append(sent_en)
        de_preprocessed_clean.append(sent_de)
    else:
        continue

In [None]:
len(en_preprocessed_clean), len(de_preprocessed_clean)

## Displaying the clean sentences 

In [None]:
for e, d in zip(en_preprocessed_clean, de_preprocessed_clean[:5]):
    print('English:\n', e)
    print('German:\n', d, '\n'*3)

In [None]:
en_most_common[:15], len(en_most_common), len(de_most_common)

## Creating the vocab

In [None]:
# now we can create our lookup dicts for english and german, i.e. our vocab. 
# we will also include special tokens, later on used in the model. 
specials = ["<unk>", "<s>", "</s>", '<pad>']

en_word2ind, en_ind2word, en_vocab_size = nmt_data_utils.create_vocab(en_most_common, specials)
de_word2ind, de_ind2word, de_vocab_size = nmt_data_utils.create_vocab(de_most_common, specials)

In [None]:
# Displaying the size 
en_vocab_size, de_vocab_size

## Converting to Indices 

In [None]:
# in order to feed the sentences to the network, we have to convert them to ints, corresponding to their indices
# in the lookup dicts. 
# we reverse the source language sentences, i.e. the english sentences as this alleviates learning for the seq2seq 
# model. Apart from this we also include EndOfSentence and StartOfSentence tags, which are needed as well. 
en_inds, en_unknowns = nmt_data_utils.convert_to_inds(en_preprocessed_clean, en_word2ind, reverse = True, eos = True)
de_inds, de_unknowns = nmt_data_utils.convert_to_inds(de_preprocessed_clean, de_word2ind, sos = True, eos = True)

In [None]:
[nmt_data_utils.convert_to_words(sentence, en_ind2word) for sentence in  en_inds[:2]]

## Tunning the Hyperparameters 

In [None]:
# hyperparams. 
# those are probably not perfect, but work fine for now. 
num_layers_encoder = 4
num_layers_decoder = 4
rnn_size_encoder = 128
rnn_size_decoder = 128
embedding_dim = 300

batch_size = 64
epochs = 5 
clip = 5
keep_probability = 0.8
learning_rate = 0.01
learning_rate_decay_steps = 1000
learning_rate_decay = 0.9

## Training the Model

In [None]:
# create the graph and train the model. 
nmt_model_utils.reset_graph()

nmt = NMT_Model.NMT(en_word2ind,
                    en_ind2word,
                    de_word2ind,
                    de_ind2word,
                    './models/local_one/my_model',
                    'TRAIN',
                    embedding_dim = embedding_dim,
                    num_layers_encoder = num_layers_encoder,
                    num_layers_decoder = num_layers_decoder,
                    batch_size = batch_size,
                    clip = clip,
                    keep_probability = keep_probability,
                    learning_rate = learning_rate,
                    epochs = epochs,
                    rnn_size_encoder = rnn_size_encoder,
                    rnn_size_decoder = rnn_size_decoder, 
                    learning_rate_decay_steps = learning_rate_decay_steps,
                    learning_rate_decay = learning_rate_decay)
  
nmt.build_graph()
nmt.train(en_inds, de_inds)

## Testing the Model

In [None]:
_de_inds, _de_unknowns = nmt_data_utils.convert_to_inds(de_preprocessed_clean, de_word2ind, sos = True,  eos = True)

In [None]:
# the inference model does not necessaryly need to get input batches. we can just give it. the whole input
# data, but the the batchsize has to be specified as the lenght of the input data.
nmt_model_utils.reset_graph()

nmt = NMT_Model.NMT(en_word2ind,
                    en_ind2word,
                    de_word2ind,
                    de_ind2word,
                    './models/local_one/my_model',
                    'INFER',
                    num_layers_encoder = num_layers_encoder,
                    num_layers_decoder = num_layers_decoder,
                    batch_size = len(en_inds[:50]),
                    keep_probability = 1.0,
                    learning_rate = 0.0,
                    beam_width = 0,
                    rnn_size_encoder = rnn_size_encoder,
                    rnn_size_decoder = rnn_size_decoder)

nmt.build_graph()
preds = nmt.infer(en_inds[:50], restore_path =  './models/local_one/my_model', targets = _de_inds[:50])

## Displaying the results 

In [None]:
# show some of the created translations
# Note: the way bleu score is probably not the perfect way to do it
#The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a 
#reference sentence.

A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.
nmt_model_utils.sample_results(preds, en_ind2word, de_ind2word, en_word2ind, de_word2ind, _de_inds[:50], en_inds[:50])

<img src="Results.JPG" alt="Drawing" style="width: 1000px;"> 

## References

1. https://github.com/thomasschmied/Neural_Machine_Translation_Tensorflow
2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.472.1019&rep=rep1&type=pdf
3. https://pdfs.semanticscholar.org/0fa1/911622a6c0a3dd43fefbdf2695ebdb7e10fa.pdf
4. https://www.cs.cmu.edu/~awb/papers/eurospeech2003/speechalator.pdf
5. https://github.com/tensorflow/nmt