<h2>Abdul Wahab</h2>
<h3>Natural Language Identification (Embedded Devices) - Using Deep Neural Network</h3>

<p>
In this project, I pulled text data from TED Talks in 63 languages.
I converted the text into its binary reperesentation of 4 byte for each letter, utf-8 encoding. 
Using Tensorflow, I trained a simple deep neural network to classify input language. I acheived 91% accuracy with mostly spoken 17 languages and 80% accuracy with all 56 languages.
</p>

<p> 
Dataset: https://www.kaggle.com/wahabjawed/text-dataset-for-63-langauges
</p>

In [9]:

# Required libraries

%config IPCompleter.greedy=True
import tensorflow.compat.v2 as tf
import numpy as np
import re


<h2>Configuration</h2>

In [37]:

# Map language index to natural language

labels_extended = { 
          0: ['Vietnamese','vi'], 1:['Albanian','sq'], 2:['Arabic','ar'],
          3: ['Armenian','hy'], 4: ['Azerbaijani','az'], 
          5: ['Belarusian','be'],6: ['Bengali','bn'], 
          7: ['Bosnian','bs'], 8: ['Bulgarian','bg'], 
          9: ['Burmese','my'], 10: ['Catalan', 'ca'],
          11: ['Chinese Simplified','zh-cn'], 12: ['Chinese Traditional','zh-tw'],
          13: ['Chinese Yue','zh'], 14: ['Croatian','hr'],
          15: ['Czech','cs'], 16: ['Danish','da'],
          17: ['Dutch','nl'], 18: ['English','en'],
          19: ['Esperanto','eo'], 20: ['Estonian','et'],
          21: ['Finnish','fi'], 22:['French','fr'],
          23: ['Galician','gl'], 24: ['Georgian','ka'], 
          25: ['German','de'],26: ['Urdu','ur'],
          27: ['Gujarati','gu'], 28: ['Hebrew','he'], 
          29: ['Hindi','hi'], 30: ['Hungarian', 'hu'],
          31: ['Indonesian','id'], 32: ['Italian','it'],
          33: ['Japanese','ja'], 34: ['Korean','ko'],
          35: ['Latvian','lv'], 36: ['Lithuanian','lt'],
          37: ['Macedonian','mk'], 38: ['Malay','ms'],
          39: ['Marathi','mr'], 40: ['Mongolian','mn'],
          41: ['Norwegian','nb'], 42: ['Persian','bg'],
          43: ['Polish','pl'], 44: ['Portuguese','pt'],
          45: ['Romanian','ro'],46: ['Russian','ru'], 
          47: ['Serbian','sr'], 48: ['Slovak','sk'], 
          49: ['Slovenian','sl'], 50: ['Spanish', 'es'],
          51: ['Swedish','sv'], 52: ['Tamil','ta'],
          53: ['Thai','th'], 54: ['Turkish','tr'],
          55: ['Ukrainian','uk']
          }



labels_standard = { 
        0: ['Indonesian','id'], 1:['English','en'], 2:['German','de'],
        3: ['Turkish','tr'],4:['Hindi','hn'],
        5: ['Spanish','es'],6: ['Bengali','bn'], 
        7: ['French','fr'], 8: ['Italian','it'], 
        9: ['Dutch','nl'], 10: ['Portuguese', 'pt'],
        11: ['Swedish','sv'], 12: ['Russian','ru'],
        13: ['Czech','cs'], 14: ['Arabic','ar'],
        15: ['Chinese Traditional','zh-cn'],16: ['Persian','fa']
}


#['STANDARD','EXTENDED']
# STANDARD supports 16 languages
# EXTENDED supports 56 languages

TYPE = 'EXTENDED'



# assign number of languages to process

if(TYPE =='STANDARD'):
    LABEL = labels_standard
else:
    LABEL = labels_extended


# regular expression pattern used to filter out data

pattern = r'[^\w\s]+|[0-9]'

# Max length of input text
MAX_INPUT_LENGTH = 13

#MAX data length for each language to balnace the dataset
MAX_LENGTH_DATA = 300000

<h2>Helper Functions</h2>

In [49]:
# Helper Functions

def clean_sentences(sentences):
    '''
    Goal: Filter out non predictive text about speaker using regular expression pattern
    
    @param sentences: (list) sentences is a list of strings, where each string is a sentence.
                       Note: The raw language_transcription should be tokenized by sentence prior
                       to being passed into this function.
    '''
    return re.sub(pattern,'',sentences)

def convertTextToBinary(word):
    word_vec = []
    vec = ''
    n = len(word)
    for i in range(n):
        current_letter = word[i]
        ind = ord(current_letter)
        placeholder = bin(ind)[2:].zfill(32)
        vec = vec + placeholder
    vec = vec.zfill(32*MAX_INPUT_LENGTH)
    for digit in vec:
        word_vec.append(int(digit))
    return word_vec
    
    
    

<h2>Deep Neural Network - Helper Function</h2>


In [58]:
def loadTfLiteModel():
    # load TfLite model
    interpreter = tf.lite.Interpreter(model_path=f'tflite-model/LanguageDetect-{TYPE}.tflite')
    interpreter.allocate_tensors()
    var_input = interpreter.get_input_details()
    var_output = interpreter.get_output_details()
    return interpreter,var_input,var_output

def detectLanguage(text):
    # load TfLite model
    interpreter,var_input,var_output = loadTfLiteModel()

    #test for results

    if len(text) > MAX_INPUT_LENGTH:
        text = text[:MAX_INPUT_LENGTH]

    text = clean_sentences(text)
    word_vec = convertTextToBinary(text)
    word_vec =np.array(word_vec,dtype='float32')
    word_vec = np.reshape(word_vec, (1,word_vec.shape[0]))

    interpreter.set_tensor(var_input[0]['index'],word_vec)
    interpreter.invoke()
    output = interpreter.get_tensor(var_output[0]['index'])
    
    digit = np.argmax(output[0])
    
   

    print(f"the language for input {text}: {LABEL[digit][0]}")
    
    for i in range(len(LABEL)):
        lang = LABEL[i][0]
        score = output[0][i]
        print(lang + ': ' + str(round(100*score, 2)) + '%')
    print('\n')


<h2>Deep Neural Network - Test Tflite model</h2>

In [59]:
#test for results

text_arr = ['father','মানবতা','بچے','الأطفال','إنسانية','mänskligheten']

for text in text_arr:
    detectLanguage(text)


the language for input father: English
Vietnamese: 0.01%
Albanian: 0.36%
Arabic: 0.0%
Armenian: 0.0%
Azerbaijani: 0.0%
Belarusian: 0.0%
Bengali: 0.0%
Bosnian: 0.01%
Bulgarian: 0.0%
Burmese: 0.0%
Catalan: 0.01%
Chinese Simplified: 0.0%
Chinese Traditional: 0.0%
Chinese Yue: 0.0%
Croatian: 0.01%
Czech: 0.0%
Danish: 0.0%
Dutch: 0.01%
English: 99.51%
Esperanto: 0.0%
Estonian: 0.0%
Finnish: 0.0%
French: 0.01%
Galician: 0.0%
Georgian: 0.0%
German: 0.0%
Urdu: 0.0%
Gujarati: 0.0%
Hebrew: 0.0%
Hindi: 0.0%
Hungarian: 0.0%
Indonesian: 0.0%
Italian: 0.0%
Japanese: 0.0%
Korean: 0.0%
Latvian: 0.0%
Lithuanian: 0.0%
Macedonian: 0.0%
Malay: 0.0%
Marathi: 0.0%
Mongolian: 0.0%
Norwegian: 0.01%
Persian: 0.0%
Polish: 0.0%
Portuguese: 0.05%
Romanian: 0.0%
Russian: 0.0%
Serbian: 0.01%
Slovak: 0.0%
Slovenian: 0.0%
Spanish: 0.0%
Swedish: 0.01%
Tamil: 0.0%
Thai: 0.0%
Turkish: 0.0%
Ukrainian: 0.0%


the language for input মনবত: Bengali
Vietnamese: 0.0%
Albanian: 0.0%
Arabic: 0.0%
Armenian: 0.0%
Azerbaijani: 0.0%