<h2>Abdul Wahab</h2>
<h3>Natural Language Identification (Embedded Devices) - Using Deep Neural Network</h3>

<p>
In this project, I pulled text data from TED Talks in 63 languages.
I converted the text into its binary reperesentation of 4 byte for each letter, utf-8 encoding. 
Using Tensorflow, I trained a simple deep neural network to classify input language. I acheived 91% accuracy with mostly spoken 17 languages and 80% accuracy with all 56 languages.
</p>

<p> 
Dataset: https://www.kaggle.com/wahabjawed/text-dataset-for-63-langauges
</p>

In [1]:

# Required libraries

%config IPCompleter.greedy=True
import tensorflow.compat.v2 as tf
tf.enable_v2_behavior()
tf.get_logger().setLevel('ERROR')
from tensorflow.compat.v2.keras.models import Sequential
from tensorflow.compat.v2.keras.layers import Dense,Dropout
from tensorflow.compat.v2.keras import initializers, optimizers


import numpy as np
import pandas as pd
import re
from unidecode import unidecode
from array import array
from nltk.tokenize import sent_tokenize
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score
import os
import matplotlib.pyplot as plt


<h2>Configuration</h2>

In [14]:

# Map language index to natural language

labels_extended = { 
          0: ['Vietnamese','vi'], 1:['Albanian','sq'], 2:['Arabic','ar'],
          3: ['Armenian','hy'], 4: ['Azerbaijani','az'], 
          5: ['Belarusian','be'],6: ['Bengali','bn'], 
          7: ['Bosnian','bs'], 8: ['Bulgarian','bg'], 
          9: ['Burmese','my'], 10: ['Catalan', 'ca'],
          11: ['Chinese Simplified','zh-cn'], 12: ['Chinese Traditional','zh-tw'],
          13: ['Chinese Yue','zh'], 14: ['Croatian','hr'],
          15: ['Czech','cs'], 16: ['Danish','da'],
          17: ['Dutch','nl'], 18: ['English','en'],
          19: ['Esperanto','eo'], 20: ['Estonian','et'],
          21: ['Finnish','fi'], 22:['French','fr'],
          23: ['Galician','gl'], 24: ['Georgian','ka'], 
          25: ['German','de'],26: ['Urdu','ur'],
          27: ['Gujarati','gu'], 28: ['Hebrew','he'], 
          29: ['Hindi','hi'], 30: ['Hungarian', 'hu'],
          31: ['Indonesian','id'], 32: ['Italian','it'],
          33: ['Japanese','ja'], 34: ['Korean','ko'],
          35: ['Latvian','lv'], 36: ['Lithuanian','lt'],
          37: ['Macedonian','mk'], 38: ['Malay','ms'],
          39: ['Marathi','mr'], 40: ['Mongolian','mn'],
          41: ['Norwegian','nb'], 42: ['Persian','bg'],
          43: ['Polish','pl'], 44: ['Portuguese','pt'],
          45: ['Romanian','ro'],46: ['Russian','ru'], 
          47: ['Serbian','sr'], 48: ['Slovak','sk'], 
          49: ['Slovenian','sl'], 50: ['Spanish', 'es'],
          51: ['Swedish','sv'], 52: ['Tamil','ta'],
          53: ['Thai','th'], 54: ['Turkish','tr'],
          55: ['Ukrainian','uk']
          }



labels_standard = { 
        0: ['Indonesian','id'], 1:['English','en'], 2:['German','de'],
        3: ['Turkish','tr'],4:['Hindi','hn'],
        5: ['Spanish','es'],6: ['Bengali','bn'], 
        7: ['French','fr'], 8: ['Italian','it'], 
        9: ['Dutch','nl'], 10: ['Portuguese', 'pt'],
        11: ['Swedish','sv'], 12: ['Russian','ru'],
        13: ['Czech','cs'], 14: ['Arabic','ar'],
        15: ['Chinese Traditional','zh-cn'],16: ['Persian','fa']
}


#['STANDARD','EXTENDED']
# STANDARD supports 16 languages
# EXTENDED supports 56 languages

TYPE = 'STANDARD'



# assign number of languages to process

if(TYPE =='STANDARD'):
    LABEL = labels_standard
else:
    LABEL = labels_extended


# regular expression pattern used to filter out data

pattern = r'[^\w\s]+|[0-9]'

# Max length of input text
MAX_INPUT_LENGTH = 13

#MAX data length for each language to balnace the dataset
MAX_LENGTH_DATA = 300000

<h2>Helper Functions</h2>

In [15]:
# Helper Functions

def clean_sentences(sentences):
    '''
    Goal: Filter out non predictive text about speaker using regular expression pattern
    
    @param sentences: (list) sentences is a list of strings, where each string is a sentence.
                       Note: The raw language_transcription should be tokenized by sentence prior
                       to being passed into this function.
    '''
    return re.sub(pattern,'',sentences)

def convertTextToBinary(word):
    word_vec = []
    vec = ''
    n = len(word)
    for i in range(n):
        current_letter = word[i]
        ind = ord(current_letter)
        placeholder = bin(ind)[2:].zfill(32)
        vec = vec + placeholder
    vec = vec.zfill(32*MAX_INPUT_LENGTH)
    for digit in vec:
        word_vec.append(int(digit))
    return word_vec
    
    
    

<h2>Deep Neural Network - Helper Function</h2>


In [18]:
def createModelStandard():
    initializer = initializers.he_uniform()
    model = Sequential()
    model.add(Dense(416, activation='relu', kernel_initializer=initializer, input_dim=416))
    model.add(Dense(512, activation='relu', kernel_initializer=initializer))
    model.add(Dense(128, activation='relu', kernel_initializer=initializer))
    model.add(Dropout(0.15))
    model.add(Dense(len(LABEL), activation='softmax'))
    model.summary()
    
    model.compile(loss='sparse_categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(1e-3), metrics=['accuracy'])
    
    return model


def createModelExtended():
    initializer = initializers.he_uniform()
    model = Sequential()
    model.add(Dense(416, activation='relu', kernel_initializer=initializer, input_dim=416))
    model.add(Dense(1024, activation='relu', kernel_initializer=initializer))
    model.add(Dense(256, activation='relu', kernel_initializer=initializer))
    model.add(Dropout(0.15))
    model.add(Dense(len(LABEL), activation='softmax'))
    model.summary()
    
    model.compile(loss='sparse_categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(1e-3), metrics=['accuracy'])
    
    return model

def loadWeights():
    model.load_weights(f'weights/{TYPE}/weights_{TYPE}.chk')
    
def detectLanguage(text, model):
    #test for results

    if len(text) > MAX_INPUT_LENGTH:
        text = text[:MAX_INPUT_LENGTH]

    text = clean_sentences(text)
    word_vec = convertTextToBinary(text)
    word_vec =np.array(word_vec,dtype='float32')
    word_vec = np.reshape(word_vec, (1,word_vec.shape[0]))


    output = model.predict(word_vec)
    
    digit = np.argmax(output[0])
    
   

    print(f"the language for input {text}: {LABEL[digit][0]}")
    
    for i in range(len(LABEL)):
        lang = LABEL[i][0]
        score = output[0][i]
        print(lang + ': ' + str(round(100*score, 2)) + '%')
    print('\n')



<h2>Deep Neural Network - Load Weights From Disk</h2>

In [19]:
# create model

if(TYPE =='STANDARD'):
    model = createModelStandard()
else:
    model = createModelExtended()

# load weights

loadWeights()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 416)               173472    
_________________________________________________________________
dense_9 (Dense)              (None, 512)               213504    
_________________________________________________________________
dense_10 (Dense)             (None, 128)               65664     
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 17)                2193      
Total params: 454,833
Trainable params: 454,833
Non-trainable params: 0
_________________________________________________________________


In [13]:
#test for results

text_arr = ['father','মানবতা','بچے','الأطفال','إنسانية','mänskligheten']

for text in text_arr:
    detectLanguage(text, model)


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
the language for input father: Armenian
Vietnamese: 0.81%
Albanian: 1.44%
Arabic: 1.35%
Armenian: 3.13%
Azerbaijani: 1.36%
Belarusian: 2.37%
Bengali: 2.02%
Bosnian: 2.44%
Bulgarian: 1.46%
Burmese: 2.23%
Catalan: 1.98%
Chinese Simplified: 1.59%
Chinese Traditional: 1.51%
Chinese Yue: 2.18%
Croatian: 2.51%
Czech: 2.12%
Danish: 2.3%
Dutch: 1.7%
English: 1.76%
Esperanto: 1.58%
Estonian: 1.41%
Finnish: 1.77%
French: 2.07%
Galician: 3.04%
Georgian: 1.5%
German: 1.98%
Urdu: 1.82%
Gujarati: 1.74%
Hebrew: 1.38%
Hindi: 1.38%
Hungarian: 1.15%
Indonesian: 1.12%
Italian: 1.6%
Japanese: 1.84%
Korean: 1.69%
Latvian: 1.94%
Lithuanian: 2.4%
Macedonian: 1.64%
Malay: 1.26%
Marathi: 1.96%
Mongolian: 2.61%
Norwegian: 1.81%
Persian: 1.68%
Polish: 1.36%
Portuguese: 1.66%
Romanian: 1.16%
Russian: