## Language Detection

**Problem Statement:** [European Parliament Proceedings Parallel Corpus](http://www.statmt.org/europarl/) is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU. Create a machine learning model trained on this dataset to predict the following test set.



Historically, language classification was done using statistical methods. All languages have certain alphabets or words that could be used to differentiate it from others. But for this method to work we have to maintain dictionaries (or an equivalent) of all languages. This is cumbersome and not scalable to other languages, other dialects or even newer vocabulary. 

In the last decade or so, people tried to solve this problem using Machine Learning and succeeded! Language Detection is now subsumed in the bigger problem domain of text classification, in which models are trained to assign categories to a given text document.

To solve this problem I decided to use Neural Networks, specifically Recurrant Neural Networks. The other contenders outside of NN that could have been used are [N-Grams](http://cloudmark.github.io/Language-Detection-Implemenation/) or [Naive Bayes](https://burakkanber.com/blog/machine-learning-naive-bayes-1/). Within NN the best option is usually with RNN over CNN. Besides, RNNs are what [power Google Translate](https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html). Another [Reference](http://cs229.stanford.edu/proj2015/324_report.pdf)

Language detection as performed by [Google](https://cloud.google.com/translate/docs/detecting-language) has ~99% accuracy. Undoubtedly the Neural Network architecture they use would be more complex and probably impossible to run on a local machine. Still I'll give it a try.

Starting with imports. I decided to use Keras framework running over a tensorflow backend

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense,Dropout,LSTM
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint   
from keras import utils
from re import sub
from string import punctuation
from os import listdir
from os.path import isfile, join
from sklearn.metrics import confusion_matrix
import pandas as pd


Using TensorFlow backend.


There is a lot of input text preprocessing needed. For starters we need to remove the html tags in the documents; remove punctuation and numbers (as they don't really help distinguish between European languages);
As the test data has sentences as input which we have to label, the training documents are also split into array of sentences.


Also it is necessary to convert the string representation into integers. I decided to just use unicode values of each character. 

Next, it is crucial to make sure that all sentences are equal length (The model requires us to know the input dimensions). If the sentence is >100 characters it is truncated. If it is lesser, the sentence is padded with NULL values in the beginning.

In [2]:
def preprocess(txt):
    txt = sub(" *<[^>]+> *"," ", txt)
    txt = sub(" *\n *","\n",txt)
    not_allowed = punctuation + '0123456789'
    txt = ''.join([i for i in txt if i not in not_allowed])
    
    sentences = txt.split("\n")
    sentences = [s for s in sentences if len(s)>1]
    return sentences

def char_to_int(st):
    return [ord(s) for s in st]
    
def cut_or_pad(st,maxlen):    
    if len(st)>=maxlen:
        return char_to_int(st[:maxlen])
    else:
        n_spaces = maxlen - len(st)
        return char_to_int(st+'\x00'*n_spaces )

Reading of the language files and construction training sets. I put a limit of 200 files due to system constraints. All the text is preprocessed. Sententeces are split and padded/cut (also have a limit). Labels are one hot encoded. Meaning instead of labeling as 'fr', we will label as according to its position in the languages array.( this is just one way to convert the string label into int value)

In [3]:
languages = ['fr', 'sl', 'sk', 'da', 'es', 'ro', 'pl', 'de', 'et', 'sv', 'fi', 'lv', 'el', 'nl', 'hu', 'pt', 'lt', 'it', 'bg', 'en', 'cs']
train_sentences, train_labels = [],[]
num_files = 200
maxlen = 100
for idx,l in enumerate(languages):
    lang_path = "./txt/" + l
    print("Fetching and processing",l)
    all_files = listdir(lang_path)
    for f in all_files[:num_files]:
        file_path = join(lang_path,f)
        with open(file_path, 'r') as txt:
            lang_sentences = preprocess(txt.read())
            for s in lang_sentences[:100]:
                train_labels.append(idx)
                train_sentences.append(cut_or_pad(s.lower(), maxlen))


Fetching and processing fr
Fetching and processing sl
Fetching and processing sk
Fetching and processing da
Fetching and processing es
Fetching and processing ro
Fetching and processing pl
Fetching and processing de
Fetching and processing et
Fetching and processing sv
Fetching and processing fi
Fetching and processing lv
Fetching and processing el
Fetching and processing nl
Fetching and processing hu
Fetching and processing pt
Fetching and processing lt
Fetching and processing it
Fetching and processing bg
Fetching and processing en
Fetching and processing cs


Same thing is done with the test data.

In [4]:
test_sentences, test_labels = [],[]
with open("./europarl.test", 'r') as f:
    sentences = preprocess(f.read())
    for sen in sentences:
        s = sen.split("\t")
        if len(s)!=2:
            continue
        test_labels.append(languages.index(s[0]))
        test_sentences.append(cut_or_pad(s[1].lower(), maxlen))        

In [5]:
print(train_sentences[0])
print(test_sentences[0])

[109, 105, 108, 108, 233, 110, 97, 105, 114, 101, 32, 112, 111, 117, 114, 32, 108, 101, 32, 100, 233, 118, 101, 108, 111, 112, 112, 101, 109, 101, 110, 116, 32, 32, 111, 98, 106, 101, 99, 116, 105, 102, 32, 32, 97, 109, 233, 108, 105, 111, 114, 101, 114, 32, 108, 97, 32, 115, 97, 110, 116, 233, 32, 109, 97, 116, 101, 114, 110, 101, 108, 108, 101, 32, 100, 233, 98, 97, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1077, 1074, 1088, 1086, 1087, 1072, 32, 32, 1085, 1077, 32, 1090, 1088, 1103, 1073, 1074, 1072, 32, 1076, 1072, 32, 1089, 1090, 1072, 1088, 1090, 1080, 1088, 1072, 32, 1085, 1086, 1074, 32, 1082, 1086, 1085, 1082, 1091, 1088, 1077, 1085, 1090, 1077, 1085, 32, 1084, 1072, 1088, 1072, 1090, 1086, 1085, 32, 1080, 32, 1080, 1079, 1093, 1086, 1076, 32, 1089, 32, 1087, 1088, 1080, 1074, 1072, 1090, 1080, 1079, 1072, 1094, 1080, 1103, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


We need to know what the max value of the integer encoded sentences are. 

In [6]:
max_value=0
for s in train_sentences:
    max_value = max(max_value, max(s))

for s in test_sentences:
    max_value = max(max_value, max(s))

max_value = max_value+100
max_value

65633

In [7]:
print("Number of train samples ", len(train_sentences))
print("Number of test samples ", len(test_sentences))


Number of train samples  83658
Number of test samples  21000


Now converting data into numpy arrays. also converting the one-hot encoded labels into a binary matrix so it can be used for multiclass classification. 

In [8]:
num_classes = len(languages)
x_train = np.asarray(train_sentences)
x_test = np.asarray(test_sentences)
y_train = utils.to_categorical(train_labels, num_classes)
y_test = utils.to_categorical(test_labels, num_classes)


The data needs to be shuffled as all the languages were grouped together. 

In [9]:
shuf = np.arange(x_train.shape[0])
np.random.shuffle(shuf)
x_train = x_train[shuf]
y_train = y_train[shuf]

shuf = np.arange(x_test.shape[0])
np.random.shuffle(shuf)
x_test = x_test[shuf]
y_test = y_test[shuf]

Using a portion of the training data as a validation set. This will be used to validate the model and the weights. The model won't be trained on it. 

In [10]:
(x_train, x_valid) = x_train[10000:], x_train[:10000]
(y_train, y_valid) = y_train[10000:], y_train[:10000]

print('x_train shape:', x_train.shape)
print('x_test shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
print(x_valid.shape[0], 'validation samples')


x_train shape: (73658, 100)
x_test shape: (73658, 100)
73658 train samples
21000 test samples
10000 validation samples


In [11]:
print(x_train.shape)
print(y_train.shape)

print(x_test.shape)
print(y_test.shape)


(73658, 100)
(73658, 21)
(21000, 100)
(21000, 21)


The model architecture starts with an embedding layer as our integer data will need to become a 3D tensor into order to be consumed by the LSTM cell. LSTM stands for Long Short Term Memory and is a type of RNN layer. It preserves context across the length of the sentence. After LSTM is a Dropout layer to reduce dimensionality and overfittinng. We finish with a fully connected layer that applies a softmax to the model and classfies into one of the languages.  
The model is compiled with the loss function and optimizers usually used in multiclass problems.

In [12]:
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(max_value, embedding_vector_length, input_length=maxlen))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 32)           2100256   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               82432     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 21)                2709      
Total params: 2,185,397.0
Trainable params: 2,185,397.0
Non-trainable params: 0.0
_________________________________________________________________
None


The model is trained for 10 epochs in batches of 128. We can specify the validation data on which the model will be tested after each epoch. A checkpoint callback is added to make sure that only the weights that performed best against the validation set is stored. This is to make sure that don't use the weights that overfit to the training data.

In [13]:
# train the model
checkpointer = ModelCheckpoint(filepath='model.weights.best.hdf5', verbose=1, 
                               save_best_only=True)
model.fit(x_train, y_train, batch_size=64, epochs=10, validation_data=(x_valid, y_valid), callbacks=[checkpointer])


Train on 73658 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fa8ee0a0978>

Finally we evaluate the model on test data. 

In [14]:
model.load_weights('model.weights.best.hdf5')
score = model.evaluate(x_test, y_test)
print("\nAccuracy: %.2f%%" % (score[1]*100))


Accuracy: 96.78%


We can also view the confusion matrix of the predicted langauage classes. The column values represent the expected labels and the rows are the predicted labels. When both values match, in the diagonal, it means classification was correct.   

In [15]:
y_pred = model.predict(x_test, verbose=1)

y_pred_labels = np.asarray([ languages[i] for i in np.argmax(y_pred.astype(float), axis=1)])
y_test_labels = np.asarray([ languages[i] for i in np.argmax(y_test.astype(float), axis=1)])
conf = confusion_matrix(y_test_labels, y_pred_labels, labels=languages)

pd.set_option('display.max_columns', 22)
pd.DataFrame(conf, index=languages, columns=languages)




Unnamed: 0,fr,sl,sk,da,es,ro,pl,de,et,sv,fi,lv,el,nl,hu,pt,lt,it,bg,en,cs
fr,946,2,0,8,3,1,0,2,3,0,0,1,0,1,0,2,1,2,0,28,0
sl,1,971,3,3,1,0,2,0,2,0,0,0,0,9,0,1,3,3,0,1,0
sk,1,8,965,0,1,0,2,0,0,0,1,0,0,0,6,1,5,0,0,0,10
da,1,0,0,981,0,0,0,0,1,4,0,0,0,9,0,2,0,0,0,2,0
es,4,2,4,1,925,0,0,0,3,0,0,0,0,0,0,39,1,12,0,9,0
ro,2,1,0,0,2,980,1,1,4,0,0,0,0,0,0,0,0,4,0,5,0
pl,0,1,0,1,0,0,994,0,0,0,0,0,0,1,0,0,1,0,0,2,0
de,0,0,0,11,0,0,1,934,6,7,3,0,0,33,0,0,0,0,0,4,1
et,4,4,0,4,0,0,0,2,939,11,30,0,0,1,0,2,0,3,0,0,0
sv,0,0,0,27,0,0,0,2,5,961,4,0,0,0,0,0,0,0,0,1,0


This model can definately be improved in many ways. 
1. Use all the language files. I put a limit due to local system's memory constraints.
2. Add more LSTM cells/more units in the cells.
3. Change the hyperparamenters. Train for longer maybe.
