## Language Detection

**Problem Statement:** [European Parliament Proceedings Parallel Corpus](http://www.statmt.org/europarl/) is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU. Create a machine learning model trained on this dataset to predict the following test set.



Historically language classification was done using statistical methods. All language have certain alphabets or words that could be used to differentiate it from others. But for this we had to maintain dictionaries or some equivalent of the languages we would like to detect. This was cumbersome and would also not scale to other languages, other dialects or even newer vocabulary. 

After that people tried to solve this problem using Machine Learning and succeeded! Language Detection is now subsumed in the bigger problem domain of text classification, which is all about assigning categories to a given text document.

To solve this problem I decided to use Neural Networks, specifically Recurrant Neural Networks. The other contenders outside of ML that could have been used are N-Grams or Naive Bayes Classifier. Within ML, the best option is usually with RNN over CNN. Besides, RNNs are what [power Google Translate](https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html).

Language detection as performed by [Google](https://cloud.google.com/translate/docs/detecting-language) has ~90% accuracy. Undoubtedly the Neural Network architecture they use would be more complex and probably impossible to run on a local machine. Still I'll give it a try.

Starting with imports. I decided to use Keras framework running over a tensorflow backend

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense,Dropout,LSTM
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint   
from keras import utils
from re import sub
from string import punctuation
from os import listdir
from os.path import isfile, join

There is a lot of input text preprocessing needed. For starters we need to remove the html tags in the documents; remove punctuation and numbers (as they don't really help distinguish between European languages);
As the test data has sentences as input which we have to label, the training documents are also split into array of sentences.


Also necessary it to convert the string representation into integers. I decided to just use unicode values of each character. 

Next, it is crucial to make sure that all sentences are equal length (The model requires us to know the input dimensions). If the sentence is >200 characters it is truncated. If it is lesser, the sentence is padded with NULL values in the beginning.

In [2]:
def preprocess(txt):
    txt = sub(" *<[^>]+> *"," ", txt)
    txt = sub(" *\n *","\n",txt)
    not_allowed = punctuation + '0123456789'
    txt = ''.join([i for i in txt if i not in not_allowed])
    
    sentences = txt.split("\n")
    sentences = [s for s in sentences if len(s)>1]
    return sentences

def char_to_int(st):
    return [ord(s) for s in st]
    
def cut_or_pad(st,maxlen):    
    if len(st)>=maxlen:
        return char_to_int(st[:maxlen])
    else:
        n_spaces = maxlen - len(st)
        return char_to_int(st+'\x00'*n_spaces )

Reading of the language files and construction training sets. I put a limit of 200 files due to system constraints. All the text is preprocessed. Sententeces are split and padded/cut. Labels are one hot encoded. Meaning instead of labeling as 'fr', we will label as according to its position in the languages array.( this is just one way to convert the string label into int value)

In [35]:
languages = ['fr', 'sl', 'sk', 'da', 'es', 'ro', 'pl', 'de', 'et', 'sv', 'fi', 'lv', 'el', 'nl', 'hu', 'pt', 'lt', 'it', 'bg', 'en', 'cs']
train_sentences, train_labels = [],[]
num_files = 200
maxlen = 200
for idx,l in enumerate(languages):
    lang_path = "./txt/" + l
    print("Fetching and processing",l)
    all_files = listdir(lang_path)
    for f in all_files[:num_files]:
        file_path = join(lang_path,f)
        with open(file_path, 'r') as txt:
            lang_sentences = preprocess(txt.read())
            for s in lang_sentences:
                train_labels.append(idx)
                train_sentences.append(cut_or_pad(s.lower(), maxlen))


Fetching and processing fr
Fetching and processing sl
Fetching and processing sk
Fetching and processing da
Fetching and processing es
Fetching and processing ro
Fetching and processing pl
Fetching and processing de
Fetching and processing et
Fetching and processing sv
Fetching and processing fi
Fetching and processing lv
Fetching and processing el
Fetching and processing nl
Fetching and processing hu
Fetching and processing pt
Fetching and processing lt
Fetching and processing it
Fetching and processing bg
Fetching and processing en
Fetching and processing cs


Same this is done with the test data.

In [36]:
test_sentences, test_labels = [],[]
with open("./europarl.test", 'r') as f:
    sentences = preprocess(f.read())
    for sen in sentences:
        s = sen.split("\t")
        if len(s)!=2:
            continue
        test_labels.append(languages.index(s[0]))
        test_sentences.append(cut_or_pad(s[1].lower(), maxlen))        

In [37]:
print(train_sentences[0])
print(test_sentences[0])

[109, 105, 108, 108, 233, 110, 97, 105, 114, 101, 32, 112, 111, 117, 114, 32, 108, 101, 32, 100, 233, 118, 101, 108, 111, 112, 112, 101, 109, 101, 110, 116, 32, 32, 111, 98, 106, 101, 99, 116, 105, 102, 32, 32, 97, 109, 233, 108, 105, 111, 114, 101, 114, 32, 108, 97, 32, 115, 97, 110, 116, 233, 32, 109, 97, 116, 101, 114, 110, 101, 108, 108, 101, 32, 100, 233, 98, 97, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1077, 1074, 1088, 1086, 1087, 1072, 32, 32, 1085, 1077, 32, 1090, 1088, 1103, 1073, 1074, 1072, 32, 1076, 1072, 32, 1089, 1090, 1072, 1088, 1090, 1080, 1088, 1072, 32, 1085, 1086, 1074, 32, 1082, 1086, 1085, 1082, 1091, 1088, 1077, 1085, 1090, 1077, 1085, 32, 

We need to know what the max value of the integer encoded sentences are. 

In [38]:
max_value=0
for s in train_sentences:
    max_value = max(max_value, max(s))

for s in test_sentences:
    max_value = max(max_value, max(s))

max_value = max_value+100
max_value

65633

In [39]:
print("Number of train samples ", len(train_sentences))
print("Number of test samples ", len(test_sentences))


Number of train samples  239930
Number of test samples  21000


Now converting data into numpy arrays. also converting the one-hot encoded labels into a binary matrix so it can be used for multiclass classification. 

In [40]:
limit = 70000 # This limit is necessary as numpy arrays of larger size consume more memory. The total array length was >120000
num_classes = len(languages)
x_train = np.asarray(train_sentences[:limit])
x_test = np.asarray(test_sentences)
y_train = utils.to_categorical(train_labels[:limit], num_classes)
y_test = utils.to_categorical(test_labels, num_classes)


The data needs to be shuffled as all the language were grouped together. 

In [None]:
shuf = np.arange(x_train.shape[0])
np.random.shuffle(shuf)
x_train = x_train[shuf]
y_train = y_train[shuf]


shuf = np.arange(x_test.shape[0])
np.random.shuffle(shuf)
x_test = x_test[shuf]
y_test = y_test[shuf]

Using a portion of the training data as a validation set. This will be used to validate the model and the weights. The model won't be trained on it. 

In [41]:
(x_train, x_valid) = x_train[5000:], x_train[:5000]
(y_train, y_valid) = y_train[5000:], y_train[:5000]

print('x_train shape:', x_train.shape)
print('x_test shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
print(x_valid.shape[0], 'validation samples')


x_train shape: (65000, 200)
65000 train samples
21000 test samples
5000 validation samples


In [42]:
print(x_train.shape)
print(y_train.shape)

print(x_test.shape)
print(y_test.shape)


(65000, 200)
(65000, 21)
(21000, 200)
(21000, 21)


The model architecture starts with an embedding layer as our integer data will need to become a 3D tensor into order to be consumed by the LSTM cell. LSTM stands for Long Short Term Memory and is a type of RNN layer. It preserves context across the length of the sentence. After LSTM is a Dropout layer to reduce dimensionality and overfittinng. We finish with a fully connected layer that applies a softmax to the model and classfies into one of the languages.  
The model is compiled with the loss function and optimizers usually used in multiclass problems.

In [44]:
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(max_value, embedding_vector_length, input_length=maxlen))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 200, 32)           2100256   
_________________________________________________________________
lstm_8 (LSTM)                (None, 128)               82432     
_________________________________________________________________
dropout_8 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 21)                2709      
Total params: 2,185,397.0
Trainable params: 2,185,397.0
Non-trainable params: 0.0
_________________________________________________________________
None


The model is trained for 10 epochs in batches of 128. We can specify the validation data on which the model will be tested after each epoch. A checkpoint callback is added to make sure that only the weights that performed best against the validation set is stored. This is to make sure that don't use the weights that overfit to the training data.

In [None]:
# train the model
checkpointer = ModelCheckpoint(filepath='model.weights.best.hdf5', verbose=1, 
                               save_best_only=True)
model.fit(x_train, y_train, batch_size=128, epochs=8, validation_data=(x_valid, y_valid), callbacks=[checkpointer])


Train on 65000 samples, validate on 5000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
 9600/65000 [===>..........................] - ETA: 483s - loss: 0.0871 - acc: 0.9769

Finally we evaluate the model on test data. 

In [None]:
model.load_weights('model.weights.best.hdf5')
scores = model.evaluate(x_test, y_test)
print("Accuracy: %.2f%%" % (scores[1]*100))


This model can definately be improved in many ways. 
1. Use all the language files. I put a limit due to local system's memory constraints.
2. Add more LSTM cells/more units in the cells.
3. Change the hyperparamenters. Train for longer maybe.
