# Language Detection Challenge Problem
Prompt:
European Parliament Proceedings Parallel Corpus is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU.  
Create a machine learning model trained on this dataset to predict the following test set.

## Introduction
This notebook provides several solutions used to solve the prompt, highlighting the solution with the highest accuracy.
<br><br>
The problem of language detection involves many different issues and possible solutions. Working with a large corpus allows for ambiguity in feature engineering. One major challenge is how to provide relevancy to the words or characters in a text. The next obvious question should be how to do this. 
<br><br>
As highlighted from the References section, given a large enough dataset an accuracy of 99% can be achieved. The best solutions used a combination of N-grams and Recurrent Neural Networks (RNN). My solution will try a similar approach.

## Workflow
The challenge solution focuses on preparing and preprocessing the dataset, classifying and creating features that can be used to correctly predict the test set.

1. Load the training dataset
2. Load the test set
3. Create a feature set using N-grams (trigrams specifically)
4. TF-IDF for trigram importance in corpus
5. Model, predict, and solve

In [1]:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

# data analysis and wrangling
import os
import glob as g
import math
from string import punctuation,digits
import pandas as pd
import numpy as np

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Preprocessing
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.util import trigrams
from collections import Counter
from collections import defaultdict

# Testing
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from keras.losses import categorical_crossentropy
from keras.optimizers import Adam

# Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from keras.models import Model, Sequential
from keras.layers.embeddings import Embedding
from keras.layers import LSTM, GRU, Input, Dense, Dropout

Using TensorFlow backend.


In [2]:
# Loading datasets
directory = '~/mynotebooks/Language-Detection/'
sub = 'txt/'
txt = '/*.txt'
DS = '.DS_Store'
testfile = 'europarl.test'
encoding = 'utf-8'

# Magic numbers
left_angle_bracket = '<'
left_parenthesis = '('
tab = '\t'
num_files_per_dir = 999
most_common = 25

# Initializations
language_lists = []
label_list = []
for _ in range(21):
    language_lists.append([])
    label_list.append([])

# Helper Functions
The following cells are helper functions that were used for preprocessing and computing TF-IDF. Their details are commented below

In [3]:
def flatten(list_of_lists):
    """
    Convert list of lists into list of strings
    :param list_of_lists: input sequence to be flattened
    :return: converted list of strings
    """
    return [val for sublist in list_of_lists for val in sublist]

def to_words(text):
    """
    Remove all non-letters and set all characters lowercase
    :param text: string
    :return: preprocessed text
    """
    for c in punctuation+digits:
        text = text.replace(c,'').lower()
    return text

def to_chars(words):
    """
    Convert preprocessed words to characters
    :param words: list of words
    :return: list of characters
    """
    chars = []
    # Set all lowercase
    # Remove non-letters
    text = to_words(words)
    for c in text:
        chars.append(c)
    return chars
    
def create_trigrams(sequence):
    """
    Create character trigrams
    :param sequence: list of characters
    :return: list of character trigrams
    """
    return list(trigrams(to_chars(sequence), pad_right=True, right_pad_symbol=' '))

def trigram_dictionary(sequence):
    """
    Find frequency of character trigram per sequence
    :param sequence: list of preprocessed character trigrams taken from training set
    :return: list of most common character trigrams as tuple ((trigram), frequency)
    """
    gram = dict()
    
    # Populate trigram dictionary
    for list_of_trigrams in sequence:
        for trigram in list_of_trigrams:
            if trigram != (' ', ' ', ' ') \
              and trigram[0:2] != (' ', ' ') \
              and trigram[1:3] != (' ', ' ') \
              and trigram[::2] != (' ', ' ') \
              and trigram[0] != ' ':
                if trigram in gram:
                    gram[trigram] += 1
                else:
                    gram[trigram] = 1

    # Turn into a list of (word, count) sorted by count from most to least
    gram = Counter(gram)
    return gram.most_common(most_common)

In [4]:
def idf(documents, tokens):
    """
    Calculate IDF (inverse document frequency)
    :param documents: list of documents for all languages
    :param tokens: list of all trigrams
    :return: dictionary of trigrams with value IDF
    """
    idf_values = {}
    for tkn in tokens:
        tkn_in_doc = map(lambda doc: tkn in doc, documents)
        idf_values[tkn] = 1 + math.log(len(documents)/(sum(tkn_in_doc)))
    return idf_values

def tfidf(documents, tokens):
    """
    Calculate TF-IDF (term frequency-inverse document frequency)
    :param documents: list of documents for all languages
    :param tokens: list of all trigrams
    :return: dictionary of trigrams with value TF-IDF
    """
    idf_values = idf(documents, tokens)
    tfidf_values = defaultdict(list)
    for document in documents:
        for key,_ in idf_values.items():
            # count frequency of trigram in document
            c = document.count(key)
            if c:
                tf = 1 + math.log(c)
            # count could equal zero so value this doc for the trigram as zero
            else:
                tf = c
            tfidf_values[key].append((tf * idf_values[key])/(len(tokens)+1))
    return tfidf_values

# Dataset
As stated in the prompt, the European Parliament Proceedings Parallel Corpus is a text dataset used for evaluating language detection engines. The corpus itself is divided by language in separate subdirectories. Within each subdirectory, there are thousands of documents including texts and conversations in their respective languages.

## Load Data
The data is located in `/txt` with all 21 languages divided by subdirectory (i.e. English is in subdirectory `/txt/en`). Both the training and test sets are loaded in the cells below.

## Preprocessing
The data is being preprocessed as it is loaded. There are couple preprocessing steps to note:
1. Remove any lines that start with '<' or '('. This is because these lines are usually names or English words which would confuse all datasets (sans the English set).
2. Set all characters lowercase and remove any punctuations and digits. Setting to lowercase makes for easier processing and punctuations/digits are not necessary for detecting language

# N-grams
For language detection, character trigrams are used. The reason for using trigrams versus bigrams or something larger is for the sake of processing. Larger N-grams would increase processing time and a huge chunk of memory.
These character trigrams will be the feature set, with each trigram showing the frequency it appears in each document.

## Features
The feature set contains trigrams from all languages. However, to use all the trigrams created would be ineffective and cause overfitting. To effectively reduce the feature set, the following steps were made:
1. Obtain all character trigrams from each document
2. For each language, take the top 25 most common trigrams
3. Recombine into a one set of features

In addition, trigram weights per document are taken into account. TF-IDF is used here to determine the trigram weights. Taking into account the weight each character trigram holds per document will help for making assumptions about the test set. Larger weights that relate to certain languages will be reflected while training and allow the model to make better decisions for cases where trigrams intersect with multiple languages.

## Test set
The test set is located in `europarl.test` and contains 21,000 lines. The first 2 characters show the language followed by the corresponding line of text (tab-delimited). the test labels are taken from the first characters and test data from the rest.

In [5]:
def load(s,t):
  """
    Load training and test sets
    :param s: /txt directory where the training dataset is located
    :param t: the file for the test set, europarl.test
    :return: (x train data, y train labels), (x test data, y test labels), trigrams for training data, 
             and the dictionary of labels such that key = language, value = int
  """
  # Create the dictionary of labels from the subdirectory names
  labels = {key:value for value, key in enumerate(os.listdir(s)) if key != '.DS_Store'}
  # Initialize train, test, and trigram as lists
  x_train, y_train = [], []
  x_test, y_test = [], []
  trigrams = []

  # Load training set
  for lang_dir, lang_val in labels.items():
    file = s+lang_dir+txt
    for file_num, filename in enumerate(g.glob(file)):
      doc, grams = [], []
      with open(filename, "r", encoding=encoding, errors='ignore') as f:
         # entire document as one string
         for line in f:
             # preprocessing
             if line.strip() != '' \
               and not line.strip().startswith(left_angle_bracket) \
               and not line.strip().startswith(left_parenthesis):
                l = line.strip().lower()
                # preprocess each line and divide into a list of words
                doc.append(to_words(l))
                # create character trigrams
                grams.append(create_trigrams(l))
         f.close()
      y_train.append(lang_val)
      x_train.append(doc)
      trigrams.append(grams)
      # for pre-training with smaller samples
      if file_num == num_files_per_dir:
         break
    print("%d %s files loaded as training data..." % (num_files_per_dir+1,lang_dir))
  print("Training data loaded.\n")

  print("Loading %s as test data..." % t)
    
  # Load test set
  with open(t, "r", encoding=encoding) as f:
        for line in f.readlines():
            temp = line.strip().split(tab)
            y_test.append(labels[temp[0]])
            x_test.append(to_words(temp[1].lower()))
  print("Test data loaded.\n")

  return (x_train, y_train), (x_test, y_test), trigrams, labels

In [6]:
(data_train, label_train), (data_test, label_test), trigrams, labels = load(sub, testfile)

print('%d train sequences' % len(data_train))
print('%d test sequences' % len(data_test))

print('Average train sequence length: {}'.format(np.mean(list(map(len, data_train)), dtype=int)))
print('Average test sequence length: {}'.format(np.mean(list(map(len, data_test)), dtype=int)))

1000 cs files loaded as training data...
1000 pt files loaded as training data...
1000 ro files loaded as training data...
1000 sk files loaded as training data...
1000 fi files loaded as training data...
1000 fr files loaded as training data...
1000 it files loaded as training data...
1000 et files loaded as training data...
1000 de files loaded as training data...
1000 pl files loaded as training data...
1000 es files loaded as training data...
1000 en files loaded as training data...
1000 nl files loaded as training data...
1000 bg files loaded as training data...
1000 sl files loaded as training data...
1000 sv files loaded as training data...
1000 hu files loaded as training data...
1000 da files loaded as training data...
1000 el files loaded as training data...
1000 lv files loaded as training data...
1000 lt files loaded as training data...
Training data loaded.

Loading europarl.test as test data...
Test data loaded.

21000 train sequences
21000 test sequences
Average train se

In [7]:
# list of most common trigrams for training data
train_dict = [trigram_dictionary(trigram) for trigram in trigrams]

# combine all training data with labeled data
training_data = list(zip(train_dict, label_train))

# turn list of lists into list of strings, to be used for TF-IDF calculations
d_train_mapped = list(map(''.join, data_train))

In [8]:
# create a different list per language to keep lables consistent
for _,val in labels.items():
    for item in training_data:
        if item[1] == val:
            language_lists[val].append(chars[0] for chars in item[0])

# recombine list into one feature set
for i in range(len(language_lists)):
    language_lists[i] = Counter(flatten(language_lists[i])).most_common(most_common)
    label_list[i] += [i] * most_common

# all features in a single list with label included
training_features = list(zip(flatten(language_lists), flatten(label_list)))

# drop trigram count and create dict with key = trigram, val = label (int)
features = dict((''.join(key[0]),val) for key, val in training_features)
tokens = set(''.join(key[0]) for key,_ in training_features)

In [9]:
# training dictionary for each feature frequency per document
train_tfidf = tfidf(d_train_mapped, tokens)
# test dictionary for each feature frequency per line
test_tfidf = tfidf(data_test, tokens)

In [10]:
# create DataFrame for visual of data
train_tfidf_df = pd.DataFrame(train_tfidf)
train_tfidf_df['languages'] = label_train

# create DataFrame for visual of data
test_tfidf_df = pd.DataFrame(test_tfidf)
test_tfidf_df['languages'] = label_test

### Visualization
The relative weights are shown in the table below. Each line corresponds to a body of text with each column referencing the character trigram. Note that spaces are also included as part of the trigram. This will help the model understand beginning and end of words.
Each cell corresponds to a normalized weight indicating its frequency in the document. The weight was calculated by dividing the product of the normalized IDF and frequency count of the trigram by the total number of trigrams

In [20]:
train_tfidf_df.head()

Unnamed: 0,a d,a p,aan,aci,af,ai,ak,als,an,ana,...,не,ни,но,пре,про,ред,та,те,то,languages
0,0.006878,0.017474,0.0,0.0,0.0,0.0,0.0,0.0,0.006899,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [19]:
test_tfidf_df.head()

Unnamed: 0,a d,a p,aan,aci,af,ai,ak,als,an,ana,...,не,ни,но,пре,про,ред,та,те,то,languages
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.016005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015497,0.015686,0.0,0.01611,0.024476,0.031302,0.0,14
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015686,0.016548,0.01611,0.014456,0.0,0.014266,14
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015686,0.0,0.01611,0.014456,0.0,0.014266,14
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.016005,0.016005,0.0,0.015686,0.034727,0.01611,0.037722,0.0,0.014266,14


### Observations
From the tables above, it can be seen that certain trigrams only appear for certain documents or lines of text. A value of zero tells the model that these set of characters are unlikely to be linked with a certain language.

In [21]:
# Training data
label_train_df = train_tfidf_df['languages']
data_train_df = train_tfidf_df.drop(['languages'], axis=1)

# Test data to predict
label_test_df = test_tfidf_df['languages']
data_test_df = test_tfidf_df.drop(['languages'], axis=1)

# Modeling
The problem itself is a multi-class classification. There are 21 classes that each line of text can be classified to. That being said, the models used for predicting only relate to multi-class classification.

1. Logistic Regression
2. KNN
3. Stochastic Gradient Descent
4. SVM
5. Gaussian Naive Bayes
6. Decision Tree
7. Random Forest
8. RNN

These models highlight the probabilities in which a given language could be considered. In cases where trigrams maybe intersect with more than one language, having probabilities to determine output boost the chance of success.

In [22]:
# Logistic Regression

lr = LogisticRegression()
lr.fit(data_train_df, label_train_df)
Y_pred = lr.predict(data_test_df)
acc_log = round(lr.score(data_test_df, label_test_df) * 100, 2)
acc_log

90.819999999999993

In [23]:
# KNN

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(data_train_df, label_train_df)
Y_pred = knn.predict(data_test_df)
acc_knn = round(knn.score(data_test_df, label_test_df) * 100, 2)
acc_knn

84.599999999999994

In [24]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(data_train_df, label_train_df)
Y_pred = sgd.predict(data_test_df)
acc_sgd = round(sgd.score(data_test_df, label_test_df) * 100, 2)
acc_sgd

81.680000000000007

In [25]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(data_train_df, label_train_df)
Y_pred = linear_svc.predict(data_test_df)
acc_linear_svc = round(linear_svc.score(data_test_df, label_test_df) * 100, 2)
acc_linear_svc

90.109999999999999

In [26]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(data_train_df, label_train_df)
Y_pred = gaussian.predict(data_test_df)
acc_gaussian = round(gaussian.score(data_test_df, label_test_df) * 100, 2)
acc_gaussian

52.75

In [27]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(data_train_df, label_train_df)
Y_pred = decision_tree.predict(data_test_df)
acc_decision_tree = round(decision_tree.score(data_test_df, label_test_df) * 100, 2)
acc_decision_tree

67.099999999999994

In [28]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(data_train_df, label_train_df)
Y_pred = random_forest.predict(data_test_df)
random_forest.score(data_test_df, label_test_df)
acc_random_forest = round(random_forest.score(data_test_df, label_test_df) * 100, 2)
acc_random_forest

86.459999999999994

In [29]:
# Models Summary

models = pd.DataFrame({
    'Model': ['KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes',
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
1,Logistic Regression,90.82
5,Linear SVC,90.11
2,Random Forest,86.46
0,KNN,84.6
4,Stochastic Gradient Decent,81.68
6,Decision Tree,67.1
3,Naive Bayes,52.75


### Observations
It can be seen that the best accuracy is <b>Logistic Regression</b> with an accuracy of 90.82%. Since Logistic Regression is inherently a classifier for multi-class, it is understandable why its accuracy became the highest. 

<b>SVM (Linear SVC)</b>: Also performed just as effectively as Logistic Regression. This could be because the languages are mostly distinguishable from each other. If there were several instances where characters overlapped, it may be harder to linearly distinguish boundaries.<br>
<b>Random Forest</b> and <b>Decision Tree</b>: RF also performed well but when compared to DT it seems that maybe this may have become an overfitting issue as DT's accuracy is much lower than RF.<br>
<b>KNN</b>: Could have been improved with finer-tuned <i>k</i>. Maybe set <i>k</i> to higher values next time.<br>
<b>Stochastic Gradient Descent</b>: Tuning <i>alpha</i> could have also improved SGD score.<br>
<b>Gaussian Naive Bayes</b>: The highest probability may have not been as obvious during prediction.

### Simple RNN and Logisitic Regression Models using Keras

In [42]:
def RNN_model(input_shape, output_size):
    """
    RNN Model
    :param input_shape: trigram feature set
    :param output_size: Length of output sequence (21 languages)
    :return: Keras model built, but not trained
    """
    model = Sequential()
    model.add(Embedding(input_shape[0], 128, input_length=input_shape[1]))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(output_size[1], activation='softmax'))
    
    return model

def LR_model(input_shape, output_size):
    """
    Logistic Regression Model
    param input_shape: trigram feature set
    :param output_size: Length of output sequence (21 languages)
    :return: Keras model built, but not trained
    """
    model = Sequential()
    model.add(Dense(256, input_dim=input_shape[1]))
    model.add(Dense(128))
    model.add(Dense(output_size[1], activation='softmax')) 
    
    return model

print("Models Loaded.")

Models Loaded.


In [None]:
num_epochs = 10
validation_split=0.2

y_train_update = np.ravel(label_train_df)
y_train_one_hot = to_categorical(y_train_update)
y_train_one_hot.reshape((-1, 1))

y_test_update = np.ravel(label_test_df)
y_test_one_hot = to_categorical(y_test_update)
y_test_one_hot.reshape((-1, 1))

In [None]:
xx = data_train_df.as_matrix()
yy = label_train_df.as_matrix()

In [35]:
learning_rate = 0.0001

model = RNN_model(xx.shape, y_train_one_hot.shape)
model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 314, 128)          2688000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 22)                2838      
Total params: 2,822,422
Trainable params: 2,822,422
Non-trainable params: 0
_________________________________________________________________


In [36]:
batch_size = 256

model.fit(xx,
          y_train_one_hot,
          batch_size=batch_size,
          epochs=num_epochs,
          verbose=1,
          validation_split=validation_split)

Train on 16800 samples, validate on 4200 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd5992bfc50>

In [37]:
score, acc = model.evaluate(data_test_df.as_matrix(), 
                            y_test_one_hot, 
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 3.51678667359
Test accuracy: 0.047619047619


### Observations
With such low accuracy, it is suspected that multiple LSTMs could help increase the accuracy of this simple model. 
<br><br>



In [43]:
learning_rate = 0.0001
model = LR_model(xx.shape, y_train_one_hot.shape)
model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 256)               80640     
_________________________________________________________________
dense_5 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_6 (Dense)              (None, 22)                2838      
Total params: 116,374
Trainable params: 116,374
Non-trainable params: 0
_________________________________________________________________


In [44]:
batch_size = 128

model.fit(xx,
          y_train_one_hot,
          batch_size=batch_size,
          epochs=num_epochs,
          verbose=1,
          validation_split=validation_split)

Train on 16800 samples, validate on 4200 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd3a4695a20>

In [45]:
score, acc = model.evaluate(data_test_df.as_matrix(), 
                            y_test_one_hot, 
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc*100)

Test accuracy: 72.5666666667


### Observations
A simple Logistic Regression model performed much better than the Simple RNN model. Still not as accurate as the previous LR model, 72.5% is still a reasonably well-performed prediction.

# Conclusion
From this challenge, it is seen that Logistic Regression performed with the highest accuracy. With the European languages not being as similar as Asianic languages for example, there seemed to be a clear separation of classes. With this clear separation, it proved to be much easier for an inherently multiclass classifer to make predictions after training.
<br><br>
In the future, I would like to improve the accuracy of the RNN model mixed with N-grams. With proper tuning, a larger dataset and multi-RNN model, the accuracy should be at least comparable to the 90% accuracy of Logistic Regression.
<br><br>
Aside from tuning hyperparameters to increase accuracy of the model, using a higher numbered N-gram would give more specific information about the data. 

# References
[1] Christian Buck, Kenneth Heafield, Bas van Ooyen. N-gram Counts and Language Models from the Common Crawl.<br>
[2] Alberto Simões, José João Almeida, and Simon D. Byers. 2014. Language Identification: a Neural Network Approach.<br>
[3] Manav Sehgal. 2017. Titanic Data Science Solutions. <https://www.kaggle.com/startupsci/titanic-data-science-solutions>.