## Dataset

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

We will use this dataset for langauge detection of 21 languages used in dataset. It can thought of a multilingual text classification problem and we will use char level features for this task.

## Data Preprocessing

For each data unit, we have taken the following pre-processing steps:

 1. Remove Tags <> and () brackets content
 2. Split into multiple sentences using \n split
 
and then saved it to dataframe lang_df for further processing or modelling.

In [1]:
# Data Preprocessing

import os
import pandas as pd
import glob
import re
import numpy as np


# Sentence to clean sentence or multiple sentences
def clean_sentence(sent, multiple=1):
    
    # Remove () part from sent and <> tags
    
    sent = re.sub("\([^)]*\)", '',sent)
    sent = re.sub("<[^>]*>", '', sent)
    
    # Split into multiple sentences if more than one sentence
    
    if multiple == 1:
        return [x for x in sent.split('\n') if x != '']
    elif multiple == 0:
        return sent

# Preprocess whole folder of europarl data
def preprocess_folder(input_org_data, preprocessing_text):
    # Process each file by langauge folder take sentences into dataframe with label

    if preprocessing_text == 0:
        invalid_files = []

        for i in output_labels:
            print("Current Folder: ", i)
            for j in glob.glob(input_org_data+'/'+i+'/*.txt'):
                try:
                    current_file = open(j).read()
                    lang_df = lang_df.append(pd.DataFrame(list(zip(len(clean_sentence(current_file))*[i], clean_sentence(current_file))),
                          columns=['label', 'sentence']))
                except:
                    invalid_files.append(j)
                    print(j)
                    pass

        # Write to File
        lang_df = lang_df.reset_index()
        del lang_df['index']
        lang_df.to_csv('lang_df.csv', index=False)

    elif preprocessing_text == 1:
        lang_df = pd.read_csv('lang_df.csv')
    
    else: # if dataset not required, features already generated
        return True

## Feature Extraction

As we are using this dataset for language detection, we will choose char-level features, since this is multilinugal and we don't need much local domain information like sublevel classification or category classification for text, char-level features will be suited much better and can be used to create a unified vocablury with less diversification.

We will use sklearn train_test_split to split into into training and test data for model validation, count_vectorizer with char analyzer for char level features(X) and label encoder for language type or (y). 

Count Vectorizer creates a char-level vocablury of the whole text data and then uses that to represent each sentence or unit of pre-processed data.

In [13]:
# Feature extraction

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
import pickle

def extract_features(lang_df, feature_extraction_text):

    if feature_extraction_text == 0:
        # Split into training and validation dataset

        X_train, X_test, y_train, y_test = train_test_split(list(lang_df.sentence), list(lang_df.label), test_size=0.33, random_state=42)

        # Generate character level features for X and Label Encoder for y

        # Train and Test

        # Sentence X
        count_vectorizer = CountVectorizer(analyzer='char')
        X_train_features = count_vectorizer.fit_transform(X_train)
        X_test_features = count_vectorizer.transform(X_test)

        # Label Y
        label_encoder = preprocessing.LabelEncoder()
        y_train_features = label_encoder.fit_transform(y_train)
        y_test_features = label_encoder.transform(y_test)

        # Pickle Save Features

        # Saving the features
        with open('features.pkl', 'wb') as f:  
            pickle.dump([X_train_features, X_test_features, y_train_features, y_test_features, count_vectorizer, label_encoder], f)

    elif feature_extraction_text == 1:
        # Restoring features and variables
        with open('X_train_features.pkl', 'rb') as f:
            X_train_features = pickle.load(f)[0]

        with open('y_train_features.pkl', 'rb') as f:
            y_train_features = pickle.load(f)[0]

        with open('X_test_features.pkl', 'rb') as f:
            X_test_features = pickle.load(f)[0]

        with open('y_test_features.pkl', 'rb') as f:
            y_test_features = pickle.load(f)[0]

        with open('count_vectorizer_label_encoder.pkl', 'rb') as f:
            count_vectorizer, label_encoder = pickle.load(f)
        
    return X_train_features, X_test_features, y_train_features, y_test_features, count_vectorizer, label_encoder



## Model-1 Sklearn Multinomial Naive Bayes

Multinomial Naive Bayes classifier is suitable for classification with discrete features, is well suited for text features. 

We will use the training features generated to fit to the model, no hypertuning.

In [3]:
# Model and Evaluation

# Multinomial Naive Bayes

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

# Train and test on MNB, input only features
def run_mnb(X, y, clf=None):
    
    # Train mode, train and return model
    if clf == None:
        
        X_train_features = X
        y_train_features = y
        
        clf = MultinomialNB()

        clf.fit(X_train_features, y_train_features)
        
        return clf
    
    else: # Test mode return F1 Accuracy
        
        X_test_features = X
        y_test_features = y
        
        return f1_score(y_test_features, clf.predict(X_test_features), average='macro')  
        

## Run Pipeline

Using pre-processing funtions to pre-process data, and then generating features, training the model and testing it on test set.

In [4]:
# Run Model Pipeline

# Original Data
input_org_data = 'txt' # Path of data folder
output_labels = os.listdir(input_org_data) # Labels for Languages in Dataset

# Dataframe for processed data, columns: label, sentence
lang_df = pd.DataFrame(columns=['label', 'sentence'])

# Config and Execute

# Preprocessing
preprocessing_text = 2

lang_df = preprocess_folder(input_org_data, preprocessing_text)

if lang_df == True:
    feature_extraction_text = 1
else:
    feature_extraction_text = 0

# Feature Extraction
X_train_features, X_test_features, y_train_features, y_test_features, count_vectorizer, label_encoder = extract_features(lang_df, feature_extraction_text)

# Model

# Train
mnb_clf = run_mnb(X_train_features, y_train_features)

# Test

#f1_test = run_mnb(X_test_features, y_test_features, mnb_clf)

## Keras Bi-RNN Classification

Using Bi-RNN with LSTM Cell for language detection.

### Model Information
    * 75x2 units LSTM cell bi-directional with concatination of outputs
    * Softmax Activation
    * Crossentropy Loss
    * Adam Optimizer
    * Mannual Batch Training

In [39]:
# Keras Bi-RNN Classification

import keras
from keras.models import Sequential, load_model
from keras.layers import Activation, Dense, Dropout, LSTM, Flatten, Bidirectional
from keras.utils import to_categorical

def run_keras_birnn(X, y, model_path=None):
    
    if model_path == None: # Train
        model = Sequential()
        model.add(Bidirectional(LSTM(75), input_shape=(1, X.shape[1]), merge_mode='concat'))
        #model.add(LSTM(75, input_shape=(1, X_train_features.shape[1])))
        model.add(Dense(len(output_labels), activation='softmax')) 
        model.summary()
        model.compile(loss='categorical_crossentropy',
                      optimizer='adam',
                      metrics=['accuracy'])

        start_idx = 0
        idx_step = 100000

        X_train_features, y_train_features = shuffle(X, y, random_state=0)

        for i in range(idx_step, X_train_features.shape[0], idx_step):
            #print(X_train_features[start_idx:i].toarray().reshape(X_train_features[start_idx:i].shape[0], 1, X_train_features[start_idx:i].shape[1]).shape)
            model_ins = model.fit(X_train_features[start_idx:i].toarray().reshape(X_train_features[start_idx:i].shape[0], 1, X_train_features[start_idx:i].shape[1]), to_categorical(y_train_features[start_idx:i]), 
                                  epochs=2,
                                 verbose=1,
                                 validation_split=0.1)
            start_idx = i

        # Save Keras Model
        model.save('keras_model.h5')
        
    else: # Test and evaluate
        model = load_model(model_path)
        return model

## Tensorflow Bi-RNN Classification

### Model Information
    * 75x2 units LSTM cell bi-directional with concatination of outputs
    * Softmax Activation
    * Crossentropy Loss
    * Adam Optimizer
    * Mannual Batch Training
    
### Model Params
    * learning_rate = 0.01
    * n_epoch = 10

### Layer Params
    * vocab_size = 322
    * num_classes = 21
    * hidden_dim = 75
    * timesteps = 1

In [6]:
# TF BiRNN Classification

import tensorflow as tf
from tensorflow.contrib import rnn
from keras.utils import to_categorical
tf.reset_default_graph()


# Model Params
learning_rate = 0.01
n_epoch = 10

# Layer Params
vocab_size = X_train_features.shape[1]
num_classes = len(output_labels)
hidden_dim = 75
timesteps = 1

# Placeholder
X = tf.placeholder(tf.float32, shape=[None, timesteps, vocab_size])
Y = tf.placeholder(tf.float32, shape=[None, num_classes])

# Weights
W_h = tf.Variable(tf.random_normal([2*hidden_dim, num_classes]))
C_h = tf.Variable(tf.random_normal([num_classes]))

# Bi-LSTM Cell
with tf.name_scope("BiLSTM"):
    with tf.variable_scope('forward'):
        lstm_fw = tf.nn.rnn_cell.LSTMCell(hidden_dim, forget_bias=1.0, state_is_tuple=True)
    with tf.variable_scope('backward'):
        lstm_bw = tf.nn.rnn_cell.LSTMCell(hidden_dim, forget_bias=1.0, state_is_tuple=True)
    
    (output_fw, output_bw), states = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw,
                                                                     cell_bw=lstm_bw,
                                                                     inputs=X,
                                                                     dtype=tf.float32,
                                                                     scope="BiLSTM")



outputs = tf.concat([output_fw, output_bw], axis=2)
outputs_flat = tf.reshape(outputs, [-1, 2 * hidden_dim])
logits = tf.matmul(outputs_flat, W_h) + C_h
#scores = tf.reshape(pred, [-1, batch_seq_len, num_classes])

prediction = tf.nn.softmax(logits)

# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
    logits=logits, labels=Y))
#optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)

# Evaluate model
correct_pred = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Saver
saver = tf.train.Saver()

In [13]:
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
X_train_features, y_train_features = shuffle(X_train_features, y_train_features, random_state=0)
test_X, test_Y = shuffle(X_test_features, y_test_features, random_state=0)
test_X, test_Y = test_X[0:100000], test_Y[0:100000]

# Start training
with tf.Session() as sess:
    
    #saver.restore(sess, "./model_tf/model_tf")
    
    # Run the initializer
    sess.run(init)
    
    # Batch Vars
    start_idx = 0
    idx_step = 100000

    # Batch Loop

    for i in range(idx_step, X_train_features.shape[0], idx_step):
        
        #if (i/idx_step >= 506):
        
        print("---------- Batch ", i/idx_step, " -------------")

        batch_x, batch_y = shuffle( X_train_features[start_idx:i].toarray().reshape(X_train_features[start_idx:i].shape[0], 1, X_train_features[start_idx:i].shape[1]), to_categorical(y_train_features[start_idx:i]), random_state=0)

        # Epoch Loop
        for step in range(1, n_epoch+1):

            # Reshuffle each epoch
            batch_x, batch_y = shuffle(batch_x, batch_y, random_state=0)

            # Feed Batch
            sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})

            if step % 5 == 0 or step == 1:
                # Calculate batch loss and accuracy
                loss, acc = sess.run([loss_op, accuracy], feed_dict={X: batch_x,
                                                                     Y: batch_y})
                print("Epoch " + str(step) + ",  Loss= " + \
                      "{:.4f}".format(loss) + ", Training Accuracy= " + \
                      "{:.3f}".format(acc))
        
        # Move Index
        start_idx = i
        
        # Mem Fix
        del batch_x
        del batch_y

        # Test Accuracy
        print("Test Accuracy:", \
            sess.run(accuracy, feed_dict={X: test_X.toarray().reshape(test_X.shape[0], 1, test_X.shape[1]), Y: to_categorical(test_Y)}))

        saver.save(sess, './model_tf/model_tf_adam')

---------- Batch  1.0  -------------
Epoch 1,  Loss= 4.8993, Training Accuracy= 0.228
Epoch 5,  Loss= 1.7260, Training Accuracy= 0.597
Epoch 10,  Loss= 0.4982, Training Accuracy= 0.867
Test Accuracy: 0.86764
---------- Batch  2.0  -------------
Epoch 1,  Loss= 0.3548, Training Accuracy= 0.916
Epoch 5,  Loss= 0.1911, Training Accuracy= 0.959
Epoch 10,  Loss= 0.1213, Training Accuracy= 0.970
Test Accuracy: 0.96967
---------- Batch  3.0  -------------
Epoch 1,  Loss= 0.1149, Training Accuracy= 0.972
Epoch 5,  Loss= 0.0968, Training Accuracy= 0.975
Epoch 10,  Loss= 0.0829, Training Accuracy= 0.978
Test Accuracy: 0.97661
---------- Batch  4.0  -------------
Epoch 1,  Loss= 0.0848, Training Accuracy= 0.978
Epoch 5,  Loss= 0.0783, Training Accuracy= 0.979
Epoch 10,  Loss= 0.0711, Training Accuracy= 0.981
Test Accuracy: 0.98006
---------- Batch  5.0  -------------
Epoch 1,  Loss= 0.0707, Training Accuracy= 0.981
Epoch 5,  Loss= 0.0664, Training Accuracy= 0.981
Epoch 10,  Loss= 0.0622, Training

Epoch 5,  Loss= 0.0341, Training Accuracy= 0.990
Epoch 10,  Loss= 0.0320, Training Accuracy= 0.991
Test Accuracy: 0.98961
---------- Batch  41.0  -------------
Epoch 1,  Loss= 0.0375, Training Accuracy= 0.990
Epoch 5,  Loss= 0.0360, Training Accuracy= 0.990
Epoch 10,  Loss= 0.0339, Training Accuracy= 0.991
Test Accuracy: 0.98987
---------- Batch  42.0  -------------
Epoch 1,  Loss= 0.0374, Training Accuracy= 0.989
Epoch 5,  Loss= 0.0361, Training Accuracy= 0.990
Epoch 10,  Loss= 0.0341, Training Accuracy= 0.990
Test Accuracy: 0.9898
---------- Batch  43.0  -------------
Epoch 1,  Loss= 0.0368, Training Accuracy= 0.990
Epoch 5,  Loss= 0.0353, Training Accuracy= 0.990
Epoch 10,  Loss= 0.0332, Training Accuracy= 0.990
Test Accuracy: 0.98985
---------- Batch  44.0  -------------
Epoch 1,  Loss= 0.0367, Training Accuracy= 0.990
Epoch 5,  Loss= 0.0350, Training Accuracy= 0.990
Epoch 10,  Loss= 0.0330, Training Accuracy= 0.991
Test Accuracy: 0.99018
---------- Batch  45.0  -------------
Epoch

## Evaluation

Evaluation of all the three models on Fellowship.ai custom dataset.

Evaluation is carried out on accuracy score metric, which is standard classification accuracy metric in all the models.

### Classification Accuracy

In [41]:
# Custom Test File, Fellowship.AI

test_fs = pd.read_csv('europarl.test', sep="\t", header=None)

test_fs[1] = [clean_sentence(i, 0) for i in test_fs[1]]

# Sklearn Evaluate
mb_acc = accuracy_score(label_encoder.transform(test_fs[0]), mnb_clf.predict(count_vectorizer.transform(test_fs[1])))

# Keras Evaluate
test_fs_X = count_vectorizer.transform(test_fs[1]).toarray()
keras_acc = run_keras_birnn(None, None, model_path='/media/abhishek/55840FA23CCD9EE9/FellowshipAI_Challenge/keras_model_2epoch.h5').evaluate(test_fs_X.reshape(test_fs_X.shape[0], 1, test_fs_X.shape[1]), to_categorical(label_encoder.transform(test_fs[0])))

print("MNB(Sklearn) Accuracy: ", mb_acc, ", Keras BI-RNN Accuracy: ", keras_acc[1])

with tf.Session() as sess:
    saver.restore(sess, "./model_tf/model_tf_adam")
    
    print("TF BI-RNN Accuracy:", \
            sess.run(accuracy, feed_dict={X: test_fs_X.reshape(test_fs_X.shape[0], 1, test_fs_X.shape[1]), Y: to_categorical(label_encoder.transform(test_fs[0]))}))

MNB(Sklearn) ACC:  0.9704724409448819 , Keras BI-RNN ACC:  0.9718167850969849
INFO:tensorflow:Restoring parameters from ./model_tf/model_tf_adam
TF BI-RNN Accuracy: 0.9703284


### F1 Score

F1 Score for each model, conveys the balance between the precision and the recall.

In [47]:
# Custom Test File, Fellowship.AI

test_fs = pd.read_csv('europarl.test', sep="\t", header=None)

test_fs[1] = [clean_sentence(i, 0) for i in test_fs[1]]

# Sklearn Evaluate
mb_acc = f1_score(label_encoder.transform(test_fs[0]), mnb_clf.predict(count_vectorizer.transform(test_fs[1])), average='macro')

# Keras Evaluate
test_fs_X = count_vectorizer.transform(test_fs[1]).toarray()
keras_acc = f1_score(label_encoder.transform(test_fs[0]), run_keras_birnn(None, None, model_path='/media/abhishek/55840FA23CCD9EE9/FellowshipAI_Challenge/keras_model_2epoch.h5').predict(test_fs_X.reshape(test_fs_X.shape[0], 1, test_fs_X.shape[1])).argmax(axis=-1), average='macro')

print("MNB(Sklearn) F1 Score: ", mb_acc, ", Keras BI-RNN F1 Score: ", keras_acc)

with tf.Session() as sess:
    saver.restore(sess, "./model_tf/model_tf_adam")
    
    y_pred = np.array(sess.run(tf.argmax(prediction, 1), feed_dict={X: test_fs_X.reshape(test_fs_X.shape[0], 1, test_fs_X.shape[1]), Y: to_categorical(label_encoder.transform(test_fs[0]))}))
    
    print("TF BI-RNN F1 Score:", \
            f1_score(label_encoder.transform(test_fs[0]), y_pred, average='macro'))

MNB(Sklearn) F1 Score:  0.9707093359216391 , Keras BI-RNN F1 Score:  0.9720171125440312
INFO:tensorflow:Restoring parameters from ./model_tf/model_tf_adam
TF BI-RNN F1 Score: 0.9705131168450581
