# Cross lingual processing and Transfer Learning using multi-linguale embedding

On this notebook, we will work on a multilingual dataset containing sentences in six languages: english, dutch, spanish, russian, arabic and turkish. Every sentence of every language comes along a with sentiment label indicating *positive* or *negative* content. There is no sentence overlap between idioms. 

Working with the LASER multilinguale representation, we directly provide the sentence embedding for all languages. Every sentence is represented by a 1024 dimensional vector indicating its position in LASER.

# Loading data from Github

In [None]:
#Let's download the dataset (if not done already) and define path
import os
!git clone https://github.com/ioannispartalas/CrossLingual-NLP-AMLD2020.git
#With this command, the path to the data is 
workdir = './CrossLingual-NLP-AMLD2020/'
os.environ["WORKDIR"] = workdir
#Please check if this correct, otherwise correct path_to_data
!ls $WORKDIR/data/laser

The dataset is made of numpy files:
```
'en_laser_train.npy'
'en_laser_test.npy'
'nl_laser_test.npy'
...
```
containing respectively training and test set for every language. 

Corresponding labels are stored in 
```
en_train_labels_adan.txt
en_test_labels_adan.txt
nl_laser_train.npy
...
```


# Importing functions

In [None]:
import sys
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
sys.path.insert(1, workdir)

from src.utils import load_training_languages, model_evaluation, get_statistics


The 3 following utility functions will be used in this notebook:

- ```
model_evaluation(model, [languages])
```: evaluate the ```model``` over list of ```languages```. Returns [F1](https://en.wikipedia.org/wiki/F1_score) score, more suited for imbalanced dataset and [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to analyse model outputs in details.
- ```x_train, y_train = load_training_languages([languages])```: Returns concatenated features and labels for languages specified in ```languages```.
- ```get_statistics([languages]```: print out class population for languages specified in ```languages```.

# Dataset statistics

The multilingual dataset consists in 6 different languages: english (```en```), spanish (`es`), dutch (`nl`), russian (`ru`), arabic (`a`r) and turkish (`tr`).

In [None]:
all_languages = ['en','es','nl','ru','ar','tr']

In [None]:
get_statistics(all_languages)

#Few Shot Learning
While learning a language classification model generally requires abundance of training materials, it happens frequently that some languages are systematically under representated, leading to poor prediction performance. 

In that situation, using a common language representation such as LASER permits to increase the training data by adding to the initial (small) set, (possibly larger) dataset from other languages. 

As shown in figure below, poplulating the training space increases the chances to accurately determine the decision function.  

![Few Shot Learning](https://upload.wikimedia.org/wikipedia/commons/d/d0/Example_of_unlabeled_data_in_semisupervised_learning.png)

In the following, we are going to experiment the Few Shot Learning concepts by training and testing classifier on different combinations of languague.

Let's train a [Logistic Regression](https://fr.wikipedia.org/wiki/R%C3%A9gression_logistique) (a linear classifier) on russian, and look at the model accuracy

In [None]:
x_train,y_train = load_training_languages(['ru'])
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['ru'])

The overall performance is not fantastic. Could we do better? Let's add more languages to the training data

In [None]:
x_train,y_train = load_training_languages(all_languages)
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['ru'])


The F1 score has improved by 0.1! Quite impressive.

Same operation with turkish

In [None]:
x_train,y_train = load_training_languages(['tr'])
lr = LogisticRegression(C = 10,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['tr'])

The F1 score is now quite low. Small dataset, data quality, language complexity may explain the poor performance.

Fair enough, let's use all available languages to improve our model

In [None]:
x_train,y_train = load_training_languages(all_languages)
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['tr'])

No improvement... Maybe another combination of languages leads to different results. What happen if we remove spanish and russian from the training set

In [None]:
x_train,y_train = load_training_languages(['ar','tr','nl','en'])
lr = LogisticRegression(C = 10,max_iter = 200,random_state = 1).fit(x_train,y_train)
_ = model_evaluation(lr, ['tr'])

Better! Apparently spanish and russian were perturbing the model for turkish language.

Could we imagine a more systematic source language selection to optimize performance on a specific target language? (Beware that the test set of the target language cannot be used to perform this selection)

#Non linear model
Until now we have used Logisitic Regression. However more complex models, such as [multi layer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP) 

In [None]:
 from sklearn.neural_network import MLPClassifier
 mlp = MLPClassifier(solver='lbfgs', 
                     hidden_layer_sizes=(16),
                     activation = 'relu',
                     alpha=1e-3,
                     max_iter = 50,
                     early_stopping =True,
                     validation_fraction = 0.2, 
                     random_state=1)\
      
 _ = model_evaluation(mlp.fit(x_train,y_train),['ru'])

or [extreme gradient boosting](https://en.wikipedia.org/wiki/XGBoost) (xgboost) are obviously possible.

In [None]:
import xgboost as xgb
boost = xgb_model = xgb.XGBClassifier(objective="binary:logistic",max_depth =5, random_state=42)
_ = model_evaluation(boost.fit(x_train,y_train),['ru'])

What can we conclude from the above results?