# Language Detection using the Europarl Corpus

## Introduction
This notebook explores a machine learning approach to the task of language identification on the [Europarl corpus](http://www.statmt.org/europarl/). The first section examines the dataset and related studies. Then, the baseline approach and its performance will be demonstrated. Finally, my approach to this problem will be presented in the last section.

## Dataset Preparation
The Europarl corpus is a collection of transcripts in 21 languages from the proceedings of the European parliament. Transcripts appear to be organized by date, and stored in folders for each respective language. Each file contains the transcript text and XML tags representing additional information on the text. There is also an [existing toolkit](http://www.statmt.org/europarl/v7/tools.tgz) for transforming the text in the dataset. This corpus will be used for training a machine learning algorithm in this task.

The test data for this task is provided by [startup.ml](https://fellowship.ai/challenge/). It is a single text file containing sample label-sentence pairs for different languages, and each sample is stored in an individual line. For this task, I will train an algorithm and classify the language for each sentence in the test data. It would be interesting to observe the performance of the algorithm on short test samples.

### Data Cleaning
Since the training data contains XML tags, we need to remove those tags from the documents.
We can achieve this by using the following command on each document.
```
sed '/^<.*>/d' input_file > output_file
```
And the shell script `de-XML-corpus.sh` runs this command over all the files in the corpus, and output files into another directory. The cleaned-up dataset is organized by the same directory structure as the original.


### Note on potential overlaps between Train and Test Set
The source of the test file is unknown, therefore it is unclear whether the data in the test file exists within the Europarl corpus used for training. This task is based on the assumption that there are no overlaps between the training data and test data. Due to the large size of the datasets, I did not check the overlaps between them in this task. This is left for future works.


## Related Work
The best performance in Language Identification over this dataset seems to be presented by [Baldwin and Lui](http://www.aclweb.org/anthology/P12-3005). They achieved an accuracy of 0.993 using the toolkit [language-detection](https://code.google.com/archive/p/language-detection/), and an accuracy of 0.992 with [langid.py](https://github.com/saffsd/langid.py) which they developed. `language-detection` uses a Naive Bayes classifier trained with character $n$-grams, while `langid.py` uses a multinomial Naive Bayes classifier trained over byte n-grams. 
Neither of the toolkits is trained purely on the Europarl corpus, and it is unclear whether the test data used is the same as the test set in this task.

## Method
For this task, I chose to train a multinomial Naive Bayes classifier on byte $n$-grams, following [Baldwin and Lui's approach](http://www.aclweb.org/anthology/P12-3005), due to its language-agnostic feature. 
I use `scikit-learn`'s [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to convert documents into feature vectors, and its [Multinomial Naive Bayes Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) as the learning algorithm.

The feature extraction function is as follows:
```python
def get_byte_n_grams(text, n_gram_range=(1,1)):
    text_bytes = bytes(text, 'utf-8')
    ngrams=[]
    beg, end=n_gram_range
    for i in range(beg,end+1):
        ngrams.extend(get_n_grams(text_bytes, i))
    return ngrams

def get_n_grams(seq, n):
    return zip(*[seq[i:] for i in range(n)])
```
`get_byte_n_grams` converts the text string into bytes, then get the n-grams specified by `n_gram_range`.

The function `get_byte_n_grams` can be passed to `scikit-learn`'s `CountVectorizer` to initialize a byte $n$-gram vectorizer ($1\leq n\leq 4)$ in the following way:
```python
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer=lambda x:get_byte_ngrams(x, n_gram_range=(1,4)))
```
`vectorizer` is now able to convert a list of strings into a list of byte $n$-gram feature vectors.