
# Language Detection Model using fastText
> "Standing on the shoulders of giants" :-) 

**Description**:<br>
This notebook represents my "naive" solution to the [Startup.ML challenge](http://startup.ml/challenge) Language Detection Model.
The target (21) languages are based on the [European Parliament Proceedings Parallel Corpus](http://www.statmt.org/europarl/).


Since I am not really building a model (mainly just preparing the data then applying it to a classifier of fastText library), 
I will explain a bit about the idea behind fastText (a recent awesome library for text classification and learning word representations "an extension to word2vec functionality").

**Why fastText ? not, for example, scikit-learn or similar ?**

**short answer**:<br> 
becuase fastText is an embedding-based approach. Embedding methods (for text represenation) are proved to work better than the "traditional" statistical-based approaches e.g. TF-IDF and Bag of words.

**not short answer**:<br>
Since 2013 (the year of word2vec paper), most language modeling systems have shifted to applying neural language models NLMs (i.e. neural word embedding, also called distributed represenation of words).
NLMs designed [2] to overcome the curse of dimensionality problem.<br>
In other words, relying on dense fixed-size vectors represenation for text features rather than the "conventional" sparse representation; where each feature (e.g. word or POS) is represented as its own vector (so called one-hot represenation).<br>

For example, we can use a large corpus to learn the embedding (word vectors) of many vocabulary.
Then, one way to represent a sentence can be by taking the average vector of the sentence's word vectors.


There are several ways to learn word embeddings, [word2vec](https://code.google.com/archive/p/word2vec/) is a popular example.<br>
fastText is another way, which even extends word2vec functionality to include supervised sentence classification.
Where sentence classification training takes place during the learning of word represenations.
<br>
That's mean the sentence averaging (that I mentioned eariler) could happen right in the learning process.
Here is an excerpt from fastText paper [1] which better describe that:
    
    The first weight matrix can be seen as a look-up table over the words of a sentence. The word representations are averaged into a text representation, which is in turn fed to a linear classifier.


**So what does that have to do with our problem?**

Basically, language detection task can be considered as a multi-label (21 in our case) supervised classification problem.
All it takes is to represent the sample sentences with a label for each to optimize the parameters. And that's exactly what fastText can do.


<hr>
Reference:

[1] Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016, July 6). Bag of Tricks for Efficient Text Classification. arXiv.org.<br>
[2] Bengio, Y., Ducharme, R., & Vincent, P. (2003). A neural probabilistic language model. The Journal of Machine Learning.

<hr>

## Workflow in this notebook

The notebook is divided into 4 parts as follows:

1. Build the dataset from the europarl corpus
2. Prepocess (prepare) the data
3. Training with fastText
4. Use the learned model to predict europarl test set from Startup.ML
5. Example on using the learned model for further predictions


The code is a mixture of bash and Python.

> I've already included the built dataset from step 1 with this repo, so one may uncompress it and start from step 2.


> It's tested on macOS (I assume Linux should be ok as long as it has clang or gcc to [compile fastText](https://github.com/facebookresearch/fastText#requirements))



## language labels:

| language   | label |
|------------|-------|
| Bulgarian  | bg    |
| Czech      | cs    |
| Danish     | da    |
| German     | de    |
| Greek      | el    |
| English    | en    |
| Spanish    | es    |
| Estonian   | et    |
| Finnish    | fi    |
| French     | fr    |
| Hungarian  | hu    |
| Italian    | it    |
| Lithuanian | lt    |
| Latvian    | lv    |
| Dutch      | nl    |
| Polish     | pl    |
| Portuguese | pt    |
| Romanian   | ro    |
| Slovak     | sk    |
| Slovene    | sl    |
| Swedish    | sv    |

<hr>

# 1. Download and build the `europarl.csv` dataset

Download and uncompress the (1.5GB) eurpoar1.tgz data

In [2]:
%%bash

mkdir -p downloaded
target=downloaded

if [ ! -f "${target}"/europarl.tgz ]; then
    wget http://www.statmt.org/europarl/v7/europarl.tgz -O "${target}"/europarl.tgz
fi

FOLDER=txt
if [ ! -d "${FOLDER}" ]; then
    tar xzf "${target}"/europarl.tgz
fi


merge the individual text files into a large file (corpus) for each language (to free space, remove them after merge)

In [2]:
%%bash

for i in $(ls txt); do
    find txt/$i -name "*.txt" -print0 | xargs -0 cat > txt/$i.txt
    rm -rf txt/$i
done

Next clean the text of each corpus by removing the markup tags (e.g. < SPEAKER id>, < p> ...etc) and lower case text. 
Then generate a csv file from each language corpus

In [4]:
import re
import pandas as pd
import os

In [5]:
def corpus2df(name, lang=None):
    df = pd.read_table(name, error_bad_lines=False, header=None, names=['text'])
    df['lang'] = lang
    df = df[['lang', 'text']]
    return df

In [7]:
def clean(f):
    """clean text, remove html tags, and return new file name"""
    in_text = open(f).read()
    cleaned = re.sub(r'<.*?>', '', in_text).lower().strip()
    outfile = f.replace('.txt', '-cleaned.txt')
    open(outfile, 'w').write(cleaned)
    os.remove(f)
    return outfile

In [6]:
corpora = [c for c in os.listdir('txt/') if c.endswith('.txt')]

In [8]:
%%time
for corpus in corpora:
    lang = corpus.replace('.txt', '')
    corpus = 'txt/' + corpus
    print('{} .. cleaning and converting to a csv .. '.format(corpus), end='\t')
    f = clean(corpus)
    df = corpus2df(f, lang)
    df.to_csv(corpus.replace('.txt', '.csv'), index=False, header=False)
    print('finished.')


txt/bg.txt .. cleaning and converting to a csv .. 	finished.
txt/cs.txt .. cleaning and converting to a csv .. 	finished.
txt/da.txt .. cleaning and converting to a csv .. 	finished.
txt/de.txt .. cleaning and converting to a csv .. 	finished.
txt/el.txt .. cleaning and converting to a csv .. 	finished.
txt/en.txt .. cleaning and converting to a csv .. 	finished.
txt/es.txt .. cleaning and converting to a csv .. 	finished.
txt/et.txt .. cleaning and converting to a csv .. 	finished.
txt/fi.txt .. cleaning and converting to a csv .. 	finished.
txt/fr.txt .. cleaning and converting to a csv .. 	finished.
txt/hu.txt .. cleaning and converting to a csv .. 	finished.
txt/it.txt .. cleaning and converting to a csv .. 	finished.
txt/lt.txt .. cleaning and converting to a csv .. 	finished.
txt/lv.txt .. cleaning and converting to a csv .. 	finished.
txt/nl.txt .. cleaning and converting to a csv .. 	finished.
txt/pl.txt .. cleaning and converting to a csv .. 	finished.
txt/pt.txt .. cleaning a

Lets move the csv files to a seaparate folder and delete the text folder to free space

In [9]:
%%bash
mkdir -p data/csv/
mv txt/*.csv data/csv/
rm -rf txt

Lets see number of samples from each language

In [10]:
%%bash
for i in $(ls data/csv); do
    wc -l data/csv/$i
done

  143787 data/csv/bg.csv
  237938 data/csv/cs.csv
  737946 data/csv/da.csv
  719202 data/csv/de.csv
  606104 data/csv/el.csv
  787720 data/csv/en.csv
  748640 data/csv/es.csv
  240591 data/csv/et.csv
  708033 data/csv/fi.csv
  753397 data/csv/fr.csv
  233216 data/csv/hu.csv
  755340 data/csv/it.csv
  235407 data/csv/lt.csv
  237013 data/csv/lv.csv
  744521 data/csv/nl.csv
  236097 data/csv/pl.csv
  745199 data/csv/pt.csv
  142853 data/csv/ro.csv
  235604 data/csv/sk.csv
  229135 data/csv/sl.csv
  720552 data/csv/sv.csv


We will use these csv files to create a single (multi-label) dataset, so we can use it in the supervised training.


**NOTE**: since there is a large number of text samples, 
I will not include all samples from each language (that would be too large to handle).
So, lets just take 40,000 samples from each language "sounds fair :) and balanced too".

In [11]:
%%bash

for i in $(ls data/csv/); do
    head -n 40000 "data/csv/$i" >> data/europarl.csv
done


Now our dataset should contain around 840,000 samples "lines" (40,000 x 21 languages).

Lets verify that:

In [12]:
!wc -l data/europarl.csv

  840000 data/europarl.csv


Also delete the other csv files to free space

In [13]:
!rm -rf data/csv

Compress the dataset to easily upload it to GitHub

In [14]:
!tar cvzf europarl-dataset.tar.gz data/europarl.csv

a data/europarl.csv


# 2. Preprocessing

Before training a classifier, we will do three steps to prepare the dataset:

1. shuffle the rows, since samples are stacked by language and this is not ideal for supervised training.
2. prepare the samples to conform to the expected [fastText](https://github.com/facebookresearch/fastText#text-classification) labeled formatting
3. split the dataset to train/test 



In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/europarl.csv', names=['lang', 'text'])

**1) Shuffle**

In [3]:
df = df.reindex(np.random.permutation(df.index)).reset_index(drop=True)

In [4]:
df[:3]

Unnamed: 0,lang,text
0,sk,"avšak vzhľadom na to, že komisia pripravuje na..."
1,lt,ar komisija galėtų pakomentuoti dabartinę proc...
2,lt,sesijos pertrauka


**2) Prepare:** a label along with a sentence in one line to save to a text file

The normalization and label formatting is based on [fastText classification](https://github.com/facebookresearch/fastText/blob/master/classification-example.sh) example.

In [5]:
def normalize_text(row):
    
    label = '__label__' + str(row['lang'])
    txt = str(row['text'])
    
    return ' '.join(( label + ' , ' + txt ).split())

Add a new column as a normalized string of a label and a sentence

In [6]:
df['normalized'] = df.apply( lambda row: normalize_text(row), axis=1 )

**3) Split data**<br>
- **75% train** (0.75 x 840k = 630k) and **25% test** (0.25 x 840k = 210k)<br>
[scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) can do the split too, but feel lazy to import the library

In [7]:
SPLIT = 630000
train = df['normalized'][:SPLIT].copy()
test = df['normalized'][SPLIT:].copy()

Finally lets save the train and test files

In [8]:
np.savetxt('data/europarl.train', train.values, fmt="%s")
np.savetxt('data/europarl.eval', test.values, fmt="%s")

# 3. Training classifier

> NOTE: The main repo of [fastText](https://github.com/facebookresearch/fastText) is written primarily in `c++`.<br>
There is a very recent [Python interface](https://github.com/salestock/fastText.py) which seems unstable yet. 
I've tried it but for some reason the generated model learns only around 204k vocabulary in compare to 1.8 million with `c++` implemenation on the same dataset and same learning parameters.
So I will not use the Python interface for learning in this demo and stick to the official library for now.

First, lets download it and build it:

In [10]:
!git clone https://github.com/facebookresearch/fastText.git

Cloning into 'fastText'...
remote: Counting objects: 465, done.[K
remote: Total 465 (delta 0), reused 0 (delta 0), pack-reused 465[K
Receiving objects: 100% (465/465), 93.20 KiB | 0 bytes/s, done.
Resolving deltas: 100% (318/318), done.
Checking connectivity... done.


In [11]:
%%bash
cd fastText/
make

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o matrix.o vector.o model.o utils.o src/fasttext.cc -o fasttext


Next, train a classification model on the prepared training corpus:

In [12]:
%%bash
mkdir -p model

TRAIN=data/europarl.train
RESULT=model/europarl

./fastText/fasttext supervised -input $TRAIN -output $RESULT

Read 1M wordsRead 2M wordsRead 3M wordsRead 4M wordsRead 5M wordsRead 6M wordsRead 7M wordsRead 8M wordsRead 9M wordsRead 10M wordsRead 11M wordsRead 12M wordsRead 13M wordsRead 14M wordsRead 15M wordsRead 16M wordsRead 17M wordsRead 18M wordsRead 19M wordsRead 20M wordsRead 21M wordsRead 22M wordsRead 23M wordsRead 24M wordsRead 25M wordsRead 26M wordsRead 27M wordsRead 28M wordsRead 29M wordsRead 30M wordsRead 31M wordsRead 32M wordsRead 33M wordsRead 34M wordsRead 35M wordsRead 36M wordsRead 37M wordsRead 38M wordsRead 39M wordsRead 40M wordsRead 40M words
Progress: 0.0%  words/sec/thread: 3744  lr: 0.050000  loss: 3.060270  eta: 1h15m Progress: 0.0%  words/sec/thread: 13617  lr: 0.050000  loss: 3.060270  eta: 0h20m Progress: 0.0%  words/sec/thread: 24227  lr: 0.050000  loss: 3.060270  eta: 0h11m Progress: 0.0%  words/sec/thread: 28760  lr: 0.049999  loss: 3.060270  eta: 0h9m Progress: 0.0%  words/sec/thread: 38529  lr: 0.049999  loss: 3

real fast :)

That will generate two files: 
- `europarl.bin`: this is the learned model which contains the optimized parameters for predicting the language label from a given text.
- `europarl.vec`: a text file that contains the learned vocabulary (around 1.8million) and their embeddings.

Next, test the accuracy of the trained model 

In [13]:
%%bash
MODEL=model/europarl.bin
TEST=data/europarl.eval

./fastText/fasttext test $MODEL $TEST

P@1: 0.989
Number of examples: 202305


98% accuracy is actually pretty amazing! with the default hyper-parameters.

#### Save prediction on the evaluation dataset 

In [14]:
%%bash
mkdir -p prediction

MODEL=model/europarl.bin
TEST=data/europarl.eval
OUTPUT=prediction/europarl.eval.predict

./fastText/fasttext predict $MODEL $TEST > $OUTPUT

# 4. Predict the europarl test set from [Startup.ML](http://startup.ml/challenge) challenge

Download the data

In [15]:
%%bash

mkdir -p downloaded

downloaded=downloaded/europarl-test.zip
TESTSET=data/europarl.test

if [ ! -f "${TESTDATA}" ]; then
    wget -O "${downloaded}" "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/language-detection/europarl-test.zip"
    unzip "${downloaded}" -d data
fi

Use the trained model to predict the labels of the test set and output result to a new file "`prediction/europarl.test.predict`".

In [16]:
%%bash

MODEL=model/europarl.bin
TEST=data/europarl.test
RESULT=prediction/europarl.test.predict

./fastText/fasttext predict $MODEL $TEST > $RESULT

Each line in the output file is a label that corresponds to the input lines in `europarl.test` file.

**DONE!**
<hr>
Sample of the predicted results

In [17]:
%%bash

RESULT=prediction/europarl.test.predict

echo 'top: '    && head $RESULT
echo 'bottom: ' && tail $RESULT

top: 
__label__bg
__label__bg
__label__bg
__label__bg
__label__bg
__label__bg
__label__bg
__label__bg
__label__bg
__label__bg
bottom: 
__label__sv
__label__sv
__label__sv
__label__sv
__label__sv
__label__sv
__label__sv
__label__sv
__label__sv
__label__sv


<hr>
# 5. Use the model for further prediction

We can use the learned model to predict any text of the 21 languages.

**demo on other sentences from**:

    english 
    spanish
    portugese
    bulgarian
    danish

In [19]:
%%writefile prediction/sample-sentences.txt
This is very cool.
esto es genial.
isso é legal.
това е готино.
det her er sejt.

Overwriting prediction/sample-sentences.txt


In [20]:
%%bash

MODEL=model/europarl.bin
TEST=prediction/sample-sentences.txt

./fastText/fasttext predict $MODEL $TEST

__label__en
__label__es
__label__pt
__label__bg
__label__da


<hr>
## Alternatively: use the Python interface to load and use the learned model

In [2]:
import fasttext as ft

In [3]:
model = ft.load_model('model/europarl.bin')

In [4]:
text = [
    "This is very cool.", # English
    "esto es genial.",    # Spanish
    "isso é legal.",      # Portugese
    "това е готино.",       # Bulgarian
    "det her er sejt."    # Danish
]

In [8]:
model.predict(text)

[u'__label__en',
 u'__label__es',
 u'__label__pt',
 u'__label__bg',
 u'__label__da']

**or read text from a file:**

In [9]:
text = open('data/europarl.test').readlines()

In [10]:
predictions = model.predict(text)

In [13]:
predictions[:10]

[u'__label__bg',
 u'__label__bg',
 u'__label__bg',
 u'__label__bg',
 u'__label__bg',
 u'__label__bg',
 u'__label__bg',
 u'__label__bg',
 u'__label__bg',
 u'__label__bg']

In [14]:
predictions[-20:]

[u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv',
 u'__label__sv']