# FastText
<h4 style="font-size:14px; font-family:Calibry" align="left"> Andrii Kruchko </h4>

<a id='0'> </a>

<hr style="height: 1px; background-color: #808080">
## Table of Contents

<ol>
    <li style="font-size:20px; font-family:Verdana">[FastText overview](#1)</li>
    <li style="font-size:20px; font-family:Verdana">[Data preprocessing for fastText](#2)</li>
    <li style="font-size:20px; font-family:Verdana">[Parameters overview and the model training](#3)</li>
</ol>

<a id='1'></a>

<hr style="height: 1px; background-color: #808080">
## 1. FastText overview
### What is it?

FastText is a linear model with a rank constraint and a fast loss approximation.<br>
It can obtain the accuracy comparable to deep learning classifiers.<br>

But it is way faster:
- FastText can train on more than one 200M words in less than five minutes using a standard multicore CPU
- Classify nearly 150K reviews in less than a minute

### Architecture

<img src="https://raw.githubusercontent.com/akruchko/test/master/1_model_architecture_of_fastText.PNG">
The model architecture of fastText for a sentence with N ngram features x1, . . . , xN .<br> The features are embedded and averaged to form the hidden variable$^1$

<hr style="height: 1px; width: 100px; background-color: #808080"; align="left"> <br>
$^1$ https://arxiv.org/pdf/1607.01759.pdf

### Algorithm

FastText uses the softmax function $f$ to compute the probability distribution over the predefined classes. For a set of N documents, this leads to minimizing the negative loglikelihood over the classes:


\begin{align}
\ -\frac{1}{N} \sum_{n=1}^N y_n log(f(BAx_n))
\end{align}
$x_n$ - the normalized bag of features of the n-th document, <br>
$y_n$ - the label, <br>
$A, B$ - weight matrices

Optimization is performing using stochastic gradient descent and a linearly decaying learning rate.

<h3 style="font-size:16px; font-family:Verdana">[To the table of contents](#0)</h3>

<a id='2'></a>

<hr style="height: 1px; background-color: #808080">
## 2. Data preprocessing for fastText
- remove nonprintable characters
- fix $n't$, $'re$, $'s$ and other cases
- remove punctuation and digits
- Porter stemming

In [1]:
import pandas as pd

from itertools import product

from string import punctuation, digits
from nltk.stem import PorterStemmer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [2]:
from fasttext import supervised

In [3]:
df_train = pd.read_csv('../data/movie_reviews.csv')
df_test = pd.read_csv('../data/test.csv')

In [4]:
# prepare punctuation and digits list for removal
translator = str.maketrans('', '', punctuation + digits)

# basic preprocessing:
def clean_data(df, col):
    df['clean_text'] = df[col].str.replace('\n', '').str.replace('\r', '').str.replace('\t', '')
    df.clean_text = df.clean_text.str.replace("n't", " not").str.replace("'re", " are").str.replace("'s", " s")
    df.clean_text = df.clean_text.str.replace("'ve", " have").str.replace("'ll", " will").str.replace("'d", " d")
    df.clean_text = df.clean_text.str.translate(translator).str.strip().str.lower()
    return df

In [5]:
df_train = clean_data(df_train, 'text')
df_test = clean_data(df_test, 'text')

In [6]:
print('before:\n', df_train.text[0])
print('after:\n', df_train.clean_text[0])

before:
 To an entire generation of filmgoers, it just might represent the most significant leap in storytelling that they will ever see...
after:
 to an entire generation of filmgoers it just might represent the most significant leap in storytelling that they will ever see


In [None]:
%%time
# Porter stemming
stemmer = PorterStemmer()
df_train['porter_text'] = df_train['clean_text'].apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split()]))
df_train = df_train[df_train.porter_text.apply(len) != 0]
df_test['porter_text'] = df_test['clean_text'].apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split()]))

In [None]:
print('before:\n', df_train.clean_text[0])
print('after:\n', df_train.porter_text[0])

In [None]:
# splitting on train and validation
df_train2, df_val = train_test_split(df_train[['label', 'porter_text']], test_size=0.2, random_state=42)

In [None]:
# Since fastText can be trained only from text files, we should mark labels. The default is `__label__` but can be custom.
df_train2['ft_label'] = df_train2['label'].apply(lambda x: '__label__1 ' if x == 1 else '__label__0 ')
df_train2[['ft_label', 'porter_text']].to_csv('../data/train_fastText.csv', index=False, header=False)

### Observations and yet another preprocessing variant

Weren't helpful:
- default stop words make the model performance worse. The list should be revised or created from scratch
- entities removal influences model's results badly. 

SpaCy lemmatization less aggressive than Porter stemmer

<h3 style="font-size:16px; font-family:Verdana">[To the table of contents](#0)</h3>

<a id='3'></a>

<hr style="height: 1px; background-color: #808080">
## 3. Parameters overview and the model training

### Parameters overview

- `lr` - learning rate. Default: **0.1**.
- `dim` - size of word vectors in the hidden unit. Default: **100**. Should be less for small datasets and the number of labels.
- `epoch` - number of epochs. Default: **5**. Higher for small learning rates.
- `min_count` - minimal number of word occurences. Default: **1**. 5 or higher to avoid overfitting.
- `word_ngrams` - max length of word ngram. Default: **1**. Higher order ngrams lead to overfitting on small datasets. if value greater than 1 learning rate and epoch should be revised.
- `bucket` - number of buckets. Default: **2000000**. Developers recommend to use lower values for small datasets (ex. 100K).
- `minn` - min length of char ngram. Default: **0**.
- `maxn` - max length of char ngram. Default: **0**.

### The model training

In [11]:
def get_score(df, clf, label='label', text='porter_text', model_name='../data/fastText_porter'):
  
    prediction = clf.predict_proba(list(df[text]))
    prediction = [int(item[0][0]) for item in prediction]

    return [round(accuracy_score(list(df[label]), prediction), 4), 
            round(f1_score(list(df[label]), prediction, pos_label=0), 4)]

In [12]:
# prepare parameters for the grid search
lrates = [0.5, 0.1, 0.01]
epochs = [5, 10]
min_c = [1, 5]

In [13]:
%%time
# let's train models and evaluate them on the validation dataset
# fastText requires at least two arguments a training file path and an output file path. In our case '../data/train_fastText.csv' and '../data/fastText_porter' respectively
results = []
for i in product(lrates, epochs, min_c):
    clf = supervised('../data/train_fastText.csv', '../data/fastText_porter', label_prefix='__label__', 
                     lr=i[0], epoch=i[1], min_count=i[2])    
    results.append(list(i) + get_score(df_val, clf))

CPU times: user 6min 48s, sys: 12.1 s, total: 7min
Wall time: 3min 59s


In [14]:
results = pd.DataFrame(results, columns=['lr', 'epoch', 'min_count', 'accuracy', 'f1']).sort_values(by='f1', ascending=False)
results.reset_index(drop=True, inplace=True)
results

Unnamed: 0,lr,epoch,min_count,accuracy,f1
0,0.5,5,5,0.8169,0.7733
1,0.1,10,5,0.8158,0.7722
2,0.5,10,5,0.8154,0.7713
3,0.1,5,5,0.8169,0.7712
4,0.1,10,1,0.8158,0.7709
5,0.1,5,1,0.8166,0.77
6,0.5,5,1,0.8165,0.7699
7,0.5,10,1,0.816,0.7682
8,0.01,10,5,0.8144,0.7659
9,0.01,10,1,0.8137,0.7633


In [15]:
# Let's train the model with the best parameters
clf = supervised('../data/train_fastText.csv', '../data/fastText_porter', label_prefix='__label__', 
                 lr=results.lr[0], epoch=results.epoch[0], min_count=results.min_count[0])

In [16]:
# and evaluate it in the test dataset
acc, f1 = get_score(df_test, clf)
print('test accuracy:', round(acc, 4), 'test f1:', round(f1, 4))

test accuracy: 0.7887 test f1: 0.7721


### The exercise # 1
Train models with predefined parameters and explain why such results were derived.

In [None]:
%%time
# A
clf = supervised('../data/train_fastText.csv', '../data/fastText_porter', label_prefix='__label__', 
                 epoch=40)

In [None]:
acc, f1 = get_score(df_val, clf)
print('validation accuracy:', round(acc, 4), 'validation f1:', round(f1, 4))
acc, f1 = get_score(df_test, clf)
print('test accuracy:', round(acc, 4), 'test f1:', round(f1, 4))

<details>
  <summary>Click to see answer</summary>
  <p align='left'>The high number of epochs leads to overfitting on the training dataset. It causes lower performances on the validation and the test dataset. </p>
</details>

In [None]:
%%time
# B
clf = supervised('../data/train_fastText.csv', '../data/fastText_porter', label_prefix='__label__', 
                 word_ngrams=3, 
                 bucket=2000000)

In [None]:
acc, f1 = get_score(df_val, clf)
print('validation accuracy:', round(acc, 4), 'validation f1:', round(f1, 4))
acc, f1 = get_score(df_test, clf)
print('test accuracy:', round(acc, 4), 'test f1:', round(f1, 4))

<details>
  <summary>Click to see answer</summary>
  <p align='left'>Bag of n-grams much better represents reviews than bag of words. It leads to higher performances on the validation and the test dataset. </p>
</details>

### The exercise # 2
Try to improve the previous results using diiferent values of parameters.

In [None]:
%%time
clf = supervised('../data/train_fastText.csv', '../data/fastText_porter', label_prefix='__label__',
                 lr=0.1, 
                 dim=100, 
                 epoch=5, 
                 min_count=1, 
                 word_ngrams=1, 
                 bucket=2000000, 
                 minn=0, 
                 maxn=0
                )

In [None]:
acc, f1 = get_score(df_val, clf)
print('validation accuracy:', round(acc, 4), 'validation f1:', round(f1, 4))
acc, f1 = get_score(df_test, clf)
print('test accuracy:', round(acc, 4), 'test f1:', round(f1, 4))

<details>
  <summary>Click to see answer</summary>
  <p align='left'>Based on previous examples we can try following parameters' values:</p>
    <pre>
      <code>
     min_count=5, 
     word_ngrams=2, 
     bucket=2000000
      </code>
    </pre>
</details>

<h3 style="font-size:16px; font-family:Verdana">[To the table of contents](#0)</h3>

## Conclusions

- Really fast: representing sentences with bag of words and bag of n-grams with hashing trick; a hierachical softmax
- It was developed mainly for large datasets (ex. 1 billion words). In case of small datasets hyperparameters should be tuned carefully to avoid overfitting or you shoud get more data.