**Train [FastText](https://github.com/facebookresearch/fastText) classification model**  
<img src=https://fasttext.cc/img/fasttext-logo-color-web.png width=500>

**Documentation**  
* our first classifier: https://fasttext.cc/docs/en/supervised-tutorial.html#our-first-classifier
* autotune: https://fasttext.cc/docs/en/autotune.html

**Install project requirements**

In [1]:
# !pip install -r requirements.txt

**Import libraries**

In [2]:
import pandas as pd
import fasttext as ft

import os

**Define input data parameters**

In [3]:
input_path = os.path.join("data", "processed")

train_data = os.path.join(input_path, "train.txt")
validation_data = os.path.join(input_path, "validation.txt")
test_data = os.path.join(input_path, "test.txt")

**Define model parameters**  
* `autotune`: if True, performs automatic hyperparameters tuning using validation data
* `quantize`: if True, quantize the model reducing the size of the model and it's memory footprint

In [4]:
auto_tune = True
quantize = False

**Train fastText supervised model**  
`train_supervised` parameters:
*    **input**: training file path (required)
*    **lr**: learning rate [0.1]
*    **dim**: size of word vectors [100]
*    **ws**: size of the context window [5]
*    **epoch**: number of epochs [5]
*    **minCount**: minimal number of word occurences [1]
*    **minCountLabel**: minimal number of label occurences [1]
*    **minn**: min length of char ngram [0]
*    **maxn**: max length of char ngram [0]
*    **neg**: number of negatives sampled [5]
*    **wordNgrams**: max length of word ngram [1]
*    **loss**: loss function {ns, hs, softmax, ova} [softmax]
*    **bucket**: number of buckets [2000000]
*    **thread**: number of threads [number of cpus]
*    **lrUpdateRate**: change the rate of updates for the learning rate [100]
*    **t**: sampling threshold [0.0001]
*    **label**: label prefix ['__label__']
*    **verbose**: verbose [2]
*    **pretrainedVectors**: pretrained word vectors (.vec file) for supervised learning []

In [5]:
if auto_tune:
    model = ft.train_supervised(input=train_data, autotuneValidationFile=validation_data)
else:
    model = ft.train_supervised(input=train_data)

**Get best model parameters**  
`train_supervised`, `train_unsupervised` and `load_model` functions return an instance of `_FastText class`, that we generaly name `model` object.

This object exposes those training arguments as properties : `lr`, `dim`, `ws`, `epoch`, `minCount`, `minCountLabel`, `minn`, `maxn`, `neg`, `wordNgrams`, `loss`, `bucket`, `thread`, `lrUpdateRate`, `t`, `label`, `verbose`, `pretrainedVectors`. So `model.wordNgrams` will give you the max length of word ngram used for training this model.

In addition, the object exposes several functions:  
*    **get_dimension**: Get the dimension (size) of a lookup vector (hidden layer). This is equivalent to `dim` property.
*    **get_input_vector**: Given an index, get the corresponding vector of the Input Matrix.
*    **get_input_matrix**: Get a copy of the full input matrix of a Model.
*    **get_labels**: Get the entire list of labels of the dictionary. This is equivalent to `labels` property.
*    **get_line**: Split a line of text into words and labels.
*    **get_output_matrix**: Get a copy of the full output matrix of a Model.
*    **get_sentence_vector**: Given a string, get a single vector represenation. This function assumes to be given a single line of text. We split words on whitespace (space, newline, tab, vertical tab) and the control characters carriage return, formfeed and the null character.
*    **get_subword_id**: Given a subword, return the index (within input matrix) it hashes to.
*    **get_subwords**: Given a word, get the subwords and their indicies.
*    **get_word_id**: Given a word, get the word id within the dictionary.
*    **get_word_vector**: Get the vector representation of word.
*    **get_words**: Get the entire list of words of the dictionary. This is equivalent to `words` property.
*    **is_quantized**: whether the model has been quantized
*    **predict**: Given a string, get a list of labels and a list of corresponding probabilities.
*    **quantize**: Quantize the model reducing the size of the model and it's memory footprint.
*    **save_model**: Save the model to the given path
*    **test**: Evaluate supervised model using file given by path
*    **test_label**: Return the precision and recall score for each label. 

In [6]:
model_params = {}
for attribute in dir(model):
    if not attribute.startswith("_") and "get_" not in attribute and attribute!="words":
        model_params[attribute] = getattr(model, attribute)
model_params

{'bucket': 0,
 'dim': 991,
 'epoch': 4,
 'f': <fasttext_pybind.fasttext at 0x1d8f3a7f270>,
 'is_quantized': <bound method _FastText.is_quantized of <fasttext.FastText._FastText object at 0x000001D8F3A75490>>,
 'label': '__label__',
 'labels': ['__label__joy',
  '__label__sadness',
  '__label__anger',
  '__label__fear',
  '__label__love',
  '__label__surprise'],
 'loss': <loss_name.softmax: 3>,
 'lr': 0.47095012446288465,
 'lrUpdateRate': 100,
 'maxn': 0,
 'minCount': 1,
 'minCountLabel': 0,
 'minn': 0,
 'neg': 5,
 'predict': <bound method _FastText.predict of <fasttext.FastText._FastText object at 0x000001D8F3A75490>>,
 'pretrainedVectors': '',
 'quantize': <bound method _FastText.quantize of <fasttext.FastText._FastText object at 0x000001D8F3A75490>>,
 'save_model': <bound method _FastText.save_model of <fasttext.FastText._FastText object at 0x000001D8F3A75490>>,
 'set_args': <bound method _FastText.set_args of <fasttext.FastText._FastText object at 0x000001D8F3A75490>>,
 'set_matrice

**Quantize model (optional)**

In [7]:
if quantize:
    model = model.quantize()

**Evaluate model performance on test data**

In [8]:
model.test(path=test_data)

(2000, 0.8865, 0.8865)

**Serialize trained model**

In [9]:
output_model = os.path.join("models", "emotion_model.bin")
model.save_model(path=output_model)