# FastText

Authors:
* Aurelien ROUXEL
* Ethan MACHAVOINE
* Jonathan POELGER

In [1]:
import datasets as ds
import fasttext
import numpy as np
import string
import random
from sklearn.model_selection import train_test_split
random.seed(42)

In [2]:
ds_train = ds.load_dataset('imdb', split='train')
ds_test = ds.load_dataset('imdb', split='test')

Found cached dataset imdb (/home/ethan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
Found cached dataset imdb (/home/ethan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


### 1. Pretreatment

In [3]:
def preprocessing(base_text: str):
  """
  Preprocess the text before classification
  Args:
    base_text: the string to preprocess
  Return:
    The preprocessed text
  """
  base_text = base_text.lower()
  base_text = base_text.replace("<br />",' ')
  text = ""
  ponct = string.punctuation
  for char in base_text:
    if char in ponct:
      text += ' '
    else:
      text += char
  return text

In [4]:
def text_label(label):
    if label == 0:
        return "negative"
    return "positive"

In [5]:
train_set = [f"__label__{text_label(text['label'])} {preprocessing(text['text'])}\n" for text in ds_train]
test_set = [f"__label__{text_label(text['label'])} {preprocessing(text['text'])}\n" for text in ds_test]
random.shuffle(train_set)
random.shuffle(test_set)

In [6]:
with open("imdb.train", "w") as f:
    f.writelines(train_set)
with open("imdb.test", "w") as f:
    f.writelines(test_set)

### 2. Train a FastText classifier

In [14]:
model_first = fasttext.train_supervised(input="imdb.train")

Read 6M words
Number of words:  75900
Number of labels: 2
Progress: 100.0% words/sec/thread: 4538137 lr:  0.000000 avg.loss:  0.381982 ETA:   0h 0m 0s


Results:
* Read 5M words
* Number of words:  75900
* Number of labels: 2
* Progress: 100.0% words/sec/thread: 4541135 lr:  0.000000 avg.loss:  0.388679 ETA:   0h 0m 0s

In [15]:
def get_true_values(model, test_set):
    values = 0
    for text in test_set:
        label = text[:17]
        predict = model.predict(text[:-1])[0][0]
        if label == predict:
            values += 1
    return values

def compute_accuracy(model, test_set):
    tn_fn = get_true_values(model, test_set)
    samples, _, _ = model.test("imdb.test")
    return tn_fn / samples

In [16]:
accuracy = compute_accuracy(model_first, test_set)
print(f"Accuracy: {accuracy}")

Accuracy: 0.87852


Result:
* Accuracy: 0.879

### 3. Use the hyperparameters search functionality

In [10]:
training_set, validation_set = train_test_split(train_set, test_size=0.2, random_state=42)
random.shuffle(training_set)
random.shuffle(validation_set)

In [11]:
with open("imdb.training.hyperparameter", "w") as f:
    f.writelines(training_set)
with open("imdb.validation.hyperparameter", "w") as f:
    f.writelines(validation_set)

In [12]:
model = fasttext.train_supervised(input='imdb.training.hyperparameter'
                                  , autotuneValidationFile='imdb.validation.hyperparameter')

Progress: 100.0% Trials:    9 Best score:  0.899400 ETA:   0h 0m 0s
Training again with best arguments
Read 4M words
Number of words:  69077
Number of labels: 2
Progress: 100.0% words/sec/thread: 1574213 lr:  0.000000 avg.loss:  0.045578 ETA:   0h 0m 0s 11.8% words/sec/thread: 1663571 lr:  0.074928 avg.loss:  0.284104 ETA:   0h 0m23s


Results:
* Progress: 100.0% Trials:   11 Best score:  0.899000 ETA:   0h 0m 0s
* Training again with best arguments
* Read 4M words
* Number of words:  69077
* Number of labels: 2
* Progress: 100.0% words/sec/thread: 1881838 lr:  0.000000 avg.loss:  0.043658 ETA:   0h 0m 0s

In [13]:
accuracy = compute_accuracy(model, test_set)
print(f"Accuracy: {accuracy}")

Accuracy: 0.89616


Result:
* Accuracy: 0.89588

### 4. Look at the differences between the 2 models

print(f"First model attributes:\n\t-learning rate: {model_first.lr},\n\t-dimension of word vectors: {model_first.dim},\n\t-epoch: {model_first.epoch}\n")
print(f"Hyperparameters trained model attributes:\n\t-learning rate: {model.lr},\n\t-dimension of word vectors: {model.dim},\n\t-epoch: {model.epoch}")

First model attributes:
* learning rate: 0.1,
* dimension of word vectors: 100,
* epoch: 5

Hyperparameters trained model attributes:
* learning rate: 0.08499425639667486,
* dimension of word vectors: 92,
* epoch: 100

#### About the differences, we can say that the model with hyperparameters training was trained for much longer, but with a slightly slower leaning rate and words represented with less features


### 5. Two wrongly classified examples from the tuned model

In [42]:
falses = [(text, model.predict(text[:-1])[0][0]) for text in test_set if model.predict(text[:-1])[0][0] != text[:17]]
first = falses[1]
second = falses[-1]
print(f"First:\n\t-text: {first[0][18:-1]}, \n\t-true label: {first[0][9:17]},\n\t-predicted label: {first[1][9:17]}\n")
print(f"Second:\n\t-text: {second[0][18:-1]}, \n\t-true label: {second[0][9:17]},\n\t-predicted label: {second[1][9:17]}\n")

First:
	-text: i just got back from seeing   comedian   it was   alright  it kept me looking at the screen  its just not the type of thing i like to go pay  7 to see   now don t get me wrong  it d make a great hbo feature  if this were something i was watching on tv  i d be hooked right in  it gives an amazing look at what comics go through before and after getting on stage  it will interest anyone who likes watching comics   but when i go to the movies  i like to be entertained  i m not there to be educated  now i know what its like for jerry seinfeld before he goes out on stage    great  but truthfully  i d rather just laugh at his jokes than worry about any of that   one more thing  with the bad attitude onry adams has  i d expect to see him taking my order from burger king before i see his hbo special  he wasn t funny  he s the kind of person that you love to hate , 
	-true label: negative,
	-predicted label: positive

Second:
	-text: this has to be one of  if not the greatest mob 

First:
* text: i just got back from seeing   comedian   it was   alright  it kept me looking at the screen  its just not the type of thing i like to go pay  7 to see   now don t get me wrong  it d make a great hbo feature  if this were something i was watching on tv  i d be hooked right in  it gives an amazing look at what comics go through before and after getting on stage  it will interest anyone who likes watching comics   but when i go to the movies  i like to be entertained  i m not there to be educated  now i know what its like for jerry seinfeld before he goes out on stage    great  but truthfully  i d rather just laugh at his jokes than worry about any of that   one more thing  with the bad attitude onry adams has  i d expect to see him taking my order from burger king before i see his hbo special  he wasn t funny  he s the kind of person that you love to hate, 
* true label: negative,
* predicted label: positive

Second:
* text: this has to be one of  if not the greatest mob crime films of all time  every thing about this movie is great  the acting in this film is of true quality  master p s acting skills make you actually believe he is italian  the cinematography is excellent too  probably the best ever  this movie was great  and i have the brain capacity of an earth worm , 
* true label: negative,
* predicted label: positive

####  For the first one, we can see that the writer actually liked the movie but would have prefered to watch it alone at home than at the cinema,  which made this comment negative, but the model probably didn't catch up with this fact, which led to wrongly classify it. 

#### For the second, it is actually pretty obvious why it was wrongly classified, and that is because the comment is actually positive until we reach the end and see that the writer was being sarcastic, which probably wasn't recognised by the model.