# 05. Document Classification with FastText

After building a classifier with Bag-of-Words, we will try N-grams embedding in this section. The model we use is [FastText](https://github.com/facebookresearch/fastText) from Facebook Research. Again, we will perform a multi-class classification and see how we can make progress.

## 1. Dataset

We will use the same dataset from previous notebook (04.bag_of_words), download data from:

http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset

In [1]:
from sklearn.datasets import fetch_20newsgroups
import spacy
import string

# To be used for pre-processing of data
tokenizer = spacy.load('en_core_web_sm')
punctuations = string.punctuation

### 1.1 Data Loading
First, let's load the dataset from sklearn. 

In [2]:
newsgroup_train = fetch_20newsgroups(subset='train')
newsgroup_test = fetch_20newsgroups(subset='test') # we will use it later

In [3]:
train_split = 10000
train_data = newsgroup_train.data[:train_split]
train_targets = newsgroup_train.target[:train_split]

val_data = newsgroup_train.data[train_split:]
val_targets = newsgroup_train.target[train_split:]

test_data = newsgroup_test.data
test_targets = newsgroup_test.target

print ("Train dataset size is {}".format(len(train_data)))
print ("Val dataset size is {}".format(len(val_data)))
print ("Test dataset size is {}".format(len(test_data)))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532


### 1.2 Preprocessing
Fasttext library takes a file as input and learn a classification model.
The sentences in input file should be in this format: "_ __label__ _[class] [Text]" 
We will prepare the train file and test file in this format.

In [4]:
def create_newsgroup_file(data, targets, outfile_name):
    with open(outfile_name, 'w') as fout:
        for i, sent in enumerate(data):
            line = "__label__" + str(targets[i]) + " " + sent.replace('\n', ' ') + "\n"
            fout.write(line)
            

create_newsgroup_file(train_data, train_targets, 'data/newsgroups.train') 
create_newsgroup_file(val_data, val_targets, 'data/newsgroups.val') 
create_newsgroup_file(test_data, test_targets, 'data/newsgroups.test') 

Let's check how the file we created look like:

In [5]:
!head -2 data/newsgroups.train

__label__7 From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15   I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is  all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail.  Thanks, - IL    ---- brought to you by your neighborhood Lerxst ----     
__label__4 From: guykuo@carson.u.washington.edu (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Pos

## 2. FastText 
**Install FastText if you haven't!**

Use the following commands to install fasttext.
```
wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip
cd fastText-0.1.0
make
```

Let's start training the fasttext classifier, and check its performance on validation set.

In [6]:
# Train fasttext
!./fastText-0.1.0/fastText supervised -input data/newsgroups.train -output data/model_newsgroup

Read 2M words
Number of words:  258366
Number of labels: 20
Progress: 100.0%  words/sec/thread: 2751034  lr: 0.000000  loss: 2.999579  eta: 0h0m 


In [7]:
# Evaluate it on validation set
!./fastText-0.1.0/fastText test data/model_newsgroup.bin data/newsgroups.val

N	1314
P@1	0.115
R@1	0.115
Number of examples: 1314


Note that FastText reports the precision and recall, not accuracy!  
The **precision** is the number of correct labels among the labels predicted by fastText.  
The **recall** is the number of labels that successfully were predicted, among all the real labels.

## 3. Improving the model
The above model gives poor performance. Let's do some data processing before training.

### 3.1 Text Clean-up

In [8]:
def preprocess_sent(sent):
    temp_sent = ' '.join(sent.split('\n')) # remove line breaks as fasttext read each sample text as a line
    tokens = tokenizer(temp_sent)
    pos = [(tok.text, tok.pos_) for tok in tokens]
    processed_toks = [tok.text.lower() for tok in tokens if (tok.text not in punctuations)]
    
    return ' '.join(processed_toks).strip() #[token.text.lower() for token in tokens]
    
    
temp = preprocess_sent(train_data[0])
temp

"from lerxst@wam.umd.edu where 's my thing subject what car is this nntp posting host rac3.wam.umd.edu organization university of maryland college park lines 15    i was wondering if anyone out there could enlighten me on this car i saw the other day it was a 2-door sports car looked to be from the late 60s/ early 70s it was called a bricklin the doors were really small in addition the front bumper was separate from the rest of the body this is   all i know if anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail   thanks il     ---- brought to you by your neighborhood lerxst ----"

In [9]:
def create_newsgroup_file(data, targets, outfile_name):
    with open(outfile_name, 'w') as fout:
        for i, sent in enumerate(data):
            proc_sent = preprocess_sent(sent)
            line = "__label__" + str(targets[i]) + " " + proc_sent + "\n"
            fout.write(line)

# this will take a long time           
create_newsgroup_file(train_data, train_targets, 'data/newsgroups.proc.train') 
create_newsgroup_file(val_data, val_targets, 'data/newsgroups.proc.val') 
create_newsgroup_file(test_data, test_targets, 'data/newsgroups.proc.test') 

In [10]:
!./fastText-0.1.0/fastText supervised -input data/newsgroups.proc.train -output data/model_newsgroup_proc

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3037630  lr: 0.000000  loss: 2.974690  eta: 0h0m 


In [11]:
!./fastText-0.1.0/fastText test data/model_newsgroup_proc.bin data/newsgroups.proc.val

N	1314
P@1	0.121
R@1	0.121
Number of examples: 1314


### 3.2 Train More Epochs

We see tiny improvement but still a bad model. Let's adjust the hyperparameters of the model.
Fasttext library uses 5 training epochs by default, which is not enough for learning our data. 
Let's try adjusting the number of epoch to 30.

**It is important to note that the two models above aren't strictly comparable.**

Each model is randomly initialized at the beginning of the training. So, every time you re-train the model, you will notice that the precision and recall are different.
In practice, it's a good idea to train the model with different initializations at least 5 times, and report the min, max, mean, and median stats.

In [12]:
!./fastText-0.1.0/fastText supervised -input data/newsgroups.proc.train -output data/model_newsgroup_ep30 -epoch 30
!./fastText-0.1.0/fastText test data/model_newsgroup_ep30.bin data/newsgroups.proc.val

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3080700  lr: 0.000000  loss: 1.170081  eta: 0h0m 
N	1314
P@1	0.791
R@1	0.791
Number of examples: 1314


Great! A huge improvement. 
Learning rate dictates how fast a model learns. By default, it's 0.05. Model will converge faster with bigger learning rate, though bigger learning rate doesn't always mean better.
Let's adjust it as well.

In [13]:
!./fastText-0.1.0/fastText supervised -input data/newsgroups.proc.train -output data/model_newsgroup_biglr -epoch 30 -lr 0.5
!./fastText-0.1.0/fastText test data/model_newsgroup_biglr.bin data/newsgroups.proc.val

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3033714  lr: 0.000000  loss: 0.223415  eta: 0h0m 
N	1314
P@1	0.87
R@1	0.87
Number of examples: 1314


Nice, the results improves! 

### 3.3 Using Bag-of-Ngrams

Now, instead of using **bags of words**, let's try using **bags of N-grams**. We'll use **Bigrams (N=2)** here.  
N-grams provide a sense of word order. 

Sentence: `"I love eating pizza"`  
Bigrams for the above sentence: `"I love"`, `"love eating"`, `"eating pizza"`.  
By looking at the N-grams, it is possible to reconstruct a sentence.

In [14]:
!./fastText-0.1.0/fastText supervised -input data/newsgroups.proc.train -output data/model_newsgroup_ngram \
-epoch 30 -lr 0.5 -wordNgrams 2
!./fastText-0.1.0/fastText test data/model_newsgroup_ngram.bin data/newsgroups.proc.val

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 1370155  lr: 0.000000  loss: 0.420135  eta: 0h0m 
N	1314
P@1	0.868
R@1	0.868
Number of examples: 1314


But using n-grams doesn't necessarily improve the prediction performance. One more experiment, try 4-gram:

In [15]:
!./fastText-0.1.0/fastText supervised -input data/newsgroups.proc.train -output data/model_newsgroup_4gram \
-epoch 30 -lr 0.5 -wordNgrams 4
!./fastText-0.1.0/fastText test data/model_newsgroup_4gram.bin data/newsgroups.proc.val

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 592167  lr: 0.000000  loss: 0.710420  eta: 0h0m 
N	1314
P@1	0.842
R@1	0.842
Number of examples: 1314


You may check out other hyperparameters you can adjust on the Fasttext repo: https://github.com/facebookresearch/fastText/blob/master/README.md

After we have chosen the best model based on validation performance, we can test how it perform on actual test set.  
Be aware, ***never*** tune your model on test set!

In [21]:
!./fastText-0.1.0/fastText test data/model_newsgroup_biglr.bin data/newsgroups.proc.test

N	7532
P@1	0.766
R@1	0.766
Number of examples: 7532
