Download data from: http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset


In [1]:
from sklearn.datasets import fetch_20newsgroups
import spacy
import string

# To be used for pre-processing of data
tokenizer = spacy.load('en_core_web_sm')
punctuations = string.punctuation

# Load data
First, let's load the dataset from sklearn. 

In [2]:
newsgroup_train = fetch_20newsgroups(subset='train')
newsgroup_test = fetch_20newsgroups(subset='test') # we will use it later

In [3]:
train_split = 10000
train_data = newsgroup_train.data[:train_split]
train_targets = newsgroup_train.target[:train_split]

val_data = newsgroup_train.data[train_split:]
val_targets = newsgroup_train.target[train_split:]

test_data = newsgroup_test.data
test_targets = newsgroup_test.target

print ("Train dataset size is {}".format(len(train_data)))
print ("Val dataset size is {}".format(len(val_data)))
print ("Test dataset size is {}".format(len(test_data)))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532


Fasttext library takes a file as input and learn a classification model.
The sentences in input file should be in this format: "_ __label__ _[class] [Text]" 
We will prepare the train file and test file in this format.

In [4]:
def create_newsgroup_file(data, targets, outfile_name):
    with open(outfile_name, 'w') as fout:
        for i, sent in enumerate(data):
            line = "__label__" + str(targets[i]) + " " + sent.replace('\n', ' ') + "\n"
            fout.write(line)
            

create_newsgroup_file(train_data, train_targets, 'newsgroups.train') 
create_newsgroup_file(val_data, val_targets, 'newsgroups.val') 
create_newsgroup_file(test_data, test_targets, 'newsgroups.test') 

### Let's check how the file we created look like

In [6]:
!head -2 newsgroups.train

__label__7 From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15   I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is  all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail.  Thanks, - IL    ---- brought to you by your neighborhood Lerxst ----     
__label__4 From: guykuo@carson.u.washington.edu (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Po

### Install FastText if you haven't! 
Use the following commands to install fasttext.
```
wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip
cd fastText-0.1.0
make
```

### Let's start training the fasttext classifier, and check its performance on validation set.

In [10]:
# Train fasttext
!./fastText-0.1.0/fastText supervised -input newsgroups.train -output model_newsgroup

Read 2M words
Number of words:  258366
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3067690  lr: 0.000000  loss: 3.000534  eta: 0h0m ords/sec/thread: 2059702  lr: 0.099346  loss: 2.841834  eta: 0h0m 0m m gress: 45.7%  words/sec/thread: 3066358  lr: 0.054282  loss: 3.017017  eta: 0h0m 3053475  lr: 0.050331  loss: 3.016595  eta: 0h0m gress: 59.4%  words/sec/thread: 3059738  lr: 0.040556  loss: 3.014911  eta: 0h0m gress: 78.1%  words/sec/thread: 3064380  lr: 0.021868  loss: 3.009293  eta: 0h0m 7953  loss: 3.008615  eta: 0h0m %  words/sec/thread: 3062455  lr: 0.014456  loss: 3.005642  eta: 0h0m gress: 99.4%  words/sec/thread: 3067650  lr: 0.000616  loss: 3.000534  eta: 0h0m 


In [11]:
# Evaluate it on validation set
!./fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.val

N	1314
P@1	0.101
R@1	0.101
Number of examples: 1314


Note that FastText reports the precision and recall, not accuracy!  
The **precision** is the number of correct labels among the labels predicted by fastText.  
The **recall** is the number of labels that successfully were predicted, among all the real labels.

## What a horrible model! Do some preprocessing to make it better!

In [16]:
def preprocess_sent(sent):
    temp_sent = ' '.join(sent.split('\n')) # remove line breaks as fasttext read each sample text as a line
    tokens = tokenizer(temp_sent)
    pos = [(tok.text, tok.pos_) for tok in tokens]
    processed_toks = [tok.text.lower() for tok in tokens if (tok.text not in punctuations)]
    
    return ' '.join(processed_toks).strip() #[token.text.lower() for token in tokens]
    
    
temp = preprocess_sent(train_data[0])
temp

"from lerxst@wam.umd.edu where 's my thing subject what car is this nntp posting host rac3.wam.umd.edu organization university of maryland college park lines 15    i was wondering if anyone out there could enlighten me on this car i saw the other day it was a 2-door sports car looked to be from the late 60s/ early 70s it was called a bricklin the doors were really small in addition the front bumper was separate from the rest of the body this is   all i know if anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail   thanks il     ---- brought to you by your neighborhood lerxst ----"

In [17]:
def create_newsgroup_file(data, targets, outfile_name):
    with open(outfile_name, 'w') as fout:
        for i, sent in enumerate(data):
            proc_sent = preprocess_sent(sent)
            line = "__label__" + str(targets[i]) + " " + proc_sent + "\n"
            fout.write(line)
            
create_newsgroup_file(train_data, train_targets, 'newsgroups.proc.train') 
create_newsgroup_file(val_data, val_targets, 'newsgroups.proc.val') 
create_newsgroup_file(test_data, test_targets, 'newsgroups.proc.test') 

In [18]:
!./fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup

Read 2M words
Number of words:  134300
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3659460  lr: 0.000000  loss: 2.972613  eta: 0h0m 14m words/sec/thread: 1560763  lr: 0.099874  loss: 2.841834  eta: 0h0m  words/sec/thread: 3509511  lr: 0.096054  loss: 2.957827  eta: 0h0m m   words/sec/thread: 3547130  lr: 0.074283  loss: 3.016389  eta: 0h0m gress: 37.7%  words/sec/thread: 3647125  lr: 0.062277  loss: 3.016427  eta: 0h0m 0m %  words/sec/thread: 3619299  lr: 0.043019  loss: 3.008533  eta: 0h0m s: 68.6%  words/sec/thread: 3663936  lr: 0.031382  loss: 3.004586  eta: 0h0m gress: 73.0%  words/sec/thread: 3668848  lr: 0.026969  loss: 3.000934  eta: 0h0m gress: 79.9%  words/sec/thread: 3662178  lr: 0.020098  loss: 2.994315  eta: 0h0m s: 84.6%  words/sec/thread: 3653509  lr: 0.015363  loss: 2.985764  eta: 0h0m s: 88.3%  words/sec/thread: 3648488  lr: 0.011748  loss: 2.977858  eta: 0h0m 


In [19]:
!./fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

N	1314
P@1	0.117
R@1	0.117
Number of examples: 1314


We see tiny improvement but still a bad model. Let's adjust the hyperparameters of the model.
Fasttext library uses 5 training epochs by default, which is not enough for learning our data. 
Let's try adjusting the number of epoch to 30.

#### It is important to note that the two models above aren't strictly comparable.
Each model is randomly initialized at the beginning of the training. So, every time you re-train the model, you will notice that the precision and recall are different.
In practice, it's a good idea to train the model with different initializations at least 5 times, and report the min, max, mean, and median stats.

In [20]:
!./fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup -epoch 30
!./fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

Read 2M words
Number of words:  134300
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3587300  lr: 0.000000  loss: 1.237170  eta: 0h0m 14m   words/sec/thread: 3557694  lr: 0.096855  loss: 3.013989  eta: 0h0m ead: 3597244  lr: 0.096162  loss: 3.015560  eta: 0h0m h0m  words/sec/thread: 3558857  lr: 0.095185  loss: 3.017139  eta: 0h0m thread: 3643371  lr: 0.093219  loss: 3.014914  eta: 0h0m m   lr: 0.092329  loss: 3.011415  eta: 0h0m   lr: 0.090072  loss: 2.989636  eta: 0h0m 0.089579  loss: 2.954622  eta: 0h0m   words/sec/thread: 3454690  lr: 0.088291  loss: 2.929592  eta: 0h0m 13.2%  words/sec/thread: 3444781  lr: 0.086831  loss: 2.909479  eta: 0h0m 9874  eta: 0h0m   eta: 0h0m gress: 17.4%  words/sec/thread: 3495630  lr: 0.082558  loss: 2.815600  eta: 0h0m s: 17.8%  words/sec/thread: 3504977  lr: 0.082157  loss: 2.813912  eta: 0h0m   loss: 2.748497  eta: 0h0m ss: 2.746757  eta: 0h0m gress: 21.3%  words/sec/thread: 3517643  lr: 0.078666  loss: 2.717894  eta: 0h0m   loss: 2.62035

Great! A huge improvement. 
Learning rate dictates how fast a model learns. By default, it's 0.05. Model will converge faster with bigger learning rate, though bigger learning rate doesn't always mean better.
Let's adjust it as well.

In [25]:
!./fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup -epoch 30 -lr 0.5
!./fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

Read 2M words
Number of words:  134300
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3610860  lr: 0.000000  loss: 0.239296  eta: 0h0m m  words/sec/thread: 3708512  lr: 0.470649  loss: 2.381975  eta: 0h0m 0.466045  loss: 2.289226  eta: 0h0m  words/sec/thread: 3689278  lr: 0.459338  loss: 2.115019  eta: 0h0m gress: 10.3%  words/sec/thread: 3674785  lr: 0.448627  loss: 1.827468  eta: 0h0m gress: 10.8%  words/sec/thread: 3669020  lr: 0.445983  loss: 1.721792  eta: 0h0m 0h0m %  words/sec/thread: 3657712  lr: 0.438591  loss: 1.492550  eta: 0h0m gress: 17.1%  words/sec/thread: 3651927  lr: 0.414698  loss: 1.235371  eta: 0h0m 8.5%  words/sec/thread: 3650562  lr: 0.407338  loss: 1.084880  eta: 0h0m   eta: 0h0m s: 21.7%  words/sec/thread: 3671343  lr: 0.391332  loss: 0.998239  eta: 0h0m gress: 23.1%  words/sec/thread: 3673424  lr: 0.384381  loss: 0.954280  eta: 0h0m 2778  eta: 0h0m  gress: 24.8%  words/sec/thread: 3656970  lr: 0.375955  loss: 0.869884  eta: 0h0m gress: 27.5%  words/se

Nice, the results improves! 

Now, instead of using **bags of words**, let's try using **bags of N-grams**. We'll use **Bigrams (N=2)** here.  
N-grams provide a sense of word order. 

Sentence: "I love eating pizza"  
Bigrams for the above sentence: "I love", "love eating", "eating pizza".  
By looking at the N-grams, it is possible to reconstruct a sentence.

In [35]:
!./fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup \
-epoch 30 -lr 0.5 -wordNgrams 2
!./fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

Read 2M words
Number of words:  134300
Number of labels: 20
Progress: 100.0%  words/sec/thread: 1577546  lr: 0.000000  loss: 0.394258  eta: 0h0m 14m d: 1521699  lr: 0.496238  loss: 2.994074  eta: 0h0m 0.491502  loss: 3.012293  eta: 0h0m 2.4%  words/sec/thread: 1564740  lr: 0.488079  loss: 3.013380  eta: 0h0m  words/sec/thread: 1581270  lr: 0.482903  loss: 2.998689  eta: 0h0m m h0m  words/sec/thread: 1607165  lr: 0.463237  loss: 2.786102  eta: 0h0m 0.460744  loss: 2.736167  eta: 0h0m  words/sec/thread: 1598420  lr: 0.457409  loss: 2.669891  eta: 0h0m  words/sec/thread: 1603732  lr: 0.455483  loss: 2.653238  eta: 0h0m d: 1601958  lr: 0.450499  loss: 2.544457  eta: 0h0m 10.2%  words/sec/thread: 1603242  lr: 0.448826  loss: 2.521739  eta: 0h0m gress: 11.2%  words/sec/thread: 1599043  lr: 0.443750  loss: 2.363167  eta: 0h0m %  words/sec/thread: 1601378  lr: 0.440802  loss: 2.352304  eta: 0h0m gress: 14.7%  words/sec/thread: 1597748  lr: 0.426492  loss: 2.118769  eta: 0h0m 15.2%  words/sec/t

You may check out other hyperparameters you can adjust on the Fasttext repo: https://github.com/facebookresearch/fastText/blob/master/README.md

After we have chosen the best model based on validation performance, we can test how it perform on actual test set.  
Remember the lecture? ***Never*** tune your model on test set!

In [36]:
!./fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.test

N	7532
P@1	0.764
R@1	0.764
Number of examples: 7532


## Exercise
Try training the fastText using IMDB Large Movie Review Dataset and fine-tune the hyperparameters.