In [12]:
#!pip3 install --upgrade pip
#!pip3 install --upgrade torch
#!pip3 install --upgrade torchtext

#some handy modules that I use often
import datetime
import os
import shutil

These declarations let plots display in Jupyter notebook

In [13]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Remember, if you encounter a "ModuleNotFoundError", you can install from inside the jupyter notebook using !pip install *yourmodule* (you may need to restart your Jupyter instance to see the impact of your install)

* [dill package index](https://pypi.python.org/pypi/dill)
* [spacy package index](https://pypi.python.org/pypi/spacy)


In [14]:
import dill as dill
import spacy

* [fastai.learner on github](https://github.com/fastai/fastai/blob/master/fastai/learner.py)
* [fastai.rnn_reg on github](https://github.com/fastai/fastai/blob/master/fastai/rnn_reg.py)
* [fastai.rnn_train on github](https://github.com/fastai/fastai/blob/master/fastai/rnn_train.py)
* [fastai.nlp on github](https://github.com/fastai/fastai/blob/master/fastai/nlp.py) - depends on the spacy module. [Solution for OSError: Can't find model 'en'](http://forums.fast.ai/t/lesson-4-oserror-cant-find-model-en/11252)
* [fastai.lm_rnn on github](https://github.com/fastai/fastai/blob/master/fastai/lm_rnn.py)

In [15]:
from fastai.learner import *
from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

* [pytorch/text (torchtext) on github](https://github.com/pytorch/text)
    * [torchtext vocab on github](https://github.com/pytorch/text/blob/master/torchtext/vocab.py)
    * [torchtext.datasets language_modeling on github](https://github.com/pytorch/text/blob/master/torchtext/datasets/language_modeling.py)

In [16]:
import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling


## Language modeling

### IMDB Data description

The [large movie view dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

However, before we try to classify *sentiment*, we will simply try to create a *language model*; that is, a model that can predict the next word in a sentence. Why? Because our model first needs to understand the structure of English, before we can expect it to recognize positive vs negative sentiment.

So our plan of attack is the same as we used for Dogs v Cats: pretrain a model to do one thing (predict the next word), and fine tune it to do something else (classify sentiment).

Unfortunately, there are no good pretrained language models available to download, so we need to create our own. To follow along with this notebook, we suggest downloading the dataset from [this location](http://files.fast.ai/data/aclImdb.tgz) on files.fast.ai.

First, let's set the project level parameters.

In [38]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
MOD_PATH = 'models/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
MODELS=f'{PATH}{MOD_PATH}'

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH) #TEST AND VALIDATION ARE THE SAME PATH...WHY?

spacy_tok = spacy.load('en')
TEXT = data.Field(lower=True, tokenize="spacy")

def print_project_variables():
    print('PATH',PATH)
    print('TRN_PATH',TRN_PATH)
    print('VAL_PATH',VAL_PATH)
    print('TRN',TRN)
    print('VAL',VAL)
    print('MODELS',MODELS)
    print('FILES',FILES)
    print('spacy_tok',spacy_tok)
    print('TEXT', TEXT)
    
print_project_variables()
    

PATH data/aclImdb/
TRN_PATH train/all/
VAL_PATH test/all/
TRN data/aclImdb/train/all/
VAL data/aclImdb/test/all/
MODELS data/aclImdb/models/
FILES {'train': 'train/all/', 'validation': 'test/all/', 'test': 'test/all/'}
spacy_tok <spacy.lang.en.English object at 0x000001F9F2B28F28>
TEXT <torchtext.data.field.Field object at 0x000001F9D5CFB9B0>


We might find it handy to have a function that accept a file path and sample size, returning a dictionary of files and reviews.

In [41]:
def get_words_from_path(sourcepath:str,samplecount:int)->dict:
    #YOUR CODE HERE
            
    return #example_reviews

reviews=get_words_from_path(TRN,10)
if reviews is not None:
    for review in reviews:
        print(review,':',reviews[review],'\n')

Now we'll check how many words are in the dataset. We're not returning anything, just looking, so you might want to use !commands.

If you are running on Windows or are having trouble with the exclamation point calls,you can [make a python function](https://stackoverflow.com/questions/28153549/creating-a-dictionary-of-word-count-of-multiple-text-files-in-a-directory) instead. Expected VAL value: 5686719 (I get a slightly higher value of 5712879 from my python code)

In [44]:
def word_count(filename:str)->int:
    fccount=0
    #YOUR CODE HERE
    
    return fcount

def directory_word_count(directory:str)->int:
    dcount=0
    #YOUR CODE HERE
    # expected VAL value: 5686719
    # expected TRN value: 17486581
        
    return dcount

Before we can analyze text, we must first *tokenize* it. This refers to the process of splitting a sentence into an array of words (or more generally, into an array of *tokens*).

fastai works closely with torchtext. We create a ModelData object for language modeling by taking advantage of `LanguageModelData`, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use `VAL_PATH` for that too.

As well as the usual `bs` (batch size) parameter, we also not have `bptt`; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

We use Pytorch's [torchtext](https://github.com/pytorch/text) library to preprocess our data, telling it to use the wonderful [spacy](https://spacy.io/) library to handle tokenization.

First, we create a torchtext *field*, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

After building our `ModelData` object, it automatically fills the `TEXT` object with a very important attribute: `TEXT.vocab`. This is a *vocabulary*, which stores which words (or *tokens*) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

*(Technical note: python's standard `Pickle` library can't handle this correctly, so at the top of this notebook we used the `dill` library instead)*.

Let's create a function that accepts a data.Field object and returns the data.Field object updated with a vocabulary and a LanguageModelData object. The function should also accept the path that we'll need to generate the Language Model Data object, the file to save the updated vocabulary to (so we don't have to regenerate it).

In [49]:
def generate_vocab(path:str,
                   text:data.Field,
                   files:dict,
                   vocabsavepath:str=None,
                   printsamples=True)->(LanguageModelData, data.Field):
    bs=64
    bptt=70
    min_freq=10
    
    md=None
    
    #YOUR CODE HERE
        
    return md,text

In [50]:
vocabpath=MODELS+'TEXT.pkl'
md,TEXT=generate_vocab(path=PATH,text=TEXT,files=FILES,vocabsavepath=vocabpath)

Our `LanguageModelData` object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our `bptt` parameter - *backprop through time*).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [51]:
#print(next(iter(md.trn_dl)))

### Train

We have a number of parameters to set - we'll learn more about these later, but you should find these values suitable for many problems. Researchers have found that large amounts of *momentum* (which we'll learn about later) don't work well with these kinds of *RNN* models, so we'll want to create a version of the *Adam* optimizer with less momentum than the default of `0.9`.

fastai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through [Dropout](https://en.wikipedia.org/wiki/Convolutional_neural_network#Dropout). There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (`alpha`, `beta`, and `clip`) shouldn't generally need tuning.

Let's create functions to load, train and save our model. Add whatever parameters you will find helpful. The get_learner function serves to store the values we use to get our learner model from a LanguageModelData object.  The train_learner function should offer the option to either load or train a module using passed in values. Suggested values are set for you.

In [52]:
def get_learner(source_model:LanguageModelData):
    learner=None
    
    #dropout will likely need tuning
    dropouti=0.05
    dropout=0.05
    wdrop=0.1
    dropoute=0.02
    dropouth=0.05
    
    
    em_sz = 200  # size of each embedding vector
    nh = 500     # number of hidden activations per layer
    nl = 3       # number of layers
    betas=(0.7, 0.99) #used to create a version of the Adam optimizer with less momentum than it's default of 0.9.
    opt_fn = partial(optim.Adam, betas=betas)
    
    #YOUR CODE HERE
    
    return learner

def train_learner(source_model:LanguageModelData):
    
    learner=get_learner(source_model)
    
    #these shouldn't need tuning
    alpha=2
    beta=1
    clip=0.3
    
    #YOUR CODE HERE
        
    return learner
        
    

In [53]:
learner=train_learner(source_model=md)

As you can see below, I gradually tuned the language model in a few stages. I possibly could have trained it further (it wasn't yet overfitting), but I didn't have time to experiment more. Maybe you can see if you can train it to a better accuracy! (I used `lr_find` to find a good learning rate, but didn't save the output in this notebook. Feel free to try running it yourself now.)

In [129]:
#Takes about 30 minutes to run:
#learner.lr_find()

 96%|█████████▌| 4398/4583 [22:58<00:57,  3.19it/s, loss=nan]   

In [55]:
learner=train_learner(source_model=md)
print(type(learner))

<class 'NoneType'>


In the sentiment analysis section, we'll just need half of the language model - the *encoder*, so we save that part.

In [57]:
#YOUR CODE HERE

Language modeling accuracy is generally measured using the metric *perplexity*, which is simply `exp()` of the loss function we used.

In [145]:
math.exp(4.165)

64.3926824434624

### Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [58]:
def primethemodel(learner):
    ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best part"""
    textpath=''#path to text pkl file
    
    text=None
    s = None
    t=None
    
    #YOUR CODE HERE
    
    return s,t,text

In [62]:
encodername=''#the name of the text sentiment encoder
m=None
original_bs=None

if learner is not None:
    learner.load_encoder(encodername)
    m=learner.model
    original_bs=m[0].bs

s,t,TEXT=primethemodel(learner=learner)


In [64]:
if s is not None and len(s)>0:
    print(s[0])

In [68]:
def get_testing_predictions(m):
    res=None
    # YOUR CODE HERE
    return res

Let's see what the top 10 predictions were for the next word after our short text:

In [69]:
res=get_testing_predictions(m)
if res is not None:
    nexts = torch.topk(res[-1], 10)[1]
    [TEXT.vocab.itos[o] for o in to_np(nexts)]

...and let's see if our model can generate a bit more text all by itself!

In [79]:

def generate_text(res,ss,samplesize):
    print(ss,"\n")
    for i in range(samplesize):
        if res is not None:
            print(res)
        #YOUR CODE HERE
    print('...')

In [80]:
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best part"""
generate_text(res,ss,50)


. So, it wasn't quite was I was expecting, but I really liked it anyway! The best part 

...


### Sentiment

We'll need to the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

In [166]:
TEXT = pickle.load(open(f'{PATH}models/TEXT_SENTIMENT.pkl','rb'))

`sequential=False` tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

`splits` is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at `lang_model-arxiv.ipynb` to see how to define your own fastai/torchtext datasets.

In [167]:
IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')

downloading aclImdb_v1.tar.gz


In [168]:
t = splits[0].examples[0]

In [169]:
t.label, ' '.join(t.text[:16])

('pos',
 'the shining is a weird example of adaptation : it has very little in common with')

fastai can create a ModelData object directly from torchtext splits.

In [171]:
md2 = TextData.from_splits(PATH, splits, m[0].bs)

In [75]:
def getsentimentmodel():
    dropouti=0.05
    dropout=0.05
    wdrop=0.1
    dropoute=0.02
    dropouth=0.05
    clip=25

    em_sz = 200  # size of each embedding vector
    nh = 500     # number of hidden activations per layer
    nl = 3       # number of layers
    betas=(0.7, 0.99) #used to create a version of the Adam optimizer with less momentum than it's default of 0.9.
    opt_fn = partial(optim.Adam, betas=betas)
    bs=64
    bptt=70
    min_freq=10    
    
    m3 = None
    #YOUR CODE HERE
    
    return m3

Because we're fine-tuning a pretrained model, we'll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

In [175]:
def finetune(lrs,m3):
    #YOUR CODE HERE
    return m3

In [176]:
m3=finetune(lrs,getsentimentmodel())

epoch      trn_loss   val_loss   accuracy                    
    0      0.420445   0.292881   0.875881  



epoch      trn_loss   val_loss   accuracy                    
    0      0.372781   0.257887   0.894271  



[0.2578866, 0.8942708333333333]

fit our sentiment model using the saved imdb2 cycle

In [76]:
#YOUR CODE HERE

load the imbdb2 cycle

In [77]:
#YOUR CODE HERE

Show the accuracy of the model

In [78]:
#YOUR CODE HERE

A recent paper from Bradbury et al, [Learned in translation: contextualized word vectors](https://einstein.ai/research/learned-in-translation-contextualized-word-vectors), has a handy summary of the latest academic research in solving this IMDB sentiment analysis problem. Many of the latest algorithms shown are tuned for this specific problem.

![image.png](attachment:image.png)

As you see, we just got a new state of the art result in sentiment analysis, decreasing the error from 5.9% to 5.5%! You should be able to get similarly world-class results on other NLP classification problems using the same basic steps.

There are many opportunities to further improve this, although we won't be able to get to them until part 2 of this course...

### End