# NLP 1: Text Preprocessing and Text Classifier

This notebook is based on fastai's **[Chapter 10](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb)**.

Please read that chapter before looking at this review.

*I suggest opening this notebook in Colab (where it can be easier to use GPU).*
*If you want to run it locally, set up the **deep-learning** environment in your terminal with `conda env create -f environment.yml` and activate it in your preferred IDE.*

In [1]:
### FOR COLAB USERS ###
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m719.8/719.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/gdrive


In [2]:
# ### FOR LOCAL USERS ###
# import fastai
# print(fastai.__version__)

# ! pip install -Uqq fastbook
# import fastbook
# fastbook.setup_book()

In [3]:
from fastbook import *
from IPython.display import display,HTML

from fastai.text.all import *

# Data

In [4]:
path = untar_data(URLs.IMDB)

In [5]:
# get text files
files = get_text_files(path, folders = ['train', 'test', 'unsup'])
len(files)

100000

In [6]:
# sample -- one file (first 75 characters for length)
sample_text = files[0].open().read();
sample_text[:75]

'I wanted to like Magnolia. The plot reminded me of Grand Canyon (which I li'

In [7]:
# first 2000 texts
txts = L(o.open().read() for o in files[:2000])
print(len(txts))

2000


In [8]:
# same sample as above
txts[0]

"I wanted to like Magnolia. The plot reminded me of Grand Canyon (which I liked). 4 different lives/stories that come together at the end but Magnolia took a wrong turn halfway through the movie and I was lost. I almost turned it off right then and there but I felt I should hang in there until the end, little did I know it would be another torturous 1 1/2 hours. Thank god I rented instead of seeing it in the theatre. I almost screamed out in frustration after 2 hours. The biggest kick in the pants was the ending frog scene. My DVD player still hasn't forgiven me and I don't blame it one bit. It was a unique movie, but a bad, boring, and pointless movie."

# Text Preprocessing

## 1. Tokenization
- Token: one element of a list created by the tokenization process. It could be a word, part of a word (a subword), or a single character.
- Tokenization: Convert the text into a list of tokens
- 3 main approaches: word-based, subword-based, character-based


### Word-based tokenization
- Split a sentence on spaces, as well as applying language-specific rules to try to separate parts of meaning even when there are no spaces (such as turning "don't" into "do n't")
- Generally, punctuation marks are also split into separate tokens

#### SpaCy

In [9]:
# spacy tokenizer (implemented in FastAI)
spacy = WordTokenizer()

In [10]:
# sample sentence -- get tokens
tokens = first(spacy(['The U.S. dollar $1 is $1.00.']))
print(tokens)

# turn tokens into a collection
coll_all = coll_repr(tokens)
print(coll_all)

# print first 3 tokens
coll_3 = coll_repr(tokens, 3)
print(coll_3)

['The', 'U.S.', 'dollar', '$', '1', 'is', '$', '1.00', '.']
(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']
(#9) ['The','U.S.','dollar'...]


In [11]:
# get tokens from a sample document in our corpus
# note: you need to pass the sample text as a list
tokens = first(spacy([sample_text]))
print(tokens)

# turn tokens into a collection
coll_all = coll_repr(tokens)
print(coll_all)

# print first 3 tokens
coll_3 = coll_repr(tokens, 3)
print(coll_3)

['I', 'wanted', 'to', 'like', 'Magnolia', '.', 'The', 'plot', 'reminded', 'me', 'of', 'Grand', 'Canyon', '(', 'which', 'I', 'liked', ')', '.', '4', 'different', 'lives', '/', 'stories', 'that', 'come', 'together', 'at', 'the', 'end', 'but', 'Magnolia', 'took', 'a', 'wrong', 'turn', 'halfway', 'through', 'the', 'movie', 'and', 'I', 'was', 'lost', '.', 'I', 'almost', 'turned', 'it', 'off', 'right', 'then', 'and', 'there', 'but', 'I', 'felt', 'I', 'should', 'hang', 'in', 'there', 'until', 'the', 'end', ',', 'little', 'did', 'I', 'know', 'it', 'would', 'be', 'another', 'torturous', '1', '1/2', 'hours', '.', 'Thank', 'god', 'I', 'rented', 'instead', 'of', 'seeing', 'it', 'in', 'the', 'theatre', '.', 'I', 'almost', 'screamed', 'out', 'in', 'frustration', 'after', '2', 'hours', '.', 'The', 'biggest', 'kick', 'in', 'the', 'pants', 'was', 'the', 'ending', 'frog', 'scene', '.', 'My', 'DVD', 'player', 'still', 'has', "n't", 'forgiven', 'me', 'and', 'I', 'do', "n't", 'blame', 'it', 'one', 'bit', '

#### FastAI tokenizer

Special tokens:
- `xxbos` = beginning of stream (sentence)
- `xxmaj` = next token starts with a capital letter
- `xxunk` = indicates the word is unknown

Default rules:
- `fix_html` = replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)
- `replace_rep` = replaces any character repeated three times or more with a special token for repetition (`xxrep`), the number of times it's repeated, then the character
- `replace_wrep `= replaces any word repeated three times or more with a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word
- `spec_add_spaces` = adds spaces around / and #
- `rm_useless_spaces` = removes all repetitions of the space character
- `replace_all_caps` = lowercases a word written in all caps and adds a special token for all caps (`xxup`) in front of it
- `replace_maj` = lowercases a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it
- `lowercase` = lowercases all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`)


In [12]:
# fastai tokenizer -- adds special characters to the spacy tokenizer we already defined
tkn = Tokenizer(spacy)

In [13]:
# see the default rules of the tokenizer
defaults.text_proc_rules

# replace_rep??

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

In [14]:
# get tokens from the sample text
# notice that you don't pass the sample text as a list here
tokens = tkn(sample_text)
print(tokens)

# turn tokens into a collection
coll_all = coll_repr(tokens)
print(coll_all)

# print first 3 tokens
coll_3 = coll_repr(tokens, 3)
print(coll_3)

['xxbos', 'i', 'wanted', 'to', 'like', 'xxmaj', 'magnolia', '.', 'xxmaj', 'the', 'plot', 'reminded', 'me', 'of', 'xxmaj', 'grand', 'xxmaj', 'canyon', '(', 'which', 'i', 'liked', ')', '.', '4', 'different', 'lives', '/', 'stories', 'that', 'come', 'together', 'at', 'the', 'end', 'but', 'xxmaj', 'magnolia', 'took', 'a', 'wrong', 'turn', 'halfway', 'through', 'the', 'movie', 'and', 'i', 'was', 'lost', '.', 'i', 'almost', 'turned', 'it', 'off', 'right', 'then', 'and', 'there', 'but', 'i', 'felt', 'i', 'should', 'hang', 'in', 'there', 'until', 'the', 'end', ',', 'little', 'did', 'i', 'know', 'it', 'would', 'be', 'another', 'torturous', '1', '1', '/', '2', 'hours', '.', 'xxmaj', 'thank', 'god', 'i', 'rented', 'instead', 'of', 'seeing', 'it', 'in', 'the', 'theatre', '.', 'i', 'almost', 'screamed', 'out', 'in', 'frustration', 'after', '2', 'hours', '.', 'xxmaj', 'the', 'biggest', 'kick', 'in', 'the', 'pants', 'was', 'the', 'ending', 'frog', 'scene', '.', 'xxmaj', 'my', 'xxup', 'dvd', 'player',

In [15]:
# another example
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

In [16]:
# tokenize 200 documents in the corpus
toks200 = txts[:200].map(tkn)

print(toks200[0])
print(toks200[1])

['xxbos', 'i', 'wanted', 'to', 'like', 'xxmaj', 'magnolia', '.', 'xxmaj', 'the', 'plot', 'reminded', 'me', 'of', 'xxmaj', 'grand', 'xxmaj', 'canyon', '(', 'which', 'i', 'liked', ')', '.', '4', 'different', 'lives', '/', 'stories', 'that', 'come', 'together', 'at', 'the', 'end', 'but', 'xxmaj', 'magnolia', 'took', 'a', 'wrong', 'turn', 'halfway', 'through', 'the', 'movie', 'and', 'i', 'was', 'lost', '.', 'i', 'almost', 'turned', 'it', 'off', 'right', 'then', 'and', 'there', 'but', 'i', 'felt', 'i', 'should', 'hang', 'in', 'there', 'until', 'the', 'end', ',', 'little', 'did', 'i', 'know', 'it', 'would', 'be', 'another', 'torturous', '1', '1', '/', '2', 'hours', '.', 'xxmaj', 'thank', 'god', 'i', 'rented', 'instead', 'of', 'seeing', 'it', 'in', 'the', 'theatre', '.', 'i', 'almost', 'screamed', 'out', 'in', 'frustration', 'after', '2', 'hours', '.', 'xxmaj', 'the', 'biggest', 'kick', 'in', 'the', 'pants', 'was', 'the', 'ending', 'frog', 'scene', '.', 'xxmaj', 'my', 'xxup', 'dvd', 'player',

### Subword-based tokenization
- Split words into smaller parts, based on the most commonly occurring substrings. For instance, "occasion" might be tokenized as "o c ca sion."
- Provides a way to easily scale between character tokenization (i.e., using a small subword vocab) and word tokenization (i.e., using a large subword vocab), and handles every human language without needing language-specific algorithms to be developed
- It can even handle other "languages" such as genomic sequences or MIDI music notation
- Picking a subword vocab size represents a compromise: a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn

In [17]:
def subword(sz, txt):
  sp = SubwordTokenizer(vocab_sz=sz) #sz = size of vocab we want to create
  sp.setup(txts) # read the documents and find the common sequences to make the vocab
  return ' '.join(first(sp([txt]))[:40])

In [18]:
# get vocab
  # _ represents a space
subword(1000, sample_text)

'▁I ▁want ed ▁to ▁like ▁Ma g no li a . ▁The ▁plot ▁re min d ed ▁me ▁of ▁G r and ▁Ca n y on ▁( w hi ch ▁I ▁like d ). ▁ 4 ▁different ▁live s /'

In [19]:
# smaller vocab
  # If we use a smaller vocab, then each token will represent fewer characters,
  # and it will take more tokens to represent a sentence
subword(200, sample_text)

'▁I ▁w an t ed ▁to ▁ li k e ▁ M a g n o li a . ▁The ▁p lo t ▁ re m in d ed ▁me ▁of ▁ G r an d ▁ C an y'

In [20]:
# larger vocab
  # most common English words will end up in the vocab themselves,
  # and we will not need as many to represent a sentence
subword(10000, sample_text)

'▁I ▁wanted ▁to ▁like ▁Magnolia . ▁The ▁plot ▁remind ed ▁me ▁of ▁Grand ▁Canyon ▁( which ▁I ▁liked ). ▁4 ▁different ▁lives / s tor ies ▁that ▁come ▁together ▁at ▁the ▁end ▁but ▁Magnolia ▁to ok ▁a ▁wrong ▁turn ▁half'

### Character-based tokenization
- Split a sentence into its individual characters

In [21]:
# not sure how to do this in fastai
# subword(26, sample_text)

## 2. Numericalization
- Goal: mapping tokens to integers
- Make a list of all of the unique tokens that appear (the vocab)
- Convert each token into a number, by replacing the token with its index in the vocab


In [22]:
# recall we already tokenized these above
tkn = Tokenizer(spacy)
toks200 = txts[:200].map(tkn)

In [23]:
# initialize numericalize
num = Numericalize()

### Set up the vocab
- Special tokens appear first, and then every word appears once, in frequency order.
- Defaults: `min_freq`=3, `max_vocab`=60000.
- `max_vocab`=60000 results in fastai replacing all words other than the most common 60,000 with a special unknown word token, `xxunk`
- `min_freq`=3 means that any word appearing less than three times is replaced with `xxunk`

In [24]:
# setup the vocab for the 200 text documents
num.setup(toks200)

In [25]:
# print out the first 20 tokens in the vocab
coll_repr(num.vocab,20)

"(#2000) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','of','to','it','is','i','in'...]"

### Numericalize the tokens
- We can pass the tensor of integers, our numericalized tokens, into the model

In [26]:
# show the first 10 tokens in the sample_text
print("original text:", tokens[:10])

# numericalization
nums = num(tokens)[:10]
print("numericalized version:", nums)

# show the vocab that corresponds to the embeddings
# note that "This" has been replaced by the unknown token "xxunk"
print("converting back to vocab:", ' '.join(num.vocab[o] for o in nums))

original text: ['xxbos', 'i', 'wanted', 'to', 'like', 'xxmaj', 'magnolia', '.', 'xxmaj', 'the']
numericalized version: TensorText([  2,  18, 501,  15,  57,   8,   0,  10,   8,   9])
converting back to vocab: xxbos i wanted to like xxmaj xxunk . xxmaj the


## Putting texts into batches
- Batches = break a stream of tokens up into contigous sections
  - **_batch size_**: the number of groups of tokens. Batches are non-contigous (aka not overlapping)
  - **_sequence length_**: the length of each batch
- Sequence of words matters
  - We want our language model to read text in order, so that it can efficiently predict what the next word is
  - This means that each new batch should begin precisely where the previous one left off
  - The model needs to maintain a "**_state_**" so that it remembers what it read previously when predicting what comes next

### Single batch

In [27]:
# sample
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
print("number of characters:", len(stream))
print(stream)

number of characters: 408
In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.
Then we will study how we build a language model and train it for a while.


In [28]:
# word-based tokens -- separated by spaces
tokens = tkn(stream)
print("number of tokens:", len(tokens))
print(tokens)

number of tokens: 90
['xxbos', 'xxmaj', 'in', 'this', 'chapter', ',', 'we', 'will', 'go', 'back', 'over', 'the', 'example', 'of', 'classifying', 'movie', 'reviews', 'we', 'studied', 'in', 'chapter', '1', 'and', 'dig', 'deeper', 'under', 'the', 'surface', '.', 'xxmaj', 'first', 'we', 'will', 'look', 'at', 'the', 'processing', 'steps', 'necessary', 'to', 'convert', 'text', 'into', 'numbers', 'and', 'how', 'to', 'customize', 'it', '.', 'xxmaj', 'by', 'doing', 'this', ',', 'we', "'ll", 'have', 'another', 'example', 'of', 'the', 'preprocessor', 'used', 'in', 'the', 'data', 'block', 'xxup', 'api', '.', '\n', 'xxmaj', 'then', 'we', 'will', 'study', 'how', 'we', 'build', 'a', 'language', 'model', 'and', 'train', 'it', 'for', 'a', 'while', '.']


In [29]:
# separate into batches
bs = 6 # break the 90 tokens into 6 contiguous parts (6 rows)
seq_len = 15 # each batch should be 15 tokens long

d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
df
# display(HTML(df.to_html(index=False,header=None)))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
1,movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
2,first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
3,how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
4,of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
5,will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


### Multiple Batches
- The corpus will be too big to only use one batch
- You need multiple batches
- Note: the order of the text in the mini-batches is preserved

In [30]:
row_len = 15
bs,seq_len = 6,5

#### Batch 1
- row 1 = tokens 1-5
  - skip tokens 6-15 (they are in row 1 of batch 2 and 3)
- row 2 = tokens 16-20
  - skip tokens 21-30 (they are in row 2 of batch 2 and 3)
- row 3 = tokens 31-35
  - skip tokens 36-45 (they are in row 3 of batch 2 and 3)
- row 4 = tokens 46-50 (50 = the last token)


In [31]:
d_tokens = np.array([tokens[i*row_len:i*row_len+seq_len] for i in range(bs)])
pd.DataFrame(d_tokens)

Unnamed: 0,0,1,2,3,4
0,xxbos,xxmaj,in,this,chapter
1,movie,reviews,we,studied,in
2,first,we,will,look,at
3,how,to,customize,it,.
4,of,the,preprocessor,used,in
5,will,study,how,we,build


#### Batch 2
-
  - skip tokens 1-5 (row 1 of batch 1)
- row 1 = tokens 6-10
  - skip tokens 11-20 (row 1 of batch 3 and row 2 of batch 1)
- row 2 = tokens 21-25
  - skip tokens 26-35 (row 2 of batch 3 and row 4 of batch 1)
- row 3 = tokens 36-40
  - skip tokens 41-50 (row 3 of batch 2 and row 4 of batch 1)

In [32]:
d_tokens = np.array([tokens[i*row_len+seq_len:i*row_len+2*seq_len] for i in range(bs)])
pd.DataFrame(d_tokens)

Unnamed: 0,0,1,2,3,4
0,",",we,will,go,back
1,chapter,1,and,dig,deeper
2,the,processing,steps,necessary,to
3,xxmaj,by,doing,this,","
4,the,data,block,xxup,api
5,a,language,model,and,train


#### Batch 3
-  
  - skip tokens 1-10 (row 1 of batch 1 and 2)
- row 1 = tokens 11-15
  - skip tokens 16-25 (row 2 of batch 1 and 2)
- row 2 = tokens 26-30
  - skip tokens 31-40 (row 3 of batch 1 and 2)
- row 3 = tokens 41-45
  - skip tokens 46-50 (row 4 of batch 1)

In [33]:
d_tokens = np.array([tokens[i*row_len+seq_len+5:i*row_len+3*seq_len] for i in range(bs)])
pd.DataFrame(d_tokens)

Unnamed: 0,0,1,2,3,4
0,over,the,example,of,classifying
1,under,the,surface,.,xxmaj
2,convert,text,into,numbers,and
3,we,'ll,have,another,example
4,.,\n,xxmaj,then,we
5,it,for,a,while,.


**Note:** Later, we will use embeddings to convert the integer word indices into activations that we can use for our neural network. Then we'll feed those embeddings into a recurrent neural network (RNN).

## 3. Create a Language Dataloader
- Fastai provides an LMDataLoader class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required
- At every epoch, we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length we picked

### Repeat the text preprocessing steps we've already done (for review)

In [34]:
# read first 2000 text documents
txts = L(o.open().read() for o in files[:2000])

In [35]:
# tokenize
tokenizer = WordTokenizer()
tkn = Tokenizer(tokenizer)
toks200 = txts[:200].map(tkn)

In [36]:
# numericalize
num = Numericalize()
num.setup(toks200)
nums200 = toks200.map(num)

### Create the dataloader

In [37]:
# create the dataloader
  # unlike the dl we see later, this requires the already tokenized/numericalized data
  # the one later does those steps automatically
dl = LMDataLoader(nums200)

### Look at a sample

In [38]:
# sample (first batch)
x,y = first(dl)
print(x.shape,y.shape)

torch.Size([64, 72]) torch.Size([64, 72])


In [39]:
# x sample
sample_x = x[0]

# print embeddings for first 20 tokens
print(sample_x[:20])

# print the vocab that corresponds to those embeddings
print(' '.join(num.vocab[o] for o in sample_x[:20]))

LMTensorText([   2,   18,  501,   15,   57,    8,    0,   10,    8,    9,  111, 1467,   86,   14,    8, 1166,    8,    0,   35,  103])
xxbos i wanted to like xxmaj xxunk . xxmaj the plot reminded me of xxmaj grand xxmaj xxunk ( which


In [40]:
# y sample -- same document as x but offset by by one token
sample_y = y[0]

# print embeddings for first 20 tokens
print(sample_y[:20])

# print the vocab that corresponds to those embeddings
print(' '.join(num.vocab[o] for o in sample_y[:20]))

TensorText([  18,  501,   15,   57,    8,    0,   10,    8,    9,  111, 1467,   86,   14,    8, 1166,    8,    0,   35,  103,   18])
i wanted to like xxmaj xxunk . xxmaj the plot reminded me of xxmaj grand xxmaj xxunk ( which i


### First 3 samples

In [41]:
# look at first 3 samples
print(' '.join(num.vocab[o] for o in x[0][:20]))
print()
print(' '.join(num.vocab[o] for o in x[1][:20]))
print()
print(' '.join(num.vocab[o] for o in x[2][:20]))

xxbos i wanted to like xxmaj xxunk . xxmaj the plot reminded me of xxmaj grand xxmaj xxunk ( which

people in my business life , i can say without exception , that xxmaj i 've never xxunk any xxunk

. xxmaj what xxup should have happened in the last 1 / 2 of this movie is xxmaj michael being


# Training a Text Classifier
- Step 1. Get pre-trained language model
- Step 2. Fine tune language model on our specific corpus
- Step 3. Create classification model

## Step 1. Get a Pre-trained Language Model
- Goal
  - Predict the next word in English-language text
  - This is Self supervised learning -- we don't need to give it (external) labels, we just give it lots and lots of text (aka the labels are embedded in the independent variable)
- Independent variable
  - The sequence of tokens starting with the first token in the vocab and ending with the second to last token in the vocab
- Dependent variable
  - The sequence of words starting with the second token and ending with the last token
- Architecture
  - RNN
- Steps
  - Make a list of all possible tokens -- the vocab
  - Replace each token with its index in the vocab
  - Create an embedding matrix for this containing a row for each token
  - Use this embedding matrix as the first layer of a neural network
  - Note: A dedicated embedding matrix can take the raw vocab indices as inputs; this is equivalent to, but faster and more efficient than, a matrix that takes as input one-hot-encoded vectors representing the indices

### Get data

In [42]:
# get imdb reviews from the data we already downloaded
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

### Datablock / Dataloaders
- Input = text
- Output = same text offset by 1
- Note: this data loaders automatically tokenizes/numericalizes the data (unlike the DL above)

In [43]:
# TextBlock
  # is_lm = True; says it is a language model (just need to feed in texts once)
  # from_folder; tells it how to tokenize/numericalize
dls_lm = DataBlock(
  blocks=TextBlock.from_folder(path, is_lm=True),
  get_items=get_imdb, # how to get the data
  splitter=RandomSplitter(0.1)
).dataloaders(
  path,
  path=path,
  bs=128,
  seq_len=80
)

In [44]:
# notice text (the x values) and text_ (the y values) are offset by n=1
dls_lm.show_batch(max_n=3)

Unnamed: 0,text,text_
0,"xxbos xxmaj this is without a doubt the worst gay - themed film i have ever seen . xxmaj it 's an embarrassment on every level . xxmaj the script is hackneyed and unfunny . xxmaj the direction -- well , what passes for the direction , is inept . xxmaj the actors appear to have been handed the orders "" ham it up ! "" xxmaj yet , the ones who are n't over - the - top are","xxmaj this is without a doubt the worst gay - themed film i have ever seen . xxmaj it 's an embarrassment on every level . xxmaj the script is hackneyed and unfunny . xxmaj the direction -- well , what passes for the direction , is inept . xxmaj the actors appear to have been handed the orders "" ham it up ! "" xxmaj yet , the ones who are n't over - the - top are wooden"
1,"disappearance of a teen couple , xxmaj ted and xxmaj angie at the riverside in north xxmaj california a group of the nearby - town teenagers goes having ghost story in the local cemetery . xxmaj however the unnaturally sudden rain forces them to movie into an old wooden house in the cemetery . xxmaj then the not - well - motivated murderer begins to kill them one by one , and the cemetery becomes their own … i think","of a teen couple , xxmaj ted and xxmaj angie at the riverside in north xxmaj california a group of the nearby - town teenagers goes having ghost story in the local cemetery . xxmaj however the unnaturally sudden rain forces them to movie into an old wooden house in the cemetery . xxmaj then the not - well - motivated murderer begins to kill them one by one , and the cemetery becomes their own … i think the"
2,"two out of ten ; but to some extent this film probably provides exactly what its sponsors intended , and judged on this basis a quality rating of four out of ten would be reasonable . xxmaj being in a charitable mood , and wanting to make it clear that i am not blaming xxmaj pia for my disappointment , i will give an imdb rating of four . xxbos xxmaj considering that "" drive "" is a low -","out of ten ; but to some extent this film probably provides exactly what its sponsors intended , and judged on this basis a quality rating of four out of ten would be reasonable . xxmaj being in a charitable mood , and wanting to make it clear that i am not blaming xxmaj pia for my disappointment , i will give an imdb rating of four . xxbos xxmaj considering that "" drive "" is a low - budget"


### Look at sample

In [45]:
# sample
x,y = first(dl)
print(x.shape,y.shape)

torch.Size([64, 72]) torch.Size([64, 72])


In [46]:
# x sample, embeddings, and vocab for those embeddings
sample_x = x[0]
print(sample_x[:20])
print(' '.join(num.vocab[o] for o in sample_x))

LMTensorText([   2,   18,  501,   15,   57,    8,    0,   10,    8,    9,  111, 1467,   86,   14,    8, 1166,    8,    0,   35,  103])
xxbos i wanted to like xxmaj xxunk . xxmaj the plot reminded me of xxmaj grand xxmaj xxunk ( which i liked ) . 4 different lives / stories that come together at the end but xxmaj xxunk took a wrong turn halfway through the movie and i was lost . i almost turned it off right then and there but i felt i should xxunk in there until the end ,


In [47]:
# y sample, embeddings, and vocab for those embeddings
sample_y = y[0]
print(sample_y[:20])
print(' '.join(num.vocab[o] for o in sample_y))

TensorText([  18,  501,   15,   57,    8,    0,   10,    8,    9,  111, 1467,   86,   14,    8, 1166,    8,    0,   35,  103,   18])
i wanted to like xxmaj xxunk . xxmaj the plot reminded me of xxmaj grand xxmaj xxunk ( which i liked ) . 4 different lives / stories that come together at the end but xxmaj xxunk took a wrong turn halfway through the movie and i was lost . i almost turned it off right then and there but i felt i should xxunk in there until the end , little


### Get pre-trained model (from wikipedia)
- Goal: Predict the next word in this specific corpus (i.e., predict the next word of a movie review)
- This FastAI learner automatically combines the embeddings in the pretrained (Wikitext) language model with random embeddings added for words that weren't in the pretrained model
- The learner knows to automatically freeze the embeddings from wikipedia

In [48]:
# learner
learn = language_model_learner(
  dls_lm, # data loader from above (automatically tokenizes/numericalizes)
  AWD_LSTM, # RNN architecture
  drop_mult=0.3, # don't worry about this for now
  # loss = default -- cross-entropy loss, since this is a classification problem
  metrics=[accuracy, Perplexity()] # accuracy = how many times model is right when predicting next word; perplexity = exponential of the loss, has to do with model confidence
).to_fp16()

## Step 2. Fine-tune the language model with the specific imdb corpus
- Helps understand the "language" of this specific corpus
- Our vocab will consist of a mix of common words that are already in the vocabulary of our pretrained model and new words specific to our corpus (cinematographic terms or actors names, for instance)
- To convert the integer word indices into activations that we can use for our neural network, we will use embeddings
  - For words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of the pretrained model
  - For new words we won't have anything, so we will just initialize the corresponding row with a random vector
- The embeddings are read into an RNN (architecture = AWS-LSTM)

In [49]:
# initial fine-tuning
  # the language_model_learner automattically calls `freeze` when using a pretrained model
  # so this will only train the embeddings from the new imdb corpus (aka with randomly initialzied weights)
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.013569,3.907824,0.300493,49.790466,20:54


In [50]:
# can save model as is (saves to learn.path/models/1epoch.pth)
learn.save('1epoch')

# load later
tmp = learn.load('1epoch')

In [51]:
# once the initial training has completed, can unfreeze the pre-trained layers
learn.unfreeze()

# and continue fine-tuning (using ALL embeddings, pre-trained and new from imdb)
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.757411,3.765701,0.316886,43.193966,23:03
1,3.705809,3.706454,0.323315,40.709198,22:46
2,3.63531,3.658247,0.32885,38.793259,22:16
3,3.564897,3.625551,0.332808,37.545395,22:08
4,3.499094,3.604402,0.3353,36.759705,21:57
5,3.429217,3.589293,0.337457,36.20845,21:55
6,3.363263,3.5792,0.339213,35.844856,22:08
7,3.293472,3.577188,0.340156,35.772816,22:07
8,3.258166,3.580547,0.3405,35.893166,22:20
9,3.209875,3.585246,0.340195,36.062229,22:16


In [52]:
# save the encoder (the model not including the final layer)
  # the final layer is task specific: converts activations to probabilities of picking each token in our vocabulary
learn.save_encoder('finetuned')

## Step 3. Fine-tune a CLASSIFICATION model (sentiment labels)
- A language model predicts the next word of a document, so it doesn't need any external labels
- A classifier predicts some external label—in the case of IMDb, it's the sentiment of a document
- We pass the vocab of the language model to the classifier model (to make sure we use the same correspondence of token to index)

### Pad the texts to make them the same length
- Notice the original texts all have different lengths (different embedding sizes)
- Need to pad them to be the same size -- expand the shorter texts to make them the same length as the longest text. PyTorch need to collate all the items in a batch into a single tensor, and a single tensor has a fixed shape
- Can also crop long texts to a shorter, fixed legth
- To do padding
  - batch together texts that are roughly the same lengths (with some shuffling for the training set) -- do this by sorting the documents by length prior to each epoch
  - thn, pad documents by batch (instead of all at once) -- we won't pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size
- The sorting and padding are automatically done by the data block API for us when using a TextBlock, with is_lm=False

In [53]:
# quick sample of 10 reviews -- note all have different embedding sizes
nums_samp = toks200[:10].map(num)
print(nums_samp.map(len))

[159, 253, 219, 194, 1152, 101, 406, 428, 247, 68]


In [54]:
# datablock / dataloaders
dls_clas = DataBlock(
  blocks=(
    TextBlock.from_folder(path, vocab=dls_lm.vocab), # x values are vocab from the language model we used earlier. Note! now is_lm=False (default)
    CategoryBlock # y values are categories (sentiment labels)
  ),
  get_y = parent_label,
  get_items=partial(get_text_files, folders=['train', 'test']),
  splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

#### Sample x and y

In [55]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos * * attention xxmaj spoilers * * \n\n xxmaj first of all , let me say that xxmaj rob xxmaj roy is one of the best films of the 90 's . xxmaj it was an amazing achievement for all those involved , especially the acting of xxmaj liam xxmaj neeson , xxmaj jessica xxmaj lange , xxmaj john xxmaj hurt , xxmaj brian xxmaj cox , and xxmaj tim xxmaj roth . xxmaj michael xxmaj canton xxmaj jones painted a wonderful portrait of the honor and dishonor that men can represent in themselves . xxmaj but alas … \n\n it constantly , and unfairly gets compared to "" braveheart "" . xxmaj these are two entirely different films , probably only similar in the fact that they are both about xxmaj scots in historical xxmaj scotland . xxmaj yet , this comparison frequently bothers me because it seems",pos
2,"xxbos * ! ! - xxup spoilers - ! ! * \n\n xxmaj before i begin this , let me say that i have had both the advantages of seeing this movie on the big screen and of having seen the "" authorized xxmaj version "" of this movie , remade by xxmaj stephen xxmaj king , himself , in 1997 . \n\n xxmaj both advantages made me appreciate this version of "" the xxmaj shining , "" all the more . \n\n xxmaj also , let me say that xxmaj i 've read xxmaj mr . xxmaj king 's book , "" the xxmaj shining "" on many occasions over the years , and while i love the book and am a huge fan of his work , xxmaj stanley xxmaj kubrick 's retelling of this story is far more compelling … and xxup scary . \n\n xxmaj kubrick",pos


### Fine tune a classifier using the sentiment labels
- Use discriminative learning rates and gradual unfreezing
- For NLP classifiers, unfreezing a few layers at a time makes a real difference


In [56]:
#classification learner
learn = text_classifier_learner(
  dls_clas,
  AWD_LSTM,
  drop_mult=0.5,
  metrics=accuracy
).to_fp16()

# make sure to add the encoder from the fine-tuned language model!!
learn = learn.load_encoder('finetuned')

In [57]:
# fine-tune: one epoch
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.246305,0.175227,0.93312,00:49


In [58]:
# fine tune again: freeze all except last 2 parameter groups
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.215631,0.159536,0.93924,00:55


In [59]:
# fine tune again: unfreeze a little bit more
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.185593,0.14801,0.94412,01:14


In [60]:
# fine tune again: finally the whole model
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.16465,0.14294,0.94732,01:34
1,0.147156,0.144155,0.94696,01:34


### Note: by training another model on all the texts read backwards and averaging the predictions of those two models, the accuracy would be even better