In [2]:
!pip install -U fastbook torchtext==0.8.1

Requirement already up-to-date: fastbook in /usr/local/lib/python3.7/dist-packages (0.0.16)
Requirement already up-to-date: torchtext==0.8.1 in /usr/local/lib/python3.7/dist-packages (0.8.1)


# NLP Deep Dive: RNNs

*Self-supervised learning*: no need to give labels to the model, just feed it a ton of texts. The model automatically gets labels from the data.

We can download a pretrained English language model, fine-tune it to predict words in a movie review, and then apply it to classification. Thus, there are 3 stages to this transfer learning in NLP:

1. Download the pretrained language model, trained on Wikipedia articles.
2. Fine-tune this model on the IMDb reviews data.
3. Use this to classify reviews as positive or negative.

## Text Preprocessing

Steps for processing text:

1. Tokenization:: convert text into a list of tokens (words, subwords, characters, etc.)
2. Numericalization:: create a list of all unique words/tokens that appear (this is the vocab). Then, take the tokenized text and convert each token into a number (the index of that token in the vocab).
3. Create model data loader:: use fastai's `LMDataLoader` class to handle text data.
4. Create language model:: use a recurrent neural network (RNN) to handle input lists which could be arbitrarily large or small.

The independent variable is the sequence of words from first to second-to-last, and the dependent variable is the sequence of words from second word to last.

The vocab will consist of words already in the vocabulary of the pretrained model and new words specific to the corpus (documents).


### Tokenization

Three approaches to tokenization:

* Word-based: split a sentences into its individual words (split on the spaces). Use language-specific rules to separate parts of meaning even if there is no space (e.g. turning "don't" into "do n't"). Punctuation marks are usually split into individual tokens.
* Subword-based:: split words into smaller parts, based on the most commonly occurring substrings (e.g. "occasion" => "o c ca sion").
* Character-based: split a sentence into individual characters.

### Word Tokenization with fastai

In [1]:
from fastai.text.all import *
path = untar_data(URLs.IMDB, force_download=True)

Grab all text files from `path`, using `folders` to restrict the search to just the provided subfolders.

In [30]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

Get the first 75 characters of the first file:

In [31]:
txt = files[0].open().read()
txt[:75]

'Tenshu is imprisoned and sentenced to death. When he survives electrocution'

Default English word tokenizer for fastai uses a library called *spaCy*. We use `WordTokenizer` below, since that refers to fastai's current default word tokenizer (which may or may not be *spaCy*).

We tokenize the file we got above, then print the first 30 tokens with `coll_repr`. Since fastai's tokenizers take a collection/list of documents to tokenize, `txt` is wrapped in a list.

In [32]:
spacy = WordTokenizer()
tokens = first(spacy([txt]))
print(coll_repr(tokens, 30))

(#260) ['Tenshu','is','imprisoned','and','sentenced','to','death','.','When','he','survives','electrocution','the','government','officials','give','him','a','choice','to','either','be','electrocute','at','a','greater','degree','or','agree','to'...]


spaCy is smart about separating punctuation; it knows to separate "." when it ends a sentence and not to separate it when "." is contained within a word/abbreviation.

In [33]:
first(spacy(['The U.S. dollar $1 is 1.00.']))

(#8) ['The','U.S.','dollar','$','1','is','1.00','.']

Use fastai's `Tokenizer` class to add *special tokens* to the token list. These tokens start with 'xx'. The main special tokens are:

* xxbos:: Indicates the beginning of a stream/text (BOS = "beginning of stream")
* xxmaj:: Indicates the next word begins with a capital (since we lowercased everything)
* xxunk:: Indicates the next word is unknown

These special tokens are used to make it easier for the model to store and process the text data (e.g. with xxmaj, every word can be lowercase, saving compute and memory resources).

In [34]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#302) ['xxbos','xxmaj','tenshu','is','imprisoned','and','sentenced','to','death','.','xxmaj','when','he','survives','electrocution','the','government','officials','give','him','a','choice','to','either','be','electrocute','at','a','greater','degree','or'...]


Checking the default rules for fastai's text processing.

Here is a brief summary of what each does:

* fix_html:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)
* replace_rep:: Replaces any character repeated three times or more with a special token for repetition (xxrep), the number of times it's repeated, then the character
* replace_wrep:: Replaces any word repeated three times or more with a special token for word repetition (xxwrep), the number of times it's repeated, then the word
* spec_add_spaces:: Adds spaces around / and #
* rm_useless_spaces:: Removes all repetitions of the space character
* replace_all_caps:: Lowercases a word written in all caps and adds a special token for all caps (xxup) in front of it
* replace_maj:: Lowercases a capitalized word and adds a special token for capitalized (xxmaj) in front of it
* lowercase:: Lowercases all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)

In [35]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

Checking out some more rules in action:

In [36]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

### Subword Tokenization

Word tokenizers use spaces to separate words/meanings; however, in a language like Chinese, there aren't any spaces to do this. Subword tokenization has two steps:

1. Scan the corpus of documents to find sthe most commonly occurring groups of letters. These become the vocab.

2. Tokenize the corpus using this vocab of *subword units*.

Read first 100 files (movie reviews) into an L object.

In [37]:
txts = L(o.open().read() for o in files[:100])

Instantiate a tokenizer with vocab size `sz`. `setup` reads all documents and finds the common sequences of characters to create the vocab.

In [38]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

Call our function `subword`, creating vocab of size 1000:

In [39]:
subword(1000)

'▁T en sh u ▁is ▁im p ris on ed ▁and ▁s ent ence d ▁to ▁de a th . ▁Whe n ▁he ▁surviv es ▁ el ect ro c ut ion ▁the ▁go ver n ment ▁of f ic'

If we use a smaller vocab, each token represents fewer characters (tokens are simpler). Thus, it takes more tokens to represent a sentence.

In [40]:
subword(200)

'▁T en sh u ▁is ▁ i m p r i s on ed ▁and ▁s ent en ce d ▁to ▁de a th . ▁W h en ▁he ▁s ur v i ve s ▁ el ect ro c'

If we use a larger vocab, most common English words will end up in the vocab themselves, and we will not need as many tokens to represent a sentence.

In [41]:
subword(10000)

'▁T en s hu ▁is ▁im prisone d ▁and ▁s ent ence d ▁to ▁death . ▁Whe n ▁he ▁survive s ▁electrocution ▁the ▁ govern ment ▁ official s ▁give ▁him ▁a ▁cho ice ▁to ▁either ▁be ▁electrocut e ▁at'

Larger vocab --> fewer tokens per sentence
* faster training
* less memory
* less state for the model to remember
* requires larger embedding matrices, which require more data to learn

Subword tokenization scales easily between character tokenization (small subword vocab) and word tokenization (large subword vocab). It can also handle every human language without needing language-specific algorithms.

### Numericalization with fastai

Numericalization: mapping tokens to integers.

Use the word tokenizer to convert the text into tokens:

In [42]:
tokens = tkn(txt)
print(coll_repr(tokens, 31))

(#302) ['xxbos','xxmaj','tenshu','is','imprisoned','and','sentenced','to','death','.','xxmaj','when','he','survives','electrocution','the','government','officials','give','him','a','choice','to','either','be','electrocute','at','a','greater','degree','or'...]


Word-tokenize the first 200 texts:

In [43]:
tokens200 = txts[:200].map(tkn)
tokens200[0]

(#302) ['xxbos','xxmaj','tenshu','is','imprisoned','and','sentenced','to','death','.'...]

Pass the list of tokens to `setup` to generate the vocab:

In [44]:
num = Numericalize()
num.setup(tokens200)
coll_repr(num.vocab, 20)

"(#1272) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','of','to','is','in','it','i'...]"

Special rules appear first, then each word appears once in frequency order. `Numericalize` defaults are `min_freq=3` and `max_vocab=60000`. `max_vocab=60000` causes fastai to replace all words other than the most common 60000 with the special unknown word token, xxunk. `min_freq=3` replaces any word that appears less than 3 times with xxunk. Both avoid having a large embedding matrix (which slows down training and takes up a lot of memory). However, `min_freq=3` prevents requiring data to train useful representations for rare words.

We can call our `Numericalize` object like a function, passing it the tokens we generated from the first file. It returns a tensor of integers that represent their index in the vocab.

In [45]:
nums = num(tokens)[:20]
nums

TensorText([  2,   8,   0,  16,   0,  12,   0,  15, 376,  10,   8,  58,  38,   0,
          0,   9,   0,   0, 206,  87])

Use the indexes in `nums` to look the corresponding words up in the vocab, and concatenate them together:

In [46]:
' '.join(num.vocab[o] for o in nums)

'xxbos xxmaj xxunk is xxunk and xxunk to death . xxmaj when he xxunk xxunk the xxunk xxunk give him'

### Putting Our Texts into Batches for a Language Model

At the beginning of each epoch:

1. Concatenate all the texts together into one long stream. To induce randomization, we shuffle the order of the documents (not the words inside of them!).

2. Divide this stream into batches(e.g. if stream has 50,000 tokens and we want a batch size of 10, this will give us 10 mini-streams (batches) of 5,000 tokens. We need to preserve the order of the tokens within each mini-stream.

`LMDataLoader` does all this for us.

Numericalize all tokens in `tokens200` (the first 200 documents).

In [47]:
nums200 = tokens200.map(num)

We use `LMDataLoader`, which (1) shuffles the collection of documents, (2) concatenates them into a stream of tokens, and (3) cuts the stream into a batch of fixed-size consecutive mini-streams.

In [48]:
dl = LMDataLoader(nums200)

We can grab the first batch. `x` contains 72 mini-streams, each with 64 integers (each is an index to lookup a word in the vocab).

In [49]:
x,y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

We examine the first row of the independent variable (`x`), which is the start of the first text:

In [50]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj xxunk is xxunk and xxunk to death . xxmaj when he xxunk xxunk the xxunk xxunk give him'

The dependent variable is the same thing, but offset by one token:

In [51]:
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj xxunk is xxunk and xxunk to death . xxmaj when he xxunk xxunk the xxunk xxunk give him a'

## Training a Text Classifier

Below, we will fine-tune our language model (which was pretrained on Wikipedia) to the corpus of IMDb reviews, and then we can use that model to train a classifier.

### Language Model Using DataBlock

When `TextBlock` is passed to `DataBlock`, tokenization and numericalization happen automatically. 

In [4]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

If the *counter.pkl* file is not being created when you untar the `URLs.IMDB` data (you receive the `FileNotFound` error), run the below cell:

In [None]:
import shutil
shutil.rmtree('') # Put the path to imdb_tok within the quotes

In [27]:
path = untar_data(URLs.IMDB, force_download=True)

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

A new version of this dataset is available, downloading...


(#7) [Path('/root/.fastai/data/imdb_tok/train'),Path('/root/.fastai/data/imdb_tok/lengths.pkl'),Path('/root/.fastai/data/imdb_tok/tmp_lm'),Path('/root/.fastai/data/imdb_tok/counter.pkl'),Path('/root/.fastai/data/imdb_tok/test'),Path('/root/.fastai/data/imdb_tok/unsup'),Path('/root/.fastai/data/imdb_tok/tmp_clas')]

In [29]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos i watched this in xxmaj july and even with the xxmaj christmas theme , found it touching and sensitive . xxmaj it is not for someone with a reality - mind as it is full of fantasy and lovely moments that sometimes do n't make sense . xxmaj william xxmaj russ did a grand job as xxmaj hank . i have only seen him in the remake of xxmaj the xxmaj long , xxmaj hot xxmaj summer where he","i watched this in xxmaj july and even with the xxmaj christmas theme , found it touching and sensitive . xxmaj it is not for someone with a reality - mind as it is full of fantasy and lovely moments that sometimes do n't make sense . xxmaj william xxmaj russ did a grand job as xxmaj hank . i have only seen him in the remake of xxmaj the xxmaj long , xxmaj hot xxmaj summer where he played"
1,"of xxmaj scream but these new xxmaj saw knockoffs are beginning to make those films look like classics . xxmaj they still pander to the same demographic that those other movies were so successful at doing , but now they add a new level of degeneracy that make the twelve to fourteen year old girls they 're aimed at feel like they 're hardcore xxup and hip . \n\n xxmaj this movie is a load of boring crap ! xxmaj","xxmaj scream but these new xxmaj saw knockoffs are beginning to make those films look like classics . xxmaj they still pander to the same demographic that those other movies were so successful at doing , but now they add a new level of degeneracy that make the twelve to fourteen year old girls they 're aimed at feel like they 're hardcore xxup and hip . \n\n xxmaj this movie is a load of boring crap ! xxmaj what"


### Fine-Tuning the Language Model

Below explained:

*   `AWD_LSTM`:: architecture that feeds embeddings into a recurrent neural network (RNN)
*   `Perplexity`:: metric that is the exponential of the loss
*.  `to_fp16`:: `Learner` method that converts the learner to 16-bit floating point values



In [52]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

`language_model_learner` automatically calls `freeze` when using a pretrained model, so `fit_one_cycle` will only train the embeddings (the randomly initialized weights from our IMDb vocab). The pretrained model vocab already has weights that have been fine tuned.

In [53]:
learn.fit_one_cycle(n_epoch=1, lr_max=2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.007678,3.89484,0.300166,49.148201,22:21


### Saving and Loading Models

Creates a file in `learn.path/models/` named *1epoch.pth*:

In [54]:
learn.save('1epoch')

Path('/root/.fastai/data/imdb/models/1epoch.pth')

Load the model after creating the `Learner` the same way, or resuming training later:

In [55]:
learn = learn.load('1epoch')

After the intial training has finished, we can continue fine-tuning the model after unfreezing:

In [56]:
learn.unfreeze()
learn.fit_one_cycle(1, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.704372,3.666091,0.327556,39.09877,23:39


The below line saves the model *except* for the final layer (which converts activations to probabilities of picking each token in the vocab). The model not including the final layer is called the **encoder**.

In [57]:
learn.save_encoder('finetuned')

In [61]:
Path(path/'models').ls()

(#2) [Path('/root/.fastai/data/imdb/models/finetuned.pth'),Path('/root/.fastai/data/imdb/models/1epoch.pth')]

### Text Generation

We can use our model to generate random reviews. In `preds`, we store an array of sentences which the model generates.

In [62]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
          for _ in range(N_SENTENCES)]

There is some randomness in which word is chosen (based on the probabilities returned by the model) so that we don't get exactly the same review:

In [63]:
print("\n".join(preds))

i liked this movie because of the Australian English accents . It was a Aussie movie . 

 It was somewhat fun to see Australian director Bryan Kirkwood at work in the Aussie film industry as
i liked this movie because the film is really funny . The film was excellent , and i just love it . This movie was perfect . My Rating : 10 / 10 This film is the perfect example


### Creating the Classifier DataLoaders

Here we create an NLP classifier data block. We pass our language model's `vocab` into this `DataBlock`, and we use `CategoryBlock` to create the classifier. We pass the `vocab` in because we want to use the same correspondence of token to index. Otherwise, the embeddings learned in our fine-tuned language model won't make sense to this model.

We do not specify `is_lm=True`, since we have regular labeled data, not using the next tokens as labels.

In [64]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

We show the batch, with the independent variable (text) and dependent variable (sentiment):

In [65]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos * ! ! - xxup spoilers - ! ! * \n\n xxmaj before i begin this , let me say that i have had both the advantages of seeing this movie on the big screen and of having seen the "" authorized xxmaj version "" of this movie , remade by xxmaj stephen xxmaj king , himself , in 1997 . \n\n xxmaj both advantages made me appreciate this version of "" the xxmaj shining , "" all the more . \n\n xxmaj also , let me say that xxmaj i 've read xxmaj mr . xxmaj king 's book , "" the xxmaj shining "" on many occasions over the years , and while i love the book and am a huge fan of his work , xxmaj stanley xxmaj kubrick 's retelling of this story is far more compelling … and xxup scary . \n\n xxmaj kubrick",pos
2,"xxbos xxrep 3 * xxmaj warning - this review contains "" plot spoilers , "" though nothing could "" spoil "" this movie any more than it already is . xxmaj it really xxup is that bad . xxrep 3 * \n\n xxmaj before i begin , xxmaj i 'd like to let everyone know that this definitely is one of those so - incredibly - bad - that - you - fall - over - laughing movies . xxmaj if you 're in a lighthearted mood and need a very hearty laugh , this is the movie for you . xxmaj now without further ado , my review : \n\n xxmaj this movie was found in a bargain bin at wal - mart . xxmaj that should be the first clue as to how good of a movie it is . xxmaj secondly , it stars the lame action",neg


Create a mini-batch with the first 10 documents, numericalizing all of them:

In [67]:
nums_samp = tokens200[:10].map(num)

Show the length of each tokenized and numericalized document. We see that the documents are all different lengths. With images, we were able to correct this by cropping, padding, and/or squishing to make the inputs the same size.

We use padding to augment the text data in a similar way. We first batch together texts that are roughly the same lengths. We then expand the shortest texts to make them the same size as the largest text in the batch, using a special padding token that is ignored by the model.

In [68]:
nums_samp.map(len)

(#10) [302,225,184,279,157,145,179,130,423,492]

The `TextBlock` class does the text batching/padding for us when `is_lm=False`.

In [69]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

We will now load the encoder from the fine-tuned language model:

In [70]:
learn = learn.load_encoder('finetuned')

### Fine-Tuning the Classifier

In [71]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.2889,0.225328,0.90848,01:13


In [72]:
learn.freeze_to(-2)
learn.fit_one_cycle( 1, slice(1e-2/2.6**4, 1e-2) )

epoch,train_loss,valid_loss,accuracy,time
0,0.248256,0.19044,0.92612,01:18


In [73]:
learn.freeze_to(-3)
learn.fit_one_cycle( 1, slice(5e-3/(2.6**4), 5e-3) )

epoch,train_loss,valid_loss,accuracy,time
0,0.206388,0.170632,0.93572,01:39


In [74]:
learn.unfreeze()
learn.fit_one_cycle( 1, slice(1e-3/2.6**4, 1e-3) )

epoch,train_loss,valid_loss,accuracy,time
0,0.16576,0.16605,0.9382,02:00


## Disinformation and Language Models

The vast improvement of language models has lead to a rise in disinformation from AI-generated text online. Language models have learned English well enough to the point where posts written by AI are almost indistinguishable from a normal human's writing. Auto-generated profiles/bots abound on social media, and it is a challenge to see whether we can develop models good enough to detect whether something has been created by AI.

## Conclusion

Two types of models in this chapter:

* Language models that can generate text
* Classifier that determines if a review is positive or negative

To build the classifier, we used:

* Pretrained language model
* Fine-tuned the model with the IMDb data
* Used the model's body (the encoder) with a new head to do classification


## Questionnaire

1. What is "self-supervised learning"?
1. What is a "language model"?
1. Why is a language model considered self-supervised?
1. What are self-supervised models usually used for?
1. Why do we fine-tune language models?
1. What are the three steps to create a state-of-the-art text classifier?
1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
1. What are the three steps to prepare your data for a language model?
1. What is "tokenization"? Why do we need it?
1. Name three different approaches to tokenization.
1. What is `xxbos`?
1. List four rules that fastai applies to text during tokenization.
1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
1. What is "numericalization"?
1. Why might there be words that are replaced with the "unknown word" token?
1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
1. Why do we need padding for text classification? Why don't we need it for language modeling?
1. What does an embedding matrix for NLP contain? What is its shape?
1. What is "perplexity"?
1. Why do we have to pass the vocabulary of the language model to the classifier data block?
1. What is "gradual unfreezing"?
1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?

### Further Research

1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
1. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?