In [1]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [2]:
from fastbook import *
from IPython.display import display, HTML

In [3]:
gdrive

Path('/content/gdrive/My Drive')

In [4]:
course_path = gdrive/'Colab Notebooks/fastbook/clean'
course_path

Path('/content/gdrive/My Drive/Colab Notebooks/fastbook/clean')

# NLP Deep Dive: RNNs

## Text Preprocessing

### Word Tokenization with fastai

In [5]:
from fastai.text.all import *

In [6]:
URLs.IMDB

'https://s3.amazonaws.com/fast-ai-nlp/imdb.tgz'

In [7]:
path = untar_data(URLs.IMDB)
path

Path('/root/.fastai/data/imdb')

In [8]:
files = get_text_files(path, folders=['train', 'test', 'unsup'])
files

(#100000) [Path('/root/.fastai/data/imdb/train/pos/5493_10.txt'),Path('/root/.fastai/data/imdb/train/pos/946_8.txt'),Path('/root/.fastai/data/imdb/train/pos/2033_8.txt'),Path('/root/.fastai/data/imdb/train/pos/496_10.txt'),Path('/root/.fastai/data/imdb/train/pos/7362_10.txt'),Path('/root/.fastai/data/imdb/train/pos/12169_7.txt'),Path('/root/.fastai/data/imdb/train/pos/9501_10.txt'),Path('/root/.fastai/data/imdb/train/pos/11909_7.txt'),Path('/root/.fastai/data/imdb/train/pos/4661_10.txt'),Path('/root/.fastai/data/imdb/train/pos/8842_9.txt')...]

In [9]:
with open(files[0], 'r') as f:
    txt = f.read()
txt[:75]

"It's really rare that you get an inside view at a media deception that has "

---

Demo of `coll_repr`:

In [10]:
x = list(range(10))
x

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [11]:
coll_repr(x, 3)

'(#10) [0,1,2...]'

In [12]:
coll_repr(x, 5)

'(#10) [0,1,2,3,4...]'

In [13]:
type(coll_repr(x, 5))

str

In [14]:
print(coll_repr(x, 5))

(#10) [0,1,2,3,4...]


---

In [15]:
spacy = WordTokenizer()
type(spacy)

In [16]:
type(spacy([txt]))

generator

**Note:** The generator generates a list of tokens for each string. If we pass a list with more than one string, then we'll get back more than one list.

In [17]:
help(first)

Help on function first in module fastcore.basics:

first(x, f=None, negate=False, **kwargs)
    First element of `x`, optionally filtered by `f`, or None if missing



In [18]:
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#464) ['It',"'s",'really','rare','that','you','get','an','inside','view','at','a','media','deception','that','has','been','so','widely','reported','as','official','"','truth','"','and','caught','so','many','"'...]


In [19]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [20]:
tkn = Tokenizer(spacy)
type(tkn)

In [21]:
type(tkn(txt))

Notice two differences between `spacy` and `tkn`:

1. `spacy` accepts a collection of strings, whereas `tkn` accepts a single string.
2. `spacy` returns a generator, whereas `tkn` returns an `L` object.

In [22]:
print(coll_repr(tkn(txt), 31))

(#521) ['xxbos','xxmaj','it',"'s",'really','rare','that','you','get','an','inside','view','at','a','media','deception','that','has','been','so','widely','reported','as','official','"','truth','"','and','caught','so','many'...]


In [23]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

In [24]:
??replace_rep

In [25]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

### Subword Tokenization

In [26]:
txts = []
for o in files[:2000]:
    with open(o, 'r') as f:
        txts.append(f.read())
txts = L(txts)
txts

(#2000) ['It\'s really rare that you get an inside view at a media deception that has been so widely reported as official "truth" and caught so many "news" agencies with their pants down. This movie, in my view, deserves every price there is in journalism - it\'s objective (yes!), courageous and a real "scoop". It can do without comment, fake scenes or leading questions - everyone, including Chavez equally gets to make fools of themselves in their own words. The filmmakers "only" had to keep track of events and keep their cameras rolling.<br /><br />The Venezuelan elite teaches us "How to depose of a President and sell it as a victory of democracy". It\'s amazing that they lost in the end - so far. From what I know, the biggest TV station involved only got its terrestrial license revoked, they\'re still broadcasting via cable and satellite. I highly doubt whether George W. or Barack Obama would be that tolerant after an attempted coup. But then, they don\'t have to worry.<br /><br />Th

In [27]:
sp = SubwordTokenizer(vocab_sz=1000)
type(sp)

Recall that SentencePiece is a common choice is multi-lingual models (such as XLMRoBERTa).

In [28]:
sp.setup(txts)

{'sp_model': Path('tmp/spm.model')}

**Note:** We're passing in a list of strings to `setup`.

In [29]:
type(sp([txt]))

generator

In [30]:
toks = first(sp([txt]))
print(coll_repr(toks, 40))

(#875) ['▁It',"'",'s','▁really','▁','ra','re','▁that','▁you','▁get','▁an','▁in','side','▁view','▁at','▁a','▁me','d','ia','▁de','ce','p','tion','▁that','▁has','▁been','▁so','▁w','id','e','ly','▁re','port','ed','▁as','▁off','ic','i','al','▁"'...]


In [31]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [32]:
subword(1000)

'▁It \' s ▁really ▁ ra re ▁that ▁you ▁get ▁an ▁in side ▁view ▁at ▁a ▁me d ia ▁de ce p tion ▁that ▁has ▁been ▁so ▁w id e ly ▁re port ed ▁as ▁off ic i al ▁"'

In [33]:
subword(200)

"▁I t ' s ▁re al ly ▁ ra re ▁that ▁you ▁ g e t ▁ an ▁in s i d e ▁ v i e w ▁a t ▁a ▁ m ed i a ▁ d e ce"

In [34]:
subword(10000)

'▁It \' s ▁really ▁rare ▁that ▁you ▁get ▁an ▁inside ▁view ▁at ▁a ▁media ▁de ce p tion ▁that ▁has ▁been ▁so ▁widely ▁report ed ▁as ▁official ▁" t ru th " ▁and ▁caught ▁so ▁many ▁" new s "'

### Numericalization with fastai

In [35]:
toks = tkn(txt)
print(coll_repr(toks, 31))

(#521) ['xxbos','xxmaj','it',"'s",'really','rare','that','you','get','an','inside','view','at','a','media','deception','that','has','been','so','widely','reported','as','official','"','truth','"','and','caught','so','many'...]


In [36]:
# Reminder:
type(txts), len(txts)

(fastcore.foundation.L, 2000)

In [37]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#521) ['xxbos','xxmaj','it',"'s",'really','rare','that','you','get','an'...]

In [38]:
num = Numericalize()
type(num)

**Note:** When instantiating `Numericalize`, we can specify `min_freq` and `max_vocab`. The reason is that the spaCy tokenizer doesn't create a vocab with a fixed number of tokens. It also doesn't set a minimum frequency. But SentencePiece has a fixed vocabulary size!

In [39]:
num.setup(toks200)

**Note:** We're passing in a list of lists of tokens to `setup`.

In [40]:
type(num.vocab), len(num.vocab)

(list, 2320)

The length of the vocab is less because we only used 200 texts to set it up.

In [41]:
coll_repr(num.vocab, 20)

"(#2320) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','that'...]"

In [42]:
nums = num(toks)[:20]
nums

TensorText([   2,    8,   18,   22,  100,  908,   19,   38,  114,   47,  787,  389,   48,   13, 1072,    0,   19,   54,  117,   52])

In [43]:
' '.join(num.vocab[o] for o in nums)

"xxbos xxmaj it 's really rare that you get an inside view at a media xxunk that has been so"

### Putting Our Texts into Batches for a Language Model

In [44]:
# Just run:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


In [45]:
# Just run:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


In [46]:
# Just run:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train


In [47]:
# Just run:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


In [48]:
nums200 = toks200.map(num)
type(nums200)

In [49]:
len(nums200)

200

In [50]:
nums200[0]

TensorText([   2,    8,   18,   22,  100,  908,   19,   38,  114,   47,  787,  389,   48,   13, 1072,    0,   19,   54,  117,   52,    0,    0,   24,    0,   23,  909,   23,   12, 1664,   52,  109,
              23, 1312,   23,    0,   29,   74,    0,  247,   11,    8,   21,   37,   10,   17,   73,  389,   10, 1073,  196, 1665,   59,   16,   17,    0,   26,   18,   22, 1666,   34,  412,   62,
              32,   10,    0,   12,   13,  197,   23,    0,   23,   11,    8,   18,   87,   72,  185, 1313,   10, 1667,  135,   61,  619, 1668,   26,  390,   10,  472,    8, 1314,  788,  321,   15,
             118,    0,   14,  566,   17,   74,  148,  620,   11,    8,    9,    0,   23,   82,   23,   92,   15,  391, 1315,   14,  413,   12,  391,   74, 1669,    0,   11,   25,    8,    9,    8,
               0,    0,    0,  179,   23,   85,   15,    0,   14,   13,    8,    0,   12,    0,   18,   24,   13,    0,   14,    0,   23,   11,    8,   18,   22,  278,   19,   51,  293,   17,    9,
          

In [51]:
nums200[0].shape

torch.Size([521])

In [52]:
nums200[1]

TensorText([   2,    8,   24,    9,    0,   14,    8,  793,   54,    0,  189,    9,  218,   10,  371,   40,    0,   36,    9, 1677,   61,    9,    0,    0,   10,   18,   16,   13,   64,   55,  102,
              39,  911,   15,    0,   13,  142,  149,   15,  280,  120,   11,    8,   19,   54, 1324,    0,  511,   40,    8,    0,    8,  568,   10,    0,   12,  474,    0,   17,    0,  512,    8,
            1325,   11,   25,    8,   24,   27,    0,  569,   56,   15,   27,  691,   10,   31,   89,   45,  391,    8,  793,   17,  133,  103,  129,    8,  568,  692,    9,  241,  219,   16,    0,
              11,    8,    9, 1326,   17,    8,  568,   28, 1678,   56,  224,  794,   10,   31,   22,   13,  334,  256,   10,   13,  241,  225,   58,   27, 1679,    8,  512,    0,   10,    8,  393,
              11,    8,    0,   14,    8,    0,    8,  621,   10,    8,  180,    8,  912,   11,   25,    8,   33,   24,    8,  795,    8,    0,  913,   21,  622,   10,    0,   16,   41,  115,  570,
          

In [53]:
nums200[1].shape

torch.Size([480])

In [54]:
dl = LMDataLoader(nums200)
type(dl)

In [55]:
x, y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [56]:
' '.join(num.vocab[o] for o in x[0][:20])

"xxbos xxmaj it 's really rare that you get an inside view at a media xxunk that has been so"

In [57]:
' '.join(num.vocab[o] for o in y[0][:20])

"xxmaj it 's really rare that you get an inside view at a media xxunk that has been so xxunk"

## Training a Text Classifier

### Language Model Using DataBlock

In [58]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)



In [59]:
type(dls_lm)

In [60]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos i and my friends had a great time watching this collection of dismal performances , tacky set design , a confused patchwork script , and completely unjustified invocation of serious themes like xxmaj vietnam and racism in support of xxunk trash . xxmaj the things we could n't figure out were little things like : why does a black female physics student do graduate work with a white male medical doctor ? xxmaj why this same woman had one","i and my friends had a great time watching this collection of dismal performances , tacky set design , a confused patchwork script , and completely unjustified invocation of serious themes like xxmaj vietnam and racism in support of xxunk trash . xxmaj the things we could n't figure out were little things like : why does a black female physics student do graduate work with a white male medical doctor ? xxmaj why this same woman had one hairdo"
1,"xxmaj dell xxmaj henderson , ' little xxmaj billy ' xxmaj rhodes , xxmaj billy xxmaj gilbert , and xxmaj ruth xxmaj xxunk xxmaj the acting by these actors are good especially by xxmaj jamison and xxmaj roberts . xxmaj there are many funny scenes here that i think most xxmaj three xxmaj stooges fans will love ! xxmaj in xxmaj my opinion this one of the most different xxmaj three xxmaj stooges shorts . i recommend this one to","dell xxmaj henderson , ' little xxmaj billy ' xxmaj rhodes , xxmaj billy xxmaj gilbert , and xxmaj ruth xxmaj xxunk xxmaj the acting by these actors are good especially by xxmaj jamison and xxmaj roberts . xxmaj there are many funny scenes here that i think most xxmaj three xxmaj stooges fans will love ! xxmaj in xxmaj my opinion this one of the most different xxmaj three xxmaj stooges shorts . i recommend this one to all"


Let's explore an actual minibatch comprising input IDs.

In [61]:
dls_lm[0]

<fastai.text.data.LMDataLoader at 0x7988d5973b20>

This is the training data loader.

In [62]:
x_batch, y_batch = next(iter(dls_lm[0]))
x_batch.shape, y_batch.shape

(torch.Size([128, 80]), torch.Size([128, 80]))

In [63]:
x_batch[0]

LMTensorText([    2,    19,    74,  1236,   564,    13,   359,    14,     8,  1992,     8, 17803,   135,   282,   374,    13,   656,    14,    41,  2948,    12,   449,   112,    45,    11,    31,
                496,    19,   133,   208,   209,    15,   900,     8, 10155,    52,     8,    19,   169,   340,    28,    75,  1092,    31,    15,   998,    20,    32,    35,    17,    23,   222,
               5209,   270,    94,  4307,    17,    15,     9,   310,    21,    19,    42,    36,   374,    10,    19,   434,   104,    20,  1043,   160,   399,    11,    12,   155,    19,    89,
                 36,   630], device='cuda:0')

In [64]:
y_batch[0]

TensorText([   19,    74,  1236,   564,    13,   359,    14,     8,  1992,     8, 17803,   135,   282,   374,    13,   656,    14,    41,  2948,    12,   449,   112,    45,    11,    31,   496,    19,
              133,   208,   209,    15,   900,     8, 10155,    52,     8,    19,   169,   340,    28,    75,  1092,    31,    15,   998,    20,    32,    35,    17,    23,   222,  5209,   270,    94,
             4307,    17,    15,     9,   310,    21,    19,    42,    36,   374,    10,    19,   434,   104,    20,  1043,   160,   399,    11,    12,   155,    19,    89,    36,   630,    98],
           device='cuda:0')

In [65]:
type(dls_lm.vocab), len(dls_lm.vocab)

(list, 60008)

In [66]:
print(dls_lm.vocab[:30])

['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', 'the', '.', ',', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'this', 'that', '"', "'s", '-', 'was', '\n\n', 'as', 'with', 'for']


### Fine-Tuning the Language Model

In [67]:
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

In [68]:
type(learn)

In [69]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.00889,3.907601,0.299679,49.779408,13:30


### Saving and Loading Models

In [70]:
learn.save('1epoch')

Path('/root/.fastai/data/imdb/models/1epoch.pth')

In [71]:
learn = learn.load('1epoch')

In [72]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.753312,3.767698,0.315585,43.2803,14:24
1,3.705464,3.707013,0.322892,40.731964,14:45
2,3.643721,3.659286,0.32783,38.833588,14:27
3,3.555379,3.625676,0.332236,37.550114,14:25
4,3.499944,3.604378,0.335106,36.75882,14:27
5,3.428487,3.589561,0.337112,36.218174,14:23
6,3.369237,3.581008,0.338364,35.90971,14:20
7,3.305494,3.578224,0.339425,35.809875,13:39
8,3.25001,3.580995,0.339758,35.909237,13:32
9,3.219402,3.585987,0.339525,36.088963,13:28


In [73]:
learn.save_encoder('finetuned')

### Text Generation

In [74]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]

In [75]:
preds

["i liked this movie because it dealt with real issues which i do n't know , but somehow the film was very realistic , and very hard to watch . i have seen Gary Oldman 's movies , and i always loved him",
 'i liked this movie because it was a good love story . It was a story about a married couple who decide to decide one way to go to a Florida beach . i liked the ending so much , i thought it']

### Creating the Classifier DataLoaders

In [76]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)



In [77]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj some have praised xxunk xxmaj lost xxmaj xxunk as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the "" crack staff "" of many older adventure movies has been done well before , ( think xxmaj the xxmaj dirty xxmaj dozen ) but xxunk represents one of the worse films in that motif . xxmaj the characters are weak . xxmaj even the background that each member trots out seems stock and awkward at best . xxmaj an xxup md / xxmaj medicine xxmaj man , a tomboy mechanic whose father always wanted sons , if we have not at least seen these before ,",neg
2,"xxbos xxmaj warning : xxmaj does contain spoilers . \n\n xxmaj open xxmaj your xxmaj eyes \n\n xxmaj if you have not seen this film and plan on doing so , just stop reading here and take my word for it . xxmaj you have to see this film . i have seen it four times so far and i still have n't made up my mind as to what exactly happened in the film . xxmaj that is all i am going to say because if you have not seen this film , then stop reading right now . \n\n xxmaj if you are still reading then i am going to pose some questions to you and maybe if anyone has any answers you can email me and let me know what you think . \n\n i remember my xxmaj grade 11 xxmaj english teacher quite well . xxmaj",pos


In [78]:
dls_clas[0]

<fastai.text.data.SortedDL at 0x7988c07d3d00>

This is the training data loader.

In [79]:
x_batch, y_batch = next(iter(dls_clas[0]))
x_batch.shape, y_batch.shape

  self.pid = os.fork()


(torch.Size([128, 3345]), torch.Size([128]))

In [80]:
nums_samp = toks200[:10].map(num)

In [81]:
nums_samp.map(len)

(#10) [521,480,161,67,204,334,200,237,398,943]

In [82]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

In [83]:
learn = learn.load_encoder('finetuned')

### Fine-Tuning the Classifier

In [84]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.249031,0.178546,0.93196,01:08


  self.pid = os.fork()
  self.pid = os.fork()


In [85]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.216906,0.164325,0.93896,01:04


  self.pid = os.fork()
  self.pid = os.fork()


In [86]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.19092,0.148512,0.94464,01:11


  self.pid = os.fork()
  self.pid = os.fork()


In [87]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.157399,0.146628,0.94608,01:13
1,0.151411,0.147501,0.9468,01:12


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


## Disinformation and Language Models

## Conclusion

## Questionnaire

1. What is "self-supervised learning"?
1. What is a "language model"?
1. Why is a language model considered self-supervised?
1. What are self-supervised models usually used for?
1. Why do we fine-tune language models?
1. What are the three steps to create a state-of-the-art text classifier?
1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
1. What are the three steps to prepare your data for a language model?
1. What is "tokenization"? Why do we need it?
1. Name three different approaches to tokenization.
1. What is `xxbos`?
1. List four rules that fastai applies to text during tokenization.
1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
1. What is "numericalization"?
1. Why might there be words that are replaced with the "unknown word" token?
1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
1. Why do we need padding for text classification? Why don't we need it for language modeling?
1. What does an embedding matrix for NLP contain? What is its shape?
1. What is "perplexity"?
1. Why do we have to pass the vocabulary of the language model to the classifier data block?
1. What is "gradual unfreezing"?
1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?

### Further Research

1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
1. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?