<a href="https://colab.research.google.com/github/variable-jain/Deep-learning-experiments-with-pytorch-and-fastai/blob/main/NLP_in_pytorch_fastai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [3]:
from fastbook import *
from IPython.display import display,HTML

# NLP in fastai

In [4]:
from fastai.text.all import *

## Text Preprocessing

### Text Tokenization

> Word Tokenization

In [5]:
path = untar_data(URLs.IMDB)

In [6]:
files = get_text_files(path, folders=['train', 'test', 'unsup'])

In [7]:
txt = files[0].open().read()
txt[:75]

'Once again, ITV force upon us another hash and unwanted remake of a perfect'

There are two ways to create a Tokenizer:
* Using *WordTokenizer* class and use __spacy__ Tokenizer
* Wrap the above tokenizer in fastai's *Tokenizer* class and exploit some additional functionalities

In [8]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#778) ['Once','again',',','ITV','force','upon','us','another','hash','and','unwanted','remake','of','a','perfectly','good','predecessor','.','This','episode','may','possibly','be','worse','than','the','previous',',','The','Body'...]


In [9]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#846) ['xxbos','xxmaj','once','again',',','xxup','itv','force','upon','us','another','hash','and','unwanted','remake','of','a','perfectly','good','predecessor','.','xxmaj','this','episode','may','possibly','be','worse','than','the','previous'...]


> Subword Tokenization

In [11]:
txts = L(o.open().read() for o in files[:2000])

In [15]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [16]:
subword(1000)

'▁O n ce ▁again , ▁I T V ▁for ce ▁up on ▁us ▁another ▁has h ▁and ▁un w ant ed ▁re m ake ▁of ▁a ▁perfect ly ▁good ▁pre de ce s s or . ▁This ▁episode ▁may ▁po'

In [17]:
subword(200)

'▁ O n ce ▁a g a in , ▁I T V ▁for ce ▁ u p on ▁ us ▁ an o th er ▁h a s h ▁and ▁ un w an t ed ▁re m a k'

### Numericalization

In [19]:
toks = tkn(txt)
print(coll_repr(toks,31))

(#846) ['xxbos','xxmaj','once','again',',','xxup','itv','force','upon','us','another','hash','and','unwanted','remake','of','a','perfectly','good','predecessor','.','xxmaj','this','episode','may','possibly','be','worse','than','the','previous'...]


In [20]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#846) ['xxbos','xxmaj','once','again',',','xxup','itv','force','upon','us'...]

In [22]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#2024) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','in','it','i'...]"

In [26]:
nums = num(toks)[:20]
nums

TensorText([  2,   8, 268, 174,  11,   7,   0, 647, 962, 175, 193,   0,  13,   0, 963,  14,  12, 723,  65,   0])

In [27]:
' '.join(num.vocab[o] for o in nums)

'xxbos xxmaj once again , xxup xxunk force upon us another xxunk and xxunk remake of a perfectly good xxunk'

## Training a Language Model

### Batches and DataLoaders

In [28]:
nums200 = toks200.map(num)

In [29]:
dl = LMDataLoader(nums200)

In [31]:
x, y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [32]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj once again , xxup xxunk force upon us another xxunk and xxunk remake of a perfectly good xxunk'

In [34]:
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj once again , xxup xxunk force upon us another xxunk and xxunk remake of a perfectly good xxunk .'

In [35]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks = TextBlock.from_folder(path, is_lm = True),
    get_items = get_imdb,
    splitter = RandomSplitter(0.1),
).dataloaders(path, path=path, bs=128, seq_len=80)

In [36]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj fassbinder 's most lavish production sacrifices little of his talent for identifying and deconstructing a xxunk of suffering in long , mobile takes that somehow also act as social xxunk ; here , it 's much more overt , since the story takes place in war - torn xxmaj germany at the end of xxup wwii , and the central character is a woman ( hanna xxmaj schygulla as xxmaj maria ) who capitalizes on vulnerabilities ( both","xxmaj fassbinder 's most lavish production sacrifices little of his talent for identifying and deconstructing a xxunk of suffering in long , mobile takes that somehow also act as social xxunk ; here , it 's much more overt , since the story takes place in war - torn xxmaj germany at the end of xxup wwii , and the central character is a woman ( hanna xxmaj schygulla as xxmaj maria ) who capitalizes on vulnerabilities ( both economic"
1,"magnificent , playing just right the rather confused , unlucky thirty - five year old young woman a bit given to alcohol , not overplaying her part ; xxmaj maría xxmaj galiana as her mother shows even at her age that she has come from good theatre , as no less does xxmaj carlos xxunk - novoa as the lonely neighbour . xxmaj benito xxmaj xxunk - as xxmaj almodóvar - , not only directs his film but is also",", playing just right the rather confused , unlucky thirty - five year old young woman a bit given to alcohol , not overplaying her part ; xxmaj maría xxmaj galiana as her mother shows even at her age that she has come from good theatre , as no less does xxmaj carlos xxunk - novoa as the lonely neighbour . xxmaj benito xxmaj xxunk - as xxmaj almodóvar - , not only directs his film but is also responsible"


### Fine-tuning the language model

In [None]:
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

In [38]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.012404,3.903874,0.299967,49.5942,21:33


In [39]:
learn.save("/content/gdrive/MyDrive/Colab Notebooks/Experiments Part1 fastai/1epoch")

Path('/content/gdrive/MyDrive/Colab Notebooks/Experiments Part1 fastai/1epoch.pth')

In [40]:
learn.load("/content/gdrive/MyDrive/Colab Notebooks/Experiments Part1 fastai/1epoch")

<fastai.text.learner.LMLearner at 0x7f75580af610>

In [41]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.76284,3.764196,0.316087,43.129032,23:05
1,3.714295,3.713881,0.322069,41.012669,22:57
2,3.631323,3.661057,0.327593,38.902458,22:58
3,3.571931,3.628157,0.331559,37.643383,22:57
4,3.496409,3.604099,0.334429,36.748569,23:01
5,3.442703,3.586575,0.337345,36.110195,23:03
6,3.366307,3.574365,0.339214,35.671959,23:08
7,3.28678,3.573454,0.34013,35.639477,23:39
8,3.232843,3.577133,0.340604,35.770828,23:09
9,3.216559,3.580853,0.34038,35.904137,23:02


In [42]:
learn.save_encoder('/content/gdrive/MyDrive/Colab Notebooks/Experiments Part1 fastai/finetuned')

> Text generation using language model

In [45]:
TEXT = "I like the movies because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature = 0.75) for _ in range(N_SENTENCES)]

In [46]:
print('\n'.join(preds))

i like the movies because they are great , and i have seen The Lion King , Lion King and Lion 's Gate , and i thought that this one was better , and the story was nice
i like the movies because of their performances , but this film was just too much . Its an unrealistic story and pretty uninteresting at the same time . The acting did n't really sound any good and the actors were just okay


## Training the Classification model

In [48]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y = parent_label,
    get_items = partial(get_text_files, folders=['train', 'test']),
    splitter = GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [49]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj by now you 've probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki 's classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released "" kiki 's xxmaj delivery xxmaj service "" on video which included a preview of the xxmaj laputa dub saying it was due out in "" 1 xxrep 3 9 "" . xxmaj it 's obviously way past that year now , but the dub has been finally completed . xxmaj and it 's not "" laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky "" , just "" castle xxmaj in xxmaj the xxmaj sky "" for the dub , since xxmaj laputa is not such a nice word in xxmaj spanish ( even though they use the word xxmaj laputa many times",pos
2,"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , steaming bowl of oatmeal . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain overconfidence on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an idyllic storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is",neg


In [50]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

In [52]:
learn.load_encoder("/content/gdrive/MyDrive/Colab Notebooks/Experiments Part1 fastai/finetuned")

<fastai.text.learner.TextLearner at 0x7f754b6cf690>

In [53]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.253113,0.179493,0.92948,01:07


In [54]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.212495,0.154234,0.9422,01:30


In [55]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.178305,0.148207,0.9456,01:54
1,0.166674,0.147651,0.94556,01:54
