<a href="https://colab.research.google.com/github/vikramkrishnan9885/MyColab/blob/master/fastai/FastAITextProc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import fastai
fastai.__version__

'1.0.61'

In [2]:
# validate access to GPU

!nvidia-smi

Sat Jan 22 08:04:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Training a deep learning language model with a curated IMDb text dataset

In [4]:
!pip install -Uqq fastbook

[K     |████████████████████████████████| 720 kB 7.1 MB/s 
[K     |████████████████████████████████| 46 kB 5.9 MB/s 
[K     |████████████████████████████████| 1.2 MB 34.0 MB/s 
[K     |████████████████████████████████| 189 kB 71.5 MB/s 
[K     |████████████████████████████████| 56 kB 6.3 MB/s 
[K     |████████████████████████████████| 51 kB 453 kB/s 
[?25h

In [5]:
import fastbook
fastbook.setup_book()

Mounted at /content/gdrive


In [6]:
from fastbook import *
from fastai.text.all import *

In [7]:
# define timestamp string for saving models
modifier = "aug13_2021"

In [8]:
%%time
# create path object
path = untar_data(URLs.IMDB)
path.ls()

CPU times: user 16.3 s, sys: 8.13 s, total: 24.4 s
Wall time: 28.9 s


In [9]:
%%time
# create TextDataLoaders object
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls = TextDataLoaders.from_folder(path, valid = 'test', is_lm=True, bs=8)

CPU times: user 43.9 s, sys: 10.6 s, total: 54.5 s
Wall time: 4min 28s


Here are the arguments for the definition of the TextDataLoaders object:
* **path:** The path object (associated with the IMDb curated dataset) that you defined earlier in the notebook.
* **valid:** Identifies the folder in the dataset's directory structure that will be used to assessy the performance of the model: imdb/test.
* **is_lm:** Set to True to indicate that this object will be used for a language model (as opposed to a text classifier).
* **bs:** Specifies the batch size.

**Note:**

When you are training a language model with a large dataset such as IMDb, adjusting the bs value to be lower than the default batch size of 64 will be essential for avoiding memory errors, and that is why it is set to 16 in this TextDataLoaders definition.

Run the following command to show a couple of items from a sample batch:

dls.show_batch(max_n=2)
The max_n argument specifies the number of sample batch items to show.

In [10]:
dls.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj it holds very true to the original manga of the same name , aka ( tramps xxmaj like xxmaj us in the xxup u.s ) but it can still be enjoyed even if you have n't read the manga . xxmaj it 's a different kind of tail , showing a strong and independent woman who hurts just like everyone else . xxmaj however , because of her outward strength","xxmaj it holds very true to the original manga of the same name , aka ( tramps xxmaj like xxmaj us in the xxup u.s ) but it can still be enjoyed even if you have n't read the manga . xxmaj it 's a different kind of tail , showing a strong and independent woman who hurts just like everyone else . xxmaj however , because of her outward strength ,"
1,"the historical dust . xxmaj the terrain looks like an interior tribal reservation of no particular importance to the coastal xxunk where industry dwells . xxmaj yet , it 's also a region most likely to survive anything like a destructive last wave . xxmaj perhaps there 's something about past and future to think about here . \n\n xxmaj anyway , this is a really good movie that will probably stay","historical dust . xxmaj the terrain looks like an interior tribal reservation of no particular importance to the coastal xxunk where industry dwells . xxmaj yet , it 's also a region most likely to survive anything like a destructive last wave . xxmaj perhaps there 's something about past and future to think about here . \n\n xxmaj anyway , this is a really good movie that will probably stay with"


In [29]:
%%time
# define and train model
learn = language_model_learner(dls,AWD_LSTM,metrics=accuracy).to_fp32()
learn.fine_tune(1, 1e-1)

epoch,train_loss,valid_loss,accuracy,time
0,6.321066,5.738266,0.158222,29:33


epoch,train_loss,valid_loss,accuracy,time
0,4.554787,4.394859,0.270794,33:39


CPU times: user 1h 2min 35s, sys: 37.2 s, total: 1h 3min 12s
Wall time: 1h 3min 14s


Run the following cell to exercise the language model you just trained:

Here are the arguments for this invocation of the language model:
1. The input text sample "what comes next" (first argument) is the phrase that the model will complete. The language model will predict what words should follow this phrase.
2. n_words: This is the number of words that the language model is supposed to predict to complete the input phrase.

In [30]:
learn.predict("what comes next", n_words=20)

'what comes next other in the world eating mutiny without a mental hospital would hit it on the Wall film - i'

In [31]:
learn.summary()

SequentialRNN (Input shape: 8 x 72)
Layer (type)         Output Shape         Param #    Trainable 
                     8 x 72 x 1152       
LSTM                                                           
LSTM                                                           
____________________________________________________________________________
                     8 x 72 x 400        
LSTM                                                           
RNNDropout                                                     
RNNDropout                                                     
RNNDropout                                                     
____________________________________________________________________________
                     8 x 72 x 60008      
Linear                                    24063208   True      
RNNDropout                                                     
____________________________________________________________________________

Total params: 24,063,208
Total

In [32]:
learn.model

SequentialRNN(
  (0): AWD_LSTM(
    (encoder): Embedding(60008, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(60008, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1152, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1152, 1152, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1152, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=400, out_features=60008, bias=True)
    (output_dp): RNNDropout()
  )
)

In [33]:
keep_path = learn.path

# workaround to make path writeable
learn.path = Path('/notebooks/temp')

learn.path

Path('/notebooks/temp')

In [34]:
learn.model_dir

'models'

In [35]:
learn.save('lm_'+modifier)

Path('/notebooks/temp/models/lm_aug13_2021.pth')

In [36]:
# workaround to save encoder - need to do this to later load encoder for classifier
learn.save_encoder('ft_'+modifier)

In [39]:
learn.export('lm_model_'+modifier+'.pkl')