# Training a text classifier model on a standalone dataset with fastai
- This notebook ingests This notebook ingests the fastai curated IMDB_SAMPLE dataset
- This notebook assumes you have already run text_train_lm.ipynb notebook to create a language model
- The encoder from the language model is used to create the text classifier

In [1]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

Mounted at /content/gdrive


In [2]:
#hide
from fastbook import *
from fastai.text.all import *

In [3]:
# ensure that value of modifier matches the value of modifier in text_standalone_dataset_lm notebook
modifier = 'ga_apr10'

# Ingest the dataset
- define the source of the dataset
- create a dataframe for the training dataset

In [4]:
%%time
# create dataloaders object
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

CPU times: user 55.7 ms, sys: 10.9 ms, total: 66.6 ms
Wall time: 844 ms


In [5]:
# read the training CSV into a dataframe - note that the encoding parameter is needed to avoid a decode error
df_train = pd.read_csv(path/'texts.csv')

# Define the text classifier

In [7]:
%%time
# create TextDataLoaders object
dls = TextDataLoaders.from_df(df_train, path=path, text_col='text',label_col='label')

CPU times: user 1min 7s, sys: 1.78 s, total: 1min 9s
Wall time: 1min 23s


In [8]:
dls.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is",negative
1,"xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies",positive
2,"xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of "" at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside "" edgy "" projects . \n\n xxmaj none of this excuses him this present , almost diabolical",negative


In [9]:
dls.path

Path('/root/.fastai/data/imdb_sample')

In [10]:
# save the current path
keep_path = path
print("keep_path is: ",str(keep_path))

keep_path is:  /root/.fastai/data/imdb_sample


In [11]:
%%time
# define a text_classifier_learner object
learn_clas = text_classifier_learner(dls, AWD_LSTM, 
                                metrics=accuracy).to_fp16()

CPU times: user 2.92 s, sys: 724 ms, total: 3.65 s
Wall time: 7.12 s


# Fine-tune the text classifier
Use the encoder created as part of training the language model to fine tune the text classifier

In [12]:
# Path('/storage/data/imdb')
learn_clas.path

Path('/root/.fastai/data/imdb_sample')

In [14]:
%%time
# set the path to the location of the encoder
learn_clas.path = Path('/content/gdrive/MyDrive/ga_nlp_test')

CPU times: user 58 µs, sys: 3 µs, total: 61 µs
Wall time: 67.9 µs


In [15]:
# load the encoder that was saved when the language model was trained
learn_clas = learn_clas.load_encoder('ft_standalone'+modifier)

In [16]:
learn_clas.path

Path('/content/gdrive/MyDrive/ga_nlp_test')

In [17]:
# set the path back to the original path
learn_clas.path = keep_path

In [18]:
# ch 10 style Path('/storage/data/imdb')
learn_clas.path

Path('/root/.fastai/data/imdb_sample')

In [19]:
%%time
# fine tune the model
learn_clas.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.610701,0.656558,0.635,00:15


CPU times: user 14.5 s, sys: 251 ms, total: 14.7 s
Wall time: 15.6 s


In [20]:
x, y = first(dls.train)
x.shape, y.shape, len(dls.train)

(torch.Size([64, 1696]), torch.Size([64]), 12)

In [21]:
learn_clas.summary()

SequentialRNN (Input shape: 64 x 1696)
Layer (type)         Output Shape         Param #    Trainable 
                     64 x 40 x 1152      
LSTM                                                           
LSTM                                                           
____________________________________________________________________________
                     64 x 40 x 400       
LSTM                                                           
RNNDropout                                                     
RNNDropout                                                     
RNNDropout                                                     
BatchNorm1d                               2400       True      
Dropout                                                        
____________________________________________________________________________
                     64 x 50             
Linear                                    60000      True      
ReLU                                     

# Exercise the text classifier
Apply the fine-tuned text classifier on some text samples.

In [22]:
preds = learn_clas.predict("this film is horrible")

In [23]:
preds

('negative', TensorText(0), TensorText([0.6204, 0.3796]))

In [24]:
preds = learn_clas.predict("what a great movie")

In [25]:
preds

('negative', TensorText(0), TensorText([0.6234, 0.3766]))

In [26]:
preds = learn_clas.predict("worst horror movie of the decade")

In [27]:
preds

('negative', TensorText(0), TensorText([0.6372, 0.3628]))

In [28]:
preds = learn_clas.predict("another triumph for Hoffman!")

In [29]:
preds

('positive', TensorText(1), TensorText([0.4874, 0.5126]))

In [None]:
# save the classifier model
learn_clas.path = Path('/notebooks/temp')
learn_clas.save('classifier_single_epoch_'+modifier+'d')

Path('/notebooks/temp/models/classifier_single_epoch_standalone_mar20d.pth')