Importing the libraries

In [None]:
import numpy as np
import pandas as pd
from fastai.text import *
from fastai.callbacks import *

In [None]:
path = Path('../input/nlp-getting-started')
path.ls()

In [None]:
train = pd.read_csv(path/'train.csv')
test = pd.read_csv(path/'test.csv')

We will use both our training and test text for our language model. Pay attention to the fact that we are not taking any labels here. We are not building a classifier now rather we are buliding a language model where we are using a pretrained architecture trained on the Wikitext-103 and then training it on our tweet data.

The paragraph given below is taken from [Understanding building blocks of ULMFIT](https://medium.com/mlreview/understanding-building-blocks-of-ulmfit-818d3775325b#:~:text=High%20level%20idea%20of%20ULMFIT,learning%20rates%20in%20multiple%20stages)
> High level idea of ULMFIT is to train a language model using a very large corpus like Wikitext-103 (103M tokens), then to take this pretrained model’s encoder and combine it with a custom head model, e.g. for classification, and to do the good old fine tuning using discriminative learning rates in multiple stages carefully.
Architecture that ULMFIT uses for it’s language modeling task is an [AWD-LSTM](https://arxiv.org/pdf/1708.02182.pdf). The name is an abbreviation of ASGD Weight-Dropped LSTM.

Refer to this paper of you want to read more about ULMFiT : https://arxiv.org/abs/1801.06146

ULMFiT brought the concept of transfer learning in Computer Vision to NLP

In [None]:
data_lm = (TextList.from_df(pd.concat([train[['text']], test[['text']]], ignore_index=True, axis=0))
           .split_by_rand_pct(0.15)
           .label_for_lm()
           .databunch(bs=128))

In [None]:
data_lm.show_batch()

In [None]:
## create lm learner with pre-trained model
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

In [None]:
learn.lr_find()
learn.recorder.plot(suggestion = True)

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

In [None]:
callback = SaveModelCallback(learn,monitor="accuracy", mode="max", name="best_lang_model")

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7), callbacks=[callback])

In [None]:
learn.save('fine_tuned')
learn.save_encoder('fine_tuned_enc')

Save the language model and the encoder too

In [None]:
df = train[['text', 'target']]

In [None]:
df_test = test[['text']]

Creating a Text Databunch for our classifier. We are taking 10% as validation set and keeping the vocabulary same as the language model databunch

In [None]:
data_clas = (TextList.from_df(df, vocab=data_lm.vocab)
             #.split_none()
             .split_by_rand_pct(0.1)
             .label_from_df('target')
             .add_test(TextList.from_df(df_test, vocab=data_lm.vocab))
             .databunch(bs=128))

In [None]:
## check test set looks ok
data_clas.show_batch(ds_type=DatasetType.Test)

Building the Classifier with the same encoder from the language model. 

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy, FBeta(beta=1)])
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.lr_find()
learn.recorder.plot(suggestion=True)

In [None]:
learn.fit_one_cycle(1, 1e-3, moms=(0.8,0.7))

In [None]:
## unfreeze the last 2 layers and train for 1 cycle
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-3/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
## unfreeze the last 3 layers and train for 1 cycle
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

Using callbacks to select the best classification model out of the given epochs

In [None]:
callbacks = SaveModelCallback(learn,monitor="accuracy", mode="max", name="best_classification_model")

In [None]:
## unfreeze all and train for 2 cycles
learn.unfreeze()
learn.fit_one_cycle(15, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7), callbacks=[callbacks])

In [None]:
preds, _ = learn.get_preds(ds_type=DatasetType.Test,  ordered=True)
preds = preds.argmax(dim=-1)

id = test['id']

In [None]:
my_submission = pd.DataFrame({'id': id, 'target': preds})
my_submission.to_csv('submission.csv', index=False)