# A fastai implementation for the CommonLit Readability Contest

In this knowledge contest, we are trying to predict the readability for a passage of text.

**Version History**
* *Version 1* - Getting the notebook setup, importing basic modules, reading the dataset, building a simple model, evaluate results, submission using pretrained language model only'
* *Version 2* - Internet off :)
* *Version 3* - Integrating [my own dataset](https://www.kaggle.com/abee82/fastai-wikitext-wt103-pretrained-model) to pull in the WikiText-103 pretained model for `fastai`
* *Version 4* - Fixed submision file
* *Version 5* - Added fine tuned model submission

## Background

@TODO: update

### For beginners forking this notebook

I highly suggest for anyone interested in solving problems like this to look into the course by Sylvain Gugger and Jeremy Howard. They spent a lot of time building `fastai` into a great library for accessing the latest deep learning techniques.

[Practical Deep Learning for Coders](https://course.fast.ai/)

# Setup

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
%time
from fastai.text.all import *

In [None]:
INPUT_DIR = Path("../input/commonlitreadabilityprize")
INPUT_DIR.ls()

In [None]:
WORKING_DIR = Path("./")

In [None]:
train_df = pd.read_csv(INPUT_DIR / 'train.csv')
test_df = pd.read_csv(INPUT_DIR / 'test.csv')

In [None]:
train_df.head()

In [None]:
# # Making pretrained weights work without needing to find the default filename
# if not os.path.exists('/root/.cache/torch/hub/checkpoints/'):
#         os.makedirs('/root/.cache/torch/hub/checkpoints/')
# !cp '../input/resnet50/resnet50.pth' '/root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth'

In [None]:
!mkdir -p ~/.fastai/models/

!cp -R ../input/fastai-wikitext-wt103-pretrained-model/* ~/.fastai/models/

# Utilize out of the box language model from `fastai`

In [None]:
dls_class = DataBlock(blocks=(TextBlock.from_df('excerpt', seq_len=256), RegressionBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader('target'),
                      splitter=RandomSplitter())

In [None]:
dls = dls_class.dataloaders(train_df, bs=64)

In [None]:
dls.show_batch()

In [None]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=rmse)

In [None]:
lrs = learn.lr_find()

In [None]:
lrs

In [None]:
cbs = [
#     ShowGraphCallback(),
#     GradientAccumulation(),
    MixedPrecision(),
    SaveModelCallback(monitor='_rmse', comp=np.less, min_delta=0.01),
#     ReduceLROnPlateau(monitor='rmse', comp=np.less, min_delta=0.001, patience=2),
#     MixUp(0.4),
    EarlyStoppingCallback(monitor='_rmse', comp=np.less, min_delta=0.01, patience=3),
#     GradientClip(0.1),
      ]

In [None]:
learn.fine_tune(20, lrs.lr_min/2, cbs=cbs)

In [None]:
learn.show_results()

In [None]:
learn.save('pretrained_LM')

## Pretrained LM Submission

In [None]:
SAMPLE_SUBMISSION = pd.read_csv(INPUT_DIR / 'sample_submission.csv')

In [None]:
SAMPLE_SUBMISSION.head()

In [None]:
test_dl = learn.dls.test_dl(test_df.excerpt)

In [None]:
preds = learn.get_preds(dl=test_dl, with_decoded=True)
preds = preds[2].tolist()
preds = [x for l in preds for x in l]

In [None]:
preds

In [None]:
SAMPLE_SUBMISSION['target'] = preds

In [None]:
SAMPLE_SUBMISSION.head()

In [None]:
SAMPLE_SUBMISSION.to_csv('submission.csv', index=False)

# Create Custom Language Model

In [None]:
train_df.head()

In [None]:
dls_lm = TextDataLoaders.from_df(train_df, text_col='excerpt', is_lm=True)

In [None]:
dls_lm.show_batch(max_n=3)

In [None]:
learn = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()

In [None]:
lrs = learn.lr_find()

In [None]:
lrs

In [None]:
cbs = [
#     ShowGraphCallback(),
#     GradientAccumulation(),
#     MixedPrecision(),
    SaveModelCallback(monitor='valid_loss', comp=np.less, min_delta=0.01),
#     ReduceLROnPlateau(monitor='rmse', comp=np.less, min_delta=0.001, patience=2),
#     MixUp(0.4),
    EarlyStoppingCallback(monitor='valid_loss', comp=np.less, min_delta=0.01, patience=3),
#     GradientClip(0.1),
      ]

In [None]:
learn.fine_tune(20, lrs.lr_min, cbs=cbs)

In [None]:
learn.save_encoder('finetuned_lm')

In [None]:
dls = dls_class.dataloaders(train_df, bs=64, text_vocab=dls_lm.vocab)

In [None]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=rmse)

In [None]:
lrs = learn.lr_find()

In [None]:
lrs

In [None]:
cbs = [
    MixedPrecision(),
    SaveModelCallback(monitor='valid_loss', comp=np.less, min_delta=0.01),
    EarlyStoppingCallback(monitor='valid_loss', comp=np.less, min_delta=0.01, patience=3),
      ]

In [None]:
learn = learn.load_encoder('finetuned_lm')

In [None]:
learn.fit_one_cycle(1, lrs.lr_min, cbs=cbs)

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(lrs.lr_min/2/2.5**4, lrs.lr_min/2), cbs=cbs)

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(lrs.lr_min/2/2/2.5**4, lrs.lr_min/2/2), cbs=cbs)

In [None]:
learn.unfreeze()
learn.fit_one_cycle(20, slice(lrs.lr_min/2/2/2/2.5**4, lrs.lr_min/2/2.2), cbs=cbs)

# Fine tuned LM submission

In [None]:
SAMPLE_SUBMISSION = pd.read_csv(INPUT_DIR / 'sample_submission.csv')

In [None]:
SAMPLE_SUBMISSION.head()

In [None]:
test_dl = learn.dls.test_dl(test_df.excerpt)

In [None]:
preds = learn.get_preds(dl=test_dl, with_decoded=True)
preds = preds[2].tolist()
preds = [x for l in preds for x in l]

In [None]:

preds

In [None]:
SAMPLE_SUBMISSION['target'] = preds

In [None]:
SAMPLE_SUBMISSION.head()

In [None]:

SAMPLE_SUBMISSION.to_csv('submission.csv', index=False)