In [None]:
'''
Useful links:
Data preprocessing: https://www.kaggle.com/parthsharma5795/comprehensive-twitter-airline-sentiment-analysis
Train ULMFit in IMDB: https://course.fast.ai/videos/?lesson=8
'''

In [None]:
# If you are running the notebook in COLAB run the following lines of code
!pip install torch==1.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install fastai==2.1.4

In [None]:
from fastai.text.all import *
import torch
import re
# import os
# from os import listdir
# from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display,HTML

In [None]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

In [None]:
url = 'https://raw.githubusercontent.com/arnaujc91/ULMFit/master/data/Tweets.csv'
tweets = pd.read_csv(url)

## Analyse the data

In [None]:
tweets.head()

In [None]:
display(HTML(tweets.to_html(columns=['text'], index=False,header=None, max_rows=20)))

In [None]:
tweets.airline_sentiment_confidence.hist(bins=40)

In [None]:
tweets.airline_sentiment.value_counts()/len(tweets)

Later on we can also use this same model to predict the reason of the complaint:

In [None]:
tweets.negativereason.value_counts()

In [None]:
'''If you downloaded the file and have it in a "data" directory,
        uncomment and run the lines below'''
# data_directory = Path(os.getcwd())/'data'
# assert data_directory.is_dir(), 'Data directory not found'
# data_files = listdir(data_directory)
# print(data_files)

# csv_datafile = data_directory/'Tweets.csv'
# print('Data file:', csv_datafile)
# tweets = pd.read_csv(csv_datafile)
# tweets

Every tweet starts with a [Twitter handle](https://sproutsocial.com/glossary/twitter-handle/) which refers to the Airline, to which the twitter message is adressed to. E.g:

In [None]:
list(tweets.text[:5]), list(tweets.airline[:5])

We want to remove this information. In order to do so we add a new rule to the [default rules](https://github.com/fastai/fastai/blob/a8ed5a64f93df9be02eef907ddbc355f3ad130d1/fastai/text/core.py#L96) for preprocessing text:

In [None]:
def rm_first_handle(t):
    return re.sub(r'^@\w* ', '', t)

rules = defaults.text_proc_rules
rules.insert(0, rm_first_handle)

**IMPORTANT**: If you run the previous line of code the rules will be modified for the whole Notebook. So later on this will also affect the DataLoaders!

Now let's tokenize the tweets:

In [None]:
tokenized_df, vocab_count = tokenize_df(tweets,  text_cols='text', tok=SpacyTokenizer())
vocab = list(vocab_count.keys())
print(len(vocab))
print(vocab[:5])

In [None]:
tokenized_df.text[:5]

*tokenize_df* returns the tokenized dataframe and also the counting of the tokenized words in the dataset. For example if we want to get all the words that appear at least 3 times, we can write the following code:

In [None]:
len([key for key in vocab_count if vocab_count[key]>2])

So 4683 words appear **at least** 3 times in the entire dataset.

We can now compare the tokenized VS the original tweets:



In [None]:
i = 0
for a,b in zip(list(tokenized_df.text), list(tweets.text)):
    print('before: ', b)
    print('after: ', ' '.join(a))
    print('\n')
    if i == 5:
        break
    i+=1

## Create the DataLoaders

In [None]:
# TODO: Random splitting is not a good idea, we have to see what are the distributions of labels of the tweets.
# For example if we have 80% negative tweets in training and 20% positive and then the other way around in validation
# This might not be optimal. In particular we should have a homogenous distribution of tweets in the training set
# in order to not have a bias towards a particular classifciation. 80% neg => Bias towards neg tweet classification.

In [None]:
# For an explanation of this function go back to "Analyse the data" section.
def rm_first_handle(t):
    return re.sub(r'^@\w* ', '', t)

rules = defaults.text_proc_rules
rules.insert(0, rm_first_handle)

We will now create a [TextDataLoader](https://docs.fast.ai/text.data#TextDataLoaders) which is a Wrapper around the [DataLoader](https://docs.fast.ai/data.core.html#DataLoaders) class. The DataLoader splits our dataset between training and validation. The TextDataLoader adds more functionality specific to NLP problems, like the vocabulary of the data.

Now a few things:
- Remember that the TextDataLoaders just consider a word as part of the vocab if it appears **more than 3 times** in the entire dataset by default if you use `TextDataLoaders.from_df`.
- We also need `is_lm=True` because we will first train a **Language Model**.

In [None]:
lm_dls = TextDataLoaders.from_df(tweets, text_col='text',  is_lm=True)
lm_dls.show_batch(max_n=5)

Let's see a little bit the structure of this dataloader:

In [None]:
lm_dls._docs

The previous DataLoader was for training or fine tunning the language model. The following one will be the DataLoader used for **classification**.

In [None]:
tc_dls = TextDataLoaders.from_df(tweets, text_col='text', label_col='airline_sentiment')
tc_dls.show_batch(max_n=8)

From the code of `TextDataLoaders.from_df` we can see that the vocab is created with words that appear **at least 3 times** in the entire dataset. Any words that appear with a lower frequency will be automatically tokenized as `xxunk`, which stands for *unknown*.

So far the words in the vocab are words that appear at least 3 times in the entire dataset, if we want to change that, we can not do it directly from the high level API that fastai offers, but instead we need a couple of more lines of code. Anyway the following lines are just what `TextDataLoaders.from_df` does but changing the parameter **min_freq** to one.

In [None]:
'''
If you wanna have a TextDataLoader with words that appear less than 3 times in the dataset:
'''

# we set min_freq to ONE to allow any words that appear at least once.
min_freq=1

dblock = DataBlock(blocks=[TextBlock.from_df(text_cols='text', is_lm=True, min_freq=min_freq) ],
                           get_x=ColReader("text"),
                           splitter = RandomSplitter(valid_pct=0.2))

tweets_f1 = TextDataLoaders.from_dblock(dblock, tweets)

Another issue is that eventhough `TextDataLoaders.from_df` call `tokenize_df`, the first will add some extra special tokens that are not provided by `tokenize_df`. Let's see what this means:

In [None]:
print('Length of vocabulary obtained from: ')
print(f'\n   - tokenize_df: {len(set(vocab))}')
print(f'\n   - TextDataLoaders.from_df (min_freq=3): {len(lm_dls.vocab)}')
print(f'\n   - TextDataLoaders.from_dblock (min_freq=1): {len(tweets_f1.vocab)}')


A priori you would expect that the vocab from `tokenize_df`and `TextDataLoaders.from_dblock` to be the same size as both obtain the vocab from any word that appears in the dataset. Despite of this one has 7 more items than the other. Which are those items?

In [None]:
def vocab_diff(vocab1, vocab2):
    if len(vocab1)>len(vocab2):
        b_vocab = set(vocab1)
        s_vocab = set(vocab2)
    else:
        b_vocab = set(vocab2)
        s_vocab = set(vocab1)
        
    return list(b_vocab-s_vocab.intersection(b_vocab))
        

The added special tokens from `TextDataLoaders.from_df` are:

In [None]:
vocab_diff(tweets_f1.vocab, vocab)

Besides, for some reason the special token `xxfake` appears twice in the vocab from `tweets_f1` (probably a bug):

In [None]:
import collections
print({item:count for item, count in collections.Counter(tweets_f1.vocab).items() if count > 1})

In [None]:
vocab_diff([key for key in vocab_count if vocab_count[key]>2], lm_dls.vocab)

## Some tools for debugging the data

A useful tool for debugging can be to find some word in the original texts, for example:

In [None]:
tweets.text[tweets.text.str.contains('think the US site allows that ', regex=False)]

The word "*explaining*" appears twice in the dataset, once in row 2279 and once in row 14225

Another useful tool is to decode the numericalized datasets:

In [None]:
decoded = lm_dls.train_ds.decode(lm_dls.train_ds)

In [None]:
lm_dls.train_ds[0], decoded[0]

The dictionary between tokenized words and integers is found inside the [Numericalize](https://docs.fast.ai/text.data#Numericalize) class:

In [None]:
lm_dls.train_ds.numericalize.o2i

In [None]:
# [lm_dls.train_ds.numericalize.o2i[word] for word in decoded[0][0].split()]

Also useful:

In [None]:
for text in decoded:
        if 'xxeos' in text[0]:
            print(text[0])

## Training

Make sure that the vocabs of the classifier and the Language Model are the same (as a crosscheck):

In [None]:
assert set(lm_dls.vocab) == set(tc_dls.vocab[0]), 'Vocabs are not equal!'

Now let's define the [Callbacks](https://docs.fast.ai/callback.tracker) we are going to use:
- [ActivationStats](https://docs.fast.ai/callback.hook#ActivationStats): Callback that record the mean and std of activations.
- [ShowGraphCallback](https://docs.fast.ai/callback.progress#ShowGraphCallback): Update a graph of training and validation loss
- [ParamScheduler](https://docs.fast.ai/callback.schedule#ParamScheduler): We are not going to use it in this guide, but it is definitely interesting to play with it. It allows to change the learning rate at different stages of the training and also to have a different learning rate scheduler for every parameter group. 

In [None]:
cbs = [
       ShowGraphCallback,
       ActivationStats(with_hist=True),
       SaveModelCallback
#        ParamScheduler(sched)
      ]

Now let's create the learner. The [Learner](https://docs.fast.ai/learner) class is the class that contains everything necessary for training. It contains:
- DataLoaders
- Model
- Loss function
- Optimizer 
- Splitter to split the model in several parameter groups
- Callbacks for the training.
- etc.

In the following line we will pass as arguments to the function [language_model_learner](https://docs.fast.ai/text.learner#language_model_learner): 
- The DataLoader
- The name of the model: [AWS_LSTM](https://docs.fast.ai/text.models.awdlstm)
- The Callbacks
- The path were we wanna save the trained model or part of the model, e.g. just the encoder

In [None]:
learner = language_model_learner(lm_dls, AWD_LSTM, cbs=cbs, metrics=[accuracy])

We can see that there are already certain Callbacks which are set up by default:

In [None]:
list(learner.cbs)

The following code shows the layers of the language model:

In [None]:
modules = [m for m in flatten_model(learner.model) if has_params(m)]; modules

And we will do the same for the Learner of the Text Classifier:

In [None]:
learn = text_classifier_learner(tc_dls, AWD_LSTM, drop_mult=0.5, cbs=cbs, metrics=[accuracy, F1Score(average='micro')]).to_fp16()

In [None]:
list(learn.cbs)

By default when we load a pretrained model with the `language_model_learner`, not all the layers are trainable, to see which layers are potentially trainable we use the following code:

In [None]:
def requires_grad_bool(m:nn.Module)->Optional[bool]:
    ps = list(m.parameters())
    return ps[0].requires_grad

def trainable_layers(learn):
  modules = [m for m in flatten_model(learn.model) if has_params(m)]
  for it in modules:
    print(f"{requires_grad_bool(it)}  -- ",it)

In [None]:
trainable_layers(learner)

Now we are going to try to find the best learning rate for our models, in order to do so I recommend to check the follwing question asked in stackoverflow --> [choosing-the-learning-rate-using-fastais-learn-lr-find](https://stackoverflow.com/questions/61172627/choosing-the-learning-rate-using-fastais-learn-lr-find)


In [None]:
learner.lr_find()

In [None]:
learn.lr_find()

Now let's follow the receipt from Jeremy Howard in his [paper](https://arxiv.org/abs/1801.06146). It is important to understand whats the difference between `fit`and `fit_one_cycle` (for that take a look [here](https://iconof.com/1cycle-learning-rate-policy/)). Also as you will see we will progressively unfreeze the layers during training, this is seen to perform better than just `fit`.

In [None]:
??learner.fit_one_cycle

In [None]:
learner.fit_one_cycle(10, 2e-2)
trainable_layers(learner)
learner.save('language_model')
learner.save_encoder('finetuned')

In [None]:
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(12, 2e-3)
print('\n')
trainable_layers(learn)

In [None]:
learn.lr_find()

In [None]:
# REFINING 1
learn.load('model')
learn.freeze_to(-2)
learn.fit_one_cycle(5, slice(1e-2/(2.6**4),1e-2))
print('\n')
trainable_layers(learn)

In [None]:
# REFINING 2
learn.load('model')
learn.freeze_to(-3)
learn.fit_one_cycle(3, slice(5e-3/(2.6**4),5e-3))
print('\n')
trainable_layers(learn)

In [None]:
# REFINING 3
learn.load('model')
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
print('\n')
trainable_layers(learn)

We reach an accuracy of 83% fo tweet classification, not bad! 

You can now test the language model; you can use it to create inventend sentences and see if they make sense. The more the invented sentences look like made by a person the better trained the language model.

In [None]:
learner = learner.load('language_model')

In [None]:
learner.predict("Never", 15, temperature=0.75) 

## Statistics of the training

In [None]:
list(learner.activation_stats.stats)

In [None]:
learner.activation_stats.plot_layer_stats(8)

In [None]:
learner.activation_stats.color_dim(8)

## Asses model performance

In [None]:
n= 1392
learn.predict(tweets.text[n]), tweets.airline_sentiment[n]

In [None]:
tweets.text[n]

In [None]:
clas_int = ClassificationInterpretation.from_learner(learn)

Here you can see that there is a Bias wtr to negative classification. This is because in the dataset most of the tweets are negative!

In [None]:
clas_int.plot_confusion_matrix()

In [None]:
clas_int.print_classification_report()

In [None]:
clas_int.top_losses()

In [None]:
tc_dls.vocab[1]

In [None]:
preds = learn.get_preds(dl=tc_dls[1], with_input=True, with_loss=True, with_decoded=True, act=None)

The variable predicts contains the following information: 

0.   inputs
1.   predictions
2.   targets
3. decoded
4. losses

In [None]:
torch.sum(preds[3] == preds[2]).item()/len(preds[3])

In [None]:
matches = preds[3] == preds[2]

In [None]:
decoded_valid = L(zip(lm_dls.valid_ds.decode(preds[0]),
                      list(map(lambda x: tc_dls.vocab[1][x], preds[2].tolist())),
                      list(map(lambda x: tc_dls.vocab[1][x], preds[3].tolist()))))

In [None]:
df = pd.DataFrame(list(decoded_valid[~matches]), columns =['tweet', 'Truth', 'Computed']) 

In [None]:
display(HTML(df.to_html(index=False)))

In order to know if we are getting a good performance, we can compare our model training to a benchmark. We can use the IMDB dataset from fastai and see if we get a similar performance or not. 

In [None]:
path = untar_data(URLs.IMDB_SAMPLE)
imdb = pd.read_csv(path/'texts.csv')

In [None]:
imdb.head()

In [None]:
imdb_cls  = TextDataLoaders.from_df(imdb, text_col='text', label_col='label')
imdb_lm = TextDataLoaders.from_df(imdb, text_col='text', is_lm=True)

We can compare how similar are the features of both datasets, for example we can check how many words does any review contain compared to how many words do the tweets contain.

In [None]:
pd.concat([tweets.text[:1000].apply(lambda s: len(s.split())),
           imdb.text.apply(lambda s: len(s.split()))],
           axis=1,
           keys=['tweets', 'imdb']).plot.hist(alpha=0.4, bins = 500, xlim=(0,400)) 

So we can check the words per tweet or the words per review:

In [None]:
tweets.text.apply(lambda s: len(s.split())).mean(), imdb.text.apply(lambda s: len(s.split())).mean()

Eventhough the words per tweet are much less, the total amount of words is almost teh same:

In [None]:
tweets.text.apply(lambda s: len(s.split())).sum(), imdb.text.apply(lambda s: len(s.split())).sum()

In [None]:
258446/247797

In [None]:
tweets.text.apply(lambda s: len(s.split())).hist(bins=60)

In [None]:
imdb.text.apply(lambda s: len(s.split())).hist(bins=60)

As we can see the reviews are in general much bigger than the tweets, something that is already expected but can influence the performance of the training.

Another thing to analyze is the vocabs, is the vocab from `imdb` much bigger than from the `tweets`?

In [None]:
len(lm_dls.vocab), len(imdb_lm.vocab)

As we can see the vocab of `imdb` is almost the double as the vocab for the `tweets`, therefore it could be that the language model for the `imdb` performs better.

In order to simplify things, I will just define a function that does everything we have done so far:

In [None]:
def complete_training(lm_dls, cl_dls, cbs=None):

    if cbs==None:
        cbs = [
              ShowGraphCallback,
              ActivationStats,
              SaveModelCallback
            ]

    learner = language_model_learner(lm_dls, AWD_LSTM, cbs = cbs,  metrics=[accuracy])
    # TRY CHANGING drop_mult, to see if there is an effect in training small datasets
    learn = text_classifier_learner(cl_dls, AWD_LSTM, drop_mult=0.5, cbs=cbs, metrics=accuracy).to_fp16()

    # ----  TRAIN THE LANGUAGE MODEL  ----
    learner.fit_one_cycle(10, 2e-2)
    # learner.save('language_model')
    learner.save_encoder('finetuned')

    # ----  TRAIN THE CLASSIFIER  ----
    learn = learn.load_encoder('finetuned')
    learn.fit_one_cycle(12,  2e-3)

    # REFINING 1
    learn.freeze_to(-2)
    learn.fit_one_cycle(5, slice(1e-2/(2.6**4),1e-2))

    # REFINING 2
    learn.freeze_to(-3)
    learn.fit_one_cycle(3, slice(5e-3/(2.6**4),5e-3))

    # REFINING 3
    learn.unfreeze()
    learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

    learner.remove_cb(ShowGraphCallback)
    learn.remove_cb(ShowGraphCallback)

    return learner, learn

In [None]:
learner_imdb, learn_imdb = complete_training(imdb_lm, imdb_cls)