# <center>Universal Language Model Fine-tuning for Text Classification</center>
## <center>NLP Sentimental Analysis on Twitter US Airlines Dataset</center>
### <center>By Ahmad Abboud</center>




1- Abstract
========

Training on inductive transfer has significantly influenced computer vision, but current NLP methods often need task-specific modifications and training from scratch. This notebook discusses the performance of Universal Language Model Fine-tuning (ULMFiT) [1], and efficient transfer learning approach that can be applied to any NLP function, and implementing techniques that are essential to fine-tune a language model. Further, empirical results had been introduced after applying ULMFit for NLP sentimental analysis on Twitter US airlines dataset.

2- Introduction
============

Taking a view of the advantages of pretraining we would be able to do better than initializing arbitrarily the remaining parameters of our models. However, finetuning inductive transfer was ineffective for NLP. Language model (LM) fine-tuning requires millions of in-domain documents to achieve good results, which significantly limits its applicability. Universal Language Model Fine-tuning (ULMFiT) solves these issues and facilitates stable, inductive learning transfers for any NLP function.

3- Model Description
=================

The model used is the state-of-the-art language model AWD-LSTM \[2\], a
standard LSTM with various tuned dropout hyper-parameters (with no
input, short-cut links, or other sophisticated additions).

ULMFiT consists of three stages (Figure 1):

> a\) The LM is trained on a **general-domain** corpus to capture
> general features of the language in different layers.
>
> b\) The full LM is **fine-tuned** on target task data using
> discriminative fine-tuning ('Discr') and slanted triangular learning
> rates (STLR) to learn task-specific features.
>
> c\) The classifier is fine-tuned on the target task using **gradual
> unfreezing**, 'Discr', and STLR to preserve low-level representations
> and adapt high-level ones (shaded: unfreezing stages; black: frozen).

<img src="../input/images/ULMFit.png" alt="Figure 1. ULMFiT Model Structure (source [1])s" title="ULMFiT Model Structure" />
<i> Figure 1. ULMFiT Model Structure (source [1])</i>


   a) General-domain LM pretraining 
-----------------------------

The model was pre-trained using Wikitext-103 which is consisting of
28,595 preprocessed Wikipedia documents and 103 million words [2].



   b) Target task LM fine-tuning 
--------------------------

Regardless of how complex the general-domain data used for pre-training is, the target task data would typically come from a different source. Therefore, we fine-tune the language model on target task results. This stage converges more quickly, provided a pre-trained general-domain LM, as it only needs to adjust to the idiosyncrasies of the target data, and it enables us to train a robust language model even for small datasets.


### i- Discriminative fine-tuning

Because different layers’ capture various information types, they should be fine-tuned to a different degree. Instead of using the same learning rate for all layers of the model, Discriminative fine-tuning allows one to apply specific learning levels to each layer. 
It was empirically found that it performed well to select the last layer's Alpha^L learning rate first by fine-tuning only the last layer and using Alpha^(L-1) = (Alpha^L)/2.6 as the lesser layer learning rate. For L, the order of the layer in the model.


### ii- Slanted triangular learning rates

To adjust its parameters to task-specific features, at the beginning of the training, make the model converge quickly into an acceptable region of the parameter space, and then refine its parameters. Slanted triangular learning rates (STLR), first increase the learning rate linearly and then decay it linearly. Finally, the learning rate at the stiffest slop will be chosen.

   c) Target task classifier fine-tuning
----------------------------------

The pre-trained language model with two additional linear blocks was augmented to fine-tune the classifier. Following common practice for Computer Vision (CV) classifiers, each block uses batch normalization and dropout, with intermediate layer ReLU activations, and a Softmax activation that outputs a probability distribution at the last layer over target classes. Remember that the only parameters learned from scratch are the parameters in these task-specific classifier strata. The first linear layer takes the last hidden layer being pooled as the input state.


4-  Application
===========

   a.  Dataset Description
-------------------

The dataset is published on [Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) [3], and it analyzes how travellers in February 2015 expressed their feelings on Twitter. It contains 14640 records, which was semantically analysed. 

   b. Problem Description
-------------------

Sentiment analysis of the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets. The problem under our interest is to create a classification model that can identify the sentiment of the text written by the client as positive, negative or neutral.  

   c.  Exploratory Analysis
--------------------





In [None]:
# prepare the notebook
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
# import necessary libraries
import numpy as np
from fastai.text import *
from pathlib import Path

#### Preparing and Download the data

First let's download the dataset we are going to study. 

In [None]:
# View current working directory
print(f"Current directory: {Path.cwd()}")
print(f"Home directory: {Path.home()}")
path=Path.cwd()
path=path

In [None]:
path

In [None]:
# Download Dataset
#import kaggle
#kaggle.api.authenticate()
#kaggle.api.dataset_download_files('crowdflower/twitter-airline-sentiment', path=path, unzip=True)

It only contains one csv file, let's have a look at it.

In [None]:
#Prepare Dataframe
df_org = pd.read_csv('../input/twitter-airline-sentiment/Tweets.csv')
df_org.rename(columns={'airline_sentiment':'label'},inplace=True)               
df=df_org[['label','text']]
np.random.seed(2020)
df['is_valid']=np.random.choice([True, False], len(df_org), p=[0.9,0.1 ]) # Seperate 10% for test
df.head()

In [None]:
df.info()

In [None]:
# Save to clean version CSV
df[['label','text','is_valid']].to_csv(path/'Tweets.csv',index=False)

In [None]:
#Show the first text item 
df['text'][1]

### Data Visualization

To have an idea about the distribution of the sentiment labels we will present the frequency distribution over the pie chart and histograms.

In [None]:
import seaborn as sns
df.label.value_counts().plot(kind='pie',autopct='%1.0f')

In [None]:
# Sentiment distribution
sns.countplot(x='label',data=df,palette='viridis')

In [None]:
#sentiment distribution over airelines 
plt.figure(figsize=(12,7))
sns.countplot(x='airline',hue='label',data=df_org,palette='rainbow')

   d.  Feature Selection
-----------------

Feature selection is the process of selecting what we think is worthwhile in our documents, and what can be ignored. Rejected features are those that act like noise, thus when fed to the model with the training set, the classification accuracy will decrease.
In most NLP literature work, stop words, punctuations and non-formal vocabs are deprecated from the training set. Furthermore, most work applies word stemming to return the used words to their lemma. However, we believe that all the words in the text field are important and there no reason to deprecate them. Besides, this notebook takes into consideration only the text field as the dependent variable where we leave the process of exploring the effectiveness of other features for future optimization.


   e.  Data Preprocessing
------------------


The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, we can further:

-  take care of punctuation
- some words are contractions of two different words, like isn't or don't
- we may need to clean some parts of our texts, if there's HTML code for instance




###    i-Creating Data Bunches

A text is composed of words, and we can't apply mathematical functions to them directly. We first have to convert them to numbers. This is done in two differents steps: tokenization and numericalization. A `TextDataBunch` does all of that behind the scenes for you.


In [None]:
# selecting bunch size depends on the memory size of your PC
bs=48

This is where the unlabelled data is going to be useful to us, as we can use it to fine-tune our model. Let's create our DataBunch object.

In [None]:
data_lm = (TextList.from_csv(path, 'Tweets.csv', cols='text') 
            .split_by_rand_pct(0.1,seed=2020)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))

In [None]:
#Save DataBunch object
data_lm.save('data_lm.pkl')

In [None]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)

In [None]:
# Lets have a look at the first item of the training set
data_lm.train_ds[0][0]

But the underlying data is all numbers

In [None]:
data_lm.train_ds[0][0].data[:10]

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

In [None]:
data_lm.show_batch()

### Tokenization

We have to use a special kind of `TextDataBunch` for the language model, that ignores the labels (that's why we put 0 everywhere), will shuffle the texts at each epoch before concatenating them all together (only for training, we don't shuffle for the validation set) and will send batches that read that text in order with targets that are the next word in the sentence.

The line before being a bit long, we want to load quickly the final ids by using the following cell.

In [None]:
data_lm.vocab.itos[:20]

### Numericalization

Once we have extracted tokens from our texts, we convert to integers by creating a list of all the words used. We only keep the ones that appear at least twice with a maximum vocabulary size of 60,000 (by default) and replace the ones that don't make the cut by the unknown token `UNK`.

The correspondance from ids to tokens is stored in the `vocab` attribute of our datasets, in a dictionary called `itos` (for int to string).

And if we look at what a what's in our datasets, we'll see the tokenized text as a representation:

We can then put this in a learner object very easily with a model loaded with the pretrained weights. They'll be downloaded the first time you'll execute the following line and stored in `~/.fastai/models/` (or elsewhere if you specified different paths in your config file).

   f.  Modelling
--------

###    i-   Pre-trained Learning



We're not going to train a model that classifies the reviews from scratch. Like in computer vision, we'll use a model pretrained on a bigger dataset (a cleaned subset of wikipedia called [wikitext-103](https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset)). That model has been trained to guess what the next word is, its input being all the previous words. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point.

We are going to use that 'knowledge' of the English language to build our classifier, but first, like for computer vision, we need to fine-tune the pretrained model to our particular dataset. Because the English of the reviews left by people on IMDB isn't the same as the English of wikipedia, we'll need to adjust the parameters of our model by a little bit. Plus there might be some words that would be extremely common in the reviews dataset but would be barely present in wikipedia, and therefore might not be part of the vocabulary the model was trained on.

In [None]:
# Slanted triangular learning rates (STLR), which first linearly increases the learning rate and then linearly decays it
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)


###     ii- Learning rate selection

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=15)

###    iii-  Fine-tuning

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7)) # lr should be 4*1e-2 at the stiffest slope
#The momentum is the first beta in Adam (or the momentum in SGD/RMSProp). When you pass along (0.8,0.7) it means going from 0.8 to0.7 during the warmup then from 0.8 to 0.7 in the annealing, but it only changes the first beta in Adam
#fit_one_cycle equivalent to the Adam optimizer’s (beta_2, beta_1) (notice the order) parameters, where beta_1 is the decay rate for the first moment, and beta_2 for the second

In [None]:
learn.save('fit_head')

In [None]:
learn.load('fit_head');

To complete the fine-tuning, we can then unfeeze and launch a new training.

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

In [None]:
learn.save('fine_tuned')

How good is our model? Well let's try to see what it predicts after a few given words.

In [None]:
learn.load('fine_tuned');

In [None]:
TEXT = "I liked this airline because"
N_WORDS = 40
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

We have to save not only the model, but also its encoder, the part that's responsible for creating and updating the hidden state. For the next part, we don't care about the part that tries to guess the next word.

In [None]:
learn.save_encoder('fine_tuned_enc')

###    iv-  Transfer Learning Classifier Model

Now, we'll create a new data object that only grabs the labelled data and keeps those labels. Again, this line takes a bit of time.

In [None]:
data_clas=TextClasDataBunch.from_csv(path, 'Tweets.csv',vocab=data_lm.vocab)


In [None]:
data_clas.save('data_clas.pkl')

In [None]:
data_clas = load_data(path, 'data_clas.pkl', bs=bs)

In [None]:
data_clas.show_batch()

We can then create a model to classify those reviews and load the encoder we saved before.

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)


In [None]:
#Show the learner structure 
learn

In [None]:
# Transfer learned encoder from previous language model
learn.load_encoder('fine_tuned_enc')

###    v-   Learning rate selection

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

###    vi-  Fine-tuning

In [None]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7)) 

In [None]:
learn.save('first')

In [None]:
learn.load('first')

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))   #?? why 1e-2/(2.6**4)

In [None]:
learn.save('second')

In [None]:
learn.load('second');

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [None]:
learn.load('third');

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

In [None]:
learn.save('Fourth')

In [None]:
learn.load('Fourth');

###    vii-  Prediction and Results

It is clear that the losses plunged just after the second epoch and the  accuracy reached 0.824. 

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.recorder.plot_metrics()

In [None]:
learn.predict("I really loved that airline, it was awesome!")

In [None]:
# Prepare Interpreter
interp = ClassificationInterpretation.from_learner(learn)

In [None]:
# Confusion Matrix
interp.plot_confusion_matrix()

5-  Conclusion
==========

By the conclusion, we have applied a ULMFiT on classification job for US airlines sentimental analysis where the obtained results are promising. Using transfer learning and AWD-LSTM pre-trained network we reach an accuracy of more than 82% with few learning epochs, which is pretty good compared to the literature results. Moreover, the results can be improved by pre-train the model with text chats from social networks e.g. Twitter, Facebook, where nonformal language can more accurately fit this dataset compared to Wikitext-103, which in most cases contain a formal and scientific language. Besides, exploring the effect of another independent variable in the dataset could also improve the results especially the field “negativereason” .

6-  References
==========

\[1\] J. Howard and S. Ruder, "Universal language model fine-tuning for
text classification," in *ACL 2018 - 56th Annual Meeting of the
Association for Computational Linguistics, Proceedings of the Conference
(Long Papers)*, 2018. <br>
\[2\] S. Merity, N. S. Keskar, and R. Socher,
"Regularizing and optimizing LSTM language models," in *6th
International Conference on Learning Representations, ICLR 2018 -
Conference Track Proceedings*, 2018. <br>
\[3\] "Twitter US Airline Sentiment
\| Kaggle." \[Online\]. Available:
https://www.kaggle.com/crowdflower/twitter-airline-sentiment.
\[Accessed: 13-Jul-2020\].
