**Fellowship.AI challenge: Twitter US Arline Sentiment Analysis Using ULMFit**


*Section 1. Introduction to Transfer Learning and ULMFiT*

> Transfer learning has created significant impact on Computer Vision but not so on Natural Language Processing. First of all, transfer learning models require large amounts of data and takes days to converge. For instance, Dai and Le (2015) fine-tuned the language model but required millions of documents of the specific domain to achieve good performance, which is not desirable for practical purposes. The problem is not with the language modelling fine-tuning but how to train them effectively such that a small amount of data is required to achieve the same performance as with large data corpora. On the other hand, training on a small dataset can cause the language model to overfit severely. Thus, a fine-tuning method is required to effectively train using a small amount of data and yet prevent severe overfitting. Therefore, Howard and Ruder (2018) came up with: Universal Language Model Fine-tuning (ULMFiT)

*Section 2. The Process of ULMFiT*

> ULMFiT consists of the following 3 stages:
> 1. LM pre-training: The language model (LM) is trained on general-domain corpus to capture general features of languages in its layers. For instance, Wikitext-103 corpus was used by Howard and Ruder (2018) to pretrain the LM for English.
> 2. LM finetuning: The entire LM is fine tuned on the target data using discriminative fine tuning and slanted triangular learning rates to learn task specific classifer tuning.
>     * Discriminative finetuning: The different layers of the LM capture different types of information , thus they should be finetuned to different extents. In order to allow each layer to capture information to different extents, we manipulate the learning rates of different layers via discriminative finetuning. 
>     In the case of stochastic gradient descent, where we decide on one learning rate for all layers, the model's parameters θ at time step t updates as follows:
> ![sgd_formula](https://i.paste.pics/77J2C.png) 
> 
>     In the case of discriminative finetuning, we would need to adjust the parameters at each layer of the neural network. thus we consider θ= {θ_1,...θ_l} where θ_l refers to the parameter of the l=th layer. In the same way, we have a learning rate corresponding to each of the L layers: {η_1, . . . , η_l} where η_l is the learning rate of the l-th layer. Thus the stochastic gradient update in the context of discriminative finetuning is:
> ![](https://i.paste.pics/77J31.png)
> Through experimentation, Howard and Ruder found that discriminative finetuning works well when the learning rate of the final layer ,L, was chosen first and the learning rate of the other layers were updated as follows: **η_l−1 =η_l/2.6 **
>     * Slanted triangular learning rates (SLTR) : To allow the LM to adapt its parameters to task-specific features, it is important to quickly converge. Thus, SLTR is used as it first linearly increases and then decreases linearly as shown below:
> ![](https://i.paste.pics/77J55.png)    
> 
> 3. Finally, Classifier finetuning: finetuned language model is used on target task with gradual defreezing and STLR to preserve low-level representations and adjust high-level representations
>     *     Gradual unfreezing: Instead of finetuning all layers at once, the last layer is 'unfreezed' and finetuned first as  it is considered to have the least general knowledge (Yosinski et al., 2014). How the gradual unfreezing proceeds is as follows:

>             a) the last layer is unfrozen and finetuned for one epoch

>             b)the second last layer is unfrozen and both (last and second last layers) are finetuned for one epoch

>             c)Once all the layers have been unfrozen, the entire network is finetuned till convergence is achieved.
>         
> To wrap-up this explanation on the ULMFiT process, the newly proposed techniques of discriminative finetuning, slanted triangular learning rates and gradual unfreezing work in synergy to enable ULMFiT to perform well on different datasets.
> 

    


*Section 3. Task and EDA*

> Our aim is to classify tweets about the major American airlines into 3 classes, namely, neutral, negative and positive, using ULMFiT.
> The dataset contains 14427 labelled tweets.
> We will be carrying out an exploratory data analysis or EDA on the dataset to understand more about the different tweets about the airlines-

In [None]:
#importing the following libraries-
import numpy as np
import pandas as pd
import seaborn as sns
import re
from pathlib import Path
from sklearn.model_selection import train_test_split
from fastai.text import *
# we will be using the fast.ai library to build the ULMFiT model as there are specific methods in the fast.ai library for the different techniques involved in ULMFiT

In [None]:
#loading data from Kaggle

data_path=Path('../input/twitter-airline-sentiment/')
input_name='Tweets.csv'
input_path=data_path/input_name
df_input=pd.read_csv(input_path)
print("The shape of the input csv file containing all tweets is",df_input.shape)


In [None]:
# some example of entries in the input data table
df_input.sample(10,random_state=0)

In [None]:
# we only require rge airline_sentiment and text row for our input and targetted output, thus we select just those columns
df_input_crop=df_input[['airline_sentiment','text']]
df_input_crop.sample(10)

In [None]:
#before, we start dividing the dataset into train/test/validation sets, we need to know whether the data is class imbalanced
#class imbalance can occur based on the sentiment class or airline class-

print('The distribution of data according to the different sentiment classes is:\n',df_input_crop['airline_sentiment'].value_counts())


In [None]:
print('The distribution of data according to the different sentiment classes is:\n',df_input['airline'].value_counts())


In [None]:
# to visualise the class imbalance due to both sentiment and airline class, we will use bar chart-
sns.countplot(y='airline', hue='airline_sentiment', data=df_input)

Section 3.1. The Problem of Class Imbalance

> From our EDA, we have discovered that there is a class imbalance with respect to the different airlines. Thus, there is a possibility that the model associates a certain sentiment with a certain airline. However, that is not our aim. Our aim is for the model to associate and map a certain tweet to a certain sentiment. Therefore, I felt that all airline names which appear after the '@' symbol should be replaced with '@airline'. That way we get rid of the airline class imbalance issue. 

In [None]:




regex=r"@(VirginAmerica|united|SouthwestAir|Delta|USAirways|AmericanAir|JetBlue)"
def replace(text):
    return re.sub(regex, '@airline',text, flags=re.IGNORECASE)
df_input_crop['text']=df_input_crop['text'].apply(replace)
df_input_crop['text'].sample(10)

# now all the airlines names have been replaced with the word 'airline'





*Section 4. Setting up Data*

> Now we will be splitting the dataset into training(80%) and validation(20%) sets

In [None]:
train_data,valid_data=train_test_split(df_input_crop, test_size=0.2)

*Section 5. ULMFiT model development*

Section 5.1. ULMFiT step 1: LM Pretraining

>In this language model pretraining step, we capture the general features of the language.
>First, we will be standardising the momentum and weight decay values for this entire model development process.

> moms=(0.8,0.7)

> wd=0.1

>Weight decay refers to the value by which the weights are multiplied by after each update. This prevents weights from growing too large.
>Simply put, momentum accumulates the gradient of the previous steps to determine the direction to go and allows us to quickly converge.


In [None]:
moms=(0.8,0.7)
wd=0.1

In [None]:
#fast.ai needs a path to work on
fastai_path=Path('./').resolve()
fastai_path

In [None]:
# Here TextLMDataBunch formats the input data such that data can be called immediately for use in model training
input_data=TextLMDataBunch.from_df(fastai_path, train_data,valid_data)
# input_data.save('input_data.pkl')

In [None]:
#learn_process is our model 
learn_process=language_model_learner(input_data, AWD_LSTM, drop_mult=0.3)
#learning process occurs using wikipedia data
learn_process.freeze()

Section 5.2. Finetuning LM

> In this stage,the entire LM is fine tuned on the target data using discriminative fine tuning and slanted triangular learning rates to learn task specific classifer tuning.
> We incorporate the use of 1cycle policy (Leslie Smith et. al.) to training our neural network quickly.
> When we plot the losses across a range of learning rates, we will be picking the learning rate slight before the learning rate with minimum loss such that the loss is still improving.
> Leslie Smith recommends that we change the learning curve such that it increases to the optimal learning rate (minimum loss) from a lower learning rate for half the iterations and gradual reduces back to the lower learning rate for the remaining half of the iterations.
> As suggested in the fast.ai docs, we can let our maximum learning rate be tenth of the optimal learning rate as shown in the plot below.

In [None]:
#lr_find helps to fine a good learning rate
learn_process.lr_find()
#plots the losses over a range of learning rates
learn_process.recorder.plot()

In [None]:
#from the plot above we can see that 5e-01 is the learning rate with minimum loss
# we will take our max learning rate in 1cyle to be tenth of that: 5e-02
learn_process.fit_one_cycle(1,5e-02,moms=moms,wd=wd)

In [None]:
learn_process.unfreeze()

In [None]:
learn_process.save_encoder('encoder')

Section 5.3 Classifier finetuning

> The finetuned language model is used on target task with gradual defreezing and STLR to preserve low-level representations and adjust high-level representations.
> In this final stage, we further divide the dataset into training, validation and testing datasets
> The breakdown of the data is as follows:
> 
> * test data: 20%
> * training data: 64%
> * validation data: 16%

In [None]:
trainNvalid, test=train_test_split(df_input_crop, test_size=0.2)

In [None]:
train, valid=train_test_split(trainNvalid, test_size=0.2)

In [None]:
# formatting data such that it can be easily fed into model
data_classified= TextClasDataBunch.from_df(fastai_path, train, valid, test_df=test, vocab= input_data.train_ds.vocab, text_cols='text', label_cols='airline_sentiment', bs=32)

In [None]:
data_classified.show_batch()

In [None]:
# updating our model learn_process with classifier tuning
learn_process=text_classifier_learner(data_classified, AWD_LSTM, drop_mult=0.5)

In [None]:
learn_process.load_encoder('encoder')

In [None]:
# freeze all layers before unfreezing gradually in the following step
learn_process.freeze()
learn_process.lr_find()
learn_process.recorder.plot()

In [None]:
# 3e-01/10
lr=3.0E-02
learn_process.fit_one_cycle(1,lr,moms=moms, wd=wd)

In [None]:
# here we begin the gradual unfreezing strategy where we unfreeze layer by layer in a cumulative fashion
learn_process.freeze_to(-2)
lr/=2
# 2.6 was found be an optimal factor from experimentation (Howard and Ruder)
learn_process.fit_one_cycle(1, slice(lr/(2.6**4),lr),moms=moms, wd=wd)

In [None]:
learn_process.freeze_to(-3)
lr/=2
learn_process.fit_one_cycle(1, slice(lr/(2.6**4),lr),moms=moms, wd=wd)

In [None]:
#unfreeze all layers 
learn_process.unfreeze()

In [None]:
lr/=5
learn_process.fit_one_cycle(3, slice(lr/(2.6**4),lr), moms=moms, wd=wd)

In [None]:
#an example of the model predicting a review 
learn_process.predict('quite a good experience, not perfect tho')

# end of model development

*Section 6. Results
*
> With the prediction example, we have wrapped up our model development. We have come to the climax of our learning journey: results.
> The test accuracy is known to be the golden value that all models aim to break the record for.
> However, we will be going the extra mile here to understand the different aspects of performance of our model.
> We start by identifying the test accuracy. Followed by that, we will create a confusion matrix.


In [None]:
# test acc

vals=TextClassificationInterpretation.from_learner(learn_process)
test_acc=accuracy(vals.preds,vals.y_true)
print('Test acc is-',test_acc)



In [None]:
#confusion matrix, normalised to show accuracies for better understanding
vals.plot_confusion_matrix(normalize=True)

> The confusion matrix shows the overlap between the actual and predicted results. The values in each row should some up to 1.
> As shown above, we have achieved 89% accuracy on identifying negative reviews, 81% for positive and 61% for neutral.
> Thus our model needs to improve the most on identifying neutral tweets. It seems that the model is classifying a significant amount of neutral tweets, 28%, as negative tweets. 
> A reason for this could be the sentiment class imbalance identified in the beginning.

*Section 7. Future Improvements*

> This model can be improved in the following 3 ways I have identified below:
>     1. The sentiment class imbalance problem can be solved by having the same number of examples for each sentiment. We can see if there is an improvement in the neutral class' accuracy.
>     2. Recently, Google published a paper on BERT which is a bidirectional transformer for language understanding. A key feature of BERT was that is works bidirectionally on the input sentences. In other words, it predicts both the preceeding and succeeding words. That way, the context can be captured in a much better way
>     3. Last but not least, we can try training 3 separate models:
>             *     positive and negative data
>             *     positive and neutral data
>             *     neutral and negative data
>         Then, we can create an ensemble of all the 3 models. That way, we may be able to achieve higher accuracies on each sentiment class and the model will be able to distinguish among the 3 classes to a greater extent


 

*Section 8. Conclusion*

>  My aim for this notebook was to introduce the different aspects of developing the ULMFiT model for classifying airline tweets. I hope that beginners in deep learning and NLP will find this helpful and inspire them to improve this model by delving into more detail. I highly encourage readers to read more NLP papers such as BERT and XLNet which give both a good introduction to the capabilities of NLP and have a proper lead-in to the more advanced NLP concepts. A possible immediate step will be to try applying ULMFiT to another problem space such as Question Answering. I have listed the resources I have used to create this notebook in the following section. Please refer to them to delve deeper into ULMFiT.
> 

*Section 9. References*

> I have referred to the following resources and you may find them useful:
> 1. fast.ai documentation on their fasi.ai library:  https://docs.fast.ai/training.html
> 2. the ULMFiT babpe by Howard and Ruder et. al.: https://arxiv.org/pdf/1801.06146.pdf
> 3. A through analysis of the current state transfer learning of NLP by Ruder: https://ruder.io/state-of-transfer-learning-in-nlp/
> 4. A deep dive into 1cycle polcy by Gugger : https://sgugger.github.io/the-1cycle-policy.html
> 5. A notebook on using ULMFiT for Russian language modelling: https://github.com/mamamot/Russian-ULMFit
> 
> Thank you!


                                                  THE END

*Appendix A: Post Credits*

> In this post-credits section, just like in a Marvel movie, I will dive a little deeper into an idea I teased in the main notebook.
> Here, I will explore the possibility of training ULMFiT on data that runs backwards. In other words, the words will be input 'anti-chronologically'.
> Please note that this section has been added as an extension as this is meant to be an inspiration for readers of this notebook to delve deeper into the possible improvements suggested.

In [None]:
# A function to flip the order of words in tweets

input_data_backward=df_input_crop
for i in range(len(input_data_backward['text'])):
    
#     break
    inputWords = input_data_backward['text'][i].split(" ")  
    inputWords=inputWords[-1::-1] 
    output = ' '.join(inputWords) 
    input_data_backward.at[i,'text']=output
#     print(i)
#     break
input_data_backward.sample(30)
# df_input_crop.sample(10)


In [None]:
train_data_backward,valid_data_backward=train_test_split(input_data_backward, test_size=0.2)

In [None]:
#stage 1: learning process
learn_process_backward=language_model_learner(input_data, AWD_LSTM, drop_mult=0.3)
#learning process occurs using wikipedia data
learn_process_backward.freeze()

In [None]:
learn_process_backward.lr_find()
learn_process_backward.recorder.plot()

In [None]:
learn_process_backward.fit_one_cycle(1,5.0E-02,moms=moms,wd=wd)

In [None]:
learn_process.unfreeze()

In [None]:
# learn_process_backward.save_encoder('encoder_backward')

In [None]:
# trainNvalid_backward, test_backward=train_test_split(input_data_backward, test_size=0.2)

In [None]:
learn_process.fit_one_cycle(3,5.0E-03,moms=moms,wd=wd)

In [None]:
learn_process.save_encoder('backward_encoder')

In [None]:
trainNvalid_backward, test_backward=train_test_split(input_data_backward, test_size=0.2)

In [None]:
train_backward, valid_backward=train_test_split(trainNvalid_backward, test_size=0.2)

In [None]:
data_classified_backward= TextClasDataBunch.from_df(fastai_path, train_backward, valid_backward, test_df=test_backward, vocab= input_data.train_ds.vocab, text_cols='text', label_cols='airline_sentiment', bs=32)

In [None]:
data_classified.show_batch()

In [None]:
learn_process_backward=text_classifier_learner(data_classified_backward, AWD_LSTM, drop_mult=0.5)

In [None]:
learn_process_backward.load_encoder('backward_encoder')

In [None]:
learn_process_backward.freeze()
learn_process_backward.lr_find()
learn_process_backward.recorder.plot()

In [None]:
lr=3.0E-02
learn_process_backward.fit_one_cycle(1,lr,moms=moms, wd=wd)

In [None]:
learn_process_backward.freeze_to(-2)
lr/=2
learn_process_backward.fit_one_cycle(1, slice(lr/(2.6**4),lr),moms=moms, wd=wd)

In [None]:
learn_process_backward.freeze_to(-3)
lr/=2
learn_process_backward.fit_one_cycle(1, slice(lr/(2.6**4),lr),moms=moms, wd=wd)

In [None]:
learn_process_backward.unfreeze()

In [None]:
lr/=5
learn_process_backward.fit_one_cycle(15, slice(lr/(2.6**4),lr), moms=moms, wd=wd)

In [None]:
learn_process_backward.predict('quite a good experience, not perfect tho')

ensemble

In [None]:
forward_preds, forward_targets=learn_process.get_preds(ordered=True)

print('Forward classifier results (validation set): \nValidation accuracy: {:.2f}, Validation error rate: {:.2f}'.format(accuracy(forward_preds, forward_targets), error_rate(forward_preds, forward_targets)))

In [None]:
backward_preds, backward_targets=learn_process_backward.get_preds(ordered=True)

print('Forward classifier results (validation set): \nValidation accuracy: {:.2f}, Validation error rate: {:.2f}'.format(accuracy(backward_preds, backward_targets), error_rate(backward_preds, backward_targets)))

In [None]:
ensemble_preds =(forward_preds + backward_preds)/2
# get combined(mean) accuracy on validation set
print('Ensemble classifier results (validation set): \nValidation accuracy: {:.2f}, Validation error rate: {:.2f}'.format(accuracy(ensemble_preds, forward_targets), error_rate(ensemble_preds, forward_targets)))

> As seen above, the backward_model performs slightly better than the forward_model we built in the main note book. However, when we create an ensemble of both models, the accuracy drops significantly to 60%. 
> 
> What could be the reason? Is there anything I missed out on?
> 
> Please comment below to take this discussion further and learn together!