# Twitter Airline Sentiment using ULMFiT

# 1. Introduction

## 1.1. Problem Statement

We will be tackling the **ULMFiT Sentiment** problem from Fellowship.ai which states:

> Apply a supervised or semi-supervised ULMFiT model to Twitter US Airlines Sentiment

The Twitter US Airlines Sentiment is this dataset: https://www.kaggle.com/crowdflower/twitter-airline-sentiment#Tweets.csv and the ULMFiT model was introduced here: http://nlp.fast.ai/classification/2018/05/15/introducing-ulmfit.html.

## 1.2. Our Approach

### 1.2.1 Data exploration and processing

After doing some data exploration, we see that the sentiments are heavily biased towards negative and that the distribution of airline sentiments depends on the specific airline. For example, tweets about Virgin America tend to be more positive than the others.

The imbalanced data suggests that accuracy should not be our only performance metric. To this end, we will also perform a ROC and AUC analysis on the resulting model.

Since the distribution of airline sentiments depend on the specific airline and each tweet contains `@{airline}`, we will do a bit of pre-processing. We will substitute `@{airline}` with `@airline`, for example, `@united` becomes `@airline`. The goal of the substition is to avoid training the model to determine sentiment using the additional data of which airline the tweet was sent towards.

### 1.2.2 Model training

We follow the ULMFiT approach of Howard and Ruder found here: https://arxiv.org/pdf/1801.06146.pdf. We will also make extensive use of the `fastai` package as the methods describe in the paper are implemented in this package. The paper discuss applying ULMFiT to a IMDB sentiment problem and provides an example notebook (https://github.com/fastai/fastai/blob/master/examples/ULMFit.ipynb). That notebook was followed in the creation of this notebook.

Figure 1 of the Howard and Ruder paper outlines the 3 steps of language model transfer learning.

1. **LM pre-training**: This step was done in 3.2 simply by using the builtin `language_model_learner` of `fastai`. This fetches a language model built using the `wiki103`  dataset (derived from Wikipedia articles )by Howard and Ruder. This step had a large computational cost. The goal of trasfer learning and the ULMFiT approach is to take this pre-trained model and fine-tune it to our problem.

2. **LM fine-tuning**: The langugage of Wikipedia is different from that of Twitter so we need to fine-tune the language model to the dataset we're interested in. This was done in section 3.3.

3. **Classifer fine-tuning**: A language model predicts the next word given the beginning of a sentence. This is not what we want. So we replace the last layers with some layers for sentiment classification. This was done in section 3.4.

The above descrive a fairly generic overview of language model transfer learning. If applied naively, overfitting on the smaller dataset is proned to happen and ''castastrophpic forgetting'' occurs, as described in the paper. The Howard and Ruder paper proposes method to avoid this. They propose 'discriminative fine-tuning' (Discr),  'slanted triangular learning rate' (STLR), and gradual unfreezing. All of these techniques have been implemented and somewhat abstrated away in the `fastai` package. We only have to provide a few parameters.

## 1.3. Results

This is a summary of section 4.

We were able to obtain a final accuracy of 83.7%. But since the data was not balanced, we consider other performance metrics. 

In 4.3, we treat our model as a scorer rather than a classifier which enables us to compute ROC curves by considering different thresholds. We plot the one vs. rest ROC curves and also compute the ROC-AUC. We obtain values of .941, .911, .960 when negative, neutral, and positive were taken as the one in one vs. rest, respectively.

Moreover, in 4.4, we consider adapting our model to a binary classifier where the Positive class is the negative sentiment and the Negative class is the combination of the positive and neutral sentiment. We do this by taking the Positive class whenever a threshold is surpassed. We will determine the optimal threshold given costs assigned to False Positives and False Negatives.

## 2. Data Exploration and Preparation

In [None]:
# Import fastai to use their ULMFiT implementation
from fastai.text import * 

# fastai needs us to specify a path sometimes
from pathlib import Path

# Import usual data science libraries
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score

The Twitter US Airlines Sentiment has been provided by Kaggle was originally sourced here: https://www.figure-eight.com/data-for-everyone/. The description reads: 

> A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).

## 2.1. Load and sample data

We will load the entire csv into `df_full`. We'll reserve `df` for when we drop all columns besides `airline_sentiment` and `text`.

In [None]:
path = Path('../input/twitter-airline-sentiment/')
file_name = 'Tweets.csv'

In [None]:
file_path = path / file_name
df_full = pd.read_csv(file_path)
df_full.size

In [None]:
df_full.sample(10, random_state=0)

In [None]:
pd.set_option('display.max_colwidth', 0) # tweets aren't too long so let's just print it all

In [None]:
df = df_full[['airline_sentiment', 'text']]
df.sample(10)

## 2.2 Data counts

We see that the tweet sentiments are heavily skewed towards negative sentiments.

In [None]:
df[['airline_sentiment', 'text']].isna().sum()

In [None]:
df['airline_sentiment'].value_counts()

In [None]:
df['airline_sentiment'].value_counts(normalize=True)

## 2.3. Data preprocessing

### 2.3.1. Sentiment by airline

As we see in the sample, `@{airline}` often appears in the text. If the sentiments depend on the airlines (for example, if everyone just loves Virgin America), then we should consider doing some preprocessing of the text so that the output of our model is indepedent from the airlines.

In [None]:
sns.countplot(y='airline', hue='airline_sentiment', data=df_full)

### 2.3.2 Text substitution

In light of the dependence of sentiment with airlines, we will do  substitute each instance of `@{airline}` with `@airline`. For example, we will replace `@united` with `@airline`. This does not remove all hints about the airlines from the text (for example, any tweet talking about purple lights is probably Virgin America), but it's still a good first step.

In [None]:
import re
regex = r"@(VirginAmerica|united|SouthwestAir|Delta|USAirways|AmericanAir)"
def text_replace(s):
    return re.sub(regex, '@airline', s, flags=re.IGNORECASE)

In [None]:
df['text'] = df['text'].apply(text_replace)

In [None]:
df['text'].sample(5)

# 3. Model training

## 3.1. Training - validation split

In [None]:
train, valid = train_test_split(df, test_size=0.2)

In [None]:
moms = (0.8,0.7)
wd = 0.1

## 3.2 Get pre-trained model

This is done in 3 steps as seen in Figure 1 here: https://arxiv.org/pdf/1801.06146.pdf

In [None]:
working_path = Path('./').resolve() # fastai needs a working path

In [None]:
data_lm = TextLMDataBunch.from_df(working_path, train, valid) # form the data bunch

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3) # this fetches the wiki103 model
learn.freeze()

## 3.3. Fine-tune language model

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(1, 5.0E-02, moms=moms, wd=wd) # 5.0E-02 is LR with the steepest slope above

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(3, 5.0E-03, moms=moms, wd=wd)

In [None]:
learn.predict('My flight is great!', n_words=20)

In [None]:
learn.save_encoder('ft_enc')

## 3.4. Classifier fine-tuning

In [None]:
train_valid, test = train_test_split(df, test_size=0.2)
train, valid = train_test_split(train_valid, test_size=0.2)

In [None]:
data_clas = TextClasDataBunch.from_df(working_path, train, valid, test_df=test, vocab=data_lm.train_ds.vocab, text_cols='text', label_cols='airline_sentiment', bs=32)

In [None]:
data_clas.show_batch()

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('ft_enc')
learn.freeze()

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
lr = 3.0E-02
learn.fit_one_cycle(1, lr, moms=moms, wd=wd)

In [None]:
learn.freeze_to(-2)
lr /= 2
learn.fit_one_cycle(1, slice(lr/(2.6**4), lr), moms=moms, wd=wd)

In [None]:
learn.freeze_to(-3)
lr /= 2
learn.fit_one_cycle(1, slice(lr/(2.6**4), lr), moms=moms, wd=wd)

In [None]:
learn.unfreeze()
lr /= 5
learn.fit_one_cycle(3, slice(lr/(2.6**4), lr), moms=moms, wd=wd)

In [None]:
learn.predict('I love flying')

In [None]:
learn.predict('My flight was delayed')

In [None]:
learn.predict("Safe flight!")

# 4. Summary of results

## 4.1 Accuracy

In [None]:
interp = TextClassificationInterpretation.from_learner(learn)
acc = accuracy(interp.preds, interp.y_true)
print('Accuracy: {0:.3f}'.format(acc))

## 4.2 Confusion matrix

In [None]:
interp.plot_confusion_matrix()

## 4.3 ROC and AUC

Now we plot the one vs. rest ROC curves and give the corresponding AUC values. The one vs. rest ROC curves are a generalization where we choose one class to be the Positive class and combine the rest.

For each of exposition, we'll consider the ROC curve where the Positive class is the neutral sentiment and the Negative class the is the combination of the positive and negative sentiment. The ROC curve gives True Positive Rate (TPR) as a function of the True Negative Rate (TNR). We are able to vary TNR by choosing the different thresholds for a neutral sentiment to be classified as Positive. In this way, we have a family of classifiers deciding between Positive (neutral) and Negative (positive+negative) given by varying the thresholds. In this next section, we'll consider assign some hypothetical costs that'll inform which classifer out of this family is the ''best''.

In [None]:
scores = pd.DataFrame(interp.preds)
plt.figure(figsize=(12, 12))
fpr = dict()
tpr = dict()
thresh = dict()
for i, cls in zip(range(3), ['negative', 'neutral', 'positive']):
    score = scores[i].apply(lambda x: x.item())
    y_true = [x.item() == i for x in interp.y_true]
    fpr[i], tpr[i], thresh[i] = roc_curve(y_true, score, pos_label=True)
    auc = roc_auc_score(y_true, score)
    leg = "AUC: {0:.3f} -- {1}".format(float(auc), cls)
    plt.plot(fpr[i], tpr[i], label=leg)
    
plt.legend(loc="lower right", prop={'size': 28})
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

## 4.4 Cost-informed threshold choosing

As we saw in the last section, there is a trade-off between $TPR$ and $FPR$. In this section, we'll consider a hypothetical business suitation that provides a cost function and we will choose the best threshold given the cost function.

Suppose an airline is interested in possibly taking action upon seeing a negative sentiment tweet to preserve their company image. We will assign negative sentiment tweets the Positive class and positive and neutral sentiment tweets the Negative class. In this suitation a False Positive will cost wasting a small amount of an employee's time and a False Negative will cost them a company image hit. Let $C_{FP}$ and $C_{FN}$ be the costs associated to each False Positives and False Negatives, respectively. Then the cost as a function of $FPR$ and $TPR$ is proportional to 

$$C_{FP} \cdot FPR \cdot N + C_{FN} \cdot (1-TPR) \cdot P, $$

where $P, N$ is the number Positive and Negative class elements, respectively. Let $C = C_{FN}/C_{FP}$, then we see that the cost function is proportional to

$$FPR \cdot N + C \cdot (1-TPR) \cdot P$$

**Caveats:** 
1. If we were actually only interesting in whether something is negative sentiment or not, it would make sense to relabel our data accordingly and retrain the model.

2. The ROC computation done above was computed with plotting in mind so we only have a sampling of ~500 points though our training data has many more points. If this analysis was high priority, it would be worthwhile to investigate computing the ROC from scratch.


In [None]:
vc = df['airline_sentiment'].value_counts()
T = sum(vc) # number of total
P = vc[0] # number of Positive class
N = T - P # number of Negative class

# The following were computed from the ROC section. zip(Nfpr, Ntpr) gives a list of coordinates to the negative class ROC curve and Nthresh gives the corresponding thresholds.
Nfpr = fpr[0]
Ntpr = tpr[0]
Nthresh = thresh[0]

num_pts = len(Nfpr)

In [None]:
# This computes the cost as defined above where the FPR and TPR is given by Nfpr[i] and Ntpr[i]
def cost(C, i):
    return N*Nfpr[i] + C*P*(1-Ntpr[i]), Nthresh[i]

Suppose the cost of $FP$ is the same as $FN$, then this next computation shows that the threshold to pick is about $.5$.

In [None]:
min(cost(1, i) for i in range(num_pts))[1]

Suppose the cost of $FP$ is the same as $FN$, then this next computation shows that the threshold to pick is about $.2$.

In [None]:
min(cost(2, i) for i in range(num_pts))[1]