The objective of this notebook is to serve as an introduction to go about building a headlines classifier using transfer learning (more specifically, ULMFiT).

In this notebook, we will go over the following topics
1. What is imbalanced data and why is it bad?
2. What is transfer learning?
3. How to use fast.ai to train a headline classifier model quickly and easily with the help of transfer learning?

Begin with importing the libraries that we will for preliminary data analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Import the csv file as a pandas dataframe, and check what the data looks like - 

In [None]:
df = pd.read_csv('../input/ireland-historical-news/irishtimes-date-text.csv')
df.head()

Publish_date doesn't seem like it would be very useful in classifying the headline_text as headline_category, so we'll drop that column and see what our new dataframe looks like. Besides, the goal of this kernel is to classify headlines only based on the text.

In [None]:
try:
    df = df.drop('publish_date', axis=1)
except:
    #already dropped
    pass

df.head()

Let's see how many unique categories are there, and how many samples each of those have using value_counts (returns the values and their counts) of a pandas Series.

In [None]:
df['headline_category'].value_counts()

Seems like a lot. Let's find out how many exactly...

In [None]:
print(len(df['headline_category'].value_counts()))

That's certainly a lot, but most of them probably have very less frequencies. Let's see how many categories exist that have occured at least 10000 times in the sample.

In [None]:
filtered_df = df[df.groupby('headline_category').headline_category.transform(len)>10000]
print(len(filtered_df['headline_category'].value_counts()))

21 seems to be a lot more reasonable for news classifications. Let's check what they are.

In [None]:
filtered_df['headline_category'].value_counts()

In [None]:
plt.figure(figsize=(20,20))
f = sns.countplot(filtered_df['headline_category'])
f.set_xticklabels(f.get_xticklabels(), rotation=20, ha="right");

# Unbalanced Data

Looks like there are a lot of overlapping categories. 5 business categories, lots of news categories and 3 sports categories. You can combine them and implement the model and compare the results (you will probably get a higher accuracy since similar classes are merged into one, which are the ones our model will often get wrong), but here I have kept them separate.

That issue aside, we have a big problem here that is clearly shown by the count plot. There are way too many headlines classified as news (and to some extent, sports and business), compared to the others. 

**Why is this a problem?**

Well, suppose we want to predict the class without knowing any data -- the chance is 1/21 i.e. 4.76%. Pretty low. Your model can predict a random class and it'll only get 4.76% accuracy.

Now, suppose we feed this data to our model with a validation test that is also so skewed. Most of the models will start to predict news or sports or business on every headline. 
Why is that? For that, let's see how much % of our headlines is 'news'. According to the csv, we have 1.43 million rows. And from the above value_counts, we have found that news corresponds to 574774 of headlines. 

574774/1430000 = 40.19%
So even if the model were to predict that every single headline belongs to the 'news' category, it would have 40.19% accuracy. Now, if it learns the difference between only the major 3 categories -- 'news', 'sports' and 'business', it can reach about 60% accuracy.

What this means is that our model potentially won't learn anything about other classes and if that happens, it will not be able to classify the less frequent categories. 

There are many ways of fixing a class imbalance problem - 
* Using metrics like AUC (Area Under the Curve) instead of accuracy for weights updation.
* Downsampling the classes that have a lot of samples (also called majority classes)
* Upsampling the classes that have very less samples (also called minority classes)
* Using an algorithm that is robust to class imbalance (such as decision trees)
* Generate synthetic samples 
* Penalize the model if the performance on minority classes is low.

Here, we will downsample all the categories to 10000. 

In [None]:
def sampling_k_elements(group, k=10000):
    return group.sample(k)

#Apply the function to all groups
balanced_df = filtered_df.groupby('headline_category').apply(sampling_k_elements).reset_index(drop=True)
balanced_df['headline_category'].value_counts()

We now see that each category is of 10000 counts only, giving us a perfectly balanced dataset.

Let us numericalize the categories (instead of 'news', 'culture', 'health', we convert them to some corresponding numbers such as '9', '5', '2', etc). It can be easily done by converting the column to a category and finding out the category codes.

In [None]:
balanced_df['category'] = balanced_df['headline_category'].astype("category").cat.codes
balanced_df.head()

Now, there is a new column with numerical categories.
We can now proceed to split our dataset into two - train and validation. We do this in order to ensure that our model does not overfit the dataset. Comparising train accuracy vs validation loss and accuracy tells us how well the model performs for seen data (train) vs unseen data (validation).

We split 80% of dataset in train and 20% of dataset in valid in this case. We also drop the headline_category column, which is no longer required since we have converted the categories to numbers.

In [None]:
np.random.seed(123)
balanced_df = balanced_df.iloc[np.random.permutation(len(balanced_df))]
cut1 = int(0.8 * len(balanced_df)) + 1
try:
    dropped_balanced_df = balanced_df.drop('headline_category', axis=1)
except:
    pass

df_train, df_valid = dropped_balanced_df[:cut1], dropped_balanced_df[cut1:]

Checking the shapes and heads of the new dataframes. There are total 210000 samples (21 categories * 10000 headlines per category).

In [None]:
print(df_train.shape)
df_train.head()

In [None]:
print(df_valid.shape)
df_valid.head()

In [None]:
df_valid['category'].value_counts()

Now we have a relatively balanced validation dataframe (all categories are close to 2000).

We create a dictionary for future use so we know which number corresponds to which category

In [None]:
category_numbers = dict(enumerate(balanced_df['headline_category'].astype("category").cat.categories))
print (category_numbers)

# Transfer Learning

Now that our data is ready, we will dive into modelling our classifier. 

From Wikipedia, 
> Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks. 

Most of machine learning models are very specific. They are very good at handling one problem, and only that one problem. Transfer learning allows us to use the knowledge that a model has learned in domain A (suppose cars) in training a model for a similar problem in domain B (suppose trucks).

What we essentially do is take a pretrained model that was used for a task T1 in certain domain A, and fine tune it to use it on a similar task T2 in domain B.

fast ai features ULMFiT (Universal Language Model Fine-tuning for Text Classification) which as the name suggests, is a universal language model for text classification. It is a language modeller that has been trained on the Wikitext-103 (103 million tokens_ dataset, which is basically the entire wikipedia scraped. There are also other models like this, such as Google's BERT, Open AI's GPT-2 (Generative Pre-Training). 

Import fast ai's text API

In [None]:
from fastai.text import *

### Tokenization 
Next, we need to create a TextLMDataBunch, which fast ai uses for all of its NLP operations. It tokenizes the data while trying to retain most of the meaning of the sentence(we will discuss this more further down). Tokenization refers to breaking down of text into tokens - for example, "This is a sentence" is separated into 4 tokens and forced into lower case - 'this', 'is', 'a', 'sentence'.

DataBunch is what is fed to the neural network. It has 5 components (for supervised learning) - 

1. Training Dataset
2. Training DataLoader
3. Validation Dataset
4. Validation DataLoader
5. (Optional) Test set, which we are not going to use in this kernel.

Dataset and DataLoader are PyTorch classes, so you can read more about them in their docs if you want to know more in detail.

The combination of all these 5 is bunched in the fast ai DataBunch.

TextDataBunch is just a type of DataBunch, however, it has a big limitation. It only works directly if ALL texts have the same length. So to convert our headlines, which are of uneven lengths, we use TextLMDataBunch that concatenates the text without their target labels so our headlines can be used as a TextDataBunch. This TextLMDataBunch is essentially the data that we can feed into language modeler later.

This might take some time...

In [None]:
data_lm = TextLMDataBunch.from_df(path="", train_df=df_train, valid_df = df_valid, text_cols="headline_text", label_cols="category")

Let's save the databunch, so we can directly use it the next time directly.

In [None]:
data_lm.save('irish.pkl')

Let's check how data_lm looks now.

In [None]:
data_lm.show_batch(5)

### What does Language Modelling mean and what are all those xx terms?

Language modelling, in simple terms, is to understand a language using various techniques, such as its word representations(individual words) or its semantic meaning(whole text/sentence/paragraph).

We use the xx tokens so that the model can understand better and differentiate between two similar texts. For example, "I need a hotdog now" vs "I NEED a hotdog now". The second one implies more urgency, more hunger than the first. However, to have less features and reduce complexity, since all text is converted to lower case, both of them end up as "i need a hotdog now", in which we have definitely lost some understanding after tokenization. "China" is a country, "china" is porcelain. So you can see, proper tokenization is a very important prequisite for language modelling.

Using xx tokens we retain a lot of the original meaning despite converting everything to lowercase.
Here are all of the xx tokens and what they mean - 

1. xxbos = Beggining Of Sentence = This xx token represents the beggining of a sentence.

2. xxpad = Padding = Used for padding if needed to regroup several texts of different lengths into one batch.

3. xxmaj = Denotes that the following word begins with uppercase. So "Hello World" becomes "xxbos xxmaj hello xxmaj world"

4. xxfld = Separates different fields of text (when we have more than one column in dataframe for text classification, for eg, if we had both headlines and article text in two seperate columns, we would have 2 fields).

5. xxup = Denotes that the following word is entirely uppercase. "... NEED... " becomes "...xxup need..."

6. xxrep = Repetitions = Denotes how many times a CHARACTER has been repeated consecutively, if it has been repeated consecutively. "...oooo ..." becomes "...xxrep 4 o ..."

7. xxwrep = Same thing as xxrep but for words. "...muda muda muda muda muda..." becomes "...xxwrep 5 muda..."

8. xxunk = Denotes an unknown word, that is not in our vocabulary. 

Let's check the top 40 words used in the headlines, since we are done with the tokenization.

The default size of the vocabulary is 60000. A word has to occur at least 2 times to be added to the vocabulary.

In [None]:
data_lm.vocab.itos[:40]

Mostly the tokens, and the common words.

The nouns are Ireland, Irish and Dublin. There's also a €.
Until recently many believed that removing stopwords was an essential step before modelling, but views have changed. In this case, I am not going to remove them, because stop words have been shown to contribute a lot to the meaning of a sentence.

For those who are familiar with stopwords, they are words that always occur in texts, regardless of their nature. For eg, "an", "a", "in", "the", etc. compared to (say), a word like "vindicate" that is probably mostly present in only legal matters.

Now we define our language model learner, in which we pass our DataBunch and specify the pretrained network that we want to use - AWD_LSTM that was used on WT-103 (Wikitext-103). 

drop_mult is a hyper-parameter that defines by how much we want to multiply the dropout layers of our network. Dropout refers to dropping out units in the network(both hidden and visible) to avoid overfitting the network. 

drop_mult = 0.3 means all dropout layers's probability will be multiplied by 0.3. drop_mult = 1.5 means all dropout layers's probability will be multiplied by 1.5. AWD_LSTM has a dropout in every layer, so this parameter is quite important.

To understand it simply, if model is overfitting, increase drop_mult.
If model is underfitting, decrease drop_mult.

In [None]:
learner = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.1)

In [None]:
learner.save_encoder('irish_encoder')

We save the encoded learner, so we can use it in our classifier. This encoder is what understands the text.

Now that we have tried our language modeller and made an encoder that can understand text, we need to create a text classifier that separates the 21 classes we had. For that, first, we need to create a TextClasDataBunch, which is another type of DataBunch.

In this we specify vocab = data_lm.train_ds.vocab to ensure that the vocabularly that  we had in language modelling is the same as what we will have for the text classifier.

In [None]:
data_clas = TextClasDataBunch.from_df(path="", train_df=df_train, valid_df = df_valid, vocab=data_lm.train_ds.vocab, text_cols="headline_text",label_cols="category")

Here we create the text classifier, using the pretrained AWD_LSTM model that was used for WT-103.

In [None]:
clas = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.1)

Load the encoder that was obtained from language model learner...
You can check out the details of the entire network if you wish. 

In [None]:
clas.load_encoder('irish_encoder')

Fast ai provides an easy way to finding the ideal learning rate for the mode that works most of the time.

In [None]:
clas.lr_find()
clas.recorder.plot()

The y axis is the loss, and the x axis is the learning rate. If you're thinking that we should pick the learning rate that corresponds to the lowest loss, then you are mistaken in your thinking. We want our model to learn as fast as possible but if the loss is minimum, then what our model will learn is also minimum.

Most of the losses vs learning rate graphs look quite similar - They go down slowly to a certain point, stay for a few values, and then shoot up very rapidly because of overfitting. 

A good thumb of rule is to pick the loss as LR(Loss_min)/10, where LR(Loss_min) refers to the learning rate where the loss is minimum, but this does not always work. In cases like these, you might have to experiment a bit.

We specify the number of epochs(5 here), and the maximum learning rate to the classifier and train.

In [None]:
clas.fit_one_cycle(5, 1e-02)

### One cycle fit 

fastai uses 1 cycle fit, which works very well. You can check out the original paper  by Leslie Smith [here](https://arxiv.org/abs/1803.09820). 

![image.png](attachment:image.png)
Image from [Sylvain Gugger's post](https://sgugger.github.io/the-1cycle-policy.html) on 1cycle fit, which you can refer to if you want to understand 1cycle in detail.

Essentially, we start slow (low learning rate), then keep raising the learning rate until it hits maximum, and then slow down again. When all of this is done once, one cycle is complete.

# Fine-tuning

After 5 epochs on a pretrained model, we have already got nearly half chance of predicting the correct class amongst 21 (almost 10 times better than picking the correct class at random). In less than 5 minutes, too.

But we can surely improve the accuracy further. Let us keep all the layers of the pretrained model frozen(as it is) and train again.

In [None]:
clas.freeze_to(-2)
clas.lr_find()
clas.recorder.plot(suggestion=True)

Here we use another parameter, moms, which is another parameter that Leslie Smith advices to use. It stands for momentum. Since I'm picking an aggressive learning rate here, the momentum will help slow down the overshoot if there is any. 

Intuitively, we want to have

1) a higher momentum with a low learning rate

2) a low momentum with a high learning rate

> To accompany the movement toward larger learning rates, Leslie found in his experiments that decreasing the momentum led to better results. This supports the intuition that in that part of the training, we want the SGD to quickly go in new directions to find a flatter area, so the new gradients need to be given more weight. In practice, he recommends to pick two values likes 0.85 and 0.95, and decrease from the higher one to the lower one when we increase the learning rate, then go back to the higher momentum as the learning rate goes down.

Cited from [Sylvain Gugger's post](https://sgugger.github.io/the-1cycle-policy.html) 

In [None]:
clas.fit_one_cycle(5, 5e-04, moms=(0.9,0.8))

Save the encoder for the classifier so we don't have to train next time. We can simply load clas.load_encoder('freeze_2_encoder').

You can see that the training loss actually went up the last iteration, so we will need to unfreeze more layers to get better results.

In [None]:
clas.save_encoder('freeze_2_encoder')

We'll keep the model frozen except the last three layers and train again.

In [None]:
clas.freeze_to(-3)
clas.lr_find()
clas.recorder.plot(suggestion=True)

In [None]:
clas.fit_one_cycle(5, 3.2e-05, moms=(0.95,0.85))

In [None]:
clas.save_encoder('freeze_3_encoder')

Finally, we will unfreeze the entire model and train.

In [None]:
clas.unfreeze()
clas.lr_find()
clas.recorder.plot(suggestion=True)

In [None]:
clas.fit_one_cycle(3, 5e-05, moms=(0.95, 0.85))

In [None]:
clas.save_encoder('final_encoder')

And we're done. Near to 60% accuracy is pretty good for training for less than 20 minutes of training on a 21 classes problem with quite a few similar classes. This is the power of transfer learning. I'm sure the accuracy can be improved by merging similar classes, taking more gradual approach and having more epochs or simply getting more data of each class, but I'm quite satisfied with this. 

Let's predict some stuff!

In [None]:
clas.predict("Artist A's latest album is soaring through the charts")

There are two tensors here - The first tensor (7) tells us which class it probably belongs to, and the long, second tensor is a list of probabilities.

But well, what category did 7 belong to, again?
Remember the dictionary that we made when we numericalized the categories? We can use that to find out.

In [None]:
print(category_numbers.get(7))

We can check that the sum of all of the probabilities is 1.

In [None]:
clas.predict("Beatles' latest album is soaring through the charts")[2].sum()

Some more testing and fun...

In [None]:
clas.predict("An underdog wins the worldcup 2-0")

In [None]:
category_numbers.get(18) #To know what category 18 belongs to, let us see...

Pretty inconvenient to predict and find out manually each time. Let's make a function to predict and try predicting something not related to Ireland. I'll take the NIFTY stock, which is not related to Ireland.

In [None]:
def pred_classes(text):
    print(category_numbers.get(int(clas.predict(text)[1])))    

In [None]:
pred_classes("NIFTY falls down by 100 rupees")

In [None]:
pred_classes("Eggs and cholestrol - is eating many eggs really unhealthy for your heart?")

In [None]:
pred_classes("We need to do something now, or else our planet is doomed")

In [None]:
pred_classes("10 ways to improve your kitchen")

In [None]:
pred_classes("A couple gets 10 years for killing a teddy bear")

> "Transfer learning will be the next drive of ML success" - Andrew Ng

References - 
1. [fastai text docs](https://docs.fast.ai/text.html)
2. [Towards Data Science - Machine Learning — Text Classification, Language Modelling using fast.ai by Javaid Nabi](https://towardsdatascience.com/machine-learning-text-classification-language-modelling-using-fast-ai-b1b334f2872d)
3. [Towards Data Science -Transfer Learning in NLP for Tweet Stance Classification by Prashanth Rao](https://towardsdatascience.com/transfer-learning-in-nlp-for-tweet-stance-classification-8ab014da8dde)
4. [Sylvain Gugger's excellent post on one cycle fit](https://sgugger.github.io/the-1cycle-policy.html)



