# **A Brief Introduction**
> Neural Networks are a prominent field within Artificial Intelligence today. They are widely used in various services to address both regression and classification problems. In classification tasks, the goal is to categorize observations into predefined classes. However, training models from scratch can be time-consuming and requires a significant amount of data.

To reduce the data requirements and potentially shorten training time, advanced techniques like transfer learning have been developed. ULMfit [[1]](https://arxiv.org/abs/1801.06146) is one such inductive transfer learning method introduced by Jeremy Howard and Sebastian Ruder. This project focuses on building a classification algorithm using pretrained models that are fine-tuned with the ULMfit transfer learning method.



## **Problem Statement**



> The given dataset contains Twitter posts from airline customers. The posts are labeled as 'positive', 'neutral', or 'negative', representing three possible classes. The goal of this project is to build a neural network model to classify future customer posts into these original classes. Since the posts in the dataset have ground-truth labels, supervised learning methods will be employed to address this problem.



## **General Knowledge**

> Recurrent Neural Networks (RNNs) are a class of neural networks that outperform classic Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) in sequence analysis due to their design.

### **Multi-Layer Perceptrons (MLPs)**
1. Each input is considered independently of past and future inputs.
2. They require fixed-size inputs.
3. They process the entire sequence as a single input, making it difficult to model the internal ordering of the data.

### **Convolutional Neural Networks (CNNs)**
1. CNNs are suitable for training on spatially structured data, such as images.
2. CNNs learn to derive semantically meaningful data representations by encoding spatial dependencies.

> As a result, RNNs, which are designed for processing sequential data, are better suited for analyzing Twitter posts generated by customers. However, instead of using traditional RNNs, which suffer from issues such as exploding or vanishing gradients, this project employs the AWD-LSTM approach, as described in [[2]](https://arxiv.org/abs/1708.02182). This method prevents overfitting and ensures that the training process produces a robust model.



## **Transfer Learning and ULMfit approach**
> **Universal Language Model Fine-tuning for Text Classification**

>> ULMfit consists of three stages (as illustrated in the original paper):[[1]](https://arxiv.org/abs/1801.06146)):
 <img src="https://humboldt-wi.github.io/blog/img/seminar/group11_peer_reviews/ulmfit.jpeg" width="70%" height="70%" />

>> The figure above illustrates the stages of the ULMFit model, which include:

### 1. General-domain LM Pretraining
>>> The ULMFit model utilizes Merity’s Wikitext-103 dataset, a large preprocessed subset of English Wikipedia comprising 28,595 articles and 103 million words. The language model is trained by passing the text corpus through a bidirectional language model with an embedding size of 400, **3 layers**, and 1150 hidden activations per layer. Since this project focuses on transfer learning, we will not train a language model from scratch.

### 2. LM Fine-tuning
>>> After obtaining the language model, transfer learning methods are applied to adapt it to the target data. In earlier approaches, only a single layer of weights (embeddings) was used for transfer learning, which barely impacted the deeper layers of the neural network. Modern neural networks typically have multiple layers, so the transfer of information across all layers is essential for accurate predictions.

### 3. Classifier Fine-tuning
>>> Fine-tuning the classifier for the target task is a critical step. Just as the language model is fine-tuned, the classifier must also undergo non-aggressive fine-tuning. Overly aggressive fine-tuning can lead to catastrophic forgetting, while being too cautious may result in slow convergence and potential overfitting of the model.

---

> According to the article on ULMFit for text classification, it requires the following:

### 1. Discriminative Fine-tuning
>>> The layers of a neural network capture different types of information. The ULMFit method fine-tunes each layer differently, recognizing their unique contributions.

### 2. Slanted Triangular Learning Rates
>>> The goal is for the model to quickly converge to an appropriate parameter space early in training and then refine its parameters. Instead of manually setting the learning rates for individual layers, Howard and Ruder introduce **slanted triangular learning rates**. This method linearly increases the learning rate at the beginning and then linearly decreases it.

### 3. Gradual Unfreezing
>>> Fine-tuning all layers simultaneously is risky and could lead to catastrophic forgetting. To address this, Howard and Ruder propose **gradual unfreezing**. In this approach, the last layer is unfrozen first, and all unfrozen layers are fine-tuned for one epoch. The next lower frozen layer is then unfrozen, and the process is repeated until all layers are fine-tuned, ensuring convergence by the final iteration.


# **Implementation**


## **Import Libraries**



### **Updates and Imports**

In [1]:
!curl -s https://course.fast.ai/setup/colab | bash #Update the fast.ai version


<3>WSL (50173) ERROR: CreateProcessCommon:559: execvpe(/bin/bash) failed: No such file or directory


In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from fastai import *
from fastai.text import *
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import math
from fastai.callbacks import *



ModuleNotFoundError: No module named 'fastai'

## **Import data.**


In [None]:
from google.colab import * #Mount the folders of Google drive in order to load the data
drive.mount('/content/gdrive', force_remount=True)

In [None]:
tweets = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/data/Tweets.csv') #Reading data from csv
tweets.head(5) #Print the 5 first rows of the data.

In [None]:
print(tweets.shape) #Shape of the data, so there are 14640 observations, and 15 cattegories of each

In [None]:
df = pd.DataFrame({'airline_sentiment':tweets['airline_sentiment'], 'text':tweets['text']}) #Creating a DataFrame. We kept only the Airline sentiment which the label and the text
df = df.reset_index(drop = True)
df.shape

In [None]:
df['airline_sentiment'].value_counts() #Observing the data

In [None]:
df.head(5) #Print 5 lines of the data

## **Preprocessing Data.**

In [None]:
#df['text'] = df['text'].str.replace("[^a-zA-Z]", " ") #Keep only letters, although emojis can be a individual message
#df.head(5) #Preview the data after clean the unstructed text.

[**Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora**](https://link.springer.com/chapter/10.1007/978-3-319-01466-1_6)

---

> **Stopwords** can occupy unnecessary space in our dataset and consume valuable processing time. Additionally, search engines are often programmed to ignore them.



In [None]:
nltk.download('stopwords') #ntlk package contains stop words that are going to be filtered.
stop_words = stopwords.words('english')
tokenized_doc = df['text'].apply(lambda x: x.split())
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 
    detokenized_doc.append(t) 

df['text'] = detokenized_doc

In [None]:
df_trn, df_val = train_test_split(df, stratify = df['airline_sentiment'], test_size = 0.3, random_state = 12) #Split dataset to train(70%) and valid(30%). There is no prooven rule for the dataset split.
print('This is the shape of the training data: ',df_trn.shape)
print('This is the shape of the validation data: ',df_val.shape)

In [None]:
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "") #Using TextLMDataBunch it automatically does some preprocessing steps.
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=64)
labels = data_clas.classes

## **Language Model Training.**


### [**Language Models**](https://en.wikipedia.org/wiki/Language_model)
---

> **Language modeling** plays a pivotal role in modern NLP applications. It enables machines to interpret and process qualitative information by converting it into quantitative data. This transformation allows humans to communicate with machines in a way that somewhat resembles natural human interaction, albeit within certain limitations.


In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3,pretrained=True)#Initilization of the language model.
learn.lr_find(start_lr=1e-8, end_lr=1e2)#Find a good learning rate. We need to find an optimal learning rate for a good training and fast covergence. Slanted triangular learning rates. 
learn.recorder.plot(suggestion=True)#Ploting the graph of the losses in relation with the Learning rates.

learning_rate = learn.recorder.min_grad_lr#Store learning rate.
learning_rate = learning_rate + learning_rate/2#Slightly increase at the learning rate.

In [None]:
learn.freeze_to(-1)#freeze the weights exept the weights of the last layer
learn.fit_one_cycle(10, learning_rate,callbacks=[SaveModelCallback(learn, name="best_lm")],moms=(0.8,0.7)) #Fine-tune the model, Learning rate: learning_rate and finaly save the best model 

In [None]:
learn.load('best_lm') #Load the best model

In [None]:
learn.unfreeze() #Make sure that the weights are adjustable.
learn.freeze_to(-2)#freeze the weights exept the weights of the last 2 layers
learn.lr_find()
learn.recorder.plot(suggestion=True)
learning_rate = learn.recorder.min_grad_lr
learning_rate = learning_rate + learning_rate/2

In [None]:
learn.unfreeze()#Unfreeze the all the weights of all layers
learn.fit_one_cycle(10, learning_rate,callbacks=[SaveModelCallback(learn, name="best_lm")],moms=(0.8,0.7))

In [None]:
learn.load("best_lm")#Load the best model according the previous training.

In [None]:
learn.save_encoder('fine_tuned_enc')#Saving the Language Model in order to use it in the future

## **Initilization and Training of the classifier.**

In [None]:
data_clas.show_batch() #Preview the data will be fed in the classifier.

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)#Initialization of the classifier.
learn.load_encoder('fine_tuned_enc')#Load the Language Model that we trained above.

In [None]:
learn.lr_find()
learn.recorder.plot(suggestion=True)
learning_rate = learn.recorder.min_grad_lr
learning_rate = learning_rate + learning_rate/2

In [None]:
learn.freeze_to(-1)#Gradually Unfreezing: Unfreeze the last layer of the classifier 
learn.fit_one_cycle(1, learning_rate,moms=(0.8,0.7))
learn.recorder.plot_losses()

In [None]:
learn.freeze_to(-2)#Gradually Unfreezing: Unfreeze the last 2 layers of the classifier 
learn.fit_one_cycle(2,learning_rate, moms=(0.8, 0.7))
learn.recorder.plot_losses()

In [None]:
learn.unfreeze()#Unfreeze all the weights of the network
learn.lr_find()#Find a possible optimal learning rate
learn.recorder.plot(suggestion=True)
learning_rate = learn.recorder.min_grad_lr
learning_rate = learning_rate + learning_rate/2 #Slightly increase the learning rate

In [None]:
learn.fit_one_cycle(3, learning_rate)

In [None]:
learn.show_results(1)


# **Inference**

In [None]:
pred_clas, pred_idx, out = learn.predict('This flight was not good, I want my money back!')
labels[try_int(pred_idx)]

In [None]:
pred_clas, pred_idx, out = learn.predict('Great flight, see you again, I would love to flight again with you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('Thank you so much <3 love you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out= learn.predict('NO NO NO')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('I love you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('I <3 you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('I hate you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('Such a bad flight')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('What does the fox say?')
labels[try_int(pred_idx)]