# **A brief introduction**
> Nowadays, Neural Networks are a hot field of Artificial Intelligence. Moreover, many services deploy neural networks to solve regression and classification problems. In classification problems, there are some observations and the object is to classify them into possible classes. The training of models from scratch is time-consuming and requires a lot of data.
To decrease the needed data and possibly the training time, learning techniques are developed, such as transfer learning et. al... The ULMfit[[1]](https://arxiv.org/abs/1801.06146) is an inductive transfer learning method developed by Jeremy Howard, Sebastian Ruder. This project aims to build a classification algorithm by using pre-trained models which will be trained by the ULMfit transfer learning method. 


## **Problem Statement**



> The given dataset contains Twitter posts from airline costumers. The posts are labeled as 'positive', 'neutral' and 'negative'. Therefore there are three possible classes. The current projects aims to build a neural network model for classiffy the future posts of the customers at the original classes. Hence, the posts are have their ground-truth label at this project suprvised learning methods will be used for this problem.


## **General Knowledge**

> First of all, Recurrent neural network(RNN) is a class of neural networks that outperform classic Multilayer Perceptron(MLP) and Convolutional Neural Networks(CNN) in sequence analysis due their design.

> **Multi-Layer Perceptrons (MLPs)**
> 1.   Each input is considered independently of past and future inputs.
> 2.   It requires fixed-size inputs.
> 3.   It accepts the entire sequence as input; modeling the internal ordering of the data is cumbersome.

> **Convolutional Neural Networks (CNNs)**
> 1.   CNNs are suitable for training on spatially structured data, e.g.,  images.
> 2.   CNNs learn to derive semantically meaningful data representations encoding spatial dependencies.

> Therefore, RNNs(which are designed processing sequential data) are more suitable for Tweeter's posts generated by customers. However, instead of using original RNN which encounters problems such as gradient explode. This project is going to use an AWD-LSTM approach which is described in [[2]](https://arxiv.org/abs/1708.02182) and prevents models to overfit and eventually the train will output a robust model.


## **Transfer Learning and ULMfit approach**
> **Universal Language Model Fine-tuning for Text Classification**

>> ULMfit is constitued by three stages(image can be found in the original paper[[1]](https://arxiv.org/abs/1801.06146)):
 <img src="https://humboldt-wi.github.io/blog/img/seminar/group11_peer_reviews/ulmfit.jpeg" width="70%" height="70%" />

>> The figure above describes the stages of the ULMFit model which are: 

>> 1. General-domain LM pretraining.
>>> ULMFit model uses Merityâ€™s Wikitext 103 dataset which is created from a pre-processed large subset of English Wikipedia consisting of 28,595 preprocessed Wikipedia articles and 103 million words. Training the language model has been done by running the text corpus through a bidirectional language model with an embedding size of 400, **3 layers** and 1150 hidden activations per layer. Since this project aims to transfer learning we are not going to train a language model from scratch.

>> 2. LM fine-tuning.
>>> After acquiring the language model, we will now apply transfer learning methods in order to use it on target data. In the past, used to be a single layer of weights (embeddings). However, these weights only penetrate through the surface of the neural network. In practice, neural networks usually contain more than one layer, so the information has to be transferred to other layers to do accurate predictions.

>> 3. Classifier fine-tuning. 
>>>Fine-tuning the target task classifier is crucial part. As the language model's fine-tune, we also need to fine-tune the classifier with non-aggressive fine-tuning way as it will cause catastrophic forgetting and being too cautious would result in a slow convergence and thus the model it might overfit.



> According the article of the ULMfit for text classification it demands:
> 1. Discriminative fine-tuning.
>> The layers of the neural network are different and each layers captures different information. The article of the ULM-fit is fine-tuning each layer with different way. 
> 2. Slanted triangular learning rates
>> According to the article, We would like the model to quickly converge to a suitable region of the parameter space in the beginning of training and then refine its parameters. Thus, instead of setting the learning rates of the individual layers manually. Howard and Ruder use the slanted triangular learning rates. The slanted triangular learning rates method is firstly linearly increase the learning rate and then linearly decays it. 
> 3. Gradual unfreezing
>>  if we fine-tune all the layers at once, it would be risky and might cause catastrophic forgetting. Hence, Howard and Ruders use the Gradual unfreezing method. According to the ULM-fit method: We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we finetune all layers until convergence at the last iteration.

# **Implementation**


## **Import Libraries**



### **Updates and Imports**

In [None]:
!curl -s https://course.fast.ai/setup/colab | bash #Update the fast.ai version


Updating fastai...
[31mERROR: Operation cancelled by user[0m
Done.


In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from fastai import *
from fastai.text import *
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import math
from fastai.callbacks import *



## **Import data.**


In [None]:
from google.colab import * #Mount the folders of Google drive in order to load the data
drive.mount('/content/gdrive', force_remount=True)

In [None]:
tweets = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/data/Tweets.csv') #Reading data from csv
tweets.head(5) #Print the 5 first rows of the data.

In [None]:
print(tweets.shape) #Shape of the data, so there are 14640 observations, and 15 cattegories of each

In [None]:
df = pd.DataFrame({'airline_sentiment':tweets['airline_sentiment'], 'text':tweets['text']}) #Creating a DataFrame. We kept only the Airline sentiment which the label and the text
df = df.reset_index(drop = True)
df.shape

In [None]:
df['airline_sentiment'].value_counts() #Observing the data

In [None]:
df.head(5) #Print 5 lines of the data

## **Preprocessing Data.**

In [None]:
#df['text'] = df['text'].str.replace("[^a-zA-Z]", " ") #Keep only letters, although emojis can be a individual message
#df.head(5) #Preview the data after clean the unstructed text.

[**Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora**](https://https://link.springer.com/chapter/10.1007/978-3-319-01466-1_6)

---




> **Stopwords** can take up space to our dataset and valuable processing time, also search engines are programmmed to ignore them.



In [None]:
nltk.download('stopwords') #ntlk package contains stop words that are going to be filtered.
stop_words = stopwords.words('english')
tokenized_doc = df['text'].apply(lambda x: x.split())
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 
    detokenized_doc.append(t) 

df['text'] = detokenized_doc

In [None]:
df_trn, df_val = train_test_split(df, stratify = df['airline_sentiment'], test_size = 0.3, random_state = 12) #Split dataset to train(70%) and valid(30%). There is no prooven rule for the dataset split.
print('This is the shape of the training data: ',df_trn.shape)
print('This is the shape of the validation data: ',df_val.shape)

In [None]:
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "") #Using TextLMDataBunch it automatically does some preprocessing steps.
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=64)
labels = data_clas.classes

## **Language Model Training.**


### [**Language Models**](https://en.wikipedia.org/wiki/Language_model)
---


>  **Language modeling** is crucial in modern NLP applications. It is the reason that machines can understand qualitative information. Each language model type, in one way or another, turns qualitative information into quantitative information. This allows people to communicate with machines as they do with each other to a limited extent.


In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3,pretrained=True)#Initilization of the language model.
learn.lr_find(start_lr=1e-8, end_lr=1e2)#Find a good learning rate. We need to find an optimal learning rate for a good training and fast covergence. Slanted triangular learning rates. 
learn.recorder.plot(suggestion=True)#Ploting the graph of the losses in relation with the Learning rates.

learning_rate = learn.recorder.min_grad_lr#Store learning rate.
learning_rate = learning_rate + learning_rate/2#Slightly increase at the learning rate.

In [None]:
learn.freeze_to(-1)#freeze the weights exept the weights of the last layer
learn.fit_one_cycle(10, learning_rate,callbacks=[SaveModelCallback(learn, name="best_lm")],moms=(0.8,0.7)) #Fine-tune the model, Learning rate: learning_rate and finaly save the best model 

In [None]:
learn.load('best_lm') #Load the best model

In [None]:
learn.unfreeze() #Make sure that the weights are adjustable.
learn.freeze_to(-2)#freeze the weights exept the weights of the last 2 layers
learn.lr_find()
learn.recorder.plot(suggestion=True)
learning_rate = learn.recorder.min_grad_lr
learning_rate = learning_rate + learning_rate/2

In [None]:
learn.unfreeze()#Unfreeze the all the weights of all layers
learn.fit_one_cycle(10, learning_rate,callbacks=[SaveModelCallback(learn, name="best_lm")],moms=(0.8,0.7))

In [None]:
learn.load("best_lm")#Load the best model according the previous training.

In [None]:
learn.save_encoder('fine_tuned_enc')#Saving the Language Model in order to use it in the future

## **Initilization and Training of the classifier.**

In [None]:
data_clas.show_batch() #Preview the data will be fed in the classifier.

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)#Initialization of the classifier.
learn.load_encoder('fine_tuned_enc')#Load the Language Model that we trained above.

In [None]:
learn.lr_find()
learn.recorder.plot(suggestion=True)
learning_rate = learn.recorder.min_grad_lr
learning_rate = learning_rate + learning_rate/2

In [None]:
learn.freeze_to(-1)#Gradually Unfreezing: Unfreeze the last layer of the classifier 
learn.fit_one_cycle(1, learning_rate,moms=(0.8,0.7))
learn.recorder.plot_losses()

In [None]:
learn.freeze_to(-2)#Gradually Unfreezing: Unfreeze the last 2 layers of the classifier 
learn.fit_one_cycle(2,learning_rate, moms=(0.8, 0.7))
learn.recorder.plot_losses()

In [None]:
learn.unfreeze()#Unfreeze all the weights of the network
learn.lr_find()#Find a possible optimal learning rate
learn.recorder.plot(suggestion=True)
learning_rate = learn.recorder.min_grad_lr
learning_rate = learning_rate + learning_rate/2 #Slightly increase the learning rate

In [None]:
learn.fit_one_cycle(3, learning_rate)

In [None]:
learn.show_results(1)


# **Inference**

In [None]:
pred_clas, pred_idx, out = learn.predict('This flight was not good, I want my money back!')
labels[try_int(pred_idx)]

In [None]:
pred_clas, pred_idx, out = learn.predict('Great flight, see you again, I would love to flight again with you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('Thank you so much <3 love you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out= learn.predict('NO NO NO')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('I love you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('I <3 you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('I hate you!')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('Such a bad flight')
labels[try_int(pred_idx)]

In [None]:
pred_clas,pred_idx,out=learn.predict('What does the fox say?')
labels[try_int(pred_idx)]