# Suggestion Classification Using ULMFiT Transfer Learning Approach
### Let us see if this classifies "MIDAS should hire me as a research intern!" as a suggestion!

In [70]:
#Import libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import text_to_word_sequence
from nltk.tokenize import word_tokenize, WhitespaceTokenizer, TweetTokenizer, sent_tokenize
import time
import nltk
from fastai import *
from fastai.text import *
#import requirements
import string
import tensorflow as tf
from time import time
from datetime import datetime
import os
import time
import re
from sklearn.metrics import accuracy_score, f1_score

#### Stop-words are NOT removed as words like "should", "could", etc. are key suggestion indicators

In [2]:
#import nltk.data
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [3]:
#nltk.download('stopwords')
#from nltk.corpus import stopwords 
#stop_words = stopwords.words('english')

In [4]:
#stop_words #STOP WORDS SHOULD NOT BE REMOVED IN THIS PROBLEM. WHY? For exapmple, 'should' is an important word 
            #which indicates "suggestion"

#### Load and Explore Training Data, downloaded originally as "V1.4_Training.csv"

In [5]:
df = pd.read_csv('suggestion_training_data.csv').iloc[:,:3]

In [6]:
df.head()

Unnamed: 0,id,suggestion,text
0,663_3,1,"""Please enable removing language code from the..."
1,663_4,0,"""Note: in your .csproj file, there is a Suppor..."
2,664_1,0,"""Wich means the new version not fully replaced..."
3,664_2,0,"""Some of my users will still receive the old x..."
4,664_3,0,"""The store randomly gives the old xap or the n..."


In [7]:
len(df)

8499

In [8]:
df = df.dropna()

In [9]:
#data has no nan rows
len(df)

8499

In [10]:
df = df.drop(['id'],axis=1)
df = df.rename(index=str, columns={"suggestion": "label"})

In [11]:
df.head().style

Unnamed: 0,label,text
0,1,"""Please enable removing language code from the Dev Center ""language history"" For example if you ever selected ""ru"" and ""ru-ru"" laguages and you published this xap to the Store then it causes Tile localization to show the en-us(default) tile localization which is bad."""
1,0,"""Note: in your .csproj file, there is a SupportedCultures entry like this: de-DE;ru;ru-RU When I removed the ""ru"" language code and published my new xap version, the old xap version still remains in the Store with ""Replaced and unpublished""."""
2,0,"""Wich means the new version not fully replaced the old version and this causes me very serious problems: 1."""
3,0,"""Some of my users will still receive the old xap version of my app."""
4,0,"""The store randomly gives the old xap or the new xap version of my app."""


In [12]:
df.head().style

Unnamed: 0,label,text
0,1,"""Please enable removing language code from the Dev Center ""language history"" For example if you ever selected ""ru"" and ""ru-ru"" laguages and you published this xap to the Store then it causes Tile localization to show the en-us(default) tile localization which is bad."""
1,0,"""Note: in your .csproj file, there is a SupportedCultures entry like this: de-DE;ru;ru-RU When I removed the ""ru"" language code and published my new xap version, the old xap version still remains in the Store with ""Replaced and unpublished""."""
2,0,"""Wich means the new version not fully replaced the old version and this causes me very serious problems: 1."""
3,0,"""Some of my users will still receive the old xap version of my app."""
4,0,"""The store randomly gives the old xap or the new xap version of my app."""


#### Clearly, the imbalance in the dataset is evident

In [13]:
df['label'].value_counts()

0    6414
1    2085
Name: label, dtype: int64

## ULMFiT Transfer Learning Paper Overview 

> [Paper Link](https://arxiv.org/abs/1801.06146)
> Authors : **Jeremy Howard, Sebastian Ruder**

![
](https://raw.githubusercontent.com/ritzdevp/MachineLearning/master/arch.png)

The model uses AWD-LSTM ([Merity et al., 2017a](https://arxiv.org/abs/1708.02182)) layered architecture coupled with tuned dropout layers. The end point is a softmax dense layer. 

ULMFiT consists of three stages.

 1. The Language Model (LM) is trained on a general domain corpus to capture general features of the language in different layers.
 2. The full LM is fine tuned on target task data.
 3. The classifier fine tuned on the target task using techniques stated by the authors such as *gradual unfreezing*, *discriminative fine-tuning* and *slanted triangular learning rates*. 

> In the figure above; shaded: unfreezing stages, black: frozen



### Let us go for two approaches :
#### 1. Train and predict for the dataset as it is.
#### 2. Train and predict after downsampling class '0'.

In [14]:
#Splitting training data into training and validation set

valid_pct = 0.05 #validation set size 

#splitting 
cut = int(valid_pct * len(df)) + 1
train_df, valid_df = df[cut:], df[:cut]

#converting data into DataBunch, a data format compatible with fastai.text.data
#NOTE : data_lm is for language model learner
data_lm = TextLMDataBunch.from_df('data', train_df, valid_df, text_cols='text')

In [15]:
data_lm.show_batch()

idx,text
0,"xxbos "" xxmaj this capability was essential for certain low - level tasks like native crash investigation , debugging stack xxunk scenarios , deep optimization with inline assembly ( e. g. for xxup xxunk ) and such . "" xxbos "" xxmaj add integration with xxmaj xxunk "" xxbos "" xxmaj default tags would also be helpful -- even better , configurable content keyword xxunk would be great ! """
1,"and xxmaj silverlight but do n't realize it right away , and end up frustrating myself until i do . i see no reason to have 3 variations that are so similar , especially with the current efforts to make things universal . xxbos "" xxmaj the "" trial "" is a fantastic concept , but it is xxunk by this xxunk "" xxbos "" xxmaj consider the following situation"
2,""" xxbos "" xxmaj hyper - v is a powerful xxup vm management tool , and it includes the ability to create new images and manage snapshots within them , allowing developers to save the current state of the xxup vm ( e.g. , to manage particular testing scenarios such as upgrading the app when a good amount of custom data is in it , either for one app or"
3,"script > < ! "" xxbos "" xxmaj seriously to be xxunk we xxup lose a lot of features with transition of xxup wpf - > xxup uwp ... xxunk xxmaj xxunk etc ... "" xxbos "" xxmaj or at least make the browser extension useful ( xxmaj firefox add - on has no use ... ) "" "" xxbos "" xxmaj one can read more here : http :"
4,"adding a new item "" ) . "" xxbos "" xxmaj possible solution : xxmaj route class xxunk xxunk . "" xxbos i suggest to create another option that xxunk "" "" lost of signal "" "" and "" "" bad signal "" "" . xxmaj create an infinite loop that xxunk the connection in a random time . xxmaj from 1 to 10 sec , for example . xxbos"


In [16]:
# Classifier model data
#'bs' = 'batchsize'
data_clas = TextClasDataBunch.from_df('data', train_df = train_df, valid_df = valid_df, vocab=data_lm.train_ds.vocab, bs=32)

In [17]:
data_clas.show_batch()

text,target
"xxbos "" xxmaj thanks , xxmaj an xxmaj xxunk xxrep 16 - xxmaj log of infinite loop on xxmaj buffering : xxmaj state : xxmaj playing xxmaj change position to : xxunk xxrep 4 0 xxmaj state : xxmaj buffering xxmaj buffering : 100.0 % xxmaj state : xxmaj buffering xxmaj state : xxmaj buffering xxmaj buffering : 50.0 % xxmaj buffering : 50.0 % xxmaj buffering : 51.0 %",0
"xxbos xxmaj please do xxunk this post i did about the "" "" xxmaj brazilian xxmaj win 10 xxup ip xxunk still w / 3 xxmaj xxunk xxmaj xxunk xxmaj translations ! "" "" http : / / answers.microsoft.com / en - us / insider / forum / insider_wintp - xxunk / brazilian - xxunk - xxunk - xxunk / xxunk - xxunk - xxunk xxmaj section 1 ) xxmaj",0
xxbos xxmaj currently xxmaj windows xxup os decides the best xxmaj internet connection and uses it automatically ( xxmaj if i have more than one connection like wifi and xxmaj network cable ) . xxmaj need user setting option to override the same and assign one or more app to use specific network connection ( wifi ) and others to use other possible network connection ( 3 g / 4,1
"xxbos "" xxmaj it would be nice if xxmaj runs / xxmaj xxunk inside a xxmaj richtextblock / xxmaj textblock could support more formatting properties like xxup css - i.e. setting background color or a word , xxmaj it would be nice if xxmaj runs / xxmaj xxunk inside a xxmaj richtextblock / xxmaj textblock could support more formatting properties like xxup css - i.e. setting background color or a",1
"xxbos "" https : / / www.microsoft.com / en - us / store / apps / gps - navigator - recorder / 9nblggh2vvj3 https : / / www.microsoft.com / it - it / store / apps / gps - navigator - recorder / 9nblggh2vvj3 https : / / www.microsoft.com / es - es / store / apps / gps - navigator - recorder / 9nblggh2vvj3 https : / / www.microsoft.com",0


#### Training ULMFiT pretrained model by unfreezing weights of all layers

In [18]:
#learn is the 'language_model_learner' with dropout layers having 
#drop_mult is the dropout percentage for dropout layers
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.4) 
#make layers re-trainable
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)  # FIT ONE CYCLE POLICY
wd=1e-7 #weight decay regularization
lr=0.001
lrs = lr

epoch,train_loss,valid_loss,accuracy
1,5.156251,4.665867,0.257321


In [19]:
#test language model learner
learn.predict("this product should", n_words=10)

'this product should be free before launch . 33 copies of the code'

In [20]:
#save this language model encoder
learn.save_encoder('ft_enc1')

In [21]:
#load classifier and encoder
clf = text_classifier_learner(data_clas, drop_mult=0.4)
clf.load_encoder('ft_enc1')

In [22]:
#train classifier for 1+10 epochs
clf.fit_one_cycle(1, 1e-3)
clf.fit(10,lrs,wd)

epoch,train_loss,valid_loss,accuracy
1,0.477236,0.391633,0.825882


epoch,train_loss,valid_loss,accuracy
1,0.473858,0.373281,0.816471
2,0.414071,0.354965,0.837647
3,0.433779,0.348081,0.835294
4,0.436459,0.331488,0.840000
5,0.428729,0.339323,0.840000
6,0.405324,0.335680,0.832941
7,0.430015,0.336909,0.837647
8,0.424907,0.334662,0.842353
9,0.457630,0.338293,0.840000
10,0.416570,0.319682,0.851765


In [62]:
#let us test our classifier
clf.predict("I would like to have a feature that connects me to other users.")

(Category 1, tensor(1), tensor([0.1879, 0.8121]))

#### Testing on unseen data, downloaded originally as "SubtaskA_Trial_Test_Labeled.csv"

In [24]:
#let us try this classifier on an unseen data 
labeled_data = pd.read_csv('labeled_test_data.csv', encoding = "ISO-8859-1")

In [25]:
# A very balanced test data indeed
print(labeled_data.groupby('label').count())
print('\ndata length:',len(labeled_data))

        id  sentence
label               
0      296       296
1      296       296

data length: 592


In [26]:
labeled_data.head()

Unnamed: 0,id,sentence,label
0,1310_1,I'm not asking Microsoft to Gives permission l...,1
1,1312_1,somewhere between Android and iPhone.,0
2,1313_1,And in the Windows Store you can flag the App ...,0
3,1313_2,"Many thanks Sameh Hi, As we know, there is a l...",0
4,1313_3,The idea is that we can develop a regular app ...,1


In [27]:
predictions = []

In [28]:
#let us classify each sentence
for i in range(0,len(labeled_data)):
    prediction = clf.predict(labeled_data['sentence'][i])
    val = str(prediction[1])[-2:-1] #to get '1' or '0' (as string) from the prediction output which looks like
                                    #(Category 1, tensor(1), tensor([0.3325, 0.6675]))    
    val = int(val) #convert to int
    predictions.append(val) #append result to predictions list

In [29]:
#add predictions to our labeled data
labeled_data['predictions'] = predictions

In [30]:
labeled_data.head()

Unnamed: 0,id,sentence,label,predictions
0,1310_1,I'm not asking Microsoft to Gives permission l...,1,1
1,1312_1,somewhere between Android and iPhone.,0,0
2,1313_1,And in the Windows Store you can flag the App ...,0,0
3,1313_2,"Many thanks Sameh Hi, As we know, there is a l...",0,0
4,1313_3,The idea is that we can develop a regular app ...,1,0


#### Let us check the accuracy and F1 score

In [31]:
print('acc: ',accuracy_score(labeled_data['label'], labeled_data['predictions']).round(4))
print('F1:  ',f1_score(labeled_data['label'], labeled_data['predictions'],average='binary').round(4))

acc:  0.6926
F1:   0.6128


 ### Downsampling the data set

#### Let us try the second approach, ie: downsampling 0 class

In [32]:
#downsample 0 class
#pick row indexes having label = 0
zero_label_indexes = df[df['label']==0].index

In [33]:
zero_label_indexes

Index(['1', '2', '3', '4', '5', '6', '7', '9', '10', '11',
       ...
       '8489', '8490', '8491', '8492', '8493', '8494', '8495', '8496', '8497',
       '8498'],
      dtype='object', length=6414)

In [34]:
np.random.seed(10)

remove_n = 4300 #we pick 4300 random rows which are labeled as 0
drop_indices = np.random.choice(zero_label_indexes, remove_n, replace=False)

In [35]:
len(drop_indices)

4300

In [36]:
df_downsampled = df.drop(drop_indices)

In [37]:
#Quite balanced!
df_downsampled['label'].value_counts()

0    2114
1    2085
Name: label, dtype: int64

In [38]:
#Same process of splitting data into validation and training set 
valid_pct = 0.05 #validation percent
cut = int(valid_pct * len(df_downsampled)) + 1
train_df_ds, valid_df_ds = df_downsampled[cut:], df_downsampled[:cut] #'ds' = 'downsampled'
data_lm_new = TextLMDataBunch.from_df('data', train_df_ds, valid_df_ds, text_cols='text')

In [39]:
# new Classifier model data
data_clas_new = TextClasDataBunch.from_df('data', train_df = train_df_ds, valid_df = valid_df_ds, vocab=data_lm_new.train_ds.vocab, bs=32)

In [40]:
#new language model learner based on our downsampled training data
learn_new = language_model_learner(data_lm_new, pretrained_model=URLs.WT103, drop_mult=0.4)
learn_new.unfreeze()
learn_new.fit_one_cycle(1, 1e-3)  
wd=1e-7
lr=0.001
lrs = lr

epoch,train_loss,valid_loss,accuracy
1,5.323115,4.641438,0.271205


In [41]:
learn_new.predict('This feature is', n_words=10)

'This feature is known as the " great fire of the state "'

In [42]:
#save new encoder
learn_new.save_encoder('ft_enc_new')

In [43]:
#load classifier
clf_new = text_classifier_learner(data_clas_new, drop_mult=0.4)
clf_new.load_encoder('ft_enc_new')

In [44]:
#train classifier for 1 + 10 epochs
clf_new.fit_one_cycle(1, 1e-3)
clf_new.fit(10,lrs,wd)

epoch,train_loss,valid_loss,accuracy
1,0.587366,0.542312,0.733333


epoch,train_loss,valid_loss,accuracy
1,0.569483,0.542971,0.704762
2,0.560170,0.527375,0.738095
3,0.556609,0.548961,0.728571
4,0.520111,0.526778,0.742857
5,0.544389,0.556852,0.742857
6,0.530961,0.602505,0.738095
7,0.508075,0.540611,0.728571
8,0.522072,0.483270,0.771429
9,0.502235,0.502481,0.747619
10,0.527123,0.558401,0.733333


In [45]:
#let us try the classifier
clf_new.predict("It would be nice to have this feature incorporated soon.")

(Category 1, tensor(1), tensor([0.1172, 0.8828]))

#### Let us try to predict on our unseen data

In [46]:
new_predictions = []

In [47]:
for i in range(0,len(labeled_data)):
    prediction = clf_new.predict(labeled_data['sentence'][i])
    val = str(prediction[1])[-2:-1]
    val = int(val)
    new_predictions.append(val)

In [48]:
labeled_data['new_predictions'] = new_predictions

In [49]:
labeled_data.head()

Unnamed: 0,id,sentence,label,predictions,new_predictions
0,1310_1,I'm not asking Microsoft to Gives permission l...,1,1,1
1,1312_1,somewhere between Android and iPhone.,0,0,0
2,1313_1,And in the Windows Store you can flag the App ...,0,0,1
3,1313_2,"Many thanks Sameh Hi, As we know, there is a l...",0,0,1
4,1313_3,The idea is that we can develop a regular app ...,1,0,0


In [50]:
print('acc: ',accuracy_score(labeled_data['label'], labeled_data['new_predictions']))
print('F1: ', f1_score(labeled_data['label'], labeled_data['new_predictions'],average='binary'))

acc:  0.7331081081081081
F1:  0.7591463414634146


### Let us compare

In [51]:
print('acc:',accuracy_score(labeled_data['label'], labeled_data['predictions']).round(4))
print('F1: ', f1_score(labeled_data['label'], labeled_data['predictions'],average='binary').round(4))
print('\nAfter downsampling...')
print('\nacc:',accuracy_score(labeled_data['label'], labeled_data['new_predictions']).round(4))
print('F1: ', f1_score(labeled_data['label'], labeled_data['new_predictions'],average='binary').round(4))

acc: 0.6926
F1:  0.6128

After downsampling...

acc: 0.7331
F1:  0.7591


### Let us generate our submission csv file
#### Unlabeled test data was originally downloaded as "SubtaskA_EvaluationData.csv"

In [52]:
#load test data
test_data = pd.read_csv('test_data_nlp.csv')
test_data.head()

Unnamed: 0,id,text,prediction
0,9566,This would enable live traffic aware apps.,X
1,9569,Please try other formatting like bold italics ...,X
2,9576,Since computers were invented to save time I s...,X
3,9577,Allow rearranging if the user wants to change ...,X
4,9579,Add SIMD instructions for better use of ARM NE...,X


In [53]:
#drop predictions column
test_data = test_data.drop(['prediction'],axis=1)

In [54]:
final_predictions=[]

In [55]:
for i in range(0,len(test_data)):
    prediction = clf_new.predict(test_data['text'][i])
    val = str(prediction[1])[-2:-1]
    val = int(val)
    final_predictions.append(val)

In [56]:
test_data['prediction'] = final_predictions

In [58]:
test_data.to_csv('Rituraj_Singh.csv', header=False, index=False)

### And finally,

In [57]:
print(clf.predict("MIDAS should hire me as a research intern!"))
print(clf_new.predict("MIDAS should hire me as a research intern!")) 
#The second classifier seems to have a stronger opinion :D

(Category 0, tensor(0), tensor([0.5303, 0.4697]))
(Category 1, tensor(1), tensor([0.2712, 0.7288]))
