# Suggestion Classification Using ULMFiT Transfer Learning Approach
### Let us see if this classifies "MIDAS should hire me as a research intern!" as a suggestion!

In [1]:
#Import libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import text_to_word_sequence
from nltk.tokenize import word_tokenize, WhitespaceTokenizer, TweetTokenizer, sent_tokenize
import time
import nltk
from fastai import *
from fastai.text import *
#import requirements
import string
import tensorflow as tf
from time import time
from datetime import datetime
import os
import time
import re
from sklearn.metrics import accuracy_score, f1_score

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


#### Stop-words are NOT removed as words like "should", "could", etc. are key suggestion indicators

In [2]:
#import nltk.data
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [3]:
#nltk.download('stopwords')
#from nltk.corpus import stopwords 
#stop_words = stopwords.words('english')

In [4]:
#stop_words #STOP WORDS SHOULD NOT BE REMOVED IN THIS PROBLEM. WHY? For exapmple, 'should' is an important word 
            #which indicates "suggestion"

#### Load and Explore Training Data, downloaded originally as "V1.4_Training.csv"

In [5]:
df = pd.read_csv('suggestion_training_data.csv').iloc[:,:3]

In [6]:
df.head()

Unnamed: 0,id,suggestion,text
0,663_3,1,"""Please enable removing language code from the..."
1,663_4,0,"""Note: in your .csproj file, there is a Suppor..."
2,664_1,0,"""Wich means the new version not fully replaced..."
3,664_2,0,"""Some of my users will still receive the old x..."
4,664_3,0,"""The store randomly gives the old xap or the n..."


In [7]:
len(df)

8499

In [8]:
df = df.dropna()

In [9]:
#data has no nan rows
len(df)

8499

In [10]:
df = df.drop(['id'],axis=1)
df = df.rename(index=str, columns={"suggestion": "label"})

In [11]:
df.head().style

Unnamed: 0,label,text
0,1,"""Please enable removing language code from the Dev Center ""language history"" For example if you ever selected ""ru"" and ""ru-ru"" laguages and you published this xap to the Store then it causes Tile localization to show the en-us(default) tile localization which is bad."""
1,0,"""Note: in your .csproj file, there is a SupportedCultures entry like this: de-DE;ru;ru-RU When I removed the ""ru"" language code and published my new xap version, the old xap version still remains in the Store with ""Replaced and unpublished""."""
2,0,"""Wich means the new version not fully replaced the old version and this causes me very serious problems: 1."""
3,0,"""Some of my users will still receive the old xap version of my app."""
4,0,"""The store randomly gives the old xap or the new xap version of my app."""


In [12]:
df.head().style

Unnamed: 0,label,text
0,1,"""Please enable removing language code from the Dev Center ""language history"" For example if you ever selected ""ru"" and ""ru-ru"" laguages and you published this xap to the Store then it causes Tile localization to show the en-us(default) tile localization which is bad."""
1,0,"""Note: in your .csproj file, there is a SupportedCultures entry like this: de-DE;ru;ru-RU When I removed the ""ru"" language code and published my new xap version, the old xap version still remains in the Store with ""Replaced and unpublished""."""
2,0,"""Wich means the new version not fully replaced the old version and this causes me very serious problems: 1."""
3,0,"""Some of my users will still receive the old xap version of my app."""
4,0,"""The store randomly gives the old xap or the new xap version of my app."""


#### Clearly, the imbalance in the dataset is evident

In [13]:
df['label'].value_counts()

0    6414
1    2085
Name: label, dtype: int64

## ULMFiT Transfer Learning Paper Overview 

> [Paper Link](https://arxiv.org/abs/1801.06146)
> Authors : **Jeremy Howard, Sebastian Ruder**

![
](https://raw.githubusercontent.com/ritzdevp/MachineLearning/master/arch.png)

The model uses AWD-LSTM ([Merity et al., 2017a](https://arxiv.org/abs/1708.02182)) layered architecture coupled with tuned dropout layers. The end point is a softmax dense layer. 

ULMFiT consists of three stages.

 1. The Language Model (LM) is trained on a general domain corpus to capture general features of the language in different layers.
 2. The full LM is fine tuned on target task data.
 3. The classifier fine tuned on the target task using techniques stated by the authors such as *gradual unfreezing*, *discriminative fine-tuning* and *slanted triangular learning rates*. 

> In the figure above; shaded: unfreezing stages, black: frozen



### Let us go for two approaches :
#### 1. Train and predict for the dataset as it is.
#### 2. Train and predict after downsampling class '0'.

In [14]:
#Splitting training data into training and validation set

valid_pct = 0.05 #validation set size 

#splitting 
cut = int(valid_pct * len(df)) + 1
train_df, valid_df = df[cut:], df[:cut]

#converting data into DataBunch, a data format compatible with fastai.text.data
#NOTE : data_lm is for language model learner
data_lm = TextLMDataBunch.from_df('data', train_df, valid_df, text_cols='text')

In [15]:
data_lm.show_batch()

idx,text
0,"xxbos xxmaj it would be great to use xxmaj xxunk to install scripts once into the shared project rather than installing two copies of the same script into the xxmaj windows project and the xxmaj windows xxmaj phone project . xxbos "" xxmaj this tester changed the criteria . "" xxbos "" sending an xxup udp xxmaj broadcast ( send to xxunk ) is not supported . "" xxbos """
1,"designer problem ; but the attached properties are not working ( the binding ) and i have to do set binding by code xxbos "" i 'd like to have xxmaj feedly automatically detect youtube videos in feeds , and if i save a post with a youtube embed , automatically add that specific video to my ' xxmaj watch xxmaj later ' feed in youtube . "" xxbos """
2,"to feedly.com or make it possible to download an update from within the extension ? "" xxbos "" please let me know when you will add more payment options . "" xxbos "" xxmaj the new xxmaj notification access is awesome but is it possible to add xxmaj overlay xxup ui and xxmaj services ( you will find it under accessibility settings in xxmaj android ) to xxup xxunk ."
3,"post directly to xxmaj kindle or / and other e - book readers . "" xxbos "" i 've noticed that when you create a datapackage containing both data provided directly ( let 's say text using xxunk ) and xxunk formats like a bitmap ( using xxunk ) when closing the application the whole clipboard content is cleared . "" xxbos xxmaj in the xxmaj windows . xxmaj data"
4,"existing xxunk types ( i.e. "" xxbos xxmaj the .net core frame work is a xxunk idea , but incorporate ideas that have caused me to avoid linux for the most part . xxbos "" xxmaj in html if you align text to the right , then its size is reduced . "" xxbos xxmaj but the xxup uwp textboxes are not that xxunk when it comes to numeric input"


In [16]:
# Classifier model data
#'bs' = 'batchsize'
data_clas = TextClasDataBunch.from_df('data', train_df = train_df, valid_df = valid_df, vocab=data_lm.train_ds.vocab, bs=32)

In [17]:
data_clas.show_batch()

text,target
"xxbos "" xxmaj thanks , xxmaj an xxmaj xxunk xxrep 16 - xxmaj log of infinite loop on xxmaj buffering : xxmaj state : xxmaj playing xxmaj change position to : xxunk xxrep 4 0 xxmaj state : xxmaj buffering xxmaj buffering : 100.0 % xxmaj state : xxmaj buffering xxmaj state : xxmaj buffering xxmaj buffering : 50.0 % xxmaj buffering : 50.0 % xxmaj buffering : 51.0 %",0
"xxbos "" xxmaj windows phone store should show total number of app downloads ( xxmaj all xxmaj markets ) as well as ratings and reviews of xxmaj all xxmaj markets instead of showing xxunk and reviews based on region because it does n't give a xxunk view and if by chance in any country there are no reviews on your app and someone want to download your app he might",1
xxbos xxup ms xxmaj edge is not bad however people have different requirements for web browsers . xxmaj the current xxunk are too limited for developers to create a new browser . xxmaj you should give people more freedom and more apis to create new browsers . xxmaj and please do n't always use security risk as xxunk for not allowing people to create more creative apps than xxmaj edge,1
"xxbos "" xxrep 47 _ xxup itin xxrep 12 _ xxup ein xxrep xxunk _ xxrep 9 _ xxup xxunk xxrep 6 _ xxup xxunk xxrep 69 _ xxmaj instructions xxrep 90 _ xxrep 51 _ xxup xxunk xxrep 30 _ xxup ms xxrep xxunk _ xxrep 39 _ xxrep 15 _ xxmaj web xxrep 21 _ xxup ein xxrep 47 _ xxup itin xxrep 39 _ xxrep 41 _",0
"xxbos "" .wma files copied / moved to xxmaj music library using medialibraryextensions . savesong ( ) xxup api from a windows phone app are stored in xxmaj music library with .mp3 extension , though the audio file itself will play properly from xxmaj music library xxmaj when you share it via email - you can notice that file attached has a .mp3 extension xxmaj or connect phone to xxup",0


#### Training ULMFiT pretrained model by unfreezing weights of all layers

In [18]:
#learn is the 'language_model_learner' with dropout layers having 
#drop_mult is the dropout percentage for dropout layers
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.4) 
#make layers re-trainable
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)  # FIT ONE CYCLE POLICY
wd=1e-7 #weight decay regularization
lr=0.001
lrs = lr

epoch,train_loss,valid_loss,accuracy
1,5.158728,4.664846,0.258914


In [19]:
#test language model learner
learn.predict("this product should", n_words=10)

'this product should be used as a basis for be assuming that it'

In [20]:
#save this language model encoder
learn.save_encoder('ft_enc1')

In [21]:
#load classifier and encoder
clf = text_classifier_learner(data_clas, drop_mult=0.4)
clf.load_encoder('ft_enc1')

In [22]:
#train classifier for 1+10 epochs
clf.fit_one_cycle(1, 1e-3)
clf.fit(10,lrs,wd)

epoch,train_loss,valid_loss,accuracy
1,0.435861,0.365275,0.832941


epoch,train_loss,valid_loss,accuracy
1,0.430285,0.407623,0.783529
2,0.447232,0.347651,0.830588
3,0.440356,0.336098,0.830588
4,0.408999,0.338077,0.825882
5,0.418153,0.341397,0.830588
6,0.431059,0.320035,0.851765
7,0.445743,0.339636,0.832941
8,0.431257,0.332531,0.840000
9,0.408301,0.312848,0.849412
10,0.440342,0.320567,0.844706


In [23]:
#let us test our classifier
clf.predict("I would like to have a feature that connects me to other users.")

(Category 1, tensor(1), tensor([0.2356, 0.7644]))

#### Testing on unseen data, downloaded originally as "SubtaskA_Trial_Test_Labeled.csv"

In [24]:
#let us try this classifier on an unseen data 
labeled_data = pd.read_csv('labeled_test_data.csv', encoding = "ISO-8859-1")

In [25]:
# A very balanced test data indeed
print(labeled_data.groupby('label').count())
print('\ndata length:',len(labeled_data))

        id  sentence
label               
0      296       296
1      296       296

data length: 592


In [26]:
labeled_data.head()

Unnamed: 0,id,sentence,label
0,1310_1,I'm not asking Microsoft to Gives permission l...,1
1,1312_1,somewhere between Android and iPhone.,0
2,1313_1,And in the Windows Store you can flag the App ...,0
3,1313_2,"Many thanks Sameh Hi, As we know, there is a l...",0
4,1313_3,The idea is that we can develop a regular app ...,1


In [27]:
predictions = []

In [28]:
#let us classify each sentence
for i in range(0,len(labeled_data)):
    prediction = clf.predict(labeled_data['sentence'][i])
    val = str(prediction[1])[-2:-1] #to get '1' or '0' (as string) from the prediction output which looks like
                                    #(Category 1, tensor(1), tensor([0.3325, 0.6675]))    
    val = int(val) #convert to int
    predictions.append(val) #append result to predictions list

In [29]:
#add predictions to our labeled data
labeled_data['predictions'] = predictions

In [30]:
labeled_data.head()

Unnamed: 0,id,sentence,label,predictions
0,1310_1,I'm not asking Microsoft to Gives permission l...,1,0
1,1312_1,somewhere between Android and iPhone.,0,0
2,1313_1,And in the Windows Store you can flag the App ...,0,0
3,1313_2,"Many thanks Sameh Hi, As we know, there is a l...",0,0
4,1313_3,The idea is that we can develop a regular app ...,1,0


#### Let us check the accuracy and F1 score

In [31]:
print('acc: ',accuracy_score(labeled_data['label'], labeled_data['predictions']).round(4))
print('F1:  ',f1_score(labeled_data['label'], labeled_data['predictions'],average='binary').round(4))

acc:  0.7027
F1:   0.6562


 ### Downsampling the data set

#### Let us try the second approach, ie: downsampling 0 class

In [32]:
#downsample 0 class
#pick row indexes having label = 0
zero_label_indexes = df[df['label']==0].index

In [33]:
zero_label_indexes

Index(['1', '2', '3', '4', '5', '6', '7', '9', '10', '11',
       ...
       '8489', '8490', '8491', '8492', '8493', '8494', '8495', '8496', '8497',
       '8498'],
      dtype='object', length=6414)

In [34]:
np.random.seed(10)

remove_n = 4300 #we pick 4300 random rows which are labeled as 0
drop_indices = np.random.choice(zero_label_indexes, remove_n, replace=False)

In [35]:
len(drop_indices)

4300

In [36]:
df_downsampled = df.drop(drop_indices)

In [37]:
#Quite balanced!
df_downsampled['label'].value_counts()

0    2114
1    2085
Name: label, dtype: int64

In [38]:
#Same process of splitting data into validation and training set 
valid_pct = 0.05 #validation percent
cut = int(valid_pct * len(df_downsampled)) + 1
train_df_ds, valid_df_ds = df_downsampled[cut:], df_downsampled[:cut] #'ds' = 'downsampled'
data_lm_new = TextLMDataBunch.from_df('data', train_df_ds, valid_df_ds, text_cols='text')

In [39]:
# new Classifier model data
data_clas_new = TextClasDataBunch.from_df('data', train_df = train_df_ds, valid_df = valid_df_ds, vocab=data_lm_new.train_ds.vocab, bs=32)

In [40]:
#new language model learner based on our downsampled training data
learn_new = language_model_learner(data_lm_new, pretrained_model=URLs.WT103, drop_mult=0.4)
learn_new.unfreeze()
learn_new.fit_one_cycle(1, 1e-3)  
wd=1e-7
lr=0.001
lrs = lr

epoch,train_loss,valid_loss,accuracy
1,5.318812,4.636693,0.271763


In [41]:
learn_new.predict('This feature is', n_words=10)

'This feature is being changed by the location of the bottom . "'

In [42]:
#save new encoder
learn_new.save_encoder('ft_enc_new')

In [43]:
#load classifier
clf_new = text_classifier_learner(data_clas_new, drop_mult=0.4)
clf_new.load_encoder('ft_enc_new')

In [44]:
#train classifier for 1 + 10 epochs
clf_new.fit_one_cycle(1, 1e-3)
clf_new.fit(10,lrs,wd)

epoch,train_loss,valid_loss,accuracy
1,0.595074,0.531674,0.757143


epoch,train_loss,valid_loss,accuracy
1,0.567137,0.510671,0.742857
2,0.561927,0.495377,0.747619
3,0.550490,0.486710,0.747619
4,0.527414,0.538360,0.738095
5,0.541849,0.503917,0.757143
6,0.527251,0.499117,0.761905
7,0.519581,0.482363,0.761905
8,0.515662,0.483087,0.761905
9,0.489052,0.518943,0.761905
10,0.516822,0.452856,0.776190


In [45]:
#let us try the classifier
clf_new.predict("It would be nice to have this feature incorporated soon.")

(Category 1, tensor(1), tensor([0.0278, 0.9722]))

#### Let us try to predict on our unseen data

In [46]:
new_predictions = []

In [47]:
for i in range(0,len(labeled_data)):
    prediction = clf_new.predict(labeled_data['sentence'][i])
    val = str(prediction[1])[-2:-1]
    val = int(val)
    new_predictions.append(val)

In [48]:
labeled_data['new_predictions'] = new_predictions

In [49]:
labeled_data.head()

Unnamed: 0,id,sentence,label,predictions,new_predictions
0,1310_1,I'm not asking Microsoft to Gives permission l...,1,0,1
1,1312_1,somewhere between Android and iPhone.,0,0,1
2,1313_1,And in the Windows Store you can flag the App ...,0,0,1
3,1313_2,"Many thanks Sameh Hi, As we know, there is a l...",0,0,1
4,1313_3,The idea is that we can develop a regular app ...,1,0,0


In [50]:
print('acc: ',accuracy_score(labeled_data['label'], labeled_data['new_predictions']))
print('F1: ', f1_score(labeled_data['label'], labeled_data['new_predictions'],average='binary'))

acc:  0.7077702702702703
F1:  0.7358778625954198


### Let us compare

In [51]:
print('acc:',accuracy_score(labeled_data['label'], labeled_data['predictions']).round(4))
print('F1: ', f1_score(labeled_data['label'], labeled_data['predictions'],average='binary').round(4))
print('\nAfter downsampling...')
print('\nacc:',accuracy_score(labeled_data['label'], labeled_data['new_predictions']).round(4))
print('F1: ', f1_score(labeled_data['label'], labeled_data['new_predictions'],average='binary').round(4))

acc: 0.7027
F1:  0.6562

After downsampling...

acc: 0.7078
F1:  0.7359


### Let us generate our submission csv file
#### Unlabeled test data was originally downloaded as "SubtaskA_EvaluationData.csv"

In [52]:
#load test data
test_data = pd.read_csv('test_data_nlp.csv')
test_data.head()

Unnamed: 0,id,text,prediction
0,9566,This would enable live traffic aware apps.,X
1,9569,Please try other formatting like bold italics ...,X
2,9576,Since computers were invented to save time I s...,X
3,9577,Allow rearranging if the user wants to change ...,X
4,9579,Add SIMD instructions for better use of ARM NE...,X


In [53]:
#drop predictions column
test_data = test_data.drop(['prediction'],axis=1)

In [54]:
final_predictions=[]

In [55]:
for i in range(0,len(test_data)):
    prediction = clf_new.predict(test_data['text'][i])
    val = str(prediction[1])[-2:-1]
    val = int(val)
    final_predictions.append(val)

In [56]:
test_data['prediction'] = final_predictions

In [57]:
test_data.to_csv('Rituraj_Singh.csv', header=False, index=False)

### And finally,

In [60]:
print(clf.predict("MIDAS should hire me as a research intern!"))
print(clf_new.predict("MIDAS should hire me as a research intern!")) 
#The first classifier seems to have a stronger opinion :D

(Category 1, tensor(1), tensor([0.3356, 0.6644]))
(Category 1, tensor(1), tensor([0.4229, 0.5771]))
