## Notebook 3: `3_Advanced_Modelling`

In the last notebook we carried out data transformation to prepare data to be ready to fed into machine learning models. Different machine learning models were trained on the transformed data. The machine learning models took too much time for training on the the whole data set. Also some of the algorithms only accepts dense arrays for training. My laptop ran out of memory when I was converting the sparse matrix to dense arrays. Therefore, we will train some additional models only using a sample of data.   

In this notebook, we will take only 10 percent sample of the data and train some additional computationally intensive models on the sample of data. 

Following is the table of contents for this notebook:


<br>

### Table of Contents 

<br>   

1. [Data Transformation](#transformation)                  
2. [Modelling](#Modelling)        
    2.1 [SVC](#SVC)    
    2.2 [Naive Bayes](#NB)                    
3. [Conclusion](#conclusion)       




Import required libraries  

In [1]:
import numpy as np
import pandas as pd

import re

import matplotlib.pyplot as plt
import seaborn as sns


#text processing 
import nltk
from nltk.corpus import stopwords
from  nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [2]:
df_clean_whole = pd.read_csv('../data/cleaned_data.csv')

In [3]:
df_clean_whole.head()

Unnamed: 0.1,Unnamed: 0,ids,date,user,text,target
0,0,1467810369,2009-04-06 22:19:45,_TheSpecialOne_,Awww thats a bummer You shoulda got David Carr...,0
1,1,1467810672,2009-04-06 22:19:49,scotthamilton,is upset that he cant update his Facebook by t...,0
2,2,1467810917,2009-04-06 22:19:53,mattycus,I dived many times for the ball Managed to sav...,0
3,3,1467811184,2009-04-06 22:19:57,ElleCTF,my whole body feels itchy and like its on fire,0
4,4,1467811193,2009-04-06 22:19:57,Karoli,no its not behaving at all im mad why am i her...,0


Keep just 10 percent of the data.

In [4]:
#take a sample from this data set 
df_clean = df_clean_whole.sample(frac=0.1, random_state=1).copy()

In [5]:
df_clean.shape

(153270, 6)

## Data transformation <a name="transformation"></a> 

Let's first tokenize the data. 

In [6]:
#tokenize the data 
df_clean['cleaned_text'] = df_clean['text'].apply(lambda x: word_tokenize(x))

Our goal is to predict sentiment for a given tweet. Stop words such as the, a, an etc. do not contribute to the sentiment for a sentence. I am removing the stop words from tokenized data. 

In [7]:
#create a list of stop words 
stop_words = stopwords.words('english')

Perform stemming to just keep root words in the tokens. 

In [8]:
snowball_stemmer = SnowballStemmer(language='english')
wordnet_stemmer = WordNetLemmatizer()

In [9]:
#remove stop words from tokenized data
df_clean['cleaned_text'] = df_clean['cleaned_text'].apply(lambda x: [word for word in x if word.casefold() not in stop_words])

The words derived from a root word has the same effect on the sentiment as of root word. For example both work and working, would have effect on the sentiment of a tweet. To reduce number of features, we can just keep root word in our data. This process is called stemming. Let's perform stemming on our data. 

In [10]:
#perform stemming or lemmetization
df_clean['cleaned_text'] = df_clean['cleaned_text'].apply(lambda x: [snowball_stemmer.stem(word) for word in x])  

Now let's have a look at the tokenized data after removing stop words and stemming. 

In [11]:

df_clean.head()

Unnamed: 0.1,Unnamed: 0,ids,date,user,text,target,cleaned_text
303618,313515,2001769841,2009-06-02 02:00:48,AiramR,I hear ya babe I totally hear ya,0,"[hear, ya, babe, total, hear, ya]"
1319435,1374121,2051395536,2009-06-05 21:54:41,RoxieRavenclaw,Thanks Happy followfriday,1,"[thank, happi, followfriday]"
40331,41150,1573979187,2009-04-21 03:44:44,darkeyeskai,but it hits me at 2 tender spots ice cream n f...,0,"[hit, 2, tender, spot, ice, cream, n, free, n,..."
17502,17751,1556383673,2009-04-18 22:37:41,canoamnery,is this the real keri hilson can u call me my ...,0,"[real, keri, hilson, u, call, phone, number, 5..."
1004836,1044133,1957404245,2009-05-29 00:23:46,organizerlady,Thats so sweet My secret is chocolate lots of ...,1,"[that, sweet, secret, chocol, lot, chocol]"


After tokenizing text let's create feature and targets.

In [12]:
X = df_clean['cleaned_text'].apply(lambda x: ' '.join(x))
y = df_clean['target']

In [13]:
from sklearn.model_selection import train_test_split

Split data set into training and validation sets 

In [14]:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.8, stratify=y)

In [15]:
#let's vectorize the data
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
bagofwords = CountVectorizer(min_df=20)

In [17]:
bagofwords.fit(X_train)

CountVectorizer(min_df=20)

In [18]:
X_train_transformed = bagofwords.transform(X_train)
X_train_transformed

<122616x3801 sparse matrix of type '<class 'numpy.int64'>'
	with 751724 stored elements in Compressed Sparse Row format>

In [19]:
X_validation_transformed = bagofwords.transform(X_validation)

## Modelling <a name="Modelling"></a> 

### SVC <a name="SVC"></a> 

In [20]:
#let's train a support vector machine on this data
#use ctrl + space to get suggestions in vscode
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [22]:
#define estimators for the pipeline
estimators = [('count_vectorizer', CountVectorizer(min_df=20)), #vectorize data
                ('model', SVC())] #fit a logistic model

pipe = Pipeline(estimators) #make a pipeline
pipe #checkout the pipe

Pipeline(steps=[('count_vectorizer', CountVectorizer(min_df=20)),
                ('model', SVC())])

In [23]:
#define the parameter grid
param_grid = {
        'model': [SVC()], 
        'model__C': [0.001, 0.01, 0.1, 1, 10, 10],
    }

#initialize gird for the pipe
grid = GridSearchCV(pipe, param_grid, cv=5)  

#checkout the grid 
grid

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('count_vectorizer',
                                        CountVectorizer(min_df=20)),
                                       ('model', SVC())]),
             param_grid={'model': [SVC()],
                         'model__C': [0.001, 0.01, 0.1, 1, 10, 10]})

<span style="color:red">Warning, the cell below takes more than 2000 minutes to run </span>. 

In [24]:
#let's first fit the model with default parameters 
#fit on the training and validation set set 
fitted_gird = grid.fit(X, y)

KeyboardInterrupt: 

### Naive Bayes <a name="NB"></a> 

It is widely believed that naive bayes algorithms are better for text classification. This is something to do with the low variance for naive bayes algorithms as the probability of any event happening is decided based on the probabilities of past events. 


Let's apply the naive bayes algorithm to our sample data set and see how it performs. 

In [31]:
#fit a Gaussian Naive Bayes model 
from sklearn.naive_bayes import GaussianNB

In [33]:
#initialize Gaussian Naive Bayes classifier 
GNB_model = GaussianNB()

#fit the classifier on training data
GNB_model.fit(X_train_transformed.toarray(), y_train)

#evaluate model on test
GNB_model.score(X_validation_transformed.toarray(), y_validation)

0.6712328767123288

We can see that the accuracy of naive bayes is lower than the accuracy of logistic regression model. In general naive bayes performs better than logistic regression for text classification if the data is low. For high amount of training data set logistic regression perform better than naive bayes algorithm. 

More information on this topic can be found in this [paper](http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf).

## Conclusion  <a name="conclusion"></a> 

In this notebook we tried some computationally and memory intensive algorithms on a sample of data set. Both the support vector classifier and naive bayes algorithms performed worse than the logistic regression model. 

Support vector machines are found to be very computationally intensive for this study. Even on the 10% sample of data SVC took more than 24 hours to train. 

While SVC() is running on this notebook, we will do some additional modelling using some of the deep learning methods in the next notebook.    