<h1 align="center"><font size="5">Survey on Machine Learning Algorithms for IMDB review Sentiment Classification</font></h1>

## Workflow

#### Source Data Access
    - Acquiring IMDB movie reviews with sentiment polarity labels dataset from Kaggle in csv format
#### Data Preprocessing
    - Remove HTML Tags using HTML parser
    - Remove contents inside square brackets using regex
    - Remove special characters using regex
    - Remove stopwords using nltk
    - Stemming of words using nltk
#### Sampling of dataset
    - 5k positive reviews and 5k negative reviews (total 10k reviews) for training. Remaining 40k reviews for testing.
    - 10k positive reviews and 10k negative reviews (total 20k reviews) for training. Remaining 30k reviews for testing.
    - 15k positive reviews and 15k negative reviews (total 30k reviews) for training. Remaining 20k reviews for testing.
#### Word Vectorisation
    - Count vectoriser - Bag of Words model
    - Tfidf Vectoriser - Term Frequency - Inverse Document Frequency model
#### Label Binarizer
    - Converting the sentiment labels to 1s and 0s (Positive and Negative)
#### Modelling and testing 
    - Logistic Regression
    - Multinomial Naives Bayes
    - Support Vector Machines (poly)
    - Support Vector Machines (linear)
    - Support Vector Machines (rbf)
#### Final Report 
    - Accuracy Table
    - Classification report using sklearn.metrics
#### Inference
    - Algorithms with maximum accuracy
    - Precision score for those algorithms
    - A word on runtime and cost,time effective algorithm

In [1]:
import os
import random
import pandas as pd
import numpy as np
import sklearn
import nltk

In [2]:
import re

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize,sent_tokenize
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report,accuracy_score

In [4]:
import warnings
warnings.filterwarnings('ignore')

## Source Data

In [5]:
df = pd.read_csv(os.path.join(str(os.getcwd()),'Desktop','IMDB Dataset.csv'))

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Data Preprocessing

In [7]:
df.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [8]:
df.review[1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

#### Remove html tags

In [9]:
from bs4 import BeautifulSoup

In [10]:
def strip_html(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

In [11]:
df.review = df.review.apply(strip_html)

In [12]:
df.review[0]


"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [13]:
df.review.head(20)

0     One of the other reviewers has mentioned that ...
1     A wonderful little production. The filming tec...
2     I thought this was a wonderful way to spend ti...
3     Basically there's a family where a little boy ...
4     Petter Mattei's "Love in the Time of Money" is...
5     Probably my all-time favorite movie, a story o...
6     I sure would like to see a resurrection of a u...
7     This show was an amazing, fresh & innovative i...
8     Encouraged by the positive comments about this...
9     If you like original gut wrenching laughter yo...
10    Phil the Alien is one of those quirky films wh...
11    I saw this movie when I was about 12 when it c...
12    So im not a big fan of Boll's work but then ag...
13    The cast played Shakespeare.Shakespeare lost.I...
14    This a fantastic movie of three prisoners who ...
15    Kind of drawn in by the erotic scenes, only to...
16    Some films just simply should not be remade. T...
17    This movie made it into one of my top 10 m

#### Remove contents inside a square bracket

In [14]:
def find_square_brackets(text):
    pattern = re.compile(r'\[[^]]*\]')
    matches = pattern.finditer(text)
    for match in matches:
        print(match)

In [15]:
df.review.apply(find_square_brackets)

<_sre.SRE_Match object; span=(3108, 3131), match='[SPOILER . . . I guess]'>
<_sre.SRE_Match object; span=(3796, 4076), match="[DVD tip: as with the simultaneously released Vis>
<_sre.SRE_Match object; span=(179, 185), match='[100%]'>
<_sre.SRE_Match object; span=(427, 432), match='[70%]'>
<_sre.SRE_Match object; span=(535, 540), match='[90%]'>
<_sre.SRE_Match object; span=(677, 683), match='[100%]'>
<_sre.SRE_Match object; span=(756, 761), match='[91%]'>
<_sre.SRE_Match object; span=(0, 93), match="[I saw this movie once late on a public tv statio>
<_sre.SRE_Match object; span=(492, 503), match='[soon Mimi]'>
<_sre.SRE_Match object; span=(16, 22), match='[1986]'>
<_sre.SRE_Match object; span=(71, 86), match='[James Belushi]'>
<_sre.SRE_Match object; span=(188, 205), match='[Christine Tucci]'>
<_sre.SRE_Match object; span=(671, 675), match='[es]'>
<_sre.SRE_Match object; span=(2784, 2874), match='[actress Annabelle Weenick, who also served as th>
<_sre.SRE_Match object; span=(3370, 3422

0        None
1        None
2        None
3        None
4        None
5        None
6        None
7        None
8        None
9        None
10       None
11       None
12       None
13       None
14       None
15       None
16       None
17       None
18       None
19       None
20       None
21       None
22       None
23       None
24       None
25       None
26       None
27       None
28       None
29       None
         ... 
49970    None
49971    None
49972    None
49973    None
49974    None
49975    None
49976    None
49977    None
49978    None
49979    None
49980    None
49981    None
49982    None
49983    None
49984    None
49985    None
49986    None
49987    None
49988    None
49989    None
49990    None
49991    None
49992    None
49993    None
49994    None
49995    None
49996    None
49997    None
49998    None
49999    None
Name: review, Length: 50000, dtype: object

In [16]:
# Remove the square brackets content
def remove_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

In [17]:
df.review = df.review.apply(remove_square_brackets)

In [18]:
# Verify again
df.review.apply(find_square_brackets)

0        None
1        None
2        None
3        None
4        None
5        None
6        None
7        None
8        None
9        None
10       None
11       None
12       None
13       None
14       None
15       None
16       None
17       None
18       None
19       None
20       None
21       None
22       None
23       None
24       None
25       None
26       None
27       None
28       None
29       None
         ... 
49970    None
49971    None
49972    None
49973    None
49974    None
49975    None
49976    None
49977    None
49978    None
49979    None
49980    None
49981    None
49982    None
49983    None
49984    None
49985    None
49986    None
49987    None
49988    None
49989    None
49990    None
49991    None
49992    None
49993    None
49994    None
49995    None
49996    None
49997    None
49998    None
49999    None
Name: review, Length: 50000, dtype: object

#### remove the special characters

In [19]:
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

In [20]:
df.review = df.review.apply(remove_special_characters)

In [21]:
df.review[0]

'One of the other reviewers has mentioned that after watching just 1 Oz episode youll be hooked They are right as this is exactly what happened with meThe first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO Trust me this is not a show for the faint hearted or timid This show pulls no punches with regards to drugs sex or violence Its is hardcore in the classic use of the wordIt is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda Em City is home to manyAryans Muslims gangstas Latinos Christians Italians Irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayI would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare Forget pretty pictur

#### Remove Stopwords

Import the necessary nltk modules first

In [22]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

In [23]:
stopword_list = list(stopwords.words('english'))
def remove_stopwords(text):
    token = word_tokenize(text)
    text_new = []
    for word in token:
        if word not in stopword_list:
            text_new.append(word)
    text = ' '.join(text_new)
    return text

In [24]:
df.review = df.review.apply(remove_stopwords)
df.review[0]

'One reviewers mentioned watching 1 Oz episode youll hooked They right exactly happened meThe first thing struck Oz brutality unflinching scenes violence set right word GO Trust show faint hearted timid This show pulls punches regards drugs sex violence Its hardcore classic use wordIt called OZ nickname given Oswald Maximum Security State Penitentary It focuses mainly Emerald City experimental section prison cells glass fronts face inwards privacy high agenda Em City home manyAryans Muslims gangstas Latinos Christians Italians Irish moreso scuffles death stares dodgy dealings shady agreements never far awayI would say main appeal show due fact goes shows wouldnt dare Forget pretty pictures painted mainstream audiences forget charm forget romanceOZ doesnt mess around The first episode I ever saw struck nasty surreal I couldnt say I ready I watched I developed taste Oz got accustomed high levels graphic violence Not violence injustice crooked guards wholl sold nickel inmates wholl kill o

#### stemming of words

In [25]:
def stem_words(text):
    pst = nltk.porter.PorterStemmer()
    text = ' '.join([pst.stem(word) for word in text.split()])
    return text    

In [26]:
df.review = df.review.apply(stem_words)

In [27]:
df.review[0]

'one review mention watch 1 Oz episod youll hook they right exactli happen meth first thing struck Oz brutal unflinch scene violenc set right word GO trust show faint heart timid thi show pull punch regard drug sex violenc it hardcor classic use wordit call OZ nicknam given oswald maximum secur state penitentari It focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda Em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around the first episod I ever saw struck nasti surreal I couldnt say I readi I watch I develop tast Oz got accustom high level graphic violenc not violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill p

In [28]:
df.head()

Unnamed: 0,review,sentiment
0,one review mention watch 1 Oz episod youll hoo...,positive
1,A wonder littl product the film techniqu unass...,positive
2,I thought wonder way spend time hot summer wee...,positive
3,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,positive


In [29]:
df_pos = df[df['sentiment']=='positive']
df_neg = df[df['sentiment']=='negative']

#### Sampling the dataset - 5k pos and 5k negative

In [30]:
sample5_pos = df_pos.sample(5000, random_state=0)
sample5_neg = df_neg.sample(5000, random_state=0)
sample5_df = sample5_pos.append(sample5_neg)

In [31]:
# test data preparation
sample5_indexes = sample5_df.index.values.tolist()
main_indexes = df.index.values.tolist()
test5_df = pd.DataFrame(columns=['review','sentiment'])
for i in main_indexes:
    if i not in sample5_indexes:
        test5_df = test5_df.append(df[df.index==i])

In [32]:
sample5_df = sample5_df.reset_index().drop('index',axis=1)
test5_df = test5_df.reset_index().drop('index',axis=1)

In [33]:
print(sample5_df.shape)
print(test5_df.shape)

(10000, 2)
(40000, 2)


#### Vectorizers

In [34]:
s5norm_train_reviews = sample5_df.review
s5norm_test_reviews = test5_df.review

#Count vectorizer for bag of words
s5cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))

#transformed reviews
s5cv_train_reviews=s5cv.fit_transform(s5norm_train_reviews)
s5cv_test_reviews=s5cv.transform(s5norm_test_reviews)

print('BOW train CV:',s5cv_train_reviews.shape)
print('BOW test CV:',s5cv_test_reviews.shape)

#Tfidf vectorizer
s5tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))

#transformed reviews
s5tv_train_reviews=s5tv.fit_transform(s5norm_train_reviews)
s5tv_test_reviews=s5tv.transform(s5norm_test_reviews)

print('Tfidf train:',s5tv_train_reviews.shape)
print('Tfidf test:',s5tv_test_reviews.shape)

BOW train CV: (10000, 1860522)
BOW test CV: (40000, 1860522)
Tfidf train: (10000, 1860522)
Tfidf test: (40000, 1860522)


#### Label Binarizer

In [35]:
#labeling the sentient data
lb=LabelBinarizer()
#transformed sentiment data
s5sentiment_train_data=lb.fit_transform(sample5_df['sentiment'])
s5sentiment_test_data=lb.fit_transform(test5_df['sentiment'])
print(s5sentiment_train_data.shape)
print(s5sentiment_test_data.shape)

(10000, 1)
(40000, 1)


#### Modelling

#### Logistic Regression

In [36]:
#training the model
s5lr=LogisticRegression(penalty='l2',max_iter=500,C=10,random_state=42)
#Fitting the model for Bag of words
s5lr_bow=s5lr.fit(s5cv_train_reviews, s5sentiment_train_data)
print(s5lr_bow)
#Fitting the model for tfidf features
s5lr_tfidf=s5lr.fit(s5tv_train_reviews, s5sentiment_train_data)
print(s5lr_tfidf)

##Predicting the model for BOW features
s5lr_bow_predict=s5lr.predict(s5cv_test_reviews)
print(s5lr_bow_predict)
##Predicting the model for tfidf features
s5lr_tfidf_predict=s5lr.predict(s5tv_test_reviews)
print(s5lr_tfidf_predict)

#Accuracy score for bag of words
s5lr_bow_score=accuracy_score(s5sentiment_test_data, s5lr_bow_predict)
print("lr_bow_score :",s5lr_bow_score)
#Accuracy score for tfidf features
s5lr_tfidf_score=accuracy_score(s5sentiment_test_data, s5lr_tfidf_predict)
print("lr_tfidf_score :",s5lr_tfidf_score)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
[0 1 1 ... 0 0 1]
[0 1 1 ... 0 0 1]
lr_bow_score : 0.7184
lr_tfidf_score : 0.71895


#### Multinomial Naives Bayes

In [37]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
# MultinomialNB
s5MNB = MultinomialNB()

s5MNB_bow = s5MNB.fit(s5cv_train_reviews, s5sentiment_train_data)
s5MNB_tfidf = s5MNB.fit(s5tv_train_reviews, s5sentiment_train_data)

s5MNB_bow_predict = s5MNB.predict(s5cv_test_reviews)
s5MNB_bow_score = accuracy_score(s5sentiment_test_data, s5MNB_bow_predict)

s5MNB_tfidf_predict = s5MNB.predict(s5tv_test_reviews)
s5MNB_tfidf_score = accuracy_score(s5sentiment_test_data, s5MNB_tfidf_predict)

print("Multinomial NB BOW Score: ", s5MNB_bow_score)
print("Multinomial NB TF-IDF Score: ", s5MNB_tfidf_score)

Multinomial NB BOW Score:  0.719
Multinomial NB TF-IDF Score:  0.719


#### SVM Poly

In [38]:
from sklearn import svm
classifier5_poly = svm.SVC(kernel='poly')

classifier5_poly.fit(s5cv_train_reviews, s5sentiment_train_data)
classifier5_poly.fit(s5tv_train_reviews, s5sentiment_train_data)

svc5_bow_pred = classifier5_poly.predict(s5cv_test_reviews)
svc5_tfidf_pred = classifier5_poly.predict(s5tv_test_reviews)

svc5_bow_score = accuracy_score(s5sentiment_test_data, svc5_bow_pred)
svc5_tfidf_score = accuracy_score(s5sentiment_test_data, svc5_tfidf_pred)

print("The accuracy score for poly SVC for BOW model is: ", svc5_bow_score)
print("The accuracy score for poly SVC for TFIDF model is: ", svc5_tfidf_score)

The accuracy score for poly SVC for BOW model is:  0.68145
The accuracy score for poly SVC for TFIDF model is:  0.68145


#### SVM Linear

In [39]:
classifier5_linear = svm.SVC(kernel='linear')

classifier5_linear.fit(s5cv_train_reviews, s5sentiment_train_data)
classifier5_linear.fit(s5tv_train_reviews, s5sentiment_train_data)

svc5_bow_pred1 = classifier5_linear.predict(s5cv_test_reviews)
svc5_tfidf_pred1 = classifier5_linear.predict(s5tv_test_reviews)

svc5_bow_score1 = accuracy_score(s5sentiment_test_data, svc5_bow_pred1)
svc5_tfidf_score1 = accuracy_score(s5sentiment_test_data, svc5_tfidf_pred1)

print("The accuracy score for linear SVC for BOW model is: ", svc5_bow_score1)
print("The accuracy score for linear SVC for TFIDF model is: ", svc5_tfidf_score1)

The accuracy score for linear SVC for BOW model is:  0.718175
The accuracy score for linear SVC for TFIDF model is:  0.718175


#### SVM rbf

In [40]:
classifier5_rbf = svm.SVC(kernel='rbf')

classifier5_rbf.fit(s5cv_train_reviews, s5sentiment_train_data)
classifier5_rbf.fit(s5tv_train_reviews, s5sentiment_train_data)

svc5_rbf_bow_pred = classifier5_rbf.predict(s5cv_test_reviews)
svc5_rbf_tfidf_pred = classifier5_rbf.predict(s5tv_test_reviews)

svc5_rbf_bow_score = accuracy_score(s5sentiment_test_data, svc5_rbf_bow_pred)
svc5_rbf_tfidf_score = accuracy_score(s5sentiment_test_data, svc5_rbf_tfidf_pred)

print("The accuracy score for rbf SVC for BOW model is: ", svc5_rbf_bow_score)
print("The accuracy score for rbf SVC for TFIDF model is: ", svc5_rbf_tfidf_score)

The accuracy score for rbf SVC for BOW model is:  0.71825
The accuracy score for rbf SVC for TFIDF model is:  0.71735


#### Sampling the dataset - 10k pos and 10k negative

In [41]:
sample10_pos = df_pos.sample(10000, random_state=0)
sample10_neg = df_neg.sample(10000, random_state=0)
sample10_df = sample10_pos.append(sample10_neg)

In [42]:
# test data preparation
sample10_indexes = sample10_df.index.values.tolist()
main_indexes = df.index.values.tolist()
test10_df = pd.DataFrame(columns=['review','sentiment'])
for i in main_indexes:
    if i not in sample10_indexes:
        test10_df = test10_df.append(df[df.index==i])

In [43]:
sample10_df = sample10_df.reset_index().drop('index',axis=1)
test10_df = test10_df.reset_index().drop('index',axis=1)

In [44]:
print(sample10_df.shape)
print(test10_df.shape)

(20000, 2)
(30000, 2)


#### Vectorizer

In [45]:
s10norm_train_reviews = sample10_df.review
s10norm_test_reviews = test10_df.review

#Count vectorizer for bag of words
s10cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))

#transformed reviews
s10cv_train_reviews=s10cv.fit_transform(s10norm_train_reviews)
s10cv_test_reviews=s10cv.transform(s10norm_test_reviews)

print('BOW train CV:',s10cv_train_reviews.shape)
print('BOW test CV:',s10cv_test_reviews.shape)

#Tfidf vectorizer
s10tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))

#transformed reviews
s10tv_train_reviews=s10tv.fit_transform(s10norm_train_reviews)
s10tv_test_reviews=s10tv.transform(s10norm_test_reviews)

print('Tfidf train:',s10tv_train_reviews.shape)
print('Tfidf test:',s10tv_test_reviews.shape)

BOW train CV: (20000, 3447884)
BOW test CV: (30000, 3447884)
Tfidf train: (20000, 3447884)
Tfidf test: (30000, 3447884)


#### Label Binarizer

In [46]:
#labeling the sentient data
lb=LabelBinarizer()
#transformed sentiment data
s10sentiment_train_data=lb.fit_transform(sample10_df['sentiment'])
s10sentiment_test_data=lb.fit_transform(test10_df['sentiment'])
print(s10sentiment_train_data.shape)
print(s10sentiment_test_data.shape)

(20000, 1)
(30000, 1)


#### Modelling

#### Logistic Regression

In [47]:
#training the model
s10lr=LogisticRegression(penalty='l2',max_iter=500,C=10,random_state=42)
#Fitting the model for Bag of words
s10lr_bow=s10lr.fit(s10cv_train_reviews, s10sentiment_train_data)
print(s10lr_bow)
#Fitting the model for tfidf features
s10lr_tfidf=s10lr.fit(s10tv_train_reviews, s10sentiment_train_data)
print(s10lr_tfidf)

##Predicting the model for BOW features
s10lr_bow_predict=s10lr.predict(s10cv_test_reviews)
print(s10lr_bow_predict)
##Predicting the model for tfidf features
s10lr_tfidf_predict=s10lr.predict(s10tv_test_reviews)
print(s10lr_tfidf_predict)

#Accuracy score for bag of words
s10lr_bow_score=accuracy_score(s10sentiment_test_data, s10lr_bow_predict)
print("lr_bow_score :",s10lr_bow_score)
#Accuracy score for tfidf features
s10lr_tfidf_score=accuracy_score(s10sentiment_test_data, s10lr_tfidf_predict)
print("lr_tfidf_score :",s10lr_tfidf_score)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
[0 1 0 ... 1 1 0]
[0 1 0 ... 1 1 0]
lr_bow_score : 0.7347
lr_tfidf_score : 0.7342


#### Multinomial Naives Bayes

In [48]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
# MultinomialNB
s10MNB = MultinomialNB()

s10MNB_bow = s10MNB.fit(s10cv_train_reviews, s10sentiment_train_data)
s10MNB_tfidf = s10MNB.fit(s10tv_train_reviews, s10sentiment_train_data)

s10MNB_bow_predict = s10MNB.predict(s10cv_test_reviews)
s10MNB_bow_score = accuracy_score(s10sentiment_test_data, s10MNB_bow_predict)

s10MNB_tfidf_predict = s10MNB.predict(s10tv_test_reviews)
s10MNB_tfidf_score = accuracy_score(s10sentiment_test_data, s10MNB_tfidf_predict)

print("Multinomial NB BOW Score: ", s10MNB_bow_score)
print("Multinomial NB TF-IDF Score: ", s10MNB_tfidf_score)

Multinomial NB BOW Score:  0.7353333333333333
Multinomial NB TF-IDF Score:  0.7353333333333333


#### SVM Poly

In [49]:
from sklearn import svm
classifier10_poly = svm.SVC(kernel='poly')

classifier10_poly.fit(s10cv_train_reviews, s10sentiment_train_data)
classifier10_poly.fit(s10tv_train_reviews, s10sentiment_train_data)

svc10_bow_pred = classifier10_poly.predict(s10cv_test_reviews)
svc10_tfidf_pred = classifier10_poly.predict(s10tv_test_reviews)

svc10_bow_score = accuracy_score(s10sentiment_test_data, svc10_bow_pred)
svc10_tfidf_score = accuracy_score(s10sentiment_test_data, svc10_tfidf_pred)

print("The accuracy score for poly SVC for BOW model is: ", svc10_bow_score)
print("The accuracy score for poly SVC for TFIDF model is: ", svc10_tfidf_score)

The accuracy score for poly SVC for BOW model is:  0.6944333333333333
The accuracy score for poly SVC for TFIDF model is:  0.6944333333333333


#### SVM Linear

In [50]:
classifier10_linear = svm.SVC(kernel='linear')

classifier10_linear.fit(s10cv_train_reviews, s10sentiment_train_data)
classifier10_linear.fit(s10tv_train_reviews, s10sentiment_train_data)

svc10_bow_pred1 = classifier10_linear.predict(s10cv_test_reviews)
svc10_tfidf_pred1 = classifier10_linear.predict(s10tv_test_reviews)

svc10_bow_score1 = accuracy_score(s5sentiment_test_data, svc5_bow_pred1)
svc10_tfidf_score1 = accuracy_score(s5sentiment_test_data, svc5_tfidf_pred1)

print("The accuracy score for linear SVC for BOW model is: ", svc10_bow_score1)
print("The accuracy score for linear SVC for TFIDF model is: ", svc10_tfidf_score1)

The accuracy score for linear SVC for BOW model is:  0.718175
The accuracy score for linear SVC for TFIDF model is:  0.718175


#### SVM rbf

In [51]:
classifier10_rbf = svm.SVC(kernel='rbf')

classifier10_rbf.fit(s10cv_train_reviews, s10sentiment_train_data)
classifier10_rbf.fit(s10tv_train_reviews, s10sentiment_train_data)

svc10_rbf_bow_pred = classifier10_rbf.predict(s10cv_test_reviews)
svc10_rbf_tfidf_pred = classifier10_rbf.predict(s10tv_test_reviews)

svc10_rbf_bow_score = accuracy_score(s10sentiment_test_data, svc10_rbf_bow_pred)
svc10_rbf_tfidf_score = accuracy_score(s10sentiment_test_data, svc10_rbf_tfidf_pred)

print("The accuracy score for rbf SVC for BOW model is: ", svc10_rbf_bow_score)
print("The accuracy score for rbf SVC for TFIDF model is: ", svc10_rbf_tfidf_score)

The accuracy score for rbf SVC for BOW model is:  0.5418333333333333
The accuracy score for rbf SVC for TFIDF model is:  0.5047666666666667


#### Sampling 15k postive and 15k negative

In [52]:
sample15_pos = df_pos.sample(15000, random_state=0)
sample15_neg = df_neg.sample(15000, random_state=0)
sample15_df = sample15_pos.append(sample15_neg)

In [53]:
# test data preparation
sample15_indexes = sample15_df.index.values.tolist()
main_indexes = df.index.values.tolist()
test15_df = pd.DataFrame(columns=['review','sentiment'])
for i in main_indexes:
    if i not in sample15_indexes:
        test15_df = test15_df.append(df[df.index==i])

In [54]:
sample15_df = sample15_df.reset_index().drop('index',axis=1)
test15_df = test15_df.reset_index().drop('index',axis=1)

In [55]:
print(sample15_df.shape)
print(test15_df.shape)

(30000, 2)
(20000, 2)


#### Vectorizer

In [56]:
s15norm_train_reviews = sample15_df.review
s15norm_test_reviews = test15_df.review

#Count vectorizer for bag of words
s15cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))

#transformed reviews
s15cv_train_reviews=s15cv.fit_transform(s15norm_train_reviews)
s15cv_test_reviews=s15cv.transform(s15norm_test_reviews)

print('BOW train CV:',s15cv_train_reviews.shape)
print('BOW test CV:',s15cv_test_reviews.shape)

#Tfidf vectorizer
s15tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))

#transformed reviews
s15tv_train_reviews=s15tv.fit_transform(s15norm_train_reviews)
s15tv_test_reviews=s15tv.transform(s15norm_test_reviews)

print('Tfidf train:',s15tv_train_reviews.shape)
print('Tfidf test:',s15tv_test_reviews.shape)

BOW train CV: (30000, 4906126)
BOW test CV: (20000, 4906126)
Tfidf train: (30000, 4906126)
Tfidf test: (20000, 4906126)


#### Label Binarizer

In [57]:
#labeling the sentient data
lb=LabelBinarizer()
#transformed sentiment data
s15sentiment_train_data=lb.fit_transform(sample15_df['sentiment'])
s15sentiment_test_data=lb.fit_transform(test15_df['sentiment'])
print(s15sentiment_train_data.shape)
print(s15sentiment_test_data.shape)

(30000, 1)
(20000, 1)


#### Modelling

#### Logistic Regression

In [58]:
#training the model
s15lr=LogisticRegression(penalty='l2',max_iter=500,C=10,random_state=42)
#Fitting the model for Bag of words
s15lr_bow=s15lr.fit(s15cv_train_reviews, s15sentiment_train_data)
print(s15lr_bow)
#Fitting the model for tfidf features
s15lr_tfidf=s15lr.fit(s15tv_train_reviews, s15sentiment_train_data)
print(s15lr_tfidf)

##Predicting the model for BOW features
s15lr_bow_predict=s15lr.predict(s15cv_test_reviews)
print(s15lr_bow_predict)
##Predicting the model for tfidf features
s15lr_tfidf_predict=s15lr.predict(s15tv_test_reviews)
print(s15lr_tfidf_predict)

#Accuracy score for bag of words
s15lr_bow_score=accuracy_score(s15sentiment_test_data, s15lr_bow_predict)
print("lr_bow_score :",s15lr_bow_score)
#Accuracy score for tfidf features
s15lr_tfidf_score=accuracy_score(s15sentiment_test_data, s15lr_tfidf_predict)
print("lr_tfidf_score :",s15lr_tfidf_score)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=500, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
[1 0 1 ... 1 0 0]
[1 0 1 ... 1 0 0]
lr_bow_score : 0.73845
lr_tfidf_score : 0.7359


#### Multinomial Naives Bayes

In [59]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
# MultinomialNB
s15MNB = MultinomialNB()

s15MNB_bow = s15MNB.fit(s15cv_train_reviews, s15sentiment_train_data)
s15MNB_tfidf = s15MNB.fit(s15tv_train_reviews, s15sentiment_train_data)

s15MNB_bow_predict = s15MNB.predict(s15cv_test_reviews)
s15MNB_bow_score = accuracy_score(s15sentiment_test_data, s15MNB_bow_predict)

s15MNB_tfidf_predict = s15MNB.predict(s15tv_test_reviews)
s15MNB_tfidf_score = accuracy_score(s15sentiment_test_data, s15MNB_tfidf_predict)

print("Multinomial NB BOW Score: ", s15MNB_bow_score)
print("Multinomial NB TF-IDF Score: ", s15MNB_tfidf_score)

Multinomial NB BOW Score:  0.7388
Multinomial NB TF-IDF Score:  0.7388


#### SVM Poly

In [60]:
from sklearn import svm
classifier15_poly = svm.SVC(kernel='poly')

classifier15_poly.fit(s15cv_train_reviews, s15sentiment_train_data)
classifier15_poly.fit(s15tv_train_reviews, s15sentiment_train_data)

svc15_bow_pred = classifier15_poly.predict(s15cv_test_reviews)
svc15_tfidf_pred = classifier15_poly.predict(s15tv_test_reviews)

svc15_bow_score = accuracy_score(s15sentiment_test_data, svc15_bow_pred)
svc15_tfidf_score = accuracy_score(s15sentiment_test_data, svc15_tfidf_pred)

print("The accuracy score for poly SVC for BOW model is: ", svc15_bow_score)
print("The accuracy score for poly SVC for TFIDF model is: ", svc15_tfidf_score)

The accuracy score for poly SVC for BOW model is:  0.7023
The accuracy score for poly SVC for TFIDF model is:  0.7023


#### SVM Linear

In [61]:
classifier15_linear = svm.SVC(kernel='linear')

classifier15_linear.fit(s15cv_train_reviews, s15sentiment_train_data)
classifier15_linear.fit(s15tv_train_reviews, s15sentiment_train_data)

svc15_bow_pred1 = classifier15_linear.predict(s15cv_test_reviews)
svc15_tfidf_pred1 = classifier15_linear.predict(s15tv_test_reviews)

svc15_bow_score1 = accuracy_score(s15sentiment_test_data, svc15_bow_pred1)
svc15_tfidf_score1 = accuracy_score(s15sentiment_test_data, svc15_tfidf_pred1)

print("The accuracy score for linear SVC for BOW model is: ", svc15_bow_score1)
print("The accuracy score for linear SVC for TFIDF model is: ", svc15_tfidf_score1)

The accuracy score for linear SVC for BOW model is:  0.73915
The accuracy score for linear SVC for TFIDF model is:  0.73915


#### SVM rbf

In [62]:
classifier15_rbf = svm.SVC(kernel='rbf')

classifier15_rbf.fit(s15cv_train_reviews, s15sentiment_train_data)
classifier15_rbf.fit(s15tv_train_reviews, s15sentiment_train_data)

svc15_rbf_bow_pred = classifier15_rbf.predict(s15cv_test_reviews)
svc15_rbf_tfidf_pred = classifier15_rbf.predict(s15tv_test_reviews)

svc15_rbf_bow_score = accuracy_score(s15sentiment_test_data, svc15_rbf_bow_pred)
svc15_rbf_tfidf_score = accuracy_score(s15sentiment_test_data, svc15_rbf_tfidf_pred)

print("The accuracy score for rbf SVC for BOW model is: ", svc15_rbf_bow_score)
print("The accuracy score for rbf SVC for TFIDF model is: ", svc15_rbf_tfidf_score)

The accuracy score for rbf SVC for BOW model is:  0.51035
The accuracy score for rbf SVC for TFIDF model is:  0.5


## Final Report

#### 1. Accuracy (%)
#### Bag of Words

| Algorithm          | 10k    | 20k    | 30k    |
|--------------------|------------|-------------|------------|
| Logistic Regression                | 71.84       | 73.47        | 73.84         |
| Multinomial NB      | 71.90       | 73.53        |   73.88       |
| SVM - Linear | 71.81       | 71.81        | 73.91       |
| SVM - Poly | 68.14       | 69.40        | 70.23       |
| SVM - RBF | 71.82       | 54.18        | 51.03       |

#### TFIDF
| Algorithm          | 10k    | 20k    | 30k    |
|--------------------|------------|-------------|------------|
| Logistic Regression                | 71.89       | 73.42        | 73.59         |
| Multinomial NB      | 71.90       | 73.53        | 73.88         |
| SVM - Linear | 71.81       | 71.81        | 73.91       |
| SVM - Poly | 68.14       | 69.40        | 70.23       |
| SVM - RBF | 71.73       | 50.47        | 50.00       |


#### 2. Classification Report - Precision, Recall and F1 Score

#### 10K Sample

In [63]:
#Logistic Regression
lr5_bow_report=classification_report(s5sentiment_test_data,s5lr_bow_predict,target_names=['Positive','Negative'])
lr5_tfidf_report=classification_report(s5sentiment_test_data,s5lr_tfidf_predict,target_names=['Positive','Negative'])
print("Logistic Regression - Bag of Words: \n",lr5_bow_report)
print("Logistic Regression - TFIDF: \n",lr5_tfidf_report)

#Multinomial Naives Bayes
mnb5_bow_report=classification_report(s5sentiment_test_data,s5MNB_bow_predict,target_names=['Positive','Negative'])
mnb5_tfidf_report=classification_report(s5sentiment_test_data,s5MNB_tfidf_predict,target_names=['Positive','Negative'])
print("Multinomial Naives Bayes - Bag of Words: \n",mnb5_bow_report)
print("Multinomial Naives Bayes - TFIDF: \n",mnb5_tfidf_report)


#Support Vector Machines - Poly
svc5_poly_bow_report=classification_report(s5sentiment_test_data,svc5_bow_pred,target_names=['Positive','Negative'])
svc5_poly_tfidf_report=classification_report(s5sentiment_test_data,svc5_tfidf_pred,target_names=['Positive','Negative'])
print("SVM kernel='poly' - Bag of Words: \n",svc5_poly_bow_report)
print("SVM kernel='poly' - TFIDF: \n",svc5_poly_tfidf_report)

#Support Vector Machines - Linear
svc5_linear_bow_report=classification_report(s5sentiment_test_data,svc5_bow_pred1,target_names=['Positive','Negative'])
svc5_linear_tfidf_report=classification_report(s5sentiment_test_data,svc5_tfidf_pred1,target_names=['Positive','Negative'])
print("SVM kernel='linear' - Bag of Words: \n",svc5_linear_bow_report)
print("SVM kernel='linear' - TFIDF: \n",svc5_linear_tfidf_report)

#Support Vector Machines - rbf
svc5_rbf_bow_report=classification_report(s5sentiment_test_data,svc5_rbf_bow_pred,target_names=['Positive','Negative'])
svc5_rbf_tfidf_report=classification_report(s5sentiment_test_data,svc5_rbf_tfidf_pred,target_names=['Positive','Negative'])
print("SVM kernel='rbf' - Bag of Words: \n",svc5_rbf_bow_report)
print("SVM kernel='rbf' - TFIDF: \n",svc5_rbf_tfidf_report)


Logistic Regression - Bag of Words: 
              precision    recall  f1-score   support

   Positive       0.72      0.72      0.72     20000
   Negative       0.72      0.72      0.72     20000

avg / total       0.72      0.72      0.72     40000

Logistic Regression - TFIDF: 
              precision    recall  f1-score   support

   Positive       0.72      0.72      0.72     20000
   Negative       0.72      0.71      0.72     20000

avg / total       0.72      0.72      0.72     40000

Multinomial Naives Bayes - Bag of Words: 
              precision    recall  f1-score   support

   Positive       0.72      0.72      0.72     20000
   Negative       0.72      0.71      0.72     20000

avg / total       0.72      0.72      0.72     40000

Multinomial Naives Bayes - TFIDF: 
              precision    recall  f1-score   support

   Positive       0.72      0.72      0.72     20000
   Negative       0.72      0.71      0.72     20000

avg / total       0.72      0.72      0.72    

#### 20K Sample

In [64]:
#Logistic Regression
lr10_bow_report=classification_report(s10sentiment_test_data,s10lr_bow_predict,target_names=['Positive','Negative'])
lr10_tfidf_report=classification_report(s10sentiment_test_data,s10lr_tfidf_predict,target_names=['Positive','Negative'])
print("Logistic Regression - Bag of Words: \n",lr10_bow_report)
print("Logistic Regression - TFIDF: \n",lr10_tfidf_report)

#Multinomial Naives Bayes
mnb10_bow_report=classification_report(s10sentiment_test_data,s10MNB_bow_predict,target_names=['Positive','Negative'])
mnb10_tfidf_report=classification_report(s10sentiment_test_data,s10MNB_tfidf_predict,target_names=['Positive','Negative'])
print("Multinomial Naives Bayes - Bag of Words: \n",mnb10_bow_report)
print("Multinomial Naives Bayes - TFIDF: \n",mnb10_tfidf_report)

#Support Vector Machines - Poly
svc10_poly_bow_report=classification_report(s10sentiment_test_data,svc10_bow_pred,target_names=['Positive','Negative'])
svc10_poly_tfidf_report=classification_report(s10sentiment_test_data,svc10_tfidf_pred,target_names=['Positive','Negative'])
print("SVM kernel='poly' - Bag of Words: \n",svc10_poly_bow_report)
print("SVM kernel='poly' - TFIDF: \n",svc10_poly_tfidf_report)

#Support Vector Machines - Linear
svc10_linear_bow_report=classification_report(s10sentiment_test_data,svc10_bow_pred1,target_names=['Positive','Negative'])
svc10_linear_tfidf_report=classification_report(s10sentiment_test_data,svc10_tfidf_pred1,target_names=['Positive','Negative'])
print("SVM kernel='linear' - Bag of Words: \n",svc10_linear_bow_report)
print("SVM kernel='linear' - TFIDF: \n",svc10_linear_tfidf_report)

#Support Vector Machines - rbf
svc10_rbf_bow_report=classification_report(s10sentiment_test_data,svc10_rbf_bow_pred,target_names=['Positive','Negative'])
svc10_rbf_tfidf_report=classification_report(s10sentiment_test_data,svc10_rbf_tfidf_pred,target_names=['Positive','Negative'])
print("SVM kernel='rbf' - Bag of Words: \n",svc10_rbf_bow_report)
print("SVM kernel='rbf' - TFIDF: \n",svc10_rbf_tfidf_report)



Logistic Regression - Bag of Words: 
              precision    recall  f1-score   support

   Positive       0.74      0.73      0.73     15000
   Negative       0.73      0.74      0.74     15000

avg / total       0.73      0.73      0.73     30000

Logistic Regression - TFIDF: 
              precision    recall  f1-score   support

   Positive       0.72      0.76      0.74     15000
   Negative       0.75      0.71      0.73     15000

avg / total       0.73      0.73      0.73     30000

Multinomial Naives Bayes - Bag of Words: 
              precision    recall  f1-score   support

   Positive       0.73      0.74      0.74     15000
   Negative       0.74      0.73      0.73     15000

avg / total       0.74      0.74      0.74     30000

Multinomial Naives Bayes - TFIDF: 
              precision    recall  f1-score   support

   Positive       0.73      0.74      0.74     15000
   Negative       0.74      0.73      0.73     15000

avg / total       0.74      0.74      0.74    

#### 30K Sample

In [65]:
#Logistic Regression
lr15_bow_report=classification_report(s15sentiment_test_data,s15lr_bow_predict,target_names=['Positive','Negative'])
lr15_tfidf_report=classification_report(s15sentiment_test_data,s15lr_tfidf_predict,target_names=['Positive','Negative'])
print("Logistic Regression - Bag of Words: \n",lr15_bow_report)
print("Logistic Regression - TFIDF: \n",lr15_tfidf_report)

#Multinomial Naives Bayes
mnb15_bow_report=classification_report(s15sentiment_test_data,s15MNB_bow_predict,target_names=['Positive','Negative'])
mnb15_tfidf_report=classification_report(s15sentiment_test_data,s15MNB_tfidf_predict,target_names=['Positive','Negative'])
print("Multinomial Naives Bayes - Bag of Words: \n",mnb15_bow_report)
print("Multinomial Naives Bayes - TFIDF: \n",mnb15_tfidf_report)

#Support Vector Machines - Poly
svc15_poly_bow_report=classification_report(s15sentiment_test_data,svc15_bow_pred,target_names=['Positive','Negative'])
svc15_poly_tfidf_report=classification_report(s15sentiment_test_data,svc15_tfidf_pred,target_names=['Positive','Negative'])
print("SVM kernel='poly' - Bag of Words: \n",svc15_poly_bow_report)
print("SVM kernel='poly' - TFIDF: \n",svc15_poly_tfidf_report)

#Support Vector Machines - Linear
svc15_linear_bow_report=classification_report(s15sentiment_test_data,svc15_bow_pred1,target_names=['Positive','Negative'])
svc15_linear_tfidf_report=classification_report(s15sentiment_test_data,svc15_tfidf_pred1,target_names=['Positive','Negative'])
print("SVM kernel='linear' - Bag of Words: \n",svc15_linear_bow_report)
print("SVM kernel='linear' - TFIDF: \n",svc15_linear_tfidf_report)

#Support Vector Machines - rbf
svc15_rbf_bow_report=classification_report(s15sentiment_test_data,svc15_rbf_bow_pred,target_names=['Positive','Negative'])
svc15_rbf_tfidf_report=classification_report(s15sentiment_test_data,svc15_rbf_tfidf_pred,target_names=['Positive','Negative'])
print("SVM kernel='rbf' - Bag of Words: \n",svc15_rbf_bow_report)
print("SVM kernel='rbf' - TFIDF: \n",svc15_rbf_tfidf_report)


Logistic Regression - Bag of Words: 
              precision    recall  f1-score   support

   Positive       0.74      0.74      0.74     10000
   Negative       0.74      0.74      0.74     10000

avg / total       0.74      0.74      0.74     20000

Logistic Regression - TFIDF: 
              precision    recall  f1-score   support

   Positive       0.72      0.77      0.75     10000
   Negative       0.75      0.70      0.73     10000

avg / total       0.74      0.74      0.74     20000

Multinomial Naives Bayes - Bag of Words: 
              precision    recall  f1-score   support

   Positive       0.73      0.75      0.74     10000
   Negative       0.74      0.73      0.74     10000

avg / total       0.74      0.74      0.74     20000

Multinomial Naives Bayes - TFIDF: 
              precision    recall  f1-score   support

   Positive       0.73      0.75      0.74     10000
   Negative       0.74      0.73      0.74     10000

avg / total       0.74      0.74      0.74    

#### Inference

- When the size of the training data is increased, a significant increase in accuracy is observed in the following algorithms:
    - Bag Of Words Vectoriser
        - Logistic Regression - 74.28%
        - Multinomial Naives Bayes - 74.30%
        - Support Vector Machines: Linear Kernel - 74.35%
        - Support Vector Machines: Polynomial Kernel - 70.70%
    - TF-IDF Vectoriser
        - Logistic Regression - 74.25%
        - Multinomial Naives Bayes - 74.30%
        - Support Vector Machines: Linear Kernel - 74.35%
        - Support Vector Machines: Polynomial Kernel - 70.70%



- For all the above mentioned algorithms, by looking at the classification report (precision, recall and f1 score), we can therefore conclude that:
    - For the Bag of Words vectorised model, Support Vector Machines (linear kernel) shows a significant increase in accuracy (to 74.35%) when the training data is increased with a precision score of 74%
    - For the TF-IDF vectorised model, Support Vector Machines (linear kernel) shows a significant increase in accuracy (to 74.35%) when the training data is increased with a precision score of 74%



- On the other hand, since the computing time is relatively higher for SVM compared to Logistic Regression and Naives Bayes algorithms, with a constraint in supporting hardware, it would be advisable to go for either Multinomial Naives Bayes or Logistic Regression models.


- Also, with a powerful GPU, more advanced algorithms such as ensemble learning can be applied in order to improve the accuracy without taking much of computing time.