# Text Sentiment Analysis Using Amazon Review Data

Predicting a review's numeric rating based on the textual review is a quintessential multiclass text classification problem and an interesting research topic in natural language processing. New advances in NLP including the development of Glove, and Word2Vec, have increased the range of approaches available to address this question, and insights gained on this problem can generalize across sentiment analysis and NLP multiclass classification problems in general. We leveraged the extensive corpus of multi-class labeled Amazon data to apply sentiment analysis.

```Objective:``` Given a text book review, predict one of the three ```(positive, neutral, negative)``` sentiment classes.



### Importing the required libraries!!

In [1]:
# importing the libraries related to the data manipulation.
import numpy as np
import pandas as pd
import string

# importing the libraries related to the data_visualization.
import seaborn as sns
import matplotlib.pyplot as plt

# TextBlob is a Python library for processing textual data.
from textblob import TextBlob

# To avoid warnings
import warnings
warnings.filterwarnings("ignore")

### Loading the Book Review Dataset

In [2]:
# datapath = E:\DOWNLOADS\book_review.csv
Book_review = pd.read_csv(r'E:\DOWNLOADS\book_review.csv')
Book_review

Unnamed: 0,unique_id,asin,product_name,product_type,helpful,rating,title,date,reviewer,reviewer_location,review_text
0,1884956068:a_mainstay_reference_for_spanish-sp...,1884956068,Manual pedi�trico para los due�os del nuevo be...,books,7 of 8,5.0,A mainstay reference for Spanish-speaking home...,"April 6, 2000",Midwest Book Review,"Oregon, WI USA",This all-Spanish handbook for parents with new...
1,0679728740:you_want_the_necrophiliac_to_escape...,0679728740,Child of God: Books: Cormac Mccarthy,books,0 of 1,5.0,You want the necrophiliac to escape?,"October 20, 2006",Brian Asquith,,McCarthy's writing and portrayal of Lester Bal...
2,0679728740:low_brow_and_juvenile:bruce_miller,0679728740,Child of God: Books: Cormac Mccarthy,books,2 of 5,2.0,Low brow and juvenile,"September 26, 2006",Bruce Miller,"Shippensburg, PA United States",Do you giggle uncontrollably when poking corps...
3,"0679728740:mccarthy,_a_brave_writer_with_an_in...",0679728740,Child of God: Books: Cormac Mccarthy,books,4 of 5,5.0,"McCarthy, a brave writer with an incredible co...","July 24, 2005","Christopher Davis ""Christopher E.D.""","Cleveland, MS",I was initiated into the world of Cormac McCar...
4,0679728740:sevierville_in_child_of_god:alex_jo...,0679728740,Child of God: Books: Cormac Mccarthy,books,11 of 14,4.0,SEVIERVILLE in Child of God,"November 20, 2002",Alex Johnson,"Sevierville, Tennessee",I cannot speak to the literary points in the n...
...,...,...,...,...,...,...,...,...,...,...,...
41995,0972894403:that's_it!:terrie_cash,0972894403,"Open Your Mind, Open Your Heart: Books: W. Mar...",books,,5.0,That's it!,"May 10, 2004",Terrie Cash,USA,"In reading this book, I traveled on a wonderfu..."
41996,"0972894403:open_your_heart,_open_your_mind:hel...",0972894403,"Open Your Mind, Open Your Heart: Books: W. Mar...",books,,5.0,"Open Your Heart, Open Your Mind","January 13, 2004",helen bellanger,"apo, ap United States",This wonderful book touches the inner soul.You...
41997,0972894403:living_god's_will:patricia_newsome,0972894403,"Open Your Mind, Open Your Heart: Books: W. Mar...",books,1 of 1,5.0,Living God's Will,"January 8, 2004",Patricia Newsome,USA,I am a 45 year old woman who has been on a que...
41998,"0972894403:it_is_a_blessing:michael_abney,_sr.",0972894403,"Open Your Mind, Open Your Heart: Books: W. Mar...",books,,5.0,It is a Blessing,"January 7, 2004","Michael Abney, Sr.",,It is written that you should hide the Word of...


In [3]:
# shape of the actual dataframe
Book_review.shape

(42000, 11)

It has ```42000 rows and 11 columns```

In [4]:
# display the names of various columns present in the dataframe
Book_review.columns

Index(['unique_id', 'asin', 'product_name', 'product_type', 'helpful',
       'rating', 'title', 'date', 'reviewer', 'reviewer_location',
       'review_text'],
      dtype='object')

In [5]:
# Find the information about the given dataFrame including the index dtype and column dtypes, non-null values and memory usage.
Book_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   unique_id          42000 non-null  object 
 1   asin               42000 non-null  object 
 2   product_name       42000 non-null  object 
 3   product_type       42000 non-null  object 
 4   helpful            34398 non-null  object 
 5   rating             42000 non-null  float64
 6   title              41993 non-null  object 
 7   date               41993 non-null  object 
 8   reviewer           42000 non-null  object 
 9   reviewer_location  36876 non-null  object 
 10  review_text        42000 non-null  object 
dtypes: float64(1), object(10)
memory usage: 3.5+ MB


In [6]:
# Finding the total number of null values, if present
Book_review.isnull().sum()

unique_id               0
asin                    0
product_name            0
product_type            0
helpful              7602
rating                  0
title                   7
date                    7
reviewer                0
reviewer_location    5124
review_text             0
dtype: int64

#### ```Considering only the "review_text" feature of the dataframe for our further analysis:```

In [7]:
pd.options.mode.chained_assignment = None
df = Book_review[["review_text"]]
df["review_text"] = df["review_text"].astype(str)
df.head()

Unnamed: 0,review_text
0,This all-Spanish handbook for parents with new...
1,McCarthy's writing and portrayal of Lester Bal...
2,Do you giggle uncontrollably when poking corps...
3,I was initiated into the world of Cormac McCar...
4,I cannot speak to the literary points in the n...


In [8]:
# shape of the review data
df.shape

(42000, 1)

### 1. Lowercasing the text of review_text:

In [9]:
df["lowercase_text"] = df["review_text"].str.lower()
df.head()

Unnamed: 0,review_text,lowercase_text
0,This all-Spanish handbook for parents with new...,this all-spanish handbook for parents with new...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthy's writing and portrayal of lester bal...
2,Do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...
3,I was initiated into the world of Cormac McCar...,i was initiated into the world of cormac mccar...
4,I cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...


### 2. Removal of punctuations present in the text:

In [10]:
Punctuation_remove = string.punctuation
def remove_punctuation(lowercase_text):
    """custom function to remove the punctuation"""
    return lowercase_text.translate(str.maketrans('', '', Punctuation_remove))

df["NoPunctuations_text"] = df["lowercase_text"].apply(lambda lowercase_text: remove_punctuation(lowercase_text))
df.head()

Unnamed: 0,review_text,lowercase_text,NoPunctuations_text
0,This all-Spanish handbook for parents with new...,this all-spanish handbook for parents with new...,this allspanish handbook for parents with new ...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthy's writing and portrayal of lester bal...,mccarthys writing and portrayal of lester ball...
2,Do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...
3,I was initiated into the world of Cormac McCar...,i was initiated into the world of cormac mccar...,i was initiated into the world of cormac mccar...
4,I cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...


### 3. Removal of stopwords:

In [11]:
# importing the NLP library
import nltk
from nltk.corpus import stopwords
stopwords

<WordListCorpusReader in 'C:\\Users\\SAMARTH P SHET\\AppData\\Roaming\\nltk_data\\corpora\\stopwords'>

In [12]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to C:\Users\SAMARTH P
[nltk_data]     SHET\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in stopwords])

df["Nostopwords_text"] = df["NoPunctuations_text"].apply(lambda text: remove_stopwords(text))
df.head()

Unnamed: 0,review_text,lowercase_text,NoPunctuations_text,Nostopwords_text
0,This all-Spanish handbook for parents with new...,this all-spanish handbook for parents with new...,this allspanish handbook for parents with new ...,allspanish handbook parents new babies prove e...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthy's writing and portrayal of lester bal...,mccarthys writing and portrayal of lester ball...,mccarthys writing portrayal lester ballard nec...
2,Do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...,giggle uncontrollably poking corpses stick loo...
3,I was initiated into the world of Cormac McCar...,i was initiated into the world of cormac mccar...,i was initiated into the world of cormac mccar...,initiated world cormac mccarthy novel southern...
4,I cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...,cannot speak literary points novel though say ...


### 4. Removal of Frequent words:

In [14]:
from collections import Counter
cnt = Counter()
for text in df["Nostopwords_text"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('book', 79443),
 ('read', 28623),
 ('one', 27132),
 ('like', 16226),
 ('would', 15183),
 ('time', 13181),
 ('great', 12677),
 ('story', 12663),
 ('good', 12635),
 ('books', 12467)]

In [15]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["NoFreqwords_text"] = df["Nostopwords_text"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,review_text,lowercase_text,NoPunctuations_text,Nostopwords_text,NoFreqwords_text
0,This all-Spanish handbook for parents with new...,this all-spanish handbook for parents with new...,this allspanish handbook for parents with new ...,allspanish handbook parents new babies prove e...,allspanish handbook parents new babies prove e...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthy's writing and portrayal of lester bal...,mccarthys writing and portrayal of lester ball...,mccarthys writing portrayal lester ballard nec...,mccarthys writing portrayal lester ballard nec...
2,Do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...,giggle uncontrollably poking corpses stick loo...,giggle uncontrollably poking corpses stick loo...
3,I was initiated into the world of Cormac McCar...,i was initiated into the world of cormac mccar...,i was initiated into the world of cormac mccar...,initiated world cormac mccarthy novel southern...,initiated world cormac mccarthy novel southern...
4,I cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...,cannot speak literary points novel though say ...,cannot speak literary points novel though say ...


### 5. Removal of rare words:

In [16]:
n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(NoFreqwords_text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(NoFreqwords_text).split() if word not in RAREWORDS])

df["NoRareFreqwords_text"] = df["NoFreqwords_text"].apply(lambda NoFreqwords_text: remove_rarewords(NoFreqwords_text))
df.head()

Unnamed: 0,review_text,lowercase_text,NoPunctuations_text,Nostopwords_text,NoFreqwords_text,NoRareFreqwords_text
0,This all-Spanish handbook for parents with new...,this all-spanish handbook for parents with new...,this allspanish handbook for parents with new ...,allspanish handbook parents new babies prove e...,allspanish handbook parents new babies prove e...,allspanish handbook parents new babies prove e...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthy's writing and portrayal of lester bal...,mccarthys writing and portrayal of lester ball...,mccarthys writing portrayal lester ballard nec...,mccarthys writing portrayal lester ballard nec...,mccarthys writing portrayal lester ballard nec...
2,Do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...,do you giggle uncontrollably when poking corps...,giggle uncontrollably poking corpses stick loo...,giggle uncontrollably poking corpses stick loo...,giggle uncontrollably poking corpses stick loo...
3,I was initiated into the world of Cormac McCar...,i was initiated into the world of cormac mccar...,i was initiated into the world of cormac mccar...,initiated world cormac mccarthy novel southern...,initiated world cormac mccarthy novel southern...,initiated world cormac mccarthy novel southern...
4,I cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...,i cannot speak to the literary points in the n...,cannot speak literary points novel though say ...,cannot speak literary points novel though say ...,cannot speak literary points novel though say ...


### 6. Stemming:

In [17]:
from nltk.stem.porter import PorterStemmer

# Drop the four columns 
df.drop(["NoFreqwords_text", "lowercase_text", "NoPunctuations_text","Nostopwords_text"], axis=1, inplace=True) 

stemmer = PorterStemmer()
def stem_words(NoRareFreqwords_text):
    return " ".join([stemmer.stem(word) for word in NoRareFreqwords_text.split()])

df["Stemmed_text"] = df["NoRareFreqwords_text"].apply(lambda NoRareFreqwords_text: stem_words(NoRareFreqwords_text))
df.head()

Unnamed: 0,review_text,NoRareFreqwords_text,Stemmed_text
0,This all-Spanish handbook for parents with new...,allspanish handbook parents new babies prove e...,allspanish handbook parent new babi prove esse...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthys writing portrayal lester ballard nec...,mccarthi write portray lester ballard necrophi...
2,Do you giggle uncontrollably when poking corps...,giggle uncontrollably poking corpses stick loo...,giggl uncontrol poke corps stick look youi und...
3,I was initiated into the world of Cormac McCar...,initiated world cormac mccarthy novel southern...,initi world cormac mccarthi novel southern lit...
4,I cannot speak to the literary points in the n...,cannot speak literary points novel though say ...,cannot speak literari point novel though say e...


In [18]:
  nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\SAMARTH P
[nltk_data]     SHET\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [19]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(Stemmed_text):
    return " ".join([lemmatizer.lemmatize(word) for word in Stemmed_text.split()])

df["Lemmatized_text"] = df["Stemmed_text"].apply(lambda Stemmed_text: lemmatize_words(Stemmed_text))
df.head()

Unnamed: 0,review_text,NoRareFreqwords_text,Stemmed_text,Lemmatized_text
0,This all-Spanish handbook for parents with new...,allspanish handbook parents new babies prove e...,allspanish handbook parent new babi prove esse...,allspanish handbook parent new babi prove esse...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthys writing portrayal lester ballard nec...,mccarthi write portray lester ballard necrophi...,mccarthi write portray lester ballard necrophi...
2,Do you giggle uncontrollably when poking corps...,giggle uncontrollably poking corpses stick loo...,giggl uncontrol poke corps stick look youi und...,giggl uncontrol poke corp stick look youi unde...
3,I was initiated into the world of Cormac McCar...,initiated world cormac mccarthy novel southern...,initi world cormac mccarthi novel southern lit...,initi world cormac mccarthi novel southern lit...
4,I cannot speak to the literary points in the n...,cannot speak literary points novel though say ...,cannot speak literari point novel though say e...,cannot speak literari point novel though say e...


### 7. Removings of URLs, if there are no emojis and emoticons


In [20]:
# importing regular expression
import re
def remove_urls(Lemmatized_text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', Lemmatized_text)

df["URL_removed_text"] = df["Lemmatized_text"].apply(lambda Lemmatized_text: remove_urls(Lemmatized_text))
df.head()

Unnamed: 0,review_text,NoRareFreqwords_text,Stemmed_text,Lemmatized_text,URL_removed_text
0,This all-Spanish handbook for parents with new...,allspanish handbook parents new babies prove e...,allspanish handbook parent new babi prove esse...,allspanish handbook parent new babi prove esse...,allspanish handbook parent new babi prove esse...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthys writing portrayal lester ballard nec...,mccarthi write portray lester ballard necrophi...,mccarthi write portray lester ballard necrophi...,mccarthi write portray lester ballard necrophi...
2,Do you giggle uncontrollably when poking corps...,giggle uncontrollably poking corpses stick loo...,giggl uncontrol poke corps stick look youi und...,giggl uncontrol poke corp stick look youi unde...,giggl uncontrol poke corp stick look youi unde...
3,I was initiated into the world of Cormac McCar...,initiated world cormac mccarthy novel southern...,initi world cormac mccarthi novel southern lit...,initi world cormac mccarthi novel southern lit...,initi world cormac mccarthi novel southern lit...
4,I cannot speak to the literary points in the n...,cannot speak literary points novel though say ...,cannot speak literari point novel though say e...,cannot speak literari point novel though say e...,cannot speak literari point novel though say e...


### 8. Removing of tags:

In [21]:
def remove_html(URL_removed_text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', URL_removed_text)

df["Tags_Removed_text"] = df["URL_removed_text"].apply(lambda URL_removed_text: remove_html(URL_removed_text))
df.head()

Unnamed: 0,review_text,NoRareFreqwords_text,Stemmed_text,Lemmatized_text,URL_removed_text,Tags_Removed_text
0,This all-Spanish handbook for parents with new...,allspanish handbook parents new babies prove e...,allspanish handbook parent new babi prove esse...,allspanish handbook parent new babi prove esse...,allspanish handbook parent new babi prove esse...,allspanish handbook parent new babi prove esse...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthys writing portrayal lester ballard nec...,mccarthi write portray lester ballard necrophi...,mccarthi write portray lester ballard necrophi...,mccarthi write portray lester ballard necrophi...,mccarthi write portray lester ballard necrophi...
2,Do you giggle uncontrollably when poking corps...,giggle uncontrollably poking corpses stick loo...,giggl uncontrol poke corps stick look youi und...,giggl uncontrol poke corp stick look youi unde...,giggl uncontrol poke corp stick look youi unde...,giggl uncontrol poke corp stick look youi unde...
3,I was initiated into the world of Cormac McCar...,initiated world cormac mccarthy novel southern...,initi world cormac mccarthi novel southern lit...,initi world cormac mccarthi novel southern lit...,initi world cormac mccarthi novel southern lit...,initi world cormac mccarthi novel southern lit...
4,I cannot speak to the literary points in the n...,cannot speak literary points novel though say ...,cannot speak literari point novel though say e...,cannot speak literari point novel though say e...,cannot speak literari point novel though say e...,cannot speak literari point novel though say e...


In [22]:
df.drop(["NoRareFreqwords_text", "Stemmed_text", "Lemmatized_text", "URL_removed_text"], axis=1, inplace=True)
df.head()

Unnamed: 0,review_text,Tags_Removed_text
0,This all-Spanish handbook for parents with new...,allspanish handbook parent new babi prove esse...
1,McCarthy's writing and portrayal of Lester Bal...,mccarthi write portray lester ballard necrophi...
2,Do you giggle uncontrollably when poking corps...,giggl uncontrol poke corp stick look youi unde...
3,I was initiated into the world of Cormac McCar...,initi world cormac mccarthi novel southern lit...
4,I cannot speak to the literary points in the n...,cannot speak literari point novel though say e...


In [23]:
from textblob import TextBlob

def getSubjectivity(Tags_Removed_text):
    return TextBlob(Tags_Removed_text).sentiment.subjectivity
    
def getPolarity(Tags_Removed_text):
    return TextBlob(Tags_Removed_text).sentiment.polarity

df ['polarity'] = df['Tags_Removed_text'].apply(getPolarity)
df['subjectivity'] = df['Tags_Removed_text'].apply(getSubjectivity)

def getAnalysis(score):
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Positive'
    
df['Analysis_labels'] = df['polarity'].apply(lambda x: getAnalysis(x))
        

In [24]:
df

Unnamed: 0,review_text,Tags_Removed_text,polarity,subjectivity,Analysis_labels
0,This all-Spanish handbook for parents with new...,allspanish handbook parent new babi prove esse...,0.096591,0.196970,Positive
1,McCarthy's writing and portrayal of Lester Bal...,mccarthi write portray lester ballard necrophi...,-0.008095,0.150952,Negative
2,Do you giggle uncontrollably when poking corps...,giggl uncontrol poke corp stick look youi unde...,0.049762,0.289643,Positive
3,I was initiated into the world of Cormac McCar...,initi world cormac mccarthi novel southern lit...,-0.046591,0.361827,Negative
4,I cannot speak to the literary points in the n...,cannot speak literari point novel though say e...,0.089226,0.425926,Positive
...,...,...,...,...,...
41995,"In reading this book, I traveled on a wonderfu...",read travel wonder thought inspir journey reaf...,0.283333,0.500000,Positive
41996,This wonderful book touches the inner soul.You...,wonder touch inner soulyou identifi passag dis...,0.066667,0.233333,Positive
41997,I am a 45 year old woman who has been on a que...,45 year old woman quest posit selfmotiv last t...,-0.175397,0.475298,Negative
41998,It is written that you should hide the Word of...,written hide word god heart speak live heart o...,0.199650,0.557692,Positive


## Model Building:

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_train,X_test,y_train,y_test = train_test_split(df['Tags_Removed_text'],
                                                 df['Analysis_labels'],
                                                 test_size = 0.2,random_state = 324)

In [27]:
X_train.shape

(33600,)

In [28]:
X_test.shape

(8400,)

In [29]:
df['Analysis_labels'].value_counts()

Positive    30296
Negative     8659
Neutral      3045
Name: Analysis_labels, dtype: int64

### Feature Extraction:

Here in Feature Extraction, we are using two different methods :
> `CountVectorizer`

> `TfidfVectorizer`

In [30]:
# importing the library
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 

### 1. CountVectorizer

In [31]:
vect1 = CountVectorizer()
cv_train = vect1.fit_transform(X_train)
cv_test = vect1.transform(X_test)

In [32]:
print(vect1.vocabulary_)



In [33]:
cv_train.shape

(33600, 34513)

### 2. TfidfVectorizer

In [34]:
vect2 = TfidfVectorizer()
TF_train = vect2.fit_transform(X_train)
TF_test = vect2.transform(X_test)

In [35]:
TF_train.shape

(33600, 34513)

In [36]:
print(vect2.vocabulary_)



## KNN Classifier:

### KNN Classifier for CountVectorizer:

In [37]:
#import KNN classifer and fit on the Training dataset
from sklearn.neighbors import KNeighborsClassifier
model1 = KNeighborsClassifier()
model1.fit(cv_train,y_train)

KNeighborsClassifier()

In [38]:
# Accuracy score on training dataset
model1.score(cv_train,y_train)

0.9995535714285714

In [39]:
# Accuracy on Test dataset
model1.score(cv_test,y_test)

0.9942857142857143

In [40]:
# Performing prediction on Test dataset
expected = y_test
predicted = model1.predict(cv_test)

In [41]:
# plot confusion matrix for the test dataset
from mlxtend.plotting import plot_confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

In [42]:
# Calculating F1 score
from sklearn import metrics
from sklearn.metrics import f1_score
f1_score(expected, predicted, average='macro')

0.9871397964692025

### KNN Classifier for TfidVectorizer:

In [43]:
model2 = KNeighborsClassifier()
model2.fit(TF_train,y_train)

KNeighborsClassifier()

In [44]:
# Accuracy score on training dataset
model2.score(TF_train,y_train)

0.9996726190476191

In [45]:
# Accuracy on Test dataset
model2.score(TF_test,y_test)

0.9954761904761905

In [46]:
# Performing prediction on Test dataset
expected = y_test
predicted = model2.predict(TF_test)

In [47]:
# plot confusion matrix for the test dataset
from mlxtend.plotting import plot_confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

In [48]:
# Calculating F1 score
from sklearn import metrics
from sklearn.metrics import f1_score
f1_score(expected, predicted, average='macro')

0.9934321552836466

In [51]:
import pickle
pickle.dump(model2, open('model2.pkl', 'wb'))