##### Now that we have our cleaned dataset, we will apply Natural Language Processing Techniques like
- BoW(Bag of Words)
- Tf-IDf
- Word2Vec 

##### in order to convert the text to vectors so that we will be able to apply linear algebra techniques later for predicting the sentiment

In [1]:
# Lets reload the data from the SQLite file

import sqlite3
import pandas as pd
import numpy as np

con = sqlite3.connect('final_cleaned.sqlite')
cleaned_data = pd.read_sql_query('select * \
                                 from Reviews',con)
cleaned_data.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,sentiment
0,0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,bought sever vital can dog food product found ...,postive
1,1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,product arriv label jumbo salt peanutsth peanu...,negative
2,2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",confect around centuri light pillowi citru gel...,postive
3,3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,look secret ingredi robitussin believ found go...,negative
4,4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,great taffi great price wide assort yummi taff...,postive


### Bag of Words

In [2]:
positive_data = cleaned_data[cleaned_data.sentiment == 'postive'].sample(2000)
negative_data = cleaned_data[cleaned_data.sentiment == 'negative'].sample(2000)

data_4k = pd.concat([positive_data,negative_data])
sentiment_4k = data_4k['sentiment']
data_4k = data_4k.drop(['sentiment'],axis = 1)

In [3]:
data_4k.shape

(4000, 11)

In [4]:
sentiment_4k.shape

(4000,)

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# count_vectorizer = CountVectorizer().fit(data_4k.Text.values)
# final_counts = count_vectorizer.transform(data_4k)
# final_counts.get_shape()

In [6]:
# There are many new columns formed and we know that
# BoW does not perform well
# as it does not consider the sequence information

# The advantage of using n-grams is, it considers a pair of 
# consequent words.
# if we use simple bag of words by using stop words
# it will lead to miss classification 
# in some cases,so it is always good to use n-grams.

#### N-Grams

In [7]:
#removing stop words like "not" should be avoided before building n-grams

count_vectorizer_ng = CountVectorizer(ngram_range=(1,2)).fit(data_4k.Text.values)
final_ng_count = count_vectorizer_ng.transform(data_4k.Text.values)

In [8]:
final_ng_count.shape

(4000, 127113)

### Train and Test

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train , y_test = train_test_split(final_ng_count , sentiment_4k.values, test_size = 0.3)

In [10]:
import xgboost

xgb_model = xgboost.XGBClassifier(max_depth=50,n_estimators=60,learning_rate=0.1)
xgb_model.fit(X_train,y_train)
predictions = xgb_model.predict(X_test)

In [11]:
from sklearn.metrics import f1_score,classification_report,accuracy_score
# print('training score :'.format(f1_score(y_test,predictions)))

print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

    negative       0.76      0.80      0.78       559
     postive       0.82      0.77      0.79       641

   micro avg       0.79      0.79      0.79      1200
   macro avg       0.79      0.79      0.79      1200
weighted avg       0.79      0.79      0.79      1200



In [12]:
print('F1_Score: {:.3f} %'.format((f1_score(y_test,predictions,pos_label='postive'))*100))

F1_Score: 79.487 %


In [13]:
from sklearn.metrics import confusion_matrix 
print(confusion_matrix(y_test,predictions))

[[448 111]
 [145 496]]


#### N-gram Level TF-IDF + Xgboost


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(2,3))
tfidf_vectorizer.fit(data_4k.Text)
final_tfidf_ng = tfidf_vectorizer.transform(data_4k.Text)

In [26]:
X_train, X_test, y_train , y_test = train_test_split(final_tfidf_ng , sentiment_4k, test_size = 0.3)

In [27]:
xgb_model = xgboost.XGBClassifier(max_depth=50,n_estimators=60,learning_rate=0.01)
xgb_model.fit(X_train,y_train)
predictions = xgb_model.predict(X_test)

In [28]:
print('F1_Score: {:.3f} %'.format((f1_score(y_test,predictions,pos_label='postive'))*100))

F1_Score: 47.351 %


In [29]:
print(confusion_matrix(y_test,predictions))

[[523  78]
 [389 210]]


In [30]:
# INFERENCE
print(xgb_model.predict(tfidf_vectorizer.transform(['The food is not at all tasty. I won\'t come again'])))

['negative']


In [31]:
print(xgb_model.predict(tfidf_vectorizer.transform(['Very Tasty'])))

['negative']


In [32]:
print(xgb_model.predict(tfidf_vectorizer.transform(['Hate the food'])))

['negative']


#### Character Level TF-IDF +Xgboost

In [33]:
tfidf_vectorizer = TfidfVectorizer(analyzer='char',ngram_range=(2,3))
tfidf_vectorizer.fit(data_4k.Text)
final_tfidf_ng = tfidf_vectorizer.transform(data_4k.Text)

In [34]:
X_train, X_test, y_train , y_test = train_test_split(final_tfidf_ng , sentiment_4k, test_size = 0.3)

In [35]:
xgb_model = xgboost.XGBClassifier(max_depth=50,n_estimators=60,learning_rate=0.1)
xgb_model.fit(X_train,y_train)
predictions = xgb_model.predict(X_test)

In [36]:
print('F1_Score: {:.3f} %'.format((f1_score(y_test,predictions,pos_label='postive'))*100))

F1_Score: 76.105 %


In [37]:
print(xgb_model.predict(tfidf_vectorizer.transform(['The food was good nice and Tasty'])))

['postive']


In [38]:
print(xgb_model.predict(tfidf_vectorizer.transform(['Unlike the food and was Not Tasty'])))

['negative']


##### Finally here we see that with Character level TF-IDF , we see some correct classifications

In [41]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

    negative       0.76      0.74      0.75       596
     postive       0.75      0.77      0.76       604

   micro avg       0.76      0.76      0.76      1200
   macro avg       0.76      0.76      0.76      1200
weighted avg       0.76      0.76      0.76      1200



As next steps, I would implement Word2Vec Embeddings to find the dissimilarity between the words 