# HW8: Lie Detection and Sentiment Classification with Text Mining
## IST 707 Applied Machine Learning

**Scenario**: Some people claim that machine learning algorithms can figure out whether a person is lying or not. Do you believe that? To test this claim, we have assembled a collection of customer reviews — some true and some false — on which you are going to test how good Naïve Bayes and SVMs can be for **fake review detection**. This data set also has sentiment label for each review. You will also compare NB and SVMs performance in **sentiment classification**. 

For both tasks, try different tuning parameters and report the results in a table like the following.

Model, Parameter Setting, Accuracy Lie Detection, Precision Lie Detection, Recall Lie Detection, Accuracy Sentiment, Precision Sentiment, Recall Sentiment

1.	Explain the data and the pre-processing steps you took to prepare for each classification task.

2.	Explain your initial parameter tuning strategy — which parameter to tune, to what option, and the theoretical foundation for your choice. Does your strategy help you get better results?

3.	Compare performance differences in sentiment classification and lie detection and tell us which task is harder. Try to explain why one may be harder than the other.
Provide your code in a separate script.

In [396]:
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Read in the data

In [397]:
import numpy as np
import pandas as p
df = p.read_csv("/content/drive/My Drive/Colab Notebooks/deception-data-converted.csv")
# y = df['sentiment'].values
# z = df['lie'].values
# X = df['review'].values

In [398]:
df.shape

(92, 3)

### Data Cleanup

In [399]:
# we already removed records that had blanks in all columns
# remove records that have blank in review
df = df[df['review'] != "''"]
print(df.shape)

(90, 3)


In [400]:
# get unique values in lie and sentiment column
print(df['lie'].unique())
print(df['sentiment'].unique())

['fake' 'TRUE']
['negative' 'positive']


In [401]:
# change values of column lie from fake, TRUE to 0, 1 respectively
df.loc[df["lie"] == "fake", "lie"] = "fake"
df.loc[df["lie"] == "TRUE", "lie"] = "real"
print(df['lie'].unique())
# factorize the negative and positive values in sentiment column
p.factorize(df.sentiment)
print(df['sentiment'].unique())

['fake' 'real']
['negative' 'positive']


In [402]:
# to predict sentiment and lie, define x and y parameters
y = df['sentiment'].values
z = df['lie'].values
X = df['review'].values

## Sentiment Classification
We will first create models for sentiment classification based on the column "sentiment" in the data set.

### Split training and testing data
split the data so that 80% of it goes to training data and 20% in testing/validation data set.

In [403]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2022)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train[0])
print(y_train[0])
print(X_test[0])
print(y_test[0])

(72,) (72,) (18,) (18,)
'This place was one of the best restaurant I have been. The price is little expensive, but the food and the service is best around the area. I went here with my family, and we ordered 4 dishes. They were all well cooked, and their taste were nicely balanced. Waiters came when we needed them without having to call for them. I would definitely recommend it to everyone visiting this area. '
positive
'Pastablities is a locally owned restaurant in Syracuse. The food is simple and homey and comforting. Their famous bread is baked daily and the bakery is right next door. The bread is soft and chewy and amazing with their homemade spicy tomato sauce. The paste and cheese that I had was cream and cooked to perfection. '
positive


### Data Checking

In [404]:
unique, counts = np.unique(y_train, return_counts = True)
print(np.asarray((unique, counts)))

[['negative' 'positive']
 [38 34]]


training data has well balanced examples of both negative and positive records

In [405]:
uniqueTest, countsTest = np.unique(y_test, return_counts = True)
print(np.asarray((uniqueTest, countsTest)))

[['negative' 'positive']
 [8 10]]


same as training data set, the test data set also has well balanced distribution of data between negative and positive records

### Vectorization

In [406]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# define several commonly used vectorizer setting

#  unigram boolean vectorizer
unigram_bool_vectorizer = CountVectorizer(encoding = 'latin-1', binary = True, stop_words = 'english')

#  unigram term frequency vectorizer
unigram_count_vectorizer = CountVectorizer(encoding = 'latin-1', binary = False, stop_words = 'english')

#  unigram and bigram term frequency vectorizer
gram12_count_vectorizer = CountVectorizer(encoding = 'latin-1', ngram_range = (1,2), stop_words = 'english')

#  unigram tfidf vectorizer
unigram_tfidf_vectorizer = TfidfVectorizer(encoding = 'latin-1', use_idf = True, stop_words = 'english')

#  unigram boolean vectorizer, set minimum document frequency to 5
unigram_df_bool_vectorizer = CountVectorizer(encoding = 'latin-1', binary = True, min_df = 5, stop_words = 'english')

#  unigram term frequency vectorizer, set minimum document frequency to 5
unigram_df_count_vectorizer = CountVectorizer(encoding = 'latin-1', binary = False, min_df = 5, stop_words = 'english')

#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_df_count_vectorizer = CountVectorizer(encoding = 'latin-1', ngram_range = (1,2), min_df = 5, stop_words = 'english')

#  unigram tfidf vectorizer, set minimum document frequency to 5
unigram_df_tfidf_vectorizer = TfidfVectorizer(encoding = 'latin-1', use_idf = True, min_df = 5, stop_words = 'english')

### Naive Bayes

#### Vectorize the training data
We'll use TFIDF vectorizer as it not only measures how common a particular word is across all the documents in the corpus but also measures the importance of the word in the corpus.

In [407]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_tfidf_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_tfidf_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_tfidf_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_tfidf_vectorizer.vocabulary_.get('cooked'))

(72, 1092)
[[0. 0. 0. ... 0. 0. 0.]]
1092
[('place', 693), ('best', 88), ('restaurant', 778), ('price', 721), ('little', 553), ('expensive', 348), ('food', 392), ('service', 833), ('area', 41), ('went', 1060)]
223


#### Vectorize the test data

In [408]:
X_test_vec = unigram_tfidf_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(18, 1092)


#### Create base MNB model

In [409]:
# import the MNB module
from sklearn.naive_bayes import MultinomialNB

# initialize the MNB model
nb_clf = MultinomialNB()

# use the training data to train the MNB model
nb_clf.fit(X_train_vec, y_train)

MultinomialNB()

#### Interpret a trained MNB model - conditional probs - MNB

In [410]:
print(nb_clf.feature_log_prob_.shape)

print(unigram_tfidf_vectorizer.vocabulary_.get('bad'))

# for i in range(0,1):
print(nb_clf.feature_log_prob_[0][unigram_tfidf_vectorizer.vocabulary_.get('bad')])
print(nb_clf.feature_log_prob_[1][unigram_tfidf_vectorizer.vocabulary_.get('bad')])

(2, 1092)
64
-6.387523180454758
-7.048689820019229


#### Log ratios of conditional probabilities

In [411]:
log_ratios = []
features = unigram_tfidf_vectorizer.get_feature_names_out()
neg_cond_prob = nb_clf.feature_log_prob_[0]
pos_cond_prob = nb_clf.feature_log_prob_[1]

for i in range(0, len(features)):
  log_ratio = pos_cond_prob[i] - neg_cond_prob[i]
  log_ratios.append(log_ratio)

ranks = sorted(zip(log_ratios, features))
print(ranks[:10])
print(ranks[-10:])

[(-0.8102168010676278, 'pizza'), (-0.7834353342873994, 'terrible'), (-0.7652845567614532, 'minutes'), (-0.7063783539213455, 'come'), (-0.6611666395644713, 'bad'), (-0.6599031764518237, 'worst'), (-0.6422900232267983, 'hour'), (-0.6279548103806665, 'asked'), (-0.6173596015153997, 'took'), (-0.6035462154058928, 'salad')]
[(0.5483913054291198, 'restaurants'), (0.5484244569231089, 'japanese'), (0.5808766506126357, 'home'), (0.5903823983260255, 'love'), (0.6044640969303776, 'ate'), (0.6087655778538377, 'friendly'), (0.7144463702695942, 'noodle'), (0.8249334902406913, 'great'), (0.8972803526833353, 'amazing'), (1.1267789246586348, 'best')]


The words associated with most positive reviews are at the end so it makes sense that the words in this array are "best", "amazing", "great", etc. Some words in this set don't completely make sense like the word "japanese".

Most of the words associated with negative reviews also make sense like "terrible", "bad", "worst", etc. But some don't like "pizza", "salad".

#### Test the MNB Classifier

In [412]:
# test the classifier on the test data set, print accuracy score

nb_clf.score(X_test_vec, y_test)

0.8333333333333334

So the MNB model is about 83.33% accurate.

In [413]:
# print confusion matrix (row: ground truth; col: prediction)

from sklearn.metrics import confusion_matrix
y_pred_mnb = nb_clf.fit(X_train_vec, y_train).predict(X_test_vec)
cm_mnb = confusion_matrix(y_test, y_pred_mnb, labels = ['negative', 'positive'])
print(cm_mnb)

[[8 0]
 [3 7]]


As we can see from the confusion matrix, the model predicted all records that were negative in reality as negative but it also predicted some records that were positive as negative.

In [414]:
# print classification report

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
print(precision_score(y_test, y_pred_mnb, average = None))
print(recall_score(y_test, y_pred_mnb, average = None))

from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_mnb, target_names = target_names))

[0.72727273 1.        ]
[1.  0.7]
              precision    recall  f1-score   support

    negative       0.73      1.00      0.84         8
    positive       1.00      0.70      0.82        10

    accuracy                           0.83        18
   macro avg       0.86      0.85      0.83        18
weighted avg       0.88      0.83      0.83        18



#### Interpret the prediction result

In [415]:
## find the calculated posterior probability
posterior_probs_mnb = nb_clf.predict_proba(X_test_vec)

## find the posterior probabilities for the first test example
print(posterior_probs_mnb[0])

# find the category prediction for the first test example
y_pred_mnb = nb_clf.predict(X_test_vec)
print(y_pred_mnb[0])

# check the actual label for the first test example
print(y_test[0])

[0.45957137 0.54042863]
positive
positive


Because the posterior probability for category 'positive' is the greatest, 0.5404, the prediction should be "positive". Because the actual label is also "positive", this is a correct prediction.

#### Error Analysis

In [416]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 0 such examples
print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_mnb[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 2 such examples
print()
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_mnb[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

MNB error analysis
positive examples that are mistakenly predicted as negative -

'I like the Pizza at Dominoes for their special ingredients and the prices, they give out one of the best pizza\'s in town. I ordered Special Pepperoni Pizza (large) from the online portal which gives me a discount of 50\% for being a regular customer, they always provide a good quantity of cheese over the pizza, we generally do not need to add a topping of pizza cheese for cheese lovers compared to other places. The place has a nice ambient environment, it is generally empty as it has more delivery orders than dine in. Orders are processed very quickly and has a special taste to it. Cost friendly prices makes me pick it up as my best buy. Overall Rating : 5/5'
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
'Carlo\'s Plate Shack was amazing! The waitress was friendly, attentive, and helpful in answering any 

"pizza" and "salad" are categorized as negative in this model. That seems to be the driving force behind these sentences being classified as negative using this model.

#### Hypertuning parameters for this model

In [417]:
from sklearn.model_selection import GridSearchCV

In [418]:
grid_params = {
  'alpha': np.linspace(0.5, 1.5, 6),
  'fit_prior': [True, False],
}
mnb_grid_clf = GridSearchCV(nb_clf, grid_params)
mnb_grid_clf.fit(X_train_vec, y_train)
print("Best Score: ", mnb_grid_clf.best_score_)
print("Best Params: ", mnb_grid_clf.best_params_)

Best Score:  0.8476190476190476
Best Params:  {'alpha': 1.3, 'fit_prior': True}


Hypertuning the parameters results in slight increase of accuracy to 84.76%. There's no need to create a tuned MNB model with these parameters because if the tuned model had predicted even 1 additional correct, the accuracy score would 16/18 = 88.89%.

#### MNB model with TFIDF vectorizer with min_df = 5

In [419]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_df_tfidf_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_df_tfidf_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_df_tfidf_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_df_tfidf_vectorizer.vocabulary_.get('cooked'))

(72, 78)
[[0.         0.         0.         0.         0.         0.
  0.         0.39617393 0.         0.         0.26051415 0.
  0.         0.         0.30122147 0.27858724 0.         0.
  0.         0.         0.         0.26909811 0.         0.
  0.         0.         0.13363579 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.28919518 0.         0.
  0.         0.         0.         0.         0.         0.23258055
  0.         0.         0.         0.1830415  0.         0.
  0.         0.3151048  0.13773029 0.         0.         0.
  0.         0.19025042 0.         0.         0.26051415 0.
  0.         0.         0.         0.         0.         0.30122147
  0.         0.         0.         0.17964248 0.         0.        ]]
78
[('place', 51), ('best', 7), ('restaurant', 56), ('little', 39), ('food', 26), ('service', 61), ('went', 75), ('ordered', 47), ('dishes', 21), ('cooked', 14)]
14


In [420]:
X_test_vec = unigram_df_tfidf_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(18, 78)


In [421]:
# initialize the MNB model
mnb_clf = MultinomialNB()

# use the training data to train the MNB model
mnb_clf.fit(X_train_vec, y_train)

MultinomialNB()

In [422]:
log_ratios = []
features = unigram_df_tfidf_vectorizer.get_feature_names_out()
neg_cond_prob = mnb_clf.feature_log_prob_[0]
pos_cond_prob = mnb_clf.feature_log_prob_[1]

for i in range(0, len(features)):
  log_ratio = pos_cond_prob[i] - neg_cond_prob[i]
  log_ratios.append(log_ratio)

ranks = sorted(zip(log_ratios, features))
print(ranks[:10])
print(ranks[-10:])

[(-1.161079557942359, 'terrible'), (-1.0086795339362684, 'took'), (-1.0036700498032936, 'worst'), (-0.9410640400369212, 'minutes'), (-0.8762221260039338, 'come'), (-0.871066617115563, 'asked'), (-0.8381184982965362, 'bad'), (-0.8218380048796323, 'hour'), (-0.8174964767768911, 'came'), (-0.7944447687792451, 'bland')]
[(0.70306091055643, 'nice'), (0.7293222320432777, 'ambiance'), (0.7454460265742542, 'ask'), (0.7506860105633626, 'favorite'), (0.7593951820299605, 'fresh'), (0.8197803657217815, 'friendly'), (0.8555591415698571, 'delicious'), (1.2001154288604496, 'great'), (1.4564271783835734, 'amazing'), (1.686170093910997, 'best')]


In [423]:
# test the classifier on the test data set, print accuracy score

mnb_clf.score(X_test_vec, y_test)

0.8333333333333334

In [424]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
y_pred_mnb2 = mnb_clf.fit(X_train_vec, y_train).predict(X_test_vec)
cm_mnb2 = confusion_matrix(y_test, y_pred_mnb2, labels = ['negative', 'positive'])
print(cm_mnb2)

[[7 1]
 [2 8]]


In [425]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(y_test, y_pred_mnb2, average = None))
print(recall_score(y_test, y_pred_mnb2, average = None))

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_mnb2, target_names = target_names))

[0.77777778 0.88888889]
[0.875 0.8  ]
              precision    recall  f1-score   support

    negative       0.78      0.88      0.82         8
    positive       0.89      0.80      0.84        10

    accuracy                           0.83        18
   macro avg       0.83      0.84      0.83        18
weighted avg       0.84      0.83      0.83        18



In [426]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 0 such examples
print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_mnb2[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 2 such examples
print()
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_mnb2[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

MNB error analysis
positive examples that are mistakenly predicted as negative -

'Twin Trees Cicero, NY This place is very kid friendly...bring the whole family. Awesome salad bar and high quality dishes and desserts. The price and quality are right!'
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
errors: 2

negative examples that are mistakenly predicted as positive -

'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
errors: 1


In [427]:
grid_params = {
  'alpha': np.linspace(0.5, 1.5, 6),
  'fit_prior': [True, False],
}
mnb_grid_clf2 = GridSearchCV(mnb_clf, grid_params)
mnb_grid_clf2.fit(X_train_vec, y_train)
print("Best Score: ", mnb_grid_clf.best_score_)
print("Best Params: ", mnb_grid_clf.best_params_)

Best Score:  0.8476190476190476
Best Params:  {'alpha': 1.3, 'fit_prior': True}


#### MNB with unigram count vectorizer

In [428]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_count_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_count_vectorizer.vocabulary_.get('cooked'))

(72, 1092)
[[0 0 0 ... 0 0 0]]
1092
[('place', 693), ('best', 88), ('restaurant', 778), ('price', 721), ('little', 553), ('expensive', 348), ('food', 392), ('service', 833), ('area', 41), ('went', 1060)]
223


In [429]:
X_test_vec = unigram_count_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(18, 1092)


In [430]:
# initialize the MNB model
cmnb_clf = MultinomialNB()

# use the training data to train the MNB model
cmnb_clf.fit(X_train_vec, y_train)

MultinomialNB()

In [431]:
log_ratios = []
features = unigram_count_vectorizer.get_feature_names_out()
neg_cond_prob = cmnb_clf.feature_log_prob_[0]
pos_cond_prob = cmnb_clf.feature_log_prob_[1]

for i in range(0, len(features)):
  log_ratio = pos_cond_prob[i] - neg_cond_prob[i]
  log_ratios.append(log_ratio)

ranks = sorted(zip(log_ratios, features))
print(ranks[:10])
print(ranks[-10:])

[(-2.1884184893720144, 'terrible'), (-2.0931083095676897, 'took'), (-1.987747793909863, 'plate'), (-1.987747793909863, 'worst'), (-1.869964758253479, 'asked'), (-1.869964758253479, 'come'), (-1.869964758253479, 'pizza'), (-1.7364333656289572, 'hour'), (-1.7364333656289572, 'tofu'), (-1.636349907071974, 'minutes')]
[(1.8189146958604567, 'chocolate'), (1.8189146958604567, 'cream'), (1.8189146958604567, 'flavors'), (1.8189146958604567, 'home'), (1.8189146958604567, 'love'), (1.8189146958604567, 'makes'), (2.0012362526544107, 'noodle'), (2.0012362526544107, 'restaurants'), (2.3297403196264472, 'best'), (2.607372056224727, 'amazing')]


In [432]:
# test the classifier on the test data set, print accuracy score

cmnb_clf.score(X_test_vec, y_test)

0.7222222222222222

In [433]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
y_pred_mnb3 = cmnb_clf.fit(X_train_vec, y_train).predict(X_test_vec)
cm_mnb3 = confusion_matrix(y_test, y_pred_mnb3, labels = ['negative', 'positive'])
print(cm_mnb3)

[[6 2]
 [3 7]]


In [434]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(y_test, y_pred_mnb3, average = None))
print(recall_score(y_test, y_pred_mnb3, average = None))

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_mnb3, target_names = target_names))

[0.66666667 0.77777778]
[0.75 0.7 ]
              precision    recall  f1-score   support

    negative       0.67      0.75      0.71         8
    positive       0.78      0.70      0.74        10

    accuracy                           0.72        18
   macro avg       0.72      0.72      0.72        18
weighted avg       0.73      0.72      0.72        18



In [435]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 0 such examples
print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_mnb3[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 2 such examples
print()
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_mnb3[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

MNB error analysis
positive examples that are mistakenly predicted as negative -

'I like the Pizza at Dominoes for their special ingredients and the prices, they give out one of the best pizza\'s in town. I ordered Special Pepperoni Pizza (large) from the online portal which gives me a discount of 50\% for being a regular customer, they always provide a good quantity of cheese over the pizza, we generally do not need to add a topping of pizza cheese for cheese lovers compared to other places. The place has a nice ambient environment, it is generally empty as it has more delivery orders than dine in. Orders are processed very quickly and has a special taste to it. Cost friendly prices makes me pick it up as my best buy. Overall Rating : 5/5'
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
'Carlo\'s Plate Shack was amazing! The waitress was friendly, attentive, and helpful in answering any 

In [436]:
grid_params = {
  'alpha': np.linspace(0.5, 1.5, 6),
  'fit_prior': [True, False],
}
mnb_grid_clf3 = GridSearchCV(cmnb_clf, grid_params)
mnb_grid_clf3.fit(X_train_vec, y_train)
print("Best Score: ", mnb_grid_clf.best_score_)
print("Best Params: ", mnb_grid_clf.best_params_)

Best Score:  0.8476190476190476
Best Params:  {'alpha': 1.3, 'fit_prior': True}


In [437]:
# initialize the MNB model
cmnb_clf = MultinomialNB(alpha = 1.3, fit_prior = True)

# use the training data to train the MNB model
cmnb_clf.fit(X_train_vec, y_train)

MultinomialNB(alpha=1.3)

In [438]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
y_pred_mnb3 = cmnb_clf.fit(X_train_vec, y_train).predict(X_test_vec)
cm_mnb3 = confusion_matrix(y_test, y_pred_mnb3, labels = ['negative', 'positive'])
print(cm_mnb3)

[[6 2]
 [3 7]]


In [439]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(y_test, y_pred_mnb3, average = None))
print(recall_score(y_test, y_pred_mnb3, average = None))

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_mnb3, target_names = target_names))

[0.66666667 0.77777778]
[0.75 0.7 ]
              precision    recall  f1-score   support

    negative       0.67      0.75      0.71         8
    positive       0.78      0.70      0.74        10

    accuracy                           0.72        18
   macro avg       0.72      0.72      0.72        18
weighted avg       0.73      0.72      0.72        18



In [440]:
# test the classifier on the test data set, print accuracy score

cmnb_clf.score(X_test_vec, y_test)

0.7222222222222222

#### MNB model with Count vectorizer with min_df = 5

In [441]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_df_count_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_df_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_df_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_df_count_vectorizer.vocabulary_.get('cooked'))

(72, 78)
[[0 0 0 0 0 0 0 2 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
  0 0 0 1 0 0]]
78
[('place', 51), ('best', 7), ('restaurant', 56), ('little', 39), ('food', 26), ('service', 61), ('went', 75), ('ordered', 47), ('dishes', 21), ('cooked', 14)]
14


In [442]:
X_test_vec = unigram_df_count_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(18, 78)


In [443]:
# initialize the MNB model
cmnb_clf2 = MultinomialNB()

# use the training data to train the MNB model
cmnb_clf2.fit(X_train_vec, y_train)

MultinomialNB()

In [444]:
log_ratios = []
features = unigram_df_count_vectorizer.get_feature_names_out()
neg_cond_prob = cmnb_clf2.feature_log_prob_[0]
pos_cond_prob = cmnb_clf2.feature_log_prob_[1]

for i in range(0, len(features)):
  log_ratio = pos_cond_prob[i] - neg_cond_prob[i]
  log_ratios.append(log_ratio)

ranks = sorted(zip(log_ratios, features))
print(ranks[:10])
print(ranks[-10:])

[(-2.086861609815842, 'terrible'), (-1.9915514300115174, 'took'), (-1.8861909143536915, 'worst'), (-1.7684078786973076, 'asked'), (-1.7684078786973076, 'come'), (-1.634876486072785, 'hour'), (-1.5347930275158017, 'minutes'), (-1.4807258062455264, 'came'), (-1.480725806245526, 'bland'), (-1.480725806245526, 'said')]
[(1.004180843542474, 'sauce'), (1.158331523369732, 'nice'), (1.2273243948566837, 'ambiance'), (1.2273243948566837, 'ask'), (1.2273243948566837, 'recommend'), (1.2918629159942538, 'delicious'), (1.4096459516506386, 'friendly'), (1.5637966314778962, 'great'), (2.431297199182619, 'best'), (2.708928935780899, 'amazing')]


In [445]:
# test the classifier on the test data set, print accuracy score

cmnb_clf2.score(X_test_vec, y_test)

0.8333333333333334

In [446]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
y_pred_mnb4 = cmnb_clf2.fit(X_train_vec, y_train).predict(X_test_vec)
cm_mnb4 = confusion_matrix(y_test, y_pred_mnb4, labels = ['negative', 'positive'])
print(cm_mnb4)

[[7 1]
 [2 8]]


In [447]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(y_test, y_pred_mnb4, average = None))
print(recall_score(y_test, y_pred_mnb4, average = None))

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_mnb4, target_names = target_names))

[0.77777778 0.88888889]
[0.875 0.8  ]
              precision    recall  f1-score   support

    negative       0.78      0.88      0.82         8
    positive       0.89      0.80      0.84        10

    accuracy                           0.83        18
   macro avg       0.83      0.84      0.83        18
weighted avg       0.84      0.83      0.83        18



In [448]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 0 such examples
print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_mnb4[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 2 such examples
print()
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_mnb4[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

MNB error analysis
positive examples that are mistakenly predicted as negative -

'Twin Trees Cicero, NY This place is very kid friendly...bring the whole family. Awesome salad bar and high quality dishes and desserts. The price and quality are right!'
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
errors: 2

negative examples that are mistakenly predicted as positive -

'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
errors: 1


In [449]:
grid_params = {
  'alpha': np.linspace(0.5, 1.5, 6),
  'fit_prior': [True, False],
}
mnb_grid_clf4 = GridSearchCV(cmnb_clf2, grid_params)
mnb_grid_clf4.fit(X_train_vec, y_train)
print("Best Score: ", mnb_grid_clf.best_score_)
print("Best Params: ", mnb_grid_clf.best_params_)

Best Score:  0.8476190476190476
Best Params:  {'alpha': 1.3, 'fit_prior': True}


### Support Vector Machine

#### Vectorize the training data

In [450]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_tfidf_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_tfidf_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_tfidf_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_tfidf_vectorizer.vocabulary_.get('cooked'))

(72, 1092)
[[0. 0. 0. ... 0. 0. 0.]]
1092
[('place', 693), ('best', 88), ('restaurant', 778), ('price', 721), ('little', 553), ('expensive', 348), ('food', 392), ('service', 833), ('area', 41), ('went', 1060)]
223


#### Vectorize the test data

In [451]:
X_test_vec = unigram_tfidf_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(18, 1092)


#### Create base Linear SVC model

In [452]:
# import the LinearSVC module
from sklearn.svm import LinearSVC

# initialize the LinearSVC model
svm_clf = LinearSVC(C = 1)

# use the training data to train the model
svm_clf.fit(X_train_vec, y_train)

LinearSVC(C=1)

##### Interpret trained linear SVC model

In [453]:
## For category "0" (negative), get all features and their weights and sort them in increasing order
feature_ranks = sorted(zip(svm_clf.coef_[0], unigram_tfidf_vectorizer.get_feature_names_out()))

## get the 10 features that are best indicators of negative sentiment (they are at the bottom of the ranked list negative_feature_ranks)
negative_10 = feature_ranks[:10]
print("negative words")
for i in range(0, len(negative_10)):
    print(negative_10[i])
print()

## get 10 features that are best indicators of positive sentiment (they are at the bottom of the ranked list positive_feature_ranks)
positive_10 = feature_ranks[-10:]
print("positive words")
for i in range(0, len(positive_10)):
    print(positive_10[i])
print()

negative words
(-0.6146373237528078, 'pizza')
(-0.5980084064595763, 'bad')
(-0.5621639413662503, 'minutes')
(-0.5369349518253462, 'terrible')
(-0.5235938747638061, 'went')
(-0.5228494238897976, 'place')
(-0.4545683329849489, 'hour')
(-0.44177808619691267, 'plate')
(-0.4357298312802893, 'come')
(-0.42610117095131894, 'dishes')

positive words
(0.41058569593921296, 'restaurants')
(0.42615085119897744, 'll')
(0.43257008569841726, 'ate')
(0.43316383147819953, 'noodle')
(0.43410087975570016, 'delicious')
(0.47987048462611187, 'home')
(0.5223060000269993, 'friendly')
(0.6965433988617675, 'amazing')
(0.7548398963663683, 'great')
(1.2255887757819786, 'best')



##### Test the Linear SVC classifier

In [454]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7777777777777778

In [455]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[7 1]
 [3 7]]

              precision    recall  f1-score   support

    negative       0.70      0.88      0.78         8
    positive       0.88      0.70      0.78        10

    accuracy                           0.78        18
   macro avg       0.79      0.79      0.78        18
weighted avg       0.80      0.78      0.78        18



##### Error Analysis

In [456]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 3 such examples
print("SVM error analysis")
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_svm[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()
# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 1 such examples
# print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_svm[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

SVM error analysis
negative examples that are mistakenly predicted as positive -

'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
errors: 1

positive examples that are mistakenly predicted as negative -

'I like the Pizza at Dominoes for their special ingredients and the prices, they give out one of the best pizza\'s in town. I ordered Special Pepperoni Pizza (large) from the online portal which gives me a discount of 50\% for being a regular customer, they always provide a good quantity of cheese over the pizza, we generally do not need to add a topping of pizza cheese for cheese lovers compared to other places. The place has a nice ambient environment, it is generally empty as it has more delivery orders than dine in. Orders are processed very quickly and has a special taste to it. Cost friendly prices makes 

#### Train a base SVC model (not just linear)

In [457]:
from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC()

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC()

##### Test the SVC classifier

In [458]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7222222222222222

In [459]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[8 0]
 [5 5]]

              precision    recall  f1-score   support

    negative       0.62      1.00      0.76         8
    positive       1.00      0.50      0.67        10

    accuracy                           0.72        18
   macro avg       0.81      0.75      0.71        18
weighted avg       0.83      0.72      0.71        18



##### Error Analysis

In [460]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 3 such examples
print("SVM error analysis")
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_svm[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()
# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 1 such examples
# print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_svm[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

SVM error analysis
negative examples that are mistakenly predicted as positive -

errors: 0

positive examples that are mistakenly predicted as negative -

'Pastablities is a locally owned restaurant in Syracuse. The food is simple and homey and comforting. Their famous bread is baked daily and the bakery is right next door. The bread is soft and chewy and amazing with their homemade spicy tomato sauce. The paste and cheese that I had was cream and cooked to perfection. '
'I went into the restaurant, it decorated comfortably with a soft light and nice pictures, the waitress was kind and stand by my side throughout the whole dining time, asking whether I need something more and kept smiling. '
'I like the Pizza at Dominoes for their special ingredients and the prices, they give out one of the best pizza\'s in town. I ordered Special Pepperoni Pizza (large) from the online portal which gives me a discount of 50\% for being a regular customer, they always provide a good quantity of cheese

##### Tuning the SVC model

In [461]:
# Create a dictionary of possible parameters
params_grid = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1], 'kernel':['linear','rbf', 'poly'] }
# Create the GridSearchCV object
grid_clf = GridSearchCV(svm_clf, params_grid)
print(grid_clf)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
                         'kernel': ['linear', 'rbf', 'poly']})


In [462]:
# fit the data with the best possible parameters
grid_clf = grid_clf.fit(X_train_vec, y_train)
# print the best estimator with it's parameters
# print(grid_clf.best_params_)
print("Best Score: ", grid_clf.best_score_)
print("Best Params: ", grid_clf.best_params_)

Best Score:  0.8619047619047618
Best Params:  {'C': 1, 'gamma': 0.0001, 'kernel': 'linear'}


#### Create a tuned SVC model

In [463]:
# from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC(C = 1, gamma = 0.0001, kernel = 'linear')

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC(C=1, gamma=0.0001, kernel='linear')

##### Test the tuned SVC classifier

In [464]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7777777777777778

In [465]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[7 1]
 [3 7]]

              precision    recall  f1-score   support

    negative       0.70      0.88      0.78         8
    positive       0.88      0.70      0.78        10

    accuracy                           0.78        18
   macro avg       0.79      0.79      0.78        18
weighted avg       0.80      0.78      0.78        18



##### Error Analysis

In [466]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 3 such examples
print("SVM error analysis")
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_svm[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()
# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 1 such examples
# print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_svm[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

SVM error analysis
negative examples that are mistakenly predicted as positive -

'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
errors: 1

positive examples that are mistakenly predicted as negative -

'I like the Pizza at Dominoes for their special ingredients and the prices, they give out one of the best pizza\'s in town. I ordered Special Pepperoni Pizza (large) from the online portal which gives me a discount of 50\% for being a regular customer, they always provide a good quantity of cheese over the pizza, we generally do not need to add a topping of pizza cheese for cheese lovers compared to other places. The place has a nice ambient environment, it is generally empty as it has more delivery orders than dine in. Orders are processed very quickly and has a special taste to it. Cost friendly prices makes 

#### SVC model with tfidf vectorizer with min_df = 5

In [467]:
X_test_vec = unigram_df_tfidf_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(18, 78)


In [468]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_df_tfidf_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_df_tfidf_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_df_tfidf_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_df_tfidf_vectorizer.vocabulary_.get('cooked'))

(72, 78)
[[0.         0.         0.         0.         0.         0.
  0.         0.39617393 0.         0.         0.26051415 0.
  0.         0.         0.30122147 0.27858724 0.         0.
  0.         0.         0.         0.26909811 0.         0.
  0.         0.         0.13363579 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.28919518 0.         0.
  0.         0.         0.         0.         0.         0.23258055
  0.         0.         0.         0.1830415  0.         0.
  0.         0.3151048  0.13773029 0.         0.         0.
  0.         0.19025042 0.         0.         0.26051415 0.
  0.         0.         0.         0.         0.         0.30122147
  0.         0.         0.         0.17964248 0.         0.        ]]
78
[('place', 51), ('best', 7), ('restaurant', 56), ('little', 39), ('food', 26), ('service', 61), ('went', 75), ('ordered', 47), ('dishes', 21), ('cooked', 14)]
14


In [469]:
# from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC()

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC()

In [470]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7777777777777778

In [471]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[7 1]
 [3 7]]

              precision    recall  f1-score   support

    negative       0.70      0.88      0.78         8
    positive       0.88      0.70      0.78        10

    accuracy                           0.78        18
   macro avg       0.79      0.79      0.78        18
weighted avg       0.80      0.78      0.78        18



In [472]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 3 such examples
print("SVM error analysis")
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_svm[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()
# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 1 such examples
# print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_svm[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

SVM error analysis
negative examples that are mistakenly predicted as positive -

'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
errors: 1

positive examples that are mistakenly predicted as negative -

'Twin Trees Cicero, NY This place is very kid friendly...bring the whole family. Awesome salad bar and high quality dishes and desserts. The price and quality are right!'
'I went into the restaurant, it decorated comfortably with a soft light and nice pictures, the waitress was kind and stand by my side throughout the whole dining time, asking whether I need something more and kept smiling. '
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
errors: 3


In [473]:
# Create a dictionary of possible parameters
params_grid = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1], 'kernel':['linear','rbf', 'poly'] }
# Create the GridSearchCV object
grid_clf = GridSearchCV(svm_clf, params_grid)
print(grid_clf)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
                         'kernel': ['linear', 'rbf', 'poly']})


In [474]:
# fit the data with the best possible parameters
grid_clf = grid_clf.fit(X_train_vec, y_train)
# print the best estimator with it's parameters
# print(grid_clf.best_params_)
print("Best Score: ", grid_clf.best_score_)
print("Best Params: ", grid_clf.best_params_)

Best Score:  0.7942857142857143
Best Params:  {'C': 1, 'gamma': 1, 'kernel': 'rbf'}


In [475]:
# from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC(C = 1, gamma = 1, kernel = 'rbf')

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC(C=1, gamma=1)

In [476]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7777777777777778

In [477]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[7 1]
 [3 7]]

              precision    recall  f1-score   support

    negative       0.70      0.88      0.78         8
    positive       0.88      0.70      0.78        10

    accuracy                           0.78        18
   macro avg       0.79      0.79      0.78        18
weighted avg       0.80      0.78      0.78        18



In [478]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 3 such examples
print("SVM error analysis")
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_svm[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()
# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 1 such examples
# print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_svm[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

SVM error analysis
negative examples that are mistakenly predicted as positive -

'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
errors: 1

positive examples that are mistakenly predicted as negative -

'Twin Trees Cicero, NY This place is very kid friendly...bring the whole family. Awesome salad bar and high quality dishes and desserts. The price and quality are right!'
'I went into the restaurant, it decorated comfortably with a soft light and nice pictures, the waitress was kind and stand by my side throughout the whole dining time, asking whether I need something more and kept smiling. '
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
errors: 3


#### SVC model with count vextorizer with min_df = 5

In [479]:
X_test_vec = unigram_df_count_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(18, 78)


In [480]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_df_count_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_df_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_df_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_df_count_vectorizer.vocabulary_.get('cooked'))

(72, 78)
[[0 0 0 0 0 0 0 2 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
  0 0 0 1 0 0]]
78
[('place', 51), ('best', 7), ('restaurant', 56), ('little', 39), ('food', 26), ('service', 61), ('went', 75), ('ordered', 47), ('dishes', 21), ('cooked', 14)]
14


In [481]:
# from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC()

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC()

In [482]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.6666666666666666

In [483]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[6 2]
 [4 6]]

              precision    recall  f1-score   support

    negative       0.60      0.75      0.67         8
    positive       0.75      0.60      0.67        10

    accuracy                           0.67        18
   macro avg       0.68      0.68      0.67        18
weighted avg       0.68      0.67      0.67        18



In [484]:
# Create a dictionary of possible parameters
params_grid = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1], 'kernel':['linear','rbf', 'poly'] }
# Create the GridSearchCV object
grid_clf = GridSearchCV(svm_clf, params_grid)
print(grid_clf)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
                         'kernel': ['linear', 'rbf', 'poly']})


In [485]:
# fit the data with the best possible parameters
grid_clf = grid_clf.fit(X_train_vec, y_train)
# print the best estimator with it's parameters
# print(grid_clf.best_params_)
print("Best Score: ", grid_clf.best_score_)
print("Best Params: ", grid_clf.best_params_)

Best Score:  0.7638095238095237
Best Params:  {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}


In [486]:
# from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC(C = 10, gamma = 0.001, kernel = 'rbf')

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC(C=10, gamma=0.001)

In [487]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7222222222222222

In [488]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[6 2]
 [3 7]]

              precision    recall  f1-score   support

    negative       0.67      0.75      0.71         8
    positive       0.78      0.70      0.74        10

    accuracy                           0.72        18
   macro avg       0.72      0.72      0.72        18
weighted avg       0.73      0.72      0.72        18



In [89]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 3 such examples
print("SVM error analysis")
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_svm[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)
print()
# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 1 such examples
# print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_svm[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

SVM error analysis
negative examples that are mistakenly predicted as positive -

'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
'Usually, I use Yelp to find restaurant. The Yelp would give restaurant \'stars\' to divide them into different level. And it has own map to direct you to the right place. This summer, I use it to search a subway shop in New York city, and then it give me a best choice. I followed the map to that store, and then I was surprised. It is a butcher\'s which sell ham gammon. I can not find any subway in this store. When I asked the shop assistant, he said that you can bought ham to do subway by yourself.'
errors: 2

positive examples that are mistakenly predicted as negative -

'Twin Trees Cicero, NY This place is very kid friendly...bring the whole family. Awesome salad bar and high qual

#### SVM with Count Vectorizer

In [489]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = unigram_count_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_count_vectorizer.vocabulary_.get('cooked'))

X_test_vec = unigram_count_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)

(72, 1092)
[[0 0 0 ... 0 0 0]]
1092
[('place', 693), ('best', 88), ('restaurant', 778), ('price', 721), ('little', 553), ('expensive', 348), ('food', 392), ('service', 833), ('area', 41), ('went', 1060)]
223
(18, 1092)


In [490]:
# from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC()

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC()

In [491]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.6666666666666666

In [492]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[4 4]
 [2 8]]

              precision    recall  f1-score   support

    negative       0.67      0.50      0.57         8
    positive       0.67      0.80      0.73        10

    accuracy                           0.67        18
   macro avg       0.67      0.65      0.65        18
weighted avg       0.67      0.67      0.66        18



In [493]:
# Create a dictionary of possible parameters
params_grid = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1], 'kernel':['linear','rbf', 'poly'] }
# Create the GridSearchCV object
grid_clf = GridSearchCV(svm_clf, params_grid)
print(grid_clf)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
                         'kernel': ['linear', 'rbf', 'poly']})


In [494]:
# fit the data with the best possible parameters
grid_clf = grid_clf.fit(X_train_vec, y_train)
# print the best estimator with it's parameters
# print(grid_clf.best_params_)
print("Best Score: ", grid_clf.best_score_)
print("Best Params: ", grid_clf.best_params_)

Best Score:  0.8209523809523811
Best Params:  {'C': 0.1, 'gamma': 0.0001, 'kernel': 'linear'}


In [495]:
# from sklearn.svm import SVC

# initialize the SVC model
svm_clf = SVC(C = 0.1, gamma = 0.0001, kernel = 'linear')

# use training data to train the model
svm_clf.fit(X_train_vec, y_train)

# get the accuracy of the base SVM classifier
# svm_clf_accuracy = svm_clf.score(X_test, y_test)
# print(svm_clf_accuracy)

SVC(C=0.1, gamma=0.0001, kernel='linear')

In [496]:
# test the classifier on the test data set, print accuracy score

svm_clf.score(X_test_vec, y_test)

0.7222222222222222

In [497]:
# print confusion matrix and classification report

# from sklearn.metrics import confusion_matrix
y_pred_svm = svm_clf.predict(X_test_vec)
cm_svm = confusion_matrix(y_test, y_pred_svm, labels = ['negative', 'positive'])
print(cm_svm)
print()

# from sklearn.metrics import classification_report
target_names = ['negative','positive']
print(classification_report(y_test, y_pred_svm, target_names = target_names))

[[6 2]
 [3 7]]

              precision    recall  f1-score   support

    negative       0.67      0.75      0.71         8
    positive       0.78      0.70      0.74        10

    accuracy                           0.72        18
   macro avg       0.72      0.72      0.72        18
weighted avg       0.73      0.72      0.72        18



### Fake Review Detection

In [90]:
# from sklearn.model_selection import train_test_split
X_train, X_test, z_train, z_test = train_test_split(X, z, test_size = 0.2, random_state = 2022)

print(X_train.shape, z_train.shape, X_test.shape, z_test.shape)
print(X_train[0])
print(z_train[0])
print(X_test[0])
print(z_test[0])

(72,) (72,) (18,) (18,)
'This place was one of the best restaurant I have been. The price is little expensive, but the food and the service is best around the area. I went here with my family, and we ordered 4 dishes. They were all well cooked, and their taste were nicely balanced. Waiters came when we needed them without having to call for them. I would definitely recommend it to everyone visiting this area. '
fake
'Pastablities is a locally owned restaurant in Syracuse. The food is simple and homey and comforting. Their famous bread is baked daily and the bakery is right next door. The bread is soft and chewy and amazing with their homemade spicy tomato sauce. The paste and cheese that I had was cream and cooked to perfection. '
true


#### Data Checking

In [91]:
unique, counts = np.unique(z_train, return_counts = True)
print(np.asarray((unique, counts)))

[['fake' 'true']
 [36 36]]


The training data set is well balanced as it has 36 records in each category true and false.

In [92]:
uniqueTest, countsTest = np.unique(z_test, return_counts = True)
print(np.asarray((uniqueTest, countsTest)))

[['fake' 'true']
 [10 8]]


#### Vectorize train data

In [93]:
# fit vocabulary in training documents and transform the training documents into vectors
X_train_tfidf_vec = unigram_tfidf_vectorizer.fit_transform(X_train)
X_train_tfidf_df_vec = unigram_df_tfidf_vectorizer.fit_transform(X_train)
X_train_count_vec = unigram_count_vectorizer.fit_transform(X_train)
X_train_count_df_vec = unigram_df_count_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_tfidf_vec.shape)
print(X_train_tfidf_vec[0].toarray())
print(X_train_tfidf_df_vec.shape)
print(X_train_tfidf_df_vec[0].toarray())
print(X_train_count_vec.shape)
print(X_train_count_vec[0].toarray())
print(X_train_count_df_vec.shape)
print(X_train_count_df_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_tfidf_vectorizer.vocabulary_))
print(len(unigram_df_tfidf_vectorizer.vocabulary_))
print(len(unigram_count_vectorizer.vocabulary_))
print(len(unigram_df_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_tfidf_vectorizer.vocabulary_.items())[:10])
print(list(unigram_df_tfidf_vectorizer.vocabulary_.items())[:10])
print(list(unigram_count_vectorizer.vocabulary_.items())[:10])
print(list(unigram_df_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_tfidf_vectorizer.vocabulary_.get('cooked'))
print(unigram_df_tfidf_vectorizer.vocabulary_.get('cooked'))
print(unigram_count_vectorizer.vocabulary_.get('cooked'))
print(unigram_df_count_vectorizer.vocabulary_.get('cooked'))

(72, 1092)
[[0. 0. 0. ... 0. 0. 0.]]
(72, 78)
[[0.         0.         0.         0.         0.         0.
  0.         0.39617393 0.         0.         0.26051415 0.
  0.         0.         0.30122147 0.27858724 0.         0.
  0.         0.         0.         0.26909811 0.         0.
  0.         0.         0.13363579 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.28919518 0.         0.
  0.         0.         0.         0.         0.         0.23258055
  0.         0.         0.         0.1830415  0.         0.
  0.         0.3151048  0.13773029 0.         0.         0.
  0.         0.19025042 0.         0.         0.26051415 0.
  0.         0.         0.         0.         0.         0.30122147
  0.         0.         0.         0.17964248 0.         0.        ]]
(72, 1092)
[[0 0 0 ... 0 0 0]]
(72, 78)
[[0 0 0 0 0 0 0 2 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 

#### Vectorize test data

In [94]:
X_test_count_df_vec = unigram_df_count_vectorizer.transform(X_test)
X_test_count_vec = unigram_count_vectorizer.transform(X_test)
X_test_tfidf_vec = unigram_tfidf_vectorizer.transform(X_test)
X_test_tfidf_df_vec = unigram_df_tfidf_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_count_df_vec.shape)
print(X_test_count_vec.shape)
print(X_test_tfidf_vec.shape)
print(X_test_tfidf_df_vec.shape)

(18, 78)
(18, 1092)
(18, 1092)
(18, 78)


#### MNB

In [95]:
# initialize the MNB model
count_nb_clf = MultinomialNB()
tfidf_nb_clf = MultinomialNB()
count_df_nb_clf = MultinomialNB()
tfidf_df_nb_clf = MultinomialNB()

# use the training data to train the MNB model
count_nb_clf.fit(X_train_count_vec, z_train)
count_df_nb_clf.fit(X_train_count_df_vec, z_train)
tfidf_nb_clf.fit(X_train_tfidf_vec, z_train)
tfidf_df_nb_clf.fit(X_train_tfidf_df_vec, z_train)

MultinomialNB()

#### Test the MNB classifier

In [96]:
# test the classifier on the test data set, print accuracy score

print(count_nb_clf.score(X_test_count_vec, z_test))
print(count_df_nb_clf.score(X_test_count_df_vec, z_test))
print(tfidf_nb_clf.score(X_test_tfidf_vec, z_test))
print(tfidf_df_nb_clf.score(X_test_tfidf_df_vec, z_test))

0.5555555555555556
0.5555555555555556
0.5555555555555556
0.5555555555555556


In [97]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
z_pred_count_mnb = count_nb_clf.fit(X_train_count_vec, z_train).predict(X_test_count_vec)
z_pred_tfidf_mnb = tfidf_nb_clf.fit(X_train_tfidf_vec, z_train).predict(X_test_tfidf_vec)
z_pred_count_df_mnb = count_df_nb_clf.fit(X_train_count_df_vec, z_train).predict(X_test_count_df_vec)
z_pred_tfidf_df_mnb = tfidf_df_nb_clf.fit(X_train_tfidf_df_vec, z_train).predict(X_test_tfidf_df_vec)
count_cm_mnb = confusion_matrix(z_test, z_pred_count_mnb, labels = ['fake', 'real'])
count_df_cm_mnb = confusion_matrix(z_test, z_pred_count_df_mnb, labels = ['fake', 'real'])
tfidf_cm_mnb = confusion_matrix(z_test, z_pred_tfidf_mnb, labels = ['fake', 'real'])
tfidf_df_cm_mnb = confusion_matrix(z_test, z_pred_tfidf_df_mnb, labels = ['fake', 'real'])
print(count_cm_mnb)
print(count_df_cm_mnb)
print(tfidf_cm_mnb)
print(tfidf_df_cm_mnb)

[[5 5]
 [3 5]]
[[4 6]
 [2 6]]
[[5 5]
 [3 5]]
[[5 5]
 [3 5]]


In [98]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(z_test, z_pred_count_mnb, average = None))
print(recall_score(z_test, z_pred_count_mnb, average = None))
print()
print(precision_score(z_test, z_pred_count_df_mnb, average = None))
print(recall_score(z_test, z_pred_count_df_mnb, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_mnb, average = None))
print(recall_score(z_test, z_pred_tfidf_mnb, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_df_mnb, average = None))
print(recall_score(z_test, z_pred_tfidf_df_mnb, average = None))
print()
# from sklearn.metrics import classification_report
target_names = ['fake', 'real']
print(classification_report(z_test, z_pred_count_mnb, target_names = target_names))
print(classification_report(z_test, z_pred_count_df_mnb, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_df_mnb, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_mnb, target_names = target_names))

[0.625 0.5  ]
[0.5   0.625]

[0.66666667 0.5       ]
[0.4  0.75]

[0.625 0.5  ]
[0.5   0.625]

[0.625 0.5  ]
[0.5   0.625]

              precision    recall  f1-score   support

        fake       0.62      0.50      0.56        10
        true       0.50      0.62      0.56         8

    accuracy                           0.56        18
   macro avg       0.56      0.56      0.56        18
weighted avg       0.57      0.56      0.56        18

              precision    recall  f1-score   support

        fake       0.67      0.40      0.50        10
        true       0.50      0.75      0.60         8

    accuracy                           0.56        18
   macro avg       0.58      0.57      0.55        18
weighted avg       0.59      0.56      0.54        18

              precision    recall  f1-score   support

        fake       0.62      0.50      0.56        10
        true       0.50      0.62      0.56         8

    accuracy                           0.56        18
   m

##### Error Analysis

In [99]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 0 such examples
print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_mnb[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 2 such examples
print()
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_mnb[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

MNB error analysis
positive examples that are mistakenly predicted as negative -

'I like the Pizza at Dominoes for their special ingredients and the prices, they give out one of the best pizza\'s in town. I ordered Special Pepperoni Pizza (large) from the online portal which gives me a discount of 50\% for being a regular customer, they always provide a good quantity of cheese over the pizza, we generally do not need to add a topping of pizza cheese for cheese lovers compared to other places. The place has a nice ambient environment, it is generally empty as it has more delivery orders than dine in. Orders are processed very quickly and has a special taste to it. Cost friendly prices makes me pick it up as my best buy. Overall Rating : 5/5'
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
'Carlo\'s Plate Shack was amazing! The waitress was friendly, attentive, and helpful in answering any 

##### Hypertuning parameters

In [100]:
grid_params = {
  'alpha': np.linspace(0.5, 1.5, 6),
  'fit_prior': [True, False],
}
mnb_count_grid_clf = GridSearchCV(count_nb_clf, grid_params)
mnb_count_grid_clf.fit(X_train_count_vec, z_train)
print("Best Score Count: ", mnb_count_grid_clf.best_score_)
print("Best Params Count: ", mnb_count_grid_clf.best_params_)
print()
mnb_tfidf_grid_clf = GridSearchCV(tfidf_nb_clf, grid_params)
mnb_tfidf_grid_clf.fit(X_train_tfidf_vec, z_train)
print("Best Score tfidf: ", mnb_tfidf_grid_clf.best_score_)
print("Best Params tfidf: ", mnb_tfidf_grid_clf.best_params_)
print()
mnb_count_df_grid_clf = GridSearchCV(count_df_nb_clf, grid_params)
mnb_count_df_grid_clf.fit(X_train_count_df_vec, z_train)
print("Best Score Count df: ", mnb_count_df_grid_clf.best_score_)
print("Best Params Count df: ", mnb_count_df_grid_clf.best_params_)
print()
mnb_tfidf_df_grid_clf = GridSearchCV(tfidf_df_nb_clf, grid_params)
mnb_tfidf_df_grid_clf.fit(X_train_tfidf_df_vec, z_train)
print("Best Score Count: ", mnb_tfidf_df_grid_clf.best_score_)
print("Best Params Count: ", mnb_tfidf_df_grid_clf.best_params_)
# print()

Best Score Count:  0.5028571428571429
Best Params Count:  {'alpha': 1.1, 'fit_prior': True}

Best Score tfidf:  0.5028571428571429
Best Params tfidf:  {'alpha': 0.5, 'fit_prior': True}

Best Score Count df:  0.47333333333333333
Best Params Count df:  {'alpha': 1.3, 'fit_prior': True}

Best Score Count:  0.4447619047619048
Best Params Count:  {'alpha': 0.5, 'fit_prior': True}


##### Create and test tuned models

In [101]:
# initialize the MNB model
tuned_count_nb_clf = MultinomialNB(alpha = 1.1, fit_prior = True)
tuned_tfidf_nb_clf = MultinomialNB(alpha = 0.5, fit_prior = True)
tuned_count_df_nb_clf = MultinomialNB(alpha = 1.3, fit_prior = True)
tuned_tfidf_df_nb_clf = MultinomialNB(alpha = 0.5, fit_prior = True)

# use the training data to train the MNB model
tuned_count_nb_clf.fit(X_train_count_vec, z_train)
tuned_count_df_nb_clf.fit(X_train_count_df_vec, z_train)
tuned_tfidf_nb_clf.fit(X_train_tfidf_vec, z_train)
tuned_tfidf_df_nb_clf.fit(X_train_tfidf_df_vec, z_train)

MultinomialNB(alpha=0.5)

In [102]:
# test the classifier on the test data set, print accuracy score

print(tuned_count_nb_clf.score(X_test_count_vec, z_test))
print(tuned_count_df_nb_clf.score(X_test_count_df_vec, z_test))
print(tuned_tfidf_nb_clf.score(X_test_tfidf_vec, z_test))
print(tuned_tfidf_df_nb_clf.score(X_test_tfidf_df_vec, z_test))

0.5555555555555556
0.5555555555555556
0.5555555555555556
0.5555555555555556


In [103]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
z_pred_count_mnb = tuned_count_nb_clf.fit(X_train_count_vec, z_train).predict(X_test_count_vec)
z_pred_tfidf_mnb = tuned_tfidf_nb_clf.fit(X_train_tfidf_vec, z_train).predict(X_test_tfidf_vec)
z_pred_count_df_mnb = tuned_count_df_nb_clf.fit(X_train_count_df_vec, z_train).predict(X_test_count_df_vec)
z_pred_tfidf_df_mnb = tuned_tfidf_df_nb_clf.fit(X_train_tfidf_df_vec, z_train).predict(X_test_tfidf_df_vec)
count_cm_mnb = confusion_matrix(z_test, z_pred_count_mnb, labels = ['fake', 'real'])
count_df_cm_mnb = confusion_matrix(z_test, z_pred_count_df_mnb, labels = ['fake', 'real'])
tfidf_cm_mnb = confusion_matrix(z_test, z_pred_tfidf_mnb, labels = ['fake', 'real'])
tfidf_df_cm_mnb = confusion_matrix(z_test, z_pred_tfidf_df_mnb, labels = ['fake', 'real'])
print(count_cm_mnb)
print(count_df_cm_mnb)
print(tfidf_cm_mnb)
print(tfidf_df_cm_mnb)

[[5 5]
 [3 5]]
[[4 6]
 [2 6]]
[[5 5]
 [3 5]]
[[5 5]
 [3 5]]


In [104]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(z_test, z_pred_count_mnb, average = None))
print(recall_score(z_test, z_pred_count_mnb, average = None))
print()
print(precision_score(z_test, z_pred_count_df_mnb, average = None))
print(recall_score(z_test, z_pred_count_df_mnb, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_mnb, average = None))
print(recall_score(z_test, z_pred_tfidf_mnb, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_df_mnb, average = None))
print(recall_score(z_test, z_pred_tfidf_df_mnb, average = None))
print()
# from sklearn.metrics import classification_report
target_names = ['fake', 'real']
print(classification_report(z_test, z_pred_count_mnb, target_names = target_names))
print(classification_report(z_test, z_pred_count_df_mnb, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_df_mnb, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_mnb, target_names = target_names))

[0.625 0.5  ]
[0.5   0.625]

[0.66666667 0.5       ]
[0.4  0.75]

[0.625 0.5  ]
[0.5   0.625]

[0.625 0.5  ]
[0.5   0.625]

              precision    recall  f1-score   support

        fake       0.62      0.50      0.56        10
        true       0.50      0.62      0.56         8

    accuracy                           0.56        18
   macro avg       0.56      0.56      0.56        18
weighted avg       0.57      0.56      0.56        18

              precision    recall  f1-score   support

        fake       0.67      0.40      0.50        10
        true       0.50      0.75      0.60         8

    accuracy                           0.56        18
   macro avg       0.58      0.57      0.55        18
weighted avg       0.59      0.56      0.54        18

              precision    recall  f1-score   support

        fake       0.62      0.50      0.56        10
        true       0.50      0.62      0.56         8

    accuracy                           0.56        18
   m

#### SVM

In [105]:
# initialize the SVC models
count_svm_clf = SVC()
tfidf_svm_clf = SVC()
count_df_svm_clf = SVC()
tfidf_df_svm_clf = SVC()

# use the training data to train the SVM model
count_svm_clf.fit(X_train_count_vec, z_train)
count_df_svm_clf.fit(X_train_count_df_vec, z_train)
tfidf_svm_clf.fit(X_train_tfidf_vec, z_train)
tfidf_df_svm_clf.fit(X_train_tfidf_df_vec, z_train)

SVC()

##### Test the SVM classifier

In [106]:
# test the classifier on the test data set, print accuracy score

print(count_svm_clf.score(X_test_count_vec, z_test))
print(count_df_svm_clf.score(X_test_count_df_vec, z_test))
print(tfidf_svm_clf.score(X_test_tfidf_vec, z_test))
print(tfidf_df_svm_clf.score(X_test_tfidf_df_vec, z_test))

0.4444444444444444
0.3888888888888889
0.5555555555555556
0.4444444444444444


In [107]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
z_pred_count_svm = count_svm_clf.fit(X_train_count_vec, z_train).predict(X_test_count_vec)
z_pred_tfidf_svm = tfidf_svm_clf.fit(X_train_tfidf_vec, z_train).predict(X_test_tfidf_vec)
z_pred_count_df_svm = count_df_svm_clf.fit(X_train_count_df_vec, z_train).predict(X_test_count_df_vec)
z_pred_tfidf_df_svm = tfidf_df_svm_clf.fit(X_train_tfidf_df_vec, z_train).predict(X_test_tfidf_df_vec)
count_cm_svm = confusion_matrix(z_test, z_pred_count_svm, labels = ['fake', 'real'])
count_df_cm_svm = confusion_matrix(z_test, z_pred_count_df_svm, labels = ['fake', 'real'])
tfidf_cm_svm = confusion_matrix(z_test, z_pred_tfidf_svm, labels = ['fake', 'real'])
tfidf_df_cm_svm = confusion_matrix(z_test, z_pred_tfidf_df_svm, labels = ['fake', 'real'])
print(count_cm_svm)
print(count_df_cm_svm)
print(tfidf_cm_svm)
print(tfidf_df_cm_svm)

[[2 8]
 [2 6]]
[[2 8]
 [3 5]]
[[4 6]
 [2 6]]
[[4 6]
 [4 4]]


In [108]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(z_test, z_pred_count_svm, average = None))
print(recall_score(z_test, z_pred_count_svm, average = None))
print()
print(precision_score(z_test, z_pred_count_df_svm, average = None))
print(recall_score(z_test, z_pred_count_df_svm, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_svm, average = None))
print(recall_score(z_test, z_pred_tfidf_svm, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_df_svm, average = None))
print(recall_score(z_test, z_pred_tfidf_df_svm, average = None))
print()
# from sklearn.metrics import classification_report
target_names = ['fake', 'real']
print(classification_report(z_test, z_pred_count_svm, target_names = target_names))
print(classification_report(z_test, z_pred_count_df_svm, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_df_svm, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_svm, target_names = target_names))

[0.5        0.42857143]
[0.2  0.75]

[0.4        0.38461538]
[0.2   0.625]

[0.66666667 0.5       ]
[0.4  0.75]

[0.5 0.4]
[0.4 0.5]

              precision    recall  f1-score   support

        fake       0.50      0.20      0.29        10
        true       0.43      0.75      0.55         8

    accuracy                           0.44        18
   macro avg       0.46      0.47      0.42        18
weighted avg       0.47      0.44      0.40        18

              precision    recall  f1-score   support

        fake       0.40      0.20      0.27        10
        true       0.38      0.62      0.48         8

    accuracy                           0.39        18
   macro avg       0.39      0.41      0.37        18
weighted avg       0.39      0.39      0.36        18

              precision    recall  f1-score   support

        fake       0.50      0.40      0.44        10
        true       0.40      0.50      0.44         8

    accuracy                           0.44     

##### Error Analysis

In [109]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 0 such examples
print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_mnb[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 2 such examples
print()
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_mnb[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

MNB error analysis
positive examples that are mistakenly predicted as negative -

'I like the Pizza at Dominoes for their special ingredients and the prices, they give out one of the best pizza\'s in town. I ordered Special Pepperoni Pizza (large) from the online portal which gives me a discount of 50\% for being a regular customer, they always provide a good quantity of cheese over the pizza, we generally do not need to add a topping of pizza cheese for cheese lovers compared to other places. The place has a nice ambient environment, it is generally empty as it has more delivery orders than dine in. Orders are processed very quickly and has a special taste to it. Cost friendly prices makes me pick it up as my best buy. Overall Rating : 5/5'
'Ruby Tuesday is my favorite America Style Restaurant. The salad is awesome. And I like the baby pork ribs so much . So does the coconut shrimp.'
'Carlo\'s Plate Shack was amazing! The waitress was friendly, attentive, and helpful in answering any 

##### Hypertuning parameters

In [110]:
# Create a dictionary of possible parameters
params_grid = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1], 'kernel':['linear','rbf', 'poly'] }
# Create the GridSearchCV object
svm_count_grid_clf = GridSearchCV(count_svm_clf, params_grid)
# print(grid_clf)
# mnb_count_grid_clf = GridSearchCV(count_nb_clf, grid_params)
svm_count_grid_clf.fit(X_train_count_vec, z_train)
print("Best Score Count: ", svm_count_grid_clf.best_score_)
print("Best Params Count: ", svm_count_grid_clf.best_params_)
print()
svm_tfidf_grid_clf = GridSearchCV(tfidf_svm_clf, params_grid)
svm_tfidf_grid_clf.fit(X_train_tfidf_vec, z_train)
print("Best Score tfidf: ", svm_tfidf_grid_clf.best_score_)
print("Best Params tfidf: ", svm_tfidf_grid_clf.best_params_)
print()
svm_count_df_grid_clf = GridSearchCV(count_df_svm_clf, params_grid)
svm_count_df_grid_clf.fit(X_train_count_df_vec, z_train)
print("Best Score Count df: ", svm_count_df_grid_clf.best_score_)
print("Best Params Count df: ", svm_count_df_grid_clf.best_params_)
print()
svm_tfidf_df_grid_clf = GridSearchCV(tfidf_df_svm_clf, params_grid)
svm_tfidf_df_grid_clf.fit(X_train_tfidf_df_vec, z_train)
print("Best Score Count: ", svm_tfidf_df_grid_clf.best_score_)
print("Best Params Count: ", svm_tfidf_df_grid_clf.best_params_)
# print()

Best Score Count:  0.5704761904761905
Best Params Count:  {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}

Best Score tfidf:  0.5285714285714286
Best Params tfidf:  {'C': 1, 'gamma': 1, 'kernel': 'rbf'}

Best Score Count df:  0.5295238095238094
Best Params Count df:  {'C': 1, 'gamma': 1, 'kernel': 'rbf'}

Best Score Count:  0.4866666666666667
Best Params Count:  {'C': 0.0001, 'gamma': 0.001, 'kernel': 'poly'}


##### Create and test tuned SVM models

In [111]:
# initialize the SVC models
count_svm_clf = SVC(C = 1, gamma = 0.1, kernel = 'rbf')
tfidf_svm_clf = SVC(C = 1, gamma = 1, kernel = 'rbf')
count_df_svm_clf = SVC(C = 1, gamma = 1, kernel = 'rbf')
tfidf_df_svm_clf = SVC(C = 0.0001, gamma = 0.001, kernel = 'poly')

# use the training data to train the SVM model
count_svm_clf.fit(X_train_count_vec, z_train)
count_df_svm_clf.fit(X_train_count_df_vec, z_train)
tfidf_svm_clf.fit(X_train_tfidf_vec, z_train)
tfidf_df_svm_clf.fit(X_train_tfidf_df_vec, z_train)

SVC(C=0.0001, gamma=0.001, kernel='poly')

In [112]:
# test the classifier on the test data set, print accuracy score

print(count_svm_clf.score(X_test_count_vec, z_test))
print(count_df_svm_clf.score(X_test_count_df_vec, z_test))
print(tfidf_svm_clf.score(X_test_tfidf_vec, z_test))
print(tfidf_df_svm_clf.score(X_test_tfidf_df_vec, z_test))

0.5555555555555556
0.6666666666666666
0.5555555555555556
0.3888888888888889


In [113]:
# print confusion matrix (row: ground truth; col: prediction)

# from sklearn.metrics import confusion_matrix
z_pred_count_svm = count_svm_clf.fit(X_train_count_vec, z_train).predict(X_test_count_vec)
z_pred_tfidf_svm = tfidf_svm_clf.fit(X_train_tfidf_vec, z_train).predict(X_test_tfidf_vec)
z_pred_count_df_svm = count_df_svm_clf.fit(X_train_count_df_vec, z_train).predict(X_test_count_df_vec)
z_pred_tfidf_df_svm = tfidf_df_svm_clf.fit(X_train_tfidf_df_vec, z_train).predict(X_test_tfidf_df_vec)
count_cm_svm = confusion_matrix(z_test, z_pred_count_svm, labels = ['fake', 'real'])
count_df_cm_svm = confusion_matrix(z_test, z_pred_count_df_svm, labels = ['fake', 'real'])
tfidf_cm_svm = confusion_matrix(z_test, z_pred_tfidf_svm, labels = ['fake', 'real'])
tfidf_df_cm_svm = confusion_matrix(z_test, z_pred_tfidf_df_svm, labels = ['fake', 'real'])
print(count_cm_svm)
print(count_df_cm_svm)
print(tfidf_cm_svm)
print(tfidf_df_cm_svm)

[[4 6]
 [2 6]]
[[7 3]
 [3 5]]
[[4 6]
 [2 6]]
[[4 6]
 [5 3]]


In [114]:
# print classification report

# from sklearn.metrics import precision_score
# from sklearn.metrics import recall_score
print(precision_score(z_test, z_pred_count_svm, average = None))
print(recall_score(z_test, z_pred_count_svm, average = None))
print()
print(precision_score(z_test, z_pred_count_df_svm, average = None))
print(recall_score(z_test, z_pred_count_df_svm, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_svm, average = None))
print(recall_score(z_test, z_pred_tfidf_svm, average = None))
print()
print(precision_score(z_test, z_pred_tfidf_df_svm, average = None))
print(recall_score(z_test, z_pred_tfidf_df_svm, average = None))
print()
# from sklearn.metrics import classification_report
target_names = ['fake', 'real']
print(classification_report(z_test, z_pred_count_svm, target_names = target_names))
print(classification_report(z_test, z_pred_count_df_svm, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_df_svm, target_names = target_names))
print(classification_report(z_test, z_pred_tfidf_svm, target_names = target_names))

[0.66666667 0.5       ]
[0.4  0.75]

[0.7   0.625]
[0.7   0.625]

[0.66666667 0.5       ]
[0.4  0.75]

[0.44444444 0.33333333]
[0.4   0.375]

              precision    recall  f1-score   support

        fake       0.67      0.40      0.50        10
        true       0.50      0.75      0.60         8

    accuracy                           0.56        18
   macro avg       0.58      0.57      0.55        18
weighted avg       0.59      0.56      0.54        18

              precision    recall  f1-score   support

        fake       0.70      0.70      0.70        10
        true       0.62      0.62      0.62         8

    accuracy                           0.67        18
   macro avg       0.66      0.66      0.66        18
weighted avg       0.67      0.67      0.67        18

              precision    recall  f1-score   support

        fake       0.44      0.40      0.42        10
        true       0.33      0.38      0.35         8

    accuracy                           0

In [None]:
# print out specific type of error for further analysis

# print out the negative examples that are mistakenly predicted as positive
# according to the confusion matrix, there should be 0 such examples
print("MNB error analysis")
print("positive examples that are mistakenly predicted as negative -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'positive' and y_pred_mnb[i] == 'negative'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)

# print out the positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 2 such examples
print()
print("negative examples that are mistakenly predicted as positive -")
print()
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i] == 'negative' and y_pred_mnb[i] == 'positive'):
        print(X_test[i])
        err_cnt = err_cnt + 1
print("errors:", err_cnt)