# Title
The title of the notebook should be coherent with file name. Namely, file name should be:    
*author's initials_progressive number_title.ipynb*    
For example:    
*EF_01_Data Exploration.ipynb*

## Purpose
State the purpose of the notebook.

## Methodology
Quickly describe assumptions and processing steps.

## WIP - improvements
Use this section only if the notebook is not final.

Notable TODOs:
- todo 1;
- todo 2;
- todo 3.

## Results
Describe and comment the most important results.

## Suggested next steps
State suggested next steps, based on results obtained in this notebook.

# Setup

## Library import
We import all the required Python libraries

# Data manipulation
import pandas as pd
import numpy as np

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualizations
import plotly
import plotly.graph_objs as go
import plotly.offline as ply
plotly.offline.init_notebook_mode(connected=True)

import cufflinks as cf
cf.go_offline(connected=True)
cf.set_config_file(theme='white')

import matplotlib as plt

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
import xgboost as xgb
from catboost import Pool,CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# showing multiple outputs 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from pandarallel import pandarallel

# Initialization
pandarallel.initialize(progress_bar=True)


INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Local library import
We import all the required local libraries libraries

In [2]:
# Include local library paths
import sys
# sys.path.append('path/to/local/lib') # uncomment and fill to import local libraries

# Import local libraries

# Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.


# Data import
We retrieve all the required data for the analysis.

In [3]:
df=pd.read_csv('deceptive-opinion.csv')
df

Unnamed: 0,deceptive,hotel,polarity,source,text
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...
...,...,...,...,...,...
1595,deceptive,intercontinental,negative,MTurk,Problems started when I booked the InterContin...
1596,deceptive,amalfi,negative,MTurk,The Amalfi Hotel has a beautiful website and i...
1597,deceptive,intercontinental,negative,MTurk,The Intercontinental Chicago Magnificent Mile ...
1598,deceptive,palmer,negative,MTurk,"The Palmer House Hilton, while it looks good i..."


# Data processing
Put here the core of the notebook. Feel free di further split this section into subsections.

In [4]:
#drop the feature hotel
df=df.drop(['hotel'],axis=1)

In [5]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import re
import string
def text_cleaning(text):
    '''
    Make text lowercase, remove text in square brackets,remove links,remove special characters
    and remove words containing numbers.
    '''
    #print(text)
    if type(text) is str:
        text = text.lower()
        text = re.sub('\[.*?\]', '', text)
        text = re.sub("\\W"," ",text) # remove special chars
        text = re.sub('https?://\S+|www\.\S+', '', text)
        text = re.sub('<.*?>+', '', text)
        text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub('\n', '', text)
        text = re.sub('\w*\d\w*', '', text)
    else:
        text = str(text)
    
    return text
    

In [6]:
df['text']=df['text'].apply(text_cleaning)
df

Unnamed: 0,deceptive,polarity,source,text
0,truthful,positive,TripAdvisor,we stayed for a one night getaway with family ...
1,truthful,positive,TripAdvisor,triple a rate with upgrade to view room was le...
2,truthful,positive,TripAdvisor,this comes a little late as i m finally catchi...
3,truthful,positive,TripAdvisor,the omni chicago really delivers on all fronts...
4,truthful,positive,TripAdvisor,i asked for a high floor away from the elevato...
...,...,...,...,...
1595,deceptive,negative,MTurk,problems started when i booked the intercontin...
1596,deceptive,negative,MTurk,the amalfi hotel has a beautiful website and i...
1597,deceptive,negative,MTurk,the intercontinental chicago magnificent mile ...
1598,deceptive,negative,MTurk,the palmer house hilton while it looks good i...


In [7]:
#[text_cleaning(df.text[1598])]

In [8]:
#cv.transform([text_cleaning(df.text[1598])]).toarray()

In [9]:
df.text[0]

'we stayed for a one night getaway with family on a thursday  triple aaa rate of  was a steal   floor room complete with  plasma tv bose stereo  voss and evian water  and gorgeous bathroom no tub but was fine for us  concierge was very helpful  you cannot beat this location    only flaw was breakfast was pricey and service was very very slow  for four kids and four adults on a friday morning  even though there were only two other tables in the restaurant  food was very good so it was worth the wait  i would return in a heartbeat  a gem in chicago     '

In [10]:
df['complete_text']=df['source'] + '' +df['text']
x=df['complete_text']
y=df['deceptive']

In [11]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
le=LabelEncoder()
y=le.fit_transform(y)

**Comment:** 1 means truthful, 0 means deceptive

In [12]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0,test_size=0.2)

In [13]:
np.unique(y_test, return_counts=True)

(array([0, 1]), array([165, 155]))

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1,2))
x_train=cv.fit_transform(x_train)

In [15]:
#from sklearn import cross_validation
from sklearn.model_selection import cross_val_score

# create training funtion 
def model_training(x_train,y_train):
    models = [
        LogisticRegression(max_iter = 10000),
        SVC(),
        MultinomialNB(),
        CatBoostClassifier(iterations=100, task_type="GPU", learning_rate=0.05, l2_leaf_reg=1, depth=11, loss_function= 'Logloss', eval_metric='AUC',random_seed=42,verbose=False)
        ]

    CV = 5
    cv_df = pd.DataFrame(index=range(CV * len(models)))
    entries = []

    for model in models:
         model_name = model.__class__.__name__
         accuracies = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=CV)
         for fold_idx, accuracy in enumerate(accuracies):
              entries.append((model_name, fold_idx, accuracy))
        
    cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

    #cv_df
    import seaborn as sns
    import matplotlib.pyplot as plt
    sns.boxplot(x='model_name', y='accuracy', data=cv_df)
    sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
    
    plt.show()
    
    return cv_df

In [16]:
#model_training(x_train,y_train)

In [18]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef

def model_testing(x_test, y_test, y_train, x_train):
    models = [
        LogisticRegression(max_iter = 10000),
        SVC(),
        MultinomialNB(),
        CatBoostClassifier(iterations=100, task_type="GPU", learning_rate=0.05, l2_leaf_reg=1, depth=11, loss_function= 'Logloss', eval_metric='AUC',random_seed=42,verbose=False)
        ]

    for clf in models:
        model_name = clf.__class__.__name__
        clf.fit(x_train, y_train)
        print(model_name)
        # Do the prediction
        y_predict =clf.predict(cv.transform(x_test))
        print(confusion_matrix(y_test,y_predict))
        recall=recall_score(y_test,y_predict,average='macro')
        precision=precision_score(y_test,y_predict,average='macro')
        f1score=f1_score(y_test,y_predict,average='macro')
        accuracy=accuracy_score(y_test,y_predict)
        matthews = matthews_corrcoef(y_test,y_predict) 
        print('Accuracy: '+ str(accuracy))
        print('Macro Precision: '+ str(precision))
        print('Macro Recall: '+ str(recall))
        print('Macro F1 score:'+ str(f1score))
        print('MCC:'+ str(matthews))

In [19]:
model_testing(x_test, y_test, y_train, x_train)

LogisticRegression
[[152  13]
 [ 16 139]]
Accuracy: 0.909375
Macro Precision: 0.9096177944862156
Macro Recall: 0.9089931573802541
Macro F1 score:0.9092251860981502
MCC:0.818610713553282
SVC
[[144  21]
 [ 20 135]]
Accuracy: 0.871875
Macro Precision: 0.8717166979362101
Macro Recall: 0.8718475073313783
Macro F1 score:0.8717735708910368
MCC:0.7435641937614549
MultinomialNB
[[158   7]
 [ 24 131]]
Accuracy: 0.903125
Macro Precision: 0.9087036152253544
Macro Recall: 0.9013685239491691
Macro F1 score:0.9024303882129614
MCC:0.8100389293748533




CatBoostClassifier
[[141  24]
 [  1 154]]
Accuracy: 0.921875
Macro Precision: 0.9290631429023579
Macro Recall: 0.9240469208211144
Macro F1 score:0.9217458500846123
MCC:0.8530953160944553


### Testing on Yelp Labelled Review Dataset with Sentiments and Features

In [20]:
Yelp = pd.read_excel('Yelp Labelled Review Dataset with Sentiments and Features.xlsx')

In [21]:
Yelp['clearned_Review']=Yelp['Review'].parallel_apply(text_cleaning)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=22201), Label(value='0 / 22201')))…

In [22]:
Yelp.drop(197917,axis = 0,inplace = True)
Yelp[Yelp['clearned_Review'] == 5]

Unnamed: 0,User_id,Product_id,Rating,Date,Review,Spam(1) and Not Spam(0),Sentiment,Features,clearned_Review


In [23]:
#mapping the value of y
Yelp['SpamOrNot'] = Yelp['Spam(1) and Not Spam(0)'].map({1: 'deceptive', 0: 'truthful'})

In [24]:
#Yelp[['Review']].applymap(str)
x_yelp_test =Yelp['clearned_Review']
y_yelp_test =Yelp['SpamOrNot']

In [25]:
y_yelp_test=le.fit_transform(y_yelp_test)

In [26]:
unique_elements, counts_elements = np.unique(y_yelp_test, return_counts=True) 
unique_elements
counts_elements

array([0, 1])

array([ 36132, 319077])

In [29]:
new_x_train,new_x_test,new_y_train,new_y_test=train_test_split(x_yelp_test,y_yelp_test,random_state=0,test_size=0.2)

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1,2))
new_x_train=cv.fit_transform(new_x_train)

**Comment:** 1 means truthful, 0 means deceptive

for x in x_yelp_test:
    if type(x) == int:
        print(x)

Yelp_pred=lr.predict(cv.transform(x_yelp_test))
score_1=accuracy_score(y_yelp_test,Yelp_pred)
score_1

pred_2=svm.predict(cv.transform(x_yelp_test))
score_2=accuracy_score(y_yelp_test,pred_2)
score_2

pred_3=nb.predict(cv.transform(x_yelp_test))
score_3=accuracy_score(y_yelp_test,pred_3)
score_3

In [27]:
def testing_2_models(x_test, y_test, y_train, x_train):
    models = [
        MultinomialNB(),
        CatBoostClassifier(iterations=100, task_type="GPU", learning_rate=0.05, l2_leaf_reg=1, depth=11, loss_function= 'Logloss', eval_metric='AUC',random_seed=42,verbose=False)
        ]

    for clf in models:
        model_name = clf.__class__.__name__
        clf.fit(x_train, y_train)
        print(model_name)
        # Do the prediction
        y_predict =clf.predict(cv.transform(x_test))
        print(confusion_matrix(y_test,y_predict))
        recall=recall_score(y_test,y_predict,average='macro')
        precision=precision_score(y_test,y_predict,average='macro')
        f1score=f1_score(y_test,y_predict,average='macro')
        accuracy=accuracy_score(y_test,y_predict)
        matthews = matthews_corrcoef(y_test,y_predict) 
        print('Accuracy: '+ str(accuracy))
        print('Macro Precision: '+ str(precision))
        print('Macro Recall: '+ str(recall))
        print('Macro F1 score:'+ str(f1score))
        print('MCC:'+ str(matthews))

In [None]:
testing_2_models(new_x_test, new_y_test, new_y_train, new_x_train)

MultinomialNB
[[   34  7123]
 [   65 63820]]
Accuracy: 0.8988204160918893
Macro Precision: 0.6215148966512738
Macro Recall: 0.501866570293972
Macro F1 score:0.47802959311236803
MCC:0.030120829760439876


In [68]:
pred_5= model.predict(cv.transform(x_yelp_test))
score_5= accuracy_score(y_yelp_test,pred_5)
score_5

0.8933867103592533

In [82]:
np.unique(pred_5, return_counts=True)

(array([0, 1]), array([  2498, 352711]))

In [69]:
from sklearn.metrics import confusion_matrix
from sklearn import metrics as mt

confusion_matrix(y_yelp_test,pred_5)
mt.accuracy_score(y_yelp_test,pred_5)
mt.precision_score(y_yelp_test,pred_5)
mt.recall_score(y_yelp_test,pred_5)

array([[   380,  35752],
       [  2118, 316959]])

0.8933867103592533

0.8986365608104085

0.9933621038182007

In [74]:
Yelp["True_label"] = y_yelp_test

In [70]:
Yelp["Prediction"] = pred_5

In [79]:
pd.set_option("max_rows", 1000) #see all the values of data 
Yelp[["Review","True_label","Prediction"]].head(1000)

Unnamed: 0,Review,True_label,Prediction
0,The food at snack is a selection of popular Gr...,0,1
1,This little place in Soho is wonderful. I had ...,0,1
2,ordered lunch for 15 from Snack last Friday. Ã...,0,1
3,This is a beautiful quaint little restaurant o...,0,1
4,Snack is great place for a Ã‚Â casual sit down...,0,1
5,A solid 4 stars for this greek food spot. Ã‚Â ...,0,1
6,Let me start with a shout-out to everyone who ...,0,1
7,Love this place! Ã‚Â Try the Chicken sandwich ...,0,1
8,My friend and I were intrigued by the nightly ...,0,1
9,Stopped in for lunch today and couldn't believ...,0,1


In [80]:
Yelp["Review"].loc[211]

'The food is average pizzeria and not cheap. Ã‚Â\xa0Add that to the fact that I puked my guts out in the bathroom during the "meal" and I\'ve decided not to go back.'

In [72]:
Yelp[["Spam(1) and Not Spam(0)"]].value_counts()

Spam(1) and Not Spam(0)
0                          319077
1                           36132
dtype: int64

In [73]:
Yelp[["Prediction"]].value_counts()

Prediction
1             352711
0               2498
dtype: int64

try the package that could compare multiple models

# References
We report here relevant references:
1. author1, article1, journal1, year1, url1
2. author2, article2, journal2, year2, url2

In [38]:
import pickle
# open a file, where you ant to store the data
file = open('CatBoostClassifier.pkl', 'wb')

# dump information to that file
pickle.dump(model, file)

In [42]:
model = open('CatBoostClassifier.pkl','rb')
catboost = pickle.load(model)
y_prediction = catboost.predict(cv.transform(x_test))

In [68]:
y

array([1, 1, 1, ..., 0, 0, 0])

In [43]:
y_prediction

array([1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 0])

In [71]:
x_test

1073    Webi never write these reviews  but felt that ...
326     TripAdvisorwe stayed at the palmer house hilto...
1557    MTurkmy experience at the amalfi hotel in chic...
918     Webthis review has two parts     i advise read...
974     Webwe chose to stay at a small hotel because w...
                              ...                        
1401    MTurkif you are looking for a high end hotel o...
31      TripAdvisorthe omni is in a fabulous location ...
733     MTurki went here with the family  including ou...
655     MTurki frequently have business meetings in do...
1338    MTurkwhile staying at the swissotel chicago i ...
Name: complete_text, Length: 160, dtype: object

In [40]:
x_test.iloc[-2]

'MTurki frequently have business meetings in downtown chicago and find that the hotel monaco chicago  gives me the peace of mind to make those meetings enjoyable  while it might be more expensive than your average hotel    but there is nothing average in this hotel  exceeds any expectations   service  food  quality  atmosphere were heads above anywhere  thanks for making my trips to chicago   the best of the best  '

In [73]:
Yelp.Review[

Unnamed: 0,User_id,Product_id,Rating,Date,Review,Spam(1) and Not Spam(0),Sentiment,Features,clearned_Review,SpamOrNot
0,923,0,3,2014-01-30,The food at snack is a selection of popular Gr...,1,Positive,"['appetizer tray', 'greek salad', 'main courses']",the food at snack is a selection of popular gr...,deceptive
1,924,0,3,2011-05-05,This little place in Soho is wonderful. I had ...,1,Positive,"['little place', 'soho', 'lamb sandwich', 'soh...",this little place in soho is wonderful i had ...,deceptive
2,925,0,4,2011-12-30,ordered lunch for 15 from Snack last Friday. Ã...,1,Positive,"['snack', 'regular company lunch list']",ordered lunch for from snack last friday ã â...,deceptive
3,926,0,4,2012-10-04,This is a beautiful quaint little restaurant o...,1,Positive,"['beautiful quaint', 'pretty street', 'great p...",this is a beautiful quaint little restaurant o...,deceptive
4,927,0,4,2014-02-06,Snack is great place for a Ã‚Â casual sit down...,1,Positive,"['snack', 'great place', 'Ã¢ casual', 'cold wi...",snack is great place for a ã â casual sit down...,deceptive
...,...,...,...,...,...,...,...,...,...,...
355205,161146,349,1,2012-10-04,The aircondition makes so much noise and its ...,0,Negative,[],the aircondition makes so much noise and its ...,truthful
355206,116424,349,1,2013-05-27,Even though the pictures show very clean room...,0,Negative,"['clean rooms', 'actual room', 'o clock']",even though the pictures show very clean room...,truthful
355207,161147,349,2,2011-03-03,Backyard of the hotel is total mess shouldn t...,0,Negative,"['backyard', 'total mess shouldn t']",backyard of the hotel is total mess shouldn t...,truthful
355208,97930,349,2,2014-07-29,You When I booked with your company on line y...,0,Negative,"['s room', 'villa suite theough', 'wife s 40th...",you when i booked with your company on line y...,truthful


In [44]:
# dump information to that file
pickle.dump(cv, open('transform.pkl', 'wb'))

In [None]:
with open(filename, 'wb') as fout:
    pickle.dump((cv), fout)

In [3]:
deceptive_opinion = pd.read_csv('deceptive-opinion.csv')
Yelp_review_sentiments = pd.read_excel('Yelp Labelled Review Dataset with Sentiments and Features.xlsx')

In [4]:
Yelp_review_sentiments['deceptive'] = Yelp_review_sentiments['Spam(1) and Not Spam(0)'].map({1: 'deceptive', 0: 'truthful'})
Yelp_review_sentiments['text'] = Yelp_review_sentiments["Review"]
sub_deceptive_opinion = deceptive_opinion[["deceptive", "text"]]
sub_Yelp_review_sentiments = Yelp_review_sentiments[["deceptive", "text"]]
concat_data = pd.concat([sub_deceptive_opinion, sub_Yelp_review_sentiments],ignore_index=True)

In [6]:
concat_data

Unnamed: 0,deceptive,text
0,truthful,We stayed for a one night getaway with family ...
1,truthful,Triple A rate with upgrade to view room was le...
2,truthful,This comes a little late as I'm finally catchi...
3,truthful,The Omni Chicago really delivers on all fronts...
4,truthful,I asked for a high floor away from the elevato...
...,...,...
356805,truthful,The aircondition makes so much noise and its ...
356806,truthful,Even though the pictures show very clean room...
356807,truthful,Backyard of the hotel is total mess shouldn t...
356808,truthful,You When I booked with your company on line y...


In [7]:
concat_data['text']=concat_data['text'].parallel_apply(text_cleaning)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=22301), Label(value='0 / 22301')))…

In [9]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

x=concat_data['text']
y=concat_data['deceptive']
le=LabelEncoder()
y=le.fit_transform(y)

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0,test_size=0.2)

from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1,2))
x_train=cv.fit_transform(x_train)

In [10]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef

def model_testing(x_test, y_test, y_train, x_train):
    models = [
        #LogisticRegression(max_iter = 10000),
        #SVC(),
        #MultinomialNB(),
        CatBoostClassifier(iterations=100, task_type="GPU", learning_rate=0.05, l2_leaf_reg=1, depth=11, loss_function= 'Logloss', eval_metric='AUC',class_weights=[4.79, 0.56],random_seed=42,verbose=False)
        ]

    for clf in models:
        model_name = clf.__class__.__name__
        clf.fit(x_train, y_train)
        print(model_name)
        # Do the prediction
        y_predict =clf.predict(cv.transform(x_test))
        print(confusion_matrix(y_test,y_predict))
        recall=recall_score(y_test,y_predict,average='macro')
        precision=precision_score(y_test,y_predict,average='macro')
        f1score=f1_score(y_test,y_predict,average='macro')
        accuracy=accuracy_score(y_test,y_predict)
        matthews = matthews_corrcoef(y_test,y_predict) 
        print('Accuracy: '+ str(accuracy))
        print('Macro Precision: '+ str(precision))
        print('Macro Recall: '+ str(recall))
        print('Macro F1 score:'+ str(f1score))
        print('MCC:'+ str(matthews))

In [None]:
model_testing(x_test, y_test, y_train, x_train)

In [16]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.025, random_state=42)

for train_index, test_index in split.split(x, y):
    strat_train_set = concat_data.loc[train_index]
    strat_test_set = concat_data.loc[test_index]

for train_index, validation_index in split.split(strat_train_set["text"], strat_train_set["deceptive"]):
    strat_train_set = concat_data.loc[train_index]
    strat_validation_set = concat_data.loc[validation_index]


In [19]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1,2))
x_train=cv.fit_transform(strat_train_set["text"])
y_train=le.fit_transform(strat_train_set["deceptive"])
x_valid=cv.fit_transform(strat_validation_set["text"])
y_valid=le.fit_transform(strat_validation_set["deceptive"])
x_test=cv.fit_transform(strat_test_set["text"])
y_test=le.fit_transform(strat_test_set["deceptive"])

In [17]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef

def model_testing(x_test, y_test, y_train, x_train):
    models = [
        #LogisticRegression(max_iter = 10000),
        #SVC(),
        #MultinomialNB(),
        CatBoostClassifier(iterations=100, task_type="GPU", learning_rate=0.05, l2_leaf_reg=1, depth=11, loss_function= 'Logloss', eval_metric='AUC',random_seed=42, class_weights=[4.79, 0.56], verbose=False)
        ]

    for clf in models:
        model_name = clf.__class__.__name__
        clf.fit(x_train, y_train)
        print(model_name)
        # Do the prediction
        y_predict =clf.predict(cv.transform(x_test))
        print(confusion_matrix(y_test,y_predict))
        recall=recall_score(y_test,y_predict,average='macro')
        precision=precision_score(y_test,y_predict,average='macro')
        f1score=f1_score(y_test,y_predict,average='macro')
        accuracy=accuracy_score(y_test,y_predict)
        matthews = matthews_corrcoef(y_test,y_predict) 
        print('Accuracy: '+ str(accuracy))
        print('Macro Precision: '+ str(precision))
        print('Macro Recall: '+ str(recall))
        print('Macro F1 score:'+ str(f1score))
        print('MCC:'+ str(matthews))

In [None]:
model_testing(x_test, y_test, y_train, x_train)