# Metadata Modelling

This notebook starts from the processed features of the review metadata, as well as the scores from the LSTM and FFNN models from the text modelling notebook. It then tries to use three simple classification models to try and achieve the task of predicting fake reviews. The three models are:
1. Logistic Regression
2. Multi-layered Perceptron Classifier
3. Random Forests Classifier

This notebook also looks at some of the metrics for each of the models and analyses their performance.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import sklearn

** Reading and processing data**

In [4]:
df_data1 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/6862_project/yelp_with_text_preds.csv", encoding="ISO-8859-1")
df_data1.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,ID,date,restaurantID,userID,reviewText,restaurant,fakeLabel,rating,lstm_predict_probas,ffnn_predict_probas
0,0,1,0.0,2014-11-16,0,5044.0,"Drinks were bad, the hot chocolate was watered...",Toast,-1,1,0.226938,0.272508
1,1,2,1.0,2014-09-08,0,5045.0,This was the worst experience I've ever had a ...,Toast,-1,1,0.419023,0.222183
2,2,3,2.0,2013-10-06,0,5046.0,This is located on the site of the old Spruce ...,Toast,-1,3,0.318059,0.222183
3,3,4,3.0,2014-11-30,0,5047.0,I enjoyed coffee and breakfast twice at Toast ...,Toast,-1,5,0.057116,0.222183
4,4,5,4.0,2014-08-28,0,5048.0,I love Toast! The food choices are fantastic -...,Toast,-1,5,0.066041,0.214755


In [5]:
df_data2 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/6862_project/yelp_processed_metadata.csv", encoding="ISO-8859-1")
df_data2.head()

Unnamed: 0.1,Unnamed: 0,date,fakeLabel,rating,wordCount,sentimentVADER,sentimentBLOB,VaderPosNeg,BlobPosNeg
0,1,2014-11-16,1,1,36,-0.937,-0.399259,0,0
1,2,2014-09-08,1,1,248,0.8918,0.02483,1,1
2,3,2013-10-06,1,3,50,0.9244,0.296481,1,1
3,4,2014-11-30,1,5,233,0.9952,0.385227,1,1
4,5,2014-08-28,1,5,152,0.994,0.174829,1,1


Combining the results from the text modelling notebook and the review metadata.

In [6]:
df_combined = df_data1[['rating', 'lstm_predict_probas', 'ffnn_predict_probas']]
df_combined['fakeLabel'] = df_data2['fakeLabel']
df_combined['rating'] = df_data2['rating']
df_combined['wordCount'] = df_data2['wordCount']
df_combined['sentimentVADER'] = df_data2['sentimentVADER']
df_combined['sentimentBLOB'] = df_data2['sentimentBLOB']
df_combined.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

Unnamed: 0,rating,lstm_predict_probas,ffnn_predict_probas,fakeLabel,wordCount,sentimentVADER,sentimentBLOB
0,1,0.226938,0.272508,1,36,-0.937,-0.399259
1,1,0.419023,0.222183,1,248,0.8918,0.02483
2,3,0.318059,0.222183,1,50,0.9244,0.296481
3,5,0.057116,0.222183,1,233,0.9952,0.385227
4,5,0.066041,0.214755,1,152,0.994,0.174829


Splitting the data into input, output, train and test data.

In [7]:
X = df_combined[['rating', 'lstm_predict_probas', 'ffnn_predict_probas', 'wordCount', 'sentimentVADER', 'sentimentBLOB']]
y = df_combined['fakeLabel']

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

**MODEL 1: Logistic Regression**

In [8]:
from sklearn.linear_model import LogisticRegression

logit_clf = LogisticRegression()
logit_clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [9]:
y_pred = logit_clf.predict(X_val)
from sklearn.metrics import f1_score, confusion_matrix

print(f"F1 score for logistic regression model: {f1_score(y_val,y_pred)}")
print(f"Confusion matrix for LSTM:")
print(f"{confusion_matrix(y_val,y_pred)}")

F1 score for logistic regression model: 0.13053493729203994
Confusion matrix for LSTM:
[[103460   2102]
 [ 14883   1275]]


**MODEL 2: MLP Classifier**

In [10]:
from sklearn.neural_network import MLPClassifier

mlp_clf = MLPClassifier(random_state=1, max_iter=10).fit(X_train, y_train)



In [11]:
y_pred_mlp = mlp_clf.predict(X_val)
from sklearn.metrics import f1_score, confusion_matrix

print(f"F1 score for MLP model: {f1_score(y_val,y_pred_mlp)}")
print(f"Confusion matrix for MLP:")
print(f"{confusion_matrix(y_val,y_pred_mlp)}")

F1 score for MLP model: 0.11400823392800594
Confusion matrix for MLP:
[[103854   1708]
 [ 15078   1080]]


**MODEL 3: Random Forests Classifier**

In [12]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(max_depth=10, random_state=0).fit(X_train, y_train)

In [13]:
y_pred_rf = rf_clf.predict(X_val)
from sklearn.metrics import f1_score, confusion_matrix

print(f"F1 score for MLP model: {f1_score(y_val,y_pred_rf)}")
print(f"Confusion matrix for MLP:")
print(f"{confusion_matrix(y_val,y_pred_rf)}")

F1 score for MLP model: 0.12352331606217616
Confusion matrix for MLP:
[[103612   1950]
 [ 14966   1192]]


Overall, the three models do a decent job with the classification, especially given our current limitations of not having meaningful embeddings, and using naive methods for sentiment analysis. On finishing and applying BERT to these models, we definitely believe our results will have even more value.