# Text Modelling

This notebook starts from the raw feature of the review text and creates predictions for whether a review is fake or not only given the review text. There are three methods proposed here:
1. Using vectorised text as an input to a Bidirectional LSTM 
2. Using vectorised text as an input to a Feed Forward Neural Network
3. Using pretrained BERT (uncased base) as embeddings and input into FFNN

The first two methods have been completed. The third, and perhaps most promising method, is still running remotely as of now. The estimated time of completion for the training is past the deadline for Milestone 1, and as such hasn't been included in the codebase as yet.

After completing training and predicting using these models, we use the model's output as a feature combined with the metadata to make predictions in the next notebook (Notebook 5).

**Environment setup**

In [1]:
!pip3 install transformers==4.4.1  # This is used for BERT

Collecting transformers==4.4.1
[?25l  Downloading https://files.pythonhosted.org/packages/2c/d8/5144b0712f7f82229a8da5983a8fbb8d30cec5fbd5f8d12ffe1854dcea67/transformers-4.4.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.5MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 24.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 30.4MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha25

In [2]:
# import the necessary libraries
import os 
os.environ['TF_CPP_MIN_LOG_LEVEL']='2' #  Trying to reduce tensorflow warnings
import re
import math
import string
import time
import json
import random
import numpy as np
import pandas as pd
import nltk
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# useful structures and functions for experiments 
from time import sleep
from collections import defaultdict
from glob import glob

# specific machine learning functionality
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.utils import to_categorical
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.utils.layer_utils import count_params
from sklearn.model_selection import train_test_split
from sklearn import manifold
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import f1_score, confusion_matrix
from transformers import BertTokenizer, TFBertForSequenceClassification, BertConfig
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

**Data read and processing**

Reading in the processed data from Notebook 2.

In [3]:
df_data_all_cols = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/6862_project/yelp_processed.csv', encoding="ISO-8859-1")
df_data_all_cols.head()

Unnamed: 0.1,Unnamed: 0,ID,date,restaurantID,userID,reviewText,restaurant,fakeLabel,rating
0,1,0.0,2014-11-16,0,5044.0,"Drinks were bad, the hot chocolate was watered...",Toast,-1,1
1,2,1.0,2014-09-08,0,5045.0,This was the worst experience I've ever had a ...,Toast,-1,1
2,3,2.0,2013-10-06,0,5046.0,This is located on the site of the old Spruce ...,Toast,-1,3
3,4,3.0,2014-11-30,0,5047.0,I enjoyed coffee and breakfast twice at Toast ...,Toast,-1,5
4,5,4.0,2014-08-28,0,5048.0,I love Toast! The food choices are fantastic -...,Toast,-1,5


Filtering out only the two relevant columns for this notebook - the review text as the input and the fake label as the target.

In [4]:
df_data = df_data_all_cols[['reviewText', 'fakeLabel']]
df_data.head()

Unnamed: 0,reviewText,fakeLabel
0,"Drinks were bad, the hot chocolate was watered...",-1
1,This was the worst experience I've ever had a ...,-1
2,This is located on the site of the old Spruce ...,-1
3,I enjoyed coffee and breakfast twice at Toast ...,-1
4,I love Toast! The food choices are fantastic -...,-1


Changing the convention so that `fakeLabel` is 1 when we have a fake review and 0 otherwise.

In [5]:
def refinefakeLabel(row):
  # Changing the labels so that a fake review has label 1
  if row['fakeLabel'] == -1:
    return 1
  else:
    return 0

df_data = df_data.dropna()
df_data['fakeLabel'] = df_data.apply(refinefakeLabel, axis=1)
df_data

Unnamed: 0,reviewText,fakeLabel
0,"Drinks were bad, the hot chocolate was watered...",1
1,This was the worst experience I've ever had a ...,1
2,This is located on the site of the old Spruce ...,1
3,I enjoyed coffee and breakfast twice at Toast ...,1
4,I love Toast! The food choices are fantastic -...,1
...,...,...
608593,When I first moved to the area I must say I wa...,0
608594,Kind of pricey. I guess I expected a ridiculou...,0
608595,"Stopped by this restaurant yesterday, we just ...",0
608596,Finally checked out The Best Subs in Claremont...,0


Exploring the data set to look at the number of values for each class.

In [6]:
num_1s = 0
num_0s = 0
for label in list(df_data['fakeLabel']):
  if label == 1:
    num_1s += 1
  else:
    num_0s += 1
print(f"Number of 1s: {num_1s}")
print(f"Number of 0s: {num_0s}")

Number of 1s: 80466
Number of 0s: 528132


Imagining a naive model that predicts 0s for all reviews. This is, of course, not ideal, however, because we simply don't do the task of detetecting fake reviews in this case. 

In [7]:
print(f"Baseline accuracy: {num_0s/(num_1s + num_0s)}")

Baseline accuracy: 0.8677846460224976


**Modelling**

In [8]:
# Function for standardizing text (used in preprocessing below)
def standardize_text(input_text):
    # Convert to lowercase
    lowercase = tf.strings.lower(input_text)
    # Remove HTML tags
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
      stripped_html, "[%s]" % re.escape(string.punctuation), ""
    )

In [9]:
# Setting parameters here so that they don't need to be hard coded and we can be consistent
VOCABULARY_SIZE = 15000
SEQUENCE_SIZE = 64
EMBEDDING_SIZE = 100

In [10]:
# Initialize Text Vectorizer
text_vectorizer = TextVectorization(
    standardize=standardize_text,
    max_tokens=VOCABULARY_SIZE,
    output_mode="int",
    output_sequence_length=64,
)

In [11]:
# Create the vocabulary of entire dataset
text_data = tf.data.Dataset.from_tensor_slices(df_data['reviewText'].values)

# Generate Text Vector
start_time = time.time()
text_vectorizer.adapt(text_data.batch(64))
execution_time = (time.time() - start_time)/60.0
print("Execution time (mins)",execution_time)

# Get Vocabulary
vocabulary = text_vectorizer.get_vocabulary()
vocabulary_size = len(vocabulary)
print("Vocabulary Size:",vocabulary_size)
# Generate word index
word_index = dict(zip(vocabulary, range(vocabulary_size)))

Execution time (mins) 0.9253041982650757
Vocabulary Size: 15000


In [12]:
# Splitting the data into input, output, train, and test

X = df_data['reviewText'].values
y = df_data['fakeLabel'].values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF Datasets
train_data = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_data = tf.data.Dataset.from_tensor_slices((X_val, y_val))
complete_data = tf.data.Dataset.from_tensor_slices((X, y))

In [13]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 32
TRAIN_SHUFFLE_BUFFER_SIZE = len(X_train)
VALIDATION_SHUFFLE_BUFFER_SIZE = len(X_val)

# Vectorize Text
def vectorize_text(text, label=None):
    text = tf.expand_dims(text, -1)
    text = text_vectorizer(text)
    if label is None:
        return text
    else:
        return text, label

#############
# Train data
#############
train_data = train_data.shuffle(buffer_size=TRAIN_SHUFFLE_BUFFER_SIZE)
train_data = train_data.batch(BATCH_SIZE)
train_data = train_data.map(vectorize_text)
#train_data = train_data.prefetch(buffer_size=AUTOTUNE)

##################
# Validation data
##################
validation_data1 = validation_data.shuffle(buffer_size=TRAIN_SHUFFLE_BUFFER_SIZE)
validation_data1 = validation_data1.batch(BATCH_SIZE)
validation_data1 = validation_data1.map(vectorize_text, num_parallel_calls=AUTOTUNE)
validation_data1 = validation_data1.prefetch(buffer_size=VALIDATION_SHUFFLE_BUFFER_SIZE)

##################
# Complete data
##################
complete_data = complete_data.batch(BATCH_SIZE)
complete_data = complete_data.map(vectorize_text, num_parallel_calls=AUTOTUNE)
complete_data = complete_data.prefetch(buffer_size=VALIDATION_SHUFFLE_BUFFER_SIZE)


print("train_data",train_data)
print("validation_data",validation_data1)
print("complete_data",complete_data)

train_data <MapDataset shapes: ((None, 64), (None,)), types: (tf.int64, tf.int64)>
validation_data <PrefetchDataset shapes: ((None, 64), (None,)), types: (tf.int64, tf.int64)>
complete_data <PrefetchDataset shapes: ((None, 64), (None,)), types: (tf.int64, tf.int64)>


**MODEL 1: LSTM**

In [14]:
def build_lstm():

    # Set the model name as
    model_name = 'lstm_'+str(int(time.time()))

    # Create a LSTM Model
    model = tf.keras.models.Sequential(name=model_name)
    model.add(tf.keras.Input(shape=(SEQUENCE_SIZE)))
    model.add(tf.keras.layers.Embedding(input_dim=VOCABULARY_SIZE, output_dim=EMBEDDING_SIZE))
    model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)))
    model.add(tf.keras.layers.Dense(256, activation="relu"))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(1,activation="sigmoid"))

    return model


In [15]:
############################
# Training Params
############################
learning_rate = 1e-4
epochs = 3

# Free up memory
K.clear_session()

# Build the model
model2 = build_lstm()

# Print the model architecture
print(model2.summary())

# Optimizer
optimizer = keras.optimizers.Adam(lr=learning_rate)
# Loss
loss = keras.losses.binary_crossentropy

# Compile
model2.compile(loss=loss,
                  optimizer=optimizer,
                  metrics=['accuracy'])

# Train model
start_time = time.time()
training_results = model2.fit(
        train_data,
        validation_data=validation_data1,
        epochs=epochs, 
        verbose=1)
execution_time = (time.time() - start_time)/60.0
print("Training execution time (mins)",execution_time)

Model: "lstm_1617899423"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 64, 100)           1500000   
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                34048     
_________________________________________________________________
dense (Dense)                (None, 256)               16640     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 1,550,945
Trainable params: 1,550,945
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3
Training execution time (mins) 53.718046895662944


**Model evaluation**

In [16]:
# Evaluate the model on the validation data
preds2 = model2.predict(validation_data1).flatten()
ytrue = np.concatenate([y for x, y in validation_data1], axis=0)   
print(f"F1 score for LSTM: {f1_score(ytrue,preds2>0.5)}")
print(f"Confusion matrix for LSTM:")
print(f"{confusion_matrix(ytrue,preds2>0.5)}")

F1 score for LSTM: 0.020225235578028043
Confusion matrix for LSTM:
[[104492   1070]
 [ 15982    176]]


In [17]:
conf = confusion_matrix(ytrue,preds2>0.5)
tn, fp, tp, fn = conf[0][0], conf[0][1], conf[1][1], conf[1][0]
print(f"Precision score for LSTM: {tp / (tp + fp)}")
print(f"Recall score for LSTM: {tp / (tp + fn)}")

Precision score for LSTM: 0.14125200642054575
Recall score for LSTM: 0.010892437182819657


In [18]:
print(f"accuracy for LSTM: {(tp + tn)/ (tp + tn + fn + fp)}")

accuracy for LSTM: 0.8599079855405849


This model is performing decently well, given that it has the limitation of not having great embeddings. BERT will definitel solve that issue. What we can say, however, is that 176 reviews were correctly classified as fake just with inofrmation from the text - which gives us hope that there is potential in this problem.

In [19]:
complete_preds2 = model2.predict(complete_data)
print(complete_preds2.shape)

(608598, 1)


In [20]:
df_data_all_cols['lstm_predict_probas'] = complete_preds2

**MODEL 2: FFNN**

In [21]:
def build_ffnn():

    # Set the model name as
    model_name = 'ffnn_'+str(int(time.time()))

    # Create a FFNN Model
    model = tf.keras.models.Sequential(name=model_name)
    model.add(tf.keras.Input(shape=(SEQUENCE_SIZE)))
    model.add(tf.keras.layers.Embedding(input_dim=VOCABULARY_SIZE, output_dim=EMBEDDING_SIZE))
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(256, activation="relu"))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(1,activation="sigmoid"))

    return model


In [22]:
############################
# Training Params
############################
learning_rate = 0.003
epochs = 4

# Free up memory
K.clear_session()

# Build the model
model = build_ffnn()

# Print the model architecture
print(model.summary())

# Optimizer
optimizer = keras.optimizers.Adam(lr=learning_rate)
# Loss
loss = keras.losses.binary_crossentropy

# Compile
model.compile(loss=loss,
                  optimizer=optimizer,
                  metrics=['accuracy'])

# Train model
start_time = time.time()
training_results = model.fit(
        train_data,
        validation_data=validation_data1,
        epochs=epochs, 
        verbose=1)
execution_time = (time.time() - start_time)/60.0
print("Training execution time (mins)",execution_time)

Model: "ffnn_1617902905"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 64, 100)           1500000   
_________________________________________________________________
flatten (Flatten)            (None, 6400)              0         
_________________________________________________________________
dense (Dense)                (None, 256)               1638656   
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 3,138,913
Trainable params: 3,138,913
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Training execution time (mins) 36.133088898658755


In [23]:
preds1 = model.predict(validation_data1).flatten()
print(f"F1 score for FFNN: {f1_score(ytrue,preds1>0.5)}")
print(f"Confusion matrix for FFNN:")
print(f"{confusion_matrix(ytrue,preds1>0.5)}")

F1 score for FFNN: 0.016900751792062477
Confusion matrix for FFNN:
[[104706    856]
 [ 16013    145]]


In [24]:
conf = confusion_matrix(ytrue,preds1>0.5)
tn, fp, tp, fn = conf[0][0], conf[0][1], conf[1][1], conf[1][0]
print(f"Precision score for FFNN: {tp / (tp + fp)}")
print(f"Recall score for FFNN: {tp / (tp + fn)}")
print(f"accuracy for LSTM: {(tp + tn)/ (tp + tn + fn + fp)}")

Precision score for FFNN: 0.14485514485514486
Recall score for FFNN: 0.008973882906300286
accuracy for LSTM: 0.861411436082813


Once again, the FFNN, also gives us hope with its results given its naive embedding scheme, and the fact that it still predicts 145 reviews as fake correctly. The False negative rate needs work, however.

In [25]:
complete_preds = model.predict(complete_data)
print(complete_preds.shape)

(608598, 1)


**Saving work**

In [26]:
df_data_all_cols['ffnn_predict_probas'] = complete_preds
df_data_all_cols

Unnamed: 0.1,Unnamed: 0,ID,date,restaurantID,userID,reviewText,restaurant,fakeLabel,rating,lstm_predict_probas,ffnn_predict_probas
0,1,0.0,2014-11-16,0,5044.0,"Drinks were bad, the hot chocolate was watered...",Toast,-1,1,0.226938,2.725078e-01
1,2,1.0,2014-09-08,0,5045.0,This was the worst experience I've ever had a ...,Toast,-1,1,0.419023,2.221830e-01
2,3,2.0,2013-10-06,0,5046.0,This is located on the site of the old Spruce ...,Toast,-1,3,0.318059,2.221830e-01
3,4,3.0,2014-11-30,0,5047.0,I enjoyed coffee and breakfast twice at Toast ...,Toast,-1,5,0.057116,2.221830e-01
4,5,4.0,2014-08-28,0,5048.0,I love Toast! The food choices are fantastic -...,Toast,-1,5,0.066041,2.147548e-01
...,...,...,...,...,...,...,...,...,...,...,...
608593,608594,608593.0,2013-01-20,5039,119664.0,When I first moved to the area I must say I wa...,Best Subs the,1,4,0.184477,1.052087e-01
608594,608595,608594.0,2012-11-12,5039,56277.0,Kind of pricey. I guess I expected a ridiculou...,Best Subs the,1,2,0.020917,2.715447e-08
608595,608596,608595.0,2012-08-22,5039,265320.0,"Stopped by this restaurant yesterday, we just ...",Best Subs the,1,1,0.146421,1.422865e-01
608596,608597,608596.0,2011-05-11,5039,161722.0,Finally checked out The Best Subs in Claremont...,Best Subs the,1,4,0.190455,2.221830e-01


In [27]:
df_data_all_cols.to_csv("/content/drive/MyDrive/Colab Notebooks/6862_project/yelp_with_text_preds.csv")