# W266 Final Project

Authors: Satheesh Joseph, Catherine Mou, Yi Zhang

*TODO*

- Introduction
- Literature review

## Downloading and loading the data

We acquired the dataset from the researchers in the form of Sqlite `.db` files.

In [1]:
import os, sys, re, json, time, unittest
import itertools, collections
from importlib import reload
from sklearn.model_selection import train_test_split

import numpy as np
from scipy import stats
import pandas as pd
import sqlite3
import unicodedata
import nltk

import tensorflow as tf
from sklearn.metrics import classification_report

2021-11-13 07:28:46.559956: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0


In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation, Embedding
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [3]:
tf.config.list_physical_devices('GPU')

2021-11-13 07:29:05.390988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-11-13 07:29:06.050715: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-13 07:29:06.051392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-11-13 07:29:06.051446: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2021-11-13 07:29:06.077819: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2021-11-13 07:29:06.092200: I tensorflow/stream_executor/platform/default/d

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [4]:
# Download the files if they're not here
if 'data' not in os.listdir('.') or not os.listdir('data'):
    os.system('wget https://storage.googleapis.com/mids-w266-final-project-data/yelpHotelData.db -P data/')
    os.system('wget https://storage.googleapis.com/mids-w266-final-project-data/yelpResData.db -P data/')
    print('Data downloaded successfully!')
else:
    print('Already downloaded data')

Already downloaded data


In [5]:
con = sqlite3.connect('data/yelpResData.db')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

con = sqlite3.connect('data/yelpHotelData.db')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

[('review',), ('restaurant',), ('reviewer',)]
[('review',), ('sqlite_stat1',), ('sqlite_stat2',), ('reviewer',), ('hotel',)]


In [6]:
# Reading from the hotels database
hotels_db = sqlite3.connect("data/yelpHotelData.db")
hotels = pd.read_sql_query("SELECT * FROM hotel", hotels_db)
hotel_reviews = pd.read_sql_query("SELECT * FROM review WHERE flagged in ('Y', 'N')", hotels_db)
hotel_reviews_fake = pd.read_sql_query("SELECT * FROM review WHERE flagged in ('Y')", hotels_db)
hotel_reviewers = pd.read_sql_query("SELECT * FROM reviewer", hotels_db)


print(f'The data set contains {len(hotels)} hotels, {len(hotel_reviews)} reviews, and {len(hotel_reviewers)} reviewers')

The data set contains 283086 hotels, 5858 reviews, and 5123 reviewers


In [7]:
# Reading from the restaurant database
restaurant_db = sqlite3.connect("data/yelpResData.db")
restaurant_db.text_factory = lambda x: x.decode("utf-8", errors='ignore')
restaurants = pd.read_sql_query("SELECT * FROM restaurant", restaurant_db)
restaurant_reviews = pd.read_sql_query("SELECT * FROM review WHERE flagged in ('Y', 'N')", restaurant_db)
restaurant_reviewers = pd.read_sql_query("SELECT * FROM reviewer", restaurant_db)


print(f'The data set contains {len(restaurants)} restaurants, {len(restaurant_reviews)} reviews, and {len(restaurant_reviewers)} reviewers')

The data set contains 242652 restaurants, 67019 reviews, and 16941 reviewers


# Exploratory Data Analysis

## ToDo: Performan More EDA

In [8]:
hotel_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5858 entries, 0 to 5857
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   date           5858 non-null   object
 1   reviewID       5858 non-null   object
 2   reviewerID     5858 non-null   object
 3   reviewContent  5858 non-null   object
 4   rating         5858 non-null   int64 
 5   usefulCount    5858 non-null   int64 
 6   coolCount      5858 non-null   int64 
 7   funnyCount     5858 non-null   int64 
 8   flagged        5858 non-null   object
 9   hotelID        5858 non-null   object
dtypes: int64(4), object(6)
memory usage: 457.8+ KB


In [9]:
hotel_reviews.head()

Unnamed: 0,date,reviewID,reviewerID,reviewContent,rating,usefulCount,coolCount,funnyCount,flagged,hotelID
0,6/8/2011,MyNjnxzZVTPq,IFTr6_6NI4CgCVavIL9k5g,Let me begin by saying that there are two kind...,5,18,11,28,N,tQfLGoolUMu2J0igcWcoZg
1,8/30/2011,BdD7fsPqHQL73hwENEDT-Q,c_-hF15XgNhlyy_TqzmdaA,The only place inside the Loop that you can st...,3,0,3,4,N,tQfLGoolUMu2J0igcWcoZg
2,6/26/2009,BfhqiyfC,CiwZ6S5ZizAFL5gypf8tLA,I have walked by the Tokyo Hotel countless tim...,5,12,14,23,N,tQfLGoolUMu2J0igcWcoZg
3,9/16/2010,Ol,nf3q2h-kSQoZK2jBY92FOg,"If you are considering staying here, watch thi...",1,8,2,6,N,tQfLGoolUMu2J0igcWcoZg
4,2/5/2010,i4HIAcNTjabdpG1K4F5Q2g,Sb3DJGdZ4Rq__CqxPbae-g,"This place is disgusting, absolutely horrible,...",3,11,4,9,N,tQfLGoolUMu2J0igcWcoZg


In [23]:
hotel_reviews.groupby('reviewerID').agg({"rating": np.count_nonzero, 
                                   "coolCount": np.sum, 
                                   "funnyCount": np.sum,
                                   "flagged": np.count_nonzero
                                    }).sort_values(by=['rating'], ascending=False)

Unnamed: 0_level_0,rating,coolCount,funnyCount,flagged
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
OlhH_-yyVWU6jj8H2TMSiQ,42,7,2,42
BI4lPhrUpmEySIJUywjIjQ,13,5,0,13
rQrqrb5dFztAeFYwyqbygA,13,46,40,13
gOPm2yoe38_OM6x4NIwLEw,11,2,2,11
ZYZNcugF3xUEGyLOVGiZ0Q,10,36,19,10
...,...,...,...,...
LZPtLWnjGZVODlhZk_uPdg,1,0,1,1
LXYwZJY-9zf7Q9p_1K2LwA,1,0,0,1
LXWVzg77sSA3FDMG4t5IXg,1,0,0,1
LXGuXkKMFeZ4T4-I3GoGaw,1,0,0,1


In [33]:
a = hotel_reviews_fake['reviewContent']
len(a)

780

In [32]:
a = hotel_reviews_fake['reviewContent']
np.savetxt(r'/home/jupyter/hotel_fake.txt', a.values, fmt='%s')

In [8]:
restaurant_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67019 entries, 0 to 67018
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   date           67019 non-null  object
 1   reviewID       67019 non-null  object
 2   reviewerID     67019 non-null  object
 3   reviewContent  67019 non-null  object
 4   rating         67019 non-null  int64 
 5   usefulCount    67019 non-null  int64 
 6   coolCount      67019 non-null  int64 
 7   funnyCount     67019 non-null  int64 
 8   flagged        67019 non-null  object
 9   restaurantID   67019 non-null  object
dtypes: int64(4), object(6)
memory usage: 5.1+ MB


In [9]:
reviews = pd.concat([restaurant_reviews, hotel_reviews.rename(columns={'hotelID':'restaurantID'})], ignore_index=True)
reviews.groupby('reviewerID').agg({"usefulCount": np.sum, 
                                   "coolCount": np.sum, 
                                   "funnyCount": np.sum}).sort_values(by=['usefulCount'], ascending=False)
reviews[reviews['reviewerID'] == 'w-w-k-QXosIKQ8HQVwU6IQ']['reviewContent']

94       ***Alinea is truly a one-of-a-kind experience;...
29403    ***Graham Elliot serves up refined casual food...
43054    ***Longman & Eagle is a true gastropub--a casu...
71630    ***While the rooms are small, Hotel Felix is a...
Name: reviewContent, dtype: object

In [10]:
reviews.groupby('flagged').agg('sum')

Unnamed: 0_level_0,rating,usefulCount,coolCount,funnyCount
flagged,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
N,252531,65773,42690,36444
Y,34611,0,0,0


# Baseline Model

### Try a Plain LSTM model on the Hotel data set with fixed length learned embedding

In [35]:
# Some Data Cleaning
hotel_reviews['reviewContent'] = hotel_reviews['reviewContent'].apply(lambda x: unicodedata.normalize('NFKD', x))
restaurant_reviews['reviewContent'] = restaurant_reviews['reviewContent'].apply(lambda x: unicodedata.normalize('NFKD', x))

In [36]:
# Split train/test data for hotel reviews
X_train, X_test, y_train, y_test = train_test_split(hotel_reviews, hotel_reviews['flagged']=='Y')

In [54]:
y_train

782     False
1518    False
4524    False
1340    False
3416    False
        ...  
120     False
1223    False
3077    False
376     False
489     False
Name: flagged, Length: 4393, dtype: bool

In [86]:
fake_label = [True]*2988
fake_label = pd.Series(fake_label)

In [87]:
fake_label

0       True
1       True
2       True
3       True
4       True
        ... 
2983    True
2984    True
2985    True
2986    True
2987    True
Length: 2988, dtype: bool

In [41]:
X_train.groupby('flagged').agg('count')

Unnamed: 0_level_0,date,reviewID,reviewerID,reviewContent,rating,usefulCount,coolCount,funnyCount,hotelID
flagged,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
N,3803,3803,3803,3803,3803,3803,3803,3803,3803
Y,590,590,590,590,590,590,590,590,590


In [76]:
X_train['reviewContent']

782     The lobby here gives you the impression that t...
1518    The best hotel! The view was ok, but the room ...
4524    The Palmer House is a large, old, but still gr...
1340    During my stay the hotel was under renovation....
3416    Location, location, location! This nicely reno...
                              ...                        
120     For the price I have to give a solid three sta...
1223    Very nice hotel.  Clean, attentive staff.  Loc...
3077    My hubby and I came to the windy city last yea...
376     The only thing this hotel has going for it is ...
489     Spent this past weekend in Chicago and stayed ...
Name: reviewContent, Length: 4393, dtype: object

In [72]:
generated_hotel_fake = pd.read_fwf('generated_hotel_fake.txt', header=None)

In [85]:
len(generated_hotel_fake[0])

2988

In [78]:
X_train_concat = pd.concat([X_train['reviewContent'], generated_hotel_fake[0]])

In [83]:
len(X_train_concat)

7381

In [88]:
y_train_concat = pd.concat([y_train, fake_label])

In [89]:
y_train_concat

782     False
1518    False
4524    False
1340    False
3416    False
        ...  
2983     True
2984     True
2985     True
2986     True
2987     True
Length: 7381, dtype: bool

In [90]:
len(y_train_concat)

7381

In [91]:
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(X_train['reviewContent'])
sequences = tokenizer.texts_to_sequences(X_train['reviewContent'])
train_data = pad_sequences(sequences, maxlen=100)

tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(X_test['reviewContent'])
sequences = tokenizer.texts_to_sequences(X_test['reviewContent'])
test_data = pad_sequences(sequences, maxlen=100)


In [92]:
model = Sequential()
model.add(Embedding(20000, 100, input_length=100))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


2021-11-13 09:50:48.586475: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
2021-11-13 09:50:48.587017: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560be9da84e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-13 09:50:48.587043: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-11-13 09:50:48.780168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-13 09:50:48.780834: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560bebaa3e00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-11-13 09:50:48.780860: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2021-11-1



In [93]:
model.fit(train_data, y_train, epochs=2)

Epoch 1/2


2021-11-13 09:51:02.303415: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11


Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f68dcb3ef90>

In [94]:
y_predicted = model.predict(test_data)

In [95]:
print(classification_report(y_predicted > 0.5, y_test))

              precision    recall  f1-score   support

       False       1.00      0.87      0.93      1460
        True       0.02      0.60      0.03         5

    accuracy                           0.87      1465
   macro avg       0.51      0.74      0.48      1465
weighted avg       1.00      0.87      0.93      1465



### Try a Plain LSTM model on the restaurant data set with fixed length learned embedding

In [9]:
# Split train/test data for hotel reviews
Res_X_train, Res_X_test, Res_y_train, Res_y_test = train_test_split(restaurant_reviews, restaurant_reviews['flagged']=='Y')

In [10]:
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(Res_X_train['reviewContent'])
sequences = tokenizer.texts_to_sequences(Res_X_train['reviewContent'])
Res_train_data = pad_sequences(sequences, maxlen=100)

tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(Res_X_test['reviewContent'])
sequences = tokenizer.texts_to_sequences(Res_X_test['reviewContent'])
Res_test_data = pad_sequences(sequences, maxlen=100)


In [11]:
model = Sequential()
model.add(Embedding(20000, 100, input_length=100))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [12]:
model.fit(Res_train_data, Res_y_train, epochs=2)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f139eac1340>

In [14]:
Res_y_test

46591    False
48096    False
9837     False
44919    False
40378    False
         ...  
61173     True
8630     False
55466    False
27076    False
35593    False
Name: flagged, Length: 16755, dtype: bool

In [15]:
Res_y_predicted = model.predict(Res_test_data)

In [16]:
print(classification_report(Res_y_predicted > 0.5, y_test))

              precision    recall  f1-score   support

       False       1.00      0.88      0.93     16743
        True       0.00      0.00      0.00        12

    accuracy                           0.88     16755
   macro avg       0.50      0.44      0.47     16755
weighted avg       1.00      0.88      0.93     16755



# Model 2 - Data Resampling + GloVe embedding

In [17]:
# Download the GloVe embeddings
if 'embedding' not in os.listdir('.') or not os.listdir('embedding'):
    os.system('wget http://nlp.stanford.edu/data/glove.6B.zip -P embedding/')
    os.system('cd embedding && unzip glove.6B.zip')
    print('Data the GloVe embedding successfully!')
else:
    print('Already downloaded the embedding')

Already downloaded the embedding


In [23]:
!pwd

/home/catherine041616/w266-final-project


In [21]:
os.system('cd /home/catherine041616/w266-final-project/embedding && unzip glove.6B.zip')

32512

In [22]:
import zipfile
with zipfile.ZipFile("glove.6B.zip","r") as zip_ref:
    zip_ref.extractall("targetdir")

In [24]:
# Use the 100 dimensional GloVe embedding
path_to_glove_file = "/home/catherine041616/w266-final-project/targetdir/glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


In [25]:
# Split the positive/negative samples for more balanced sampling - Hotels
X_train_positive = X_train[X_train['flagged'] == 'Y']
X_train_negative = X_train[X_train['flagged'] == 'N']

num_samples = 2000
positive_ratio = 0.4
positives = X_train_positive.sample(int(num_samples * positive_ratio), replace=True).reset_index(drop=True)
negatives = X_train_negative.sample(num_samples, replace=True).reset_index(drop=True)

X_train_balanced = pd.concat([positives, negatives], ignore_index=True).sample(frac=1)
y_train_balanced = X_train_balanced['flagged'] == 'Y'

In [31]:
# Split the positive/negative samples for more balanced sampling - Restaurant
Res_X_train_positive = Res_X_train[Res_X_train['flagged'] == 'Y']
Res_X_train_negative = Res_X_train[Res_X_train['flagged'] == 'N']

num_samples = 2000
positive_ratio = 0.4
Res_positives = Res_X_train_positive.sample(int(num_samples * positive_ratio), replace=True).reset_index(drop=True)
Res_negatives = Res_X_train_negative.sample(num_samples, replace=True).reset_index(drop=True)

Res_X_train_balanced = pd.concat([Res_positives, Res_negatives], ignore_index=True).sample(frac=1)
Res_y_train_balanced = Res_X_train_balanced['flagged'] == 'Y'

In [32]:
# First layer, vectorizing the word input - Hotels
vocabulary_size = 40000
max_tokens = 200

vectorizer = TextVectorization(max_tokens=vocabulary_size, output_sequence_length=max_tokens)
vectorizer.adapt(X_train_balanced['reviewContent'].to_numpy())

voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

print(f"Vocabulary size is {len(voc)}")


Vocabulary size is 17791


In [33]:
# First layer, vectorizing the word input - Restaurant

vectorizer.adapt(Res_X_train_balanced['reviewContent'].to_numpy())

Res_voc = vectorizer.get_vocabulary()
Res_word_index = dict(zip(Res_voc, range(len(Res_voc))))

print(f"Vocabulary size is {len(Res_voc)}")


Vocabulary size is 17368


In [28]:
# Ref: https://keras.io/examples/nlp/pretrained_word_embeddings/
# Build + Lock in the Embedding layer from GloVe - Hotels
embedding_dim = 100
hits = 0
misses = 0
num_words = len(voc) + 2

# Prepare embedding matrix
# TODO: more pre-processing to avoid ~3000 words that don't have embeddings
embedding_matrix = np.zeros((num_words, embedding_dim))
for i, word in enumerate(voc):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


embedding_layer = Embedding(
    num_words,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=False,
)


Converted 12974 words (4817 misses)


In [34]:
# Ref: https://keras.io/examples/nlp/pretrained_word_embeddings/
# Build + Lock in the Embedding layer from GloVe - Restaurant
embedding_dim = 100
hits = 0
misses = 0
Res_num_words = len(Res_voc) + 2

# Prepare embedding matrix
# TODO: more pre-processing to avoid ~3000 words that don't have embeddings
Res_embedding_matrix = np.zeros((Res_num_words, embedding_dim))
for i, word in enumerate(Res_voc):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        Res_embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Res_embedding_layer = Embedding(
    Res_num_words,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(Res_embedding_matrix),
    trainable=False,
)


Converted 12556 words (4812 misses)


In [29]:
# Vectorize the input - Hotels
X_train_ready = vectorizer(X_train_balanced['reviewContent']).numpy()
X_test_ready = vectorizer(X_test['reviewContent']).numpy()

X_train_ready.shape

(2800, 200)

In [35]:
# Vectorize the input - Restaurant
Res_X_train_ready = vectorizer(Res_X_train_balanced['reviewContent']).numpy()
Res_X_test_ready = vectorizer(Res_X_test['reviewContent']).numpy()

Res_X_train_ready.shape

(2800, 200)

In [30]:
# Build and train the model - Hotels
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(50, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train_ready, y_train_balanced, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<tensorflow.python.keras.callbacks.History at 0x7f13881f6190>

In [36]:
model.evaluate(X_test_ready, y_test)
y_predicted = model.predict(X_test_ready)
print(classification_report(y_predicted > 0.5, y_test))
X_test['flagged'].value_counts() / len(X_test)

              precision    recall  f1-score   support

       False       1.00      0.88      0.93     16749
        True       0.00      0.33      0.00         6

    accuracy                           0.88     16755
   macro avg       0.50      0.61      0.47     16755
weighted avg       1.00      0.88      0.93     16755



N    0.876634
Y    0.123366
Name: flagged, dtype: float64

In [37]:
# Build and train the model - Restaurant
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(50, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(Res_X_train_ready, Res_y_train_balanced, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<tensorflow.python.keras.callbacks.History at 0x7f1383201ee0>

In [38]:
model.evaluate(X_test_ready, y_test)
y_predicted = model.predict(X_test_ready)
print(classification_report(y_predicted > 0.5, y_test))
X_test['flagged'].value_counts() / len(X_test)

              precision    recall  f1-score   support

       False       1.00      0.88      0.93     16741
        True       0.00      0.00      0.00        14

    accuracy                           0.88     16755
   macro avg       0.50      0.44      0.47     16755
weighted avg       1.00      0.88      0.93     16755



N    0.876634
Y    0.123366
Name: flagged, dtype: float64

# Fake review generation

In [29]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
     |████████████████████████████████| 3.1 MB 4.1 MB/s            
Collecting filelock
  Downloading filelock-3.3.2-py3-none-any.whl (9.7 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
     |████████████████████████████████| 3.3 MB 55.1 MB/s            
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
     |████████████████████████████████| 895 kB 73.7 MB/s            
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 11.0 MB/s            
Installing collected packages: filelock, tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed filelock-3.3.2 huggingface-hub-0.1.2 sacremoses-0.0.46 tokenizers-0.10.3 transforme

In [34]:
b = hotel_reviews['reviewContent']
np.savetxt(r'/home/catherine041616/w266-final-project/hotel.txt', a.values, fmt='%s')

In [2]:
!python gpt2_fine_tuning.py \
    --output_dir hotel \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file='./hotel_fake.txt' \
    --per_gpu_train_batch_size=1

Traceback (most recent call last):
  File "gpt2_fine_tuning.py", line 12, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'


In [24]:
!python gpt2_generation.py \
    --model_name_or_path hotel \
    --length=10 \
    --seed=3

2021-11-13 08:13:59.640144: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
^C
