# Fake Reviews

Data Source

https://www.kaggle.com/yelp-dataset/yelp-dataset

The overall goal of this notebook is to create examples that we can use to train a neural network that will be able to create fake reviews.

In [2]:
#import findspark osx
#findspark.init()
import pyspark

Create a spark session with Apache Arrow enabled.

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder. \
    appName('big-data') \
    .master("local[8]") \
    .config("spark.driver.memory", "8g") \
    .config('spark.sql.execution.arrow.enabled', 'true').getOrCreate()

## Reviews

Loading the yelp reviews, printing their schema and dropping NA rows.

In [2]:
reviews_df = spark.read.json('yelp-dataset/yelp_academic_dataset_review.json')

In [3]:
reviews_df.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)



In [4]:
reviews_df.count()

6685900

In [5]:
reviews_df = reviews_df.na.drop()

In [6]:
reviews_df.count()

6685900

## Businesses

In [18]:
business_df = spark.read.json('yelp-dataset/yelp_academic_dataset_business.json')

In [19]:
business_df.count()

192609

In [20]:
business_df.printSchema()

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |    |-- GoodForDancing: str

In [21]:
business_df = business_df.na.drop(subset=['city', 'categories'])

In [22]:
business_df.count()

192127

We ultimately want to create examples in the following form:

'RATING CITY CATEGORIES': 'REVIEW TEXT'

If we wanted more granular control over our fake review generation, we could also use something like 'RATING BUSINESS_NAME CITY STATE CATEGORIES'

Let's keep it simple and create a new examples data frame:

In [23]:
reviews = reviews_df.alias('reviews')
business = business_df.alias('business')

In [24]:
from pyspark.sql.functions import concat_ws

examples = reviews.join(business, reviews.business_id == business.business_id) \
    .select(concat_ws(' ', reviews.stars, business.city, business.categories).alias('context'), \
            reviews.text.alias('review'))
    


In [25]:
examples.show()

+--------------------+--------------------+
|             context|              review|
+--------------------+--------------------+
|1.0 Las Vegas Fit...|Total bill for th...|
|5.0 Las Vegas Bea...|I *adore* Travis ...|
|5.0 Chandler Heal...|I have to say tha...|
|5.0 Calgary Bars,...|Went in for a lun...|
|1.0 Scottsdale Te...|Today was my seco...|
|4.0 Pittsburgh Re...|I'll be the first...|
|3.0 Markham Food,...|Tracy dessert had...|
|1.0 Scottsdale Sp...|This place has go...|
|2.0 Cleveland Bre...|I was really look...|
|3.0 Las Vegas Sho...|It's a giant Best...|
|4.0 Las Vegas Per...|Like walking back...|
|1.0 Mesa Restaura...|Walked in around ...|
|4.0 Pittsburgh It...|Wow. So surprised...|
|4.0 Las Vegas Hot...|Michael from Red ...|
|1.0 Toronto Asian...|I cannot believe ...|
|5.0 Toronto Sandw...|You can't really ...|
|4.0 Orange Villag...|Great lunch today...|
|3.0 Phoenix Carib...|I love chinese fo...|
|5.0 Chandler Sand...|We've been a huge...|
|3.0 Toronto Resta...|Good selec

In [26]:
examples.count()

6683763

## Tokenization

We need to tokenize our examples so that we can feed them into our neural network. We also need to clean up our examples before tokenizing them. We remove new-lines and non-ascii characters

In [27]:
from pyspark.sql.functions import regexp_replace
examples = examples.withColumn('review', regexp_replace(examples.review, '[\\r\\n]', ' '))

In [28]:
examples = examples.withColumn('review', regexp_replace(examples.review, '[^\x00-\x7F]+', ' '))

We now want to save the context and review columns as text files.

In [30]:
examples.select(examples.context).write.format('text').save('context.txt')
examples.select(examples.review).write.format('text').save('reviews.txt')

Making sure that both exports have the same line counts

In [31]:
!cat context.txt/*.txt | wc -l

6683763


In [33]:
!cat reviews.txt/*.txt | wc -l

6683763


Unfortunately, the default SparkML Tokenizer is rather simple, there is a RegexTokenizer, but crafting hand-written tokenization rules with it is error-prone and cumbersome. We could try to use an industrial-strength Tokenizer from spacy. Implementing an SparkML-Transformer would require a Java counter-part on top of the python implementation, so let's fall back to UDF's instead.

Another alternative is using Apache Arrow.

As it turns out, both ways really slow. The following cells are therefore only used for illustrative purposes. I've tested the UDF approach multiple times and had a lot of time outs or OutOfMemory exceptions.

I therefore had to rely on Stanford's CoreNLP package to tokenize the examples. Please consult the tokenization notebook.

In [None]:
import spacy
from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import ArrayType, StringType

nlp = spacy.load("en_core_web_sm")

@udf(ArrayType(StringType()))
def tokenize(s):
    return [token.text for token in nlp(s)]

In [None]:
import spacy
from pyspark.sql.functions import udf, pandas_udf, PandasUDFType
from pyspark.sql.types import ArrayType, StringType

nlp = spacy.load("en_core_web_sm")

def spacy_tokenize(s):
    return [token.text for token in nlp(s)]

@pandas_udf("string", PandasUDFType.SCALAR)
def tokenize(x):
    return x.apply(spacy_tokenize)

In [None]:
tokenized = examples.select(concat_ws(' ', tokenize(examples.context)).alias('context'), concat_ws(' ', tokenize(examples.review)).alias('review'))

In [None]:
train, val, test = tokenized.randomSplit([0.98, 0.016, 0.004], seed=42)

In [None]:
train.count()
val.count()
test.count()

In [None]:
def write_df(df, file_name):
    df.select(df.context).write.format('text').save(file_name + '_src.txt')
    df.select(df.review).write.format('text').save(file_name + '_tgt.txt')
    


In [None]:
train = train.cache()
val = val.cache()
test = test.cache()

In [None]:
write_df(test, 'test')

In [None]:
write_df(val, 'val')

In [None]:
write_df(train, 'train')