# Amazon Review Analysis

This is an analysis of the Amazon pet product reviews data. The goal is to determine whether or not there was any bias for reviews that were written as part of the Vine program. We will determine if having paid Vine reviews makes a difference in the percentage of 5-star reviews.

### Dependencies and Data

In [14]:
# Locate Spark
import findspark
findspark.init()

# Dependencies
from pyspark import SparkFiles
from pyspark.sql import SparkSession

# Spark session adding the Postgres driver to Spark
spark = SparkSession.builder.appName('amz').getOrCreate()
spark

In [20]:
# Read in data
df = spark.read.csv('pet_product_reviews.tsv', sep='\t', header=True, inferSchema=True)
print(df.count())
df.show(1, vertical=True, truncate=False)

2643619
-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 marketplace       | US                                                                                                                                                                                                                                                           
 customer_id       | 28794885                                                                                                                                                                                                                                                     
 review_id         | REAKC26P07MDN                                                                                                                                     

### Vine Reviews vs. Non-vine Reviews

1. Filter for rows where:
    - `total_votes` >= 5 (at least 5 votes)
    - `helpful_votes` / `total_votes` >= 0.2 (at least 20% of the votes are helpful)
2. Split the filtered data into `vine` reviews and non-`vine` reviews
3. Compare the percentage of 5-star reviews from the 2 groups

In [26]:
# Filter data
filtered_df = df.filter(df['total_votes'] >= 5)
filtered_df = filtered_df.filter(df['helpful_votes'] / df['total_votes'] >= 0.2)
filtered_df.count()

196299

In [29]:
# Vine reviews
vine_df = filtered_df.filter(filtered_df['vine'] == 'Y')
vine_count = vine_df.count()

# Non-vine reviews
nonvine_df = filtered_df.filter(filtered_df['vine'] == 'N')
nonvine_count = nonvine_df.count()

vine_count, nonvine_count

(752, 195547)

In [30]:
# 5-star count for vine reviews
vine5_df = vine_df.filter(vine_df['star_rating'] == 5)
vine5_count = vine5_df.count()

# 5-star count for non-vine reviews
nonvine5_df = nonvine_df.filter(nonvine_df['star_rating'] == 5)
nonvine5_count = nonvine5_df.count()

vine5_count, nonvine5_count

(277, 98785)

In [32]:
# 5-star percentages
print('Vine 5-star review %:', vine5_count / vine_count * 100)
print('Non-vine 5-star review %:', nonvine5_count / nonvine_count * 100)

Vine 5-star review %: 36.83510638297872
Non-vine 5-star review %: 50.51726694861082


### Comparison with Equal Sampling

The comparison above is extremely unfair as there were 195,547 reviews in the `non-vine` group and only 752 reviews in the `vine` group (less than 1% of the reviews). So we will now use random sampling with replacement to produce 2 equal groups and make the comparison again.