# Amazon Reviews (Parquet) - Spark DataFrame API + Spark SQL

1. Read Parquet into a DataFrame using an explicit **StructType / StructField** schema  
2. Register a view using **createOrReplaceTempView**  
3. Solve the same use cases using **both**:
   - DataFrame API
   - Spark SQL



## Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("Amazon-Reviews-Parquet").getOrCreate()
spark.sparkContext.setLogLevel("WARN")


## Read Parquet with Explicit Schema

Update the path below to your Parquet location.

Example (AWS S3):
- `s3://aws-glue-yourname/customer_review_parquet/`

Example (HDFS):
- `/user/yourname/customer_review_parquet/`


In [None]:
PARQUET_PATH = "s3://aws-glue-yourname/customer_review_parquet/"  # <-- change this

schema = StructType([
    StructField("marketplace", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("review_id", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("product_parent", StringType(), True),
    StructField("product_title", StringType(), True),
    StructField("product_category", StringType(), True),
    StructField("star_rating", IntegerType(), True),
    StructField("helpful_votes", IntegerType(), True),
    StructField("total_votes", IntegerType(), True),
    StructField("vine", StringType(), True),
    StructField("verified_purchase", StringType(), True),
    StructField("review_headline", StringType(), True),
    StructField("review_body", StringType(), True),
    StructField("review_date", StringType(), True),
    StructField("sentiment", StringType(), True),
])

df = spark.read.schema(schema).parquet(PARQUET_PATH)

print("Rows:", df.count())
df.printSchema()


## Create a Temporary View (Spark SQL)

This replaces the Hive external table step.  
Once the view is created, you can query it using `spark.sql(...)`.


In [None]:
df.createOrReplaceTempView("amazon_reviews_parquet")
print("View created: amazon_reviews_parquet")


## Use Case - Preview 10 rows

In [None]:
# DataFrame API
df.show(10, truncate=False)


In [None]:
# Spark SQL
spark.sql("""
SELECT *
FROM amazon_reviews_parquet
LIMIT 10
""").show(truncate=False)


## Use Case - List distinct sentiment values

In [None]:
# DataFrame API
df.select("sentiment").distinct().show(10, truncate=False)


In [None]:
# Spark SQL
spark.sql("""
SELECT DISTINCT sentiment
FROM amazon_reviews_parquet
LIMIT 10
""").show(truncate=False)


## Use Case - Total number of reviews

In [None]:
# DataFrame API
df.count()


In [None]:
# Spark SQL
spark.sql("""
SELECT COUNT(*) AS total_reviews
FROM amazon_reviews_parquet
""").show(truncate=False)


## Use Case - Reviews count by sentiment (sorted)

In [None]:
# DataFrame API
(df.groupBy("sentiment")
   .count()
   .withColumnRenamed("count", "total_reviews")
   .orderBy(F.desc("total_reviews"))
).show(truncate=False)


In [None]:
# Spark SQL
spark.sql("""
SELECT sentiment, COUNT(*) AS total_reviews
FROM amazon_reviews_parquet
GROUP BY sentiment
ORDER BY total_reviews DESC
""").show(truncate=False)


## Use Case - Reviews count by (star_rating, sentiment)

In [None]:
# DataFrame API
(df.groupBy("star_rating", "sentiment")
   .count()
   .withColumnRenamed("count", "total_reviews")
   .orderBy("star_rating", "sentiment")
).show(truncate=False)


In [None]:
# Spark SQL
spark.sql("""
SELECT star_rating, sentiment, COUNT(*) AS total_reviews
FROM amazon_reviews_parquet
GROUP BY star_rating, sentiment
ORDER BY star_rating, sentiment
""").show(truncate=False)


## Use Case - Show 10 high-rated review samples

Displays:
- product_title
- star_rating
- sentiment
- review_headline
- review_body

Ordered by `star_rating` descending.


In [None]:
# DataFrame API
(df.select("product_title", "star_rating", "sentiment", "review_headline", "review_body")
   .orderBy(F.desc("star_rating"))
   .limit(10)
).show(truncate=False)


In [None]:
# Spark SQL
spark.sql("""
SELECT product_title, star_rating, sentiment, review_headline, review_body
FROM amazon_reviews_parquet
ORDER BY star_rating DESC
LIMIT 10
""").show(truncate=False)


## Use Case - 5-star reviews that are NOT marked POSITIVE

Simple data-quality / sentiment-model check:
- `star_rating = 5`
- `sentiment != 'POSITIVE'`


In [None]:
# DataFrame API
(df.select("product_title", "star_rating", "sentiment", "review_headline", "review_body")
   .filter((F.col("star_rating") == 5) & (F.col("sentiment") != "POSITIVE"))
   .limit(10)
).show(truncate=False)


In [None]:
# Spark SQL
spark.sql("""
SELECT product_title, star_rating, sentiment, review_headline, review_body
FROM amazon_reviews_parquet
WHERE star_rating = 5
  AND sentiment != 'POSITIVE'
LIMIT 10
""").show(truncate=False)
