In [0]:
storage_account_name = "amazonelectron1446226083"
storage_account_key = "U4+rIh/mSWDtJBdiwl0UfTjOn7tEtlRxCgWIef58DxOwglDGAaImibpXUsiP7psHoE6yLZjlqFyG+ASt8JyBzA=="
spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    storage_account_key
)

In [0]:
# Load the curated gold dataset
gold_path = "abfss://processed@amazonelectron1446226083.dfs.core.windows.net/features_v1/"
gold_df = spark.read.parquet(gold_path)

print(f"Total records: {gold_df.count():,}")
gold_df.printSchema()
display(gold_df.limit(5))

## Gold Layer Enrichment Opportunities

Based on the current Gold layer structure, here are meaningful enrichments:

### 1. **Temporal Features**
* **Review month/quarter**: Extract month and quarter from reviewTime for seasonal analysis
* **Days since product launch**: Calculate product age at review time
* **Review velocity**: Number of reviews per product per time period

### 2. **Rating Analytics**
* **Rating trends**: Moving averages of ratings over time per product/brand
* **Rating volatility**: Standard deviation of ratings per product
* **Rating momentum**: Change in average rating over time windows

### 3. **Text Analytics**
* **Review length**: Character/word count of reviewText
* **Sentiment score**: Analyze review text sentiment (positive/negative/neutral)
* **Key phrases**: Extract common themes or product features mentioned

### 4. **Helpfulness Metrics**
* **Helpfulness ratio**: helpful[0] / helpful[1] (votes received / total votes)
* **Verified helpful reviews**: Flag reviews with high helpfulness scores

### 5. **Product & Brand Aggregations**
* **Average rating per product/brand**
* **Review count per product/brand**
* **Price tier categorization**: Budget/Mid-range/Premium
* **Brand market share**: Review volume by brand

### 6. **Reviewer Behavior**
* **Reviewer activity**: Total reviews per reviewer
* **Reviewer average rating**: Mean rating given by each reviewer
* **Power reviewers**: Flag highly active reviewers

### 7. **Quality Indicators**
* **Review completeness score**: Presence of summary, text, helpfulness data
* **Detailed review flag**: Reviews with substantial text (>100 words)

### 8. **Comparative Metrics**
* **Rating vs brand average**: How product compares to brand average
* **Rating vs price correlation**: Value perception indicators

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pyspark.sql import functions as F

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

In [0]:
# Convert reviewTime to proper date format and extract temporal features
from pyspark.sql.functions import to_date, year, month, avg, count, col, regexp_replace

# Normalize the date format (handle inconsistent spacing) and parse
gold_enriched = gold_df.withColumn(
    "reviewTime_normalized",
    regexp_replace(col("reviewTime"), r"\s+", " ")
).withColumn(
    "review_date", 
    to_date(col("reviewTime_normalized"), "MM d, yyyy")
).withColumn(
    "review_month",
    month("review_date")
).withColumn(
    "review_year_month",
    F.date_format("review_date", "yyyy-MM")
)

print("Data prepared with temporal features")
display(gold_enriched.select("reviewTime", "review_date", "review_year", "review_month", "overall").limit(5))

In [0]:
# Question: How do ratings change over time?
# Aggregate ratings by year-month
rating_trends = gold_enriched.groupBy("review_year_month").agg(
    avg("overall").alias("avg_rating"),
    count("*").alias("review_count")
).orderBy("review_year_month").toPandas()

# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))

# Plot 1: Average Rating Over Time
ax1.plot(rating_trends['review_year_month'], rating_trends['avg_rating'], 
         marker='o', linewidth=2, markersize=6, color='#2E86AB')
ax1.axhline(y=rating_trends['avg_rating'].mean(), color='red', 
            linestyle='--', label=f'Overall Average: {rating_trends["avg_rating"].mean():.2f}')
ax1.set_xlabel('Time Period (Year-Month)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Average Rating', fontsize=12, fontweight='bold')
ax1.set_title('Average Product Ratings Over Time', fontsize=14, fontweight='bold', pad=20)
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=10)
ax1.set_ylim([rating_trends['avg_rating'].min() - 0.2, 5.0])

# Rotate x-axis labels for better readability
for label in ax1.get_xticklabels():
    label.set_rotation(45)
    label.set_ha('right')

# Plot 2: Review Volume Over Time
ax2.bar(rating_trends['review_year_month'], rating_trends['review_count'], 
        color='#A23B72', alpha=0.7)
ax2.set_xlabel('Time Period (Year-Month)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Number of Reviews', fontsize=12, fontweight='bold')
ax2.set_title('Review Volume Over Time', fontsize=14, fontweight='bold', pad=20)
ax2.grid(True, alpha=0.3, axis='y')

# Rotate x-axis labels
for label in ax2.get_xticklabels():
    label.set_rotation(45)
    label.set_ha('right')

plt.tight_layout()
plt.show()

print("\n📊 VISUALIZATION 1 INSIGHTS:")
print("=" * 60)
print(f"• Time Range: {rating_trends['review_year_month'].min()} to {rating_trends['review_year_month'].max()}")
print(f"• Overall Average Rating: {rating_trends['avg_rating'].mean():.2f} stars")
print(f"• Highest Average Rating: {rating_trends['avg_rating'].max():.2f} stars ({rating_trends.loc[rating_trends['avg_rating'].idxmax(), 'review_year_month']})")
print(f"• Lowest Average Rating: {rating_trends['avg_rating'].min():.2f} stars ({rating_trends.loc[rating_trends['avg_rating'].idxmin(), 'review_year_month']})")
print(f"• Peak Review Volume: {rating_trends['review_count'].max():,} reviews ({rating_trends.loc[rating_trends['review_count'].idxmax(), 'review_year_month']})")
print(f"• Rating Volatility (Std Dev): {rating_trends['avg_rating'].std():.3f}")

In [0]:
# Question: How do different brands compare in ratings and popularity?
# Get top 10 brands by review count
top_brands = gold_enriched.groupBy("brand").agg(
    avg("overall").alias("avg_rating"),
    count("*").alias("review_count"),
    avg("price").alias("avg_price")
).filter(col("brand").isNotNull()).orderBy(col("review_count").desc()).limit(10).toPandas()

# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Average Rating by Brand
colors = ['#06A77D' if rating >= 4.0 else '#F18F01' if rating >= 3.0 else '#C73E1D' 
          for rating in top_brands['avg_rating']]
ax1.barh(top_brands['brand'], top_brands['avg_rating'], color=colors, alpha=0.8)
ax1.axvline(x=top_brands['avg_rating'].mean(), color='black', 
            linestyle='--', linewidth=2, label=f'Average: {top_brands["avg_rating"].mean():.2f}')
ax1.set_xlabel('Average Rating (Stars)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Brand', fontsize=12, fontweight='bold')
ax1.set_title('Top 10 Brands: Average Rating Comparison', fontsize=14, fontweight='bold', pad=20)
ax1.set_xlim([0, 5])
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3, axis='x')

# Add value labels on bars
for i, v in enumerate(top_brands['avg_rating']):
    ax1.text(v + 0.05, i, f'{v:.2f}', va='center', fontsize=9, fontweight='bold')

# Plot 2: Review Volume vs Average Price (Bubble Chart)
scatter = ax2.scatter(top_brands['review_count'], top_brands['avg_rating'], 
                     s=top_brands['avg_price']*10, alpha=0.6, 
                     c=range(len(top_brands)), cmap='viridis', edgecolors='black', linewidth=1.5)
ax2.set_xlabel('Number of Reviews', fontsize=12, fontweight='bold')
ax2.set_ylabel('Average Rating (Stars)', fontsize=12, fontweight='bold')
ax2.set_title('Brand Popularity vs Rating (Bubble Size = Avg Price)', fontsize=14, fontweight='bold', pad=20)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 5])

# Add brand labels
for idx, row in top_brands.iterrows():
    ax2.annotate(row['brand'], (row['review_count'], row['avg_rating']),
                xytext=(5, 5), textcoords='offset points', fontsize=8, 
                bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.3))

plt.tight_layout()
plt.show()

print("\n📊 VISUALIZATION 2 INSIGHTS:")
print("=" * 60)
print(f"• Most Reviewed Brand: {top_brands.iloc[0]['brand']} ({top_brands.iloc[0]['review_count']:,.0f} reviews)")
print(f"• Highest Rated Brand: {top_brands.loc[top_brands['avg_rating'].idxmax(), 'brand']} ({top_brands['avg_rating'].max():.2f} stars)")
print(f"• Lowest Rated Brand: {top_brands.loc[top_brands['avg_rating'].idxmin(), 'brand']} ({top_brands['avg_rating'].min():.2f} stars)")
print(f"• Most Expensive Brand (Avg): {top_brands.loc[top_brands['avg_price'].idxmax(), 'brand']} (${top_brands['avg_price'].max():.2f})")
print(f"• Most Affordable Brand (Avg): {top_brands.loc[top_brands['avg_price'].idxmin(), 'brand']} (${top_brands['avg_price'].min():.2f})")
print(f"\n• Rating Range Across Top Brands: {top_brands['avg_rating'].min():.2f} - {top_brands['avg_rating'].max():.2f} stars")
print(f"• Average Price Across Top Brands: ${top_brands['avg_price'].mean():.2f}")

In [0]:
# Additional visualization: Overall rating distribution
rating_dist = gold_enriched.groupBy("overall").count().orderBy("overall").toPandas()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Rating Distribution (Bar Chart)
colors_dist = ['#C73E1D', '#F18F01', '#F4B942', '#A7C957', '#06A77D']
ax1.bar(rating_dist['overall'], rating_dist['count'], color=colors_dist, alpha=0.8, edgecolor='black')
ax1.set_xlabel('Rating (Stars)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Number of Reviews', fontsize=12, fontweight='bold')
ax1.set_title('Distribution of Product Ratings', fontsize=14, fontweight='bold', pad=20)
ax1.set_xticks([1, 2, 3, 4, 5])
ax1.grid(True, alpha=0.3, axis='y')

# Add percentage labels
total_reviews = rating_dist['count'].sum()
for i, (rating, count) in enumerate(zip(rating_dist['overall'], rating_dist['count'])):
    percentage = (count / total_reviews) * 100
    ax1.text(rating, count + 500, f'{percentage:.1f}%', ha='center', fontsize=10, fontweight='bold')

# Plot 2: Cumulative Distribution
rating_dist['cumulative_pct'] = (rating_dist['count'].cumsum() / total_reviews) * 100
ax2.plot(rating_dist['overall'], rating_dist['cumulative_pct'], 
         marker='o', linewidth=3, markersize=10, color='#2E86AB')
ax2.fill_between(rating_dist['overall'], rating_dist['cumulative_pct'], alpha=0.3, color='#2E86AB')
ax2.set_xlabel('Rating (Stars)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Cumulative Percentage (%)', fontsize=12, fontweight='bold')
ax2.set_title('Cumulative Rating Distribution', fontsize=14, fontweight='bold', pad=20)
ax2.set_xticks([1, 2, 3, 4, 5])
ax2.set_ylim([0, 100])
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 RATING DISTRIBUTION INSIGHTS:")
print("=" * 60)
for _, row in rating_dist.iterrows():
    pct = (row['count'] / total_reviews) * 100
    print(f"• {int(row['overall'])} Stars: {row['count']:,} reviews ({pct:.1f}%)")

positive_reviews = rating_dist[rating_dist['overall'] >= 4]['count'].sum()
positive_pct = (positive_reviews / total_reviews) * 100
print(f"\n• Positive Reviews (4-5 stars): {positive_pct:.1f}%")
print(f"• Negative Reviews (1-2 stars): {100 - positive_pct - (rating_dist[rating_dist['overall'] == 3]['count'].sum() / total_reviews * 100):.1f}%")

## 📊 Visualization Analysis Summary

### **Visualization 1: Rating Trends Over Time**
**What it shows:**
* **Top chart**: Average product ratings plotted monthly from August 2010 to July 2014, with a reference line showing the overall average (4.23 stars)
* **Bottom chart**: Review volume (number of reviews) per month over the same period

**Key Insights:**
* Ratings show **high volatility** in early periods (2010-2011) with extreme fluctuations between 1.0 and 5.0 stars, likely due to low sample sizes
* From 2012 onwards, ratings **stabilize around 4.0-5.0 stars** as review volume increases
* **Peak review activity** occurred in February 2013 (6,984 reviews), indicating a possible product launch or promotional period
* The overall trend shows **improving and stabilizing customer satisfaction** over time
* Recent periods (2013-2014) maintain consistently high ratings above the 4.23 average

---

### **Visualization 2: Brand Comparison Analysis**
**What it shows:**
* **Left chart**: Horizontal bar chart comparing average ratings of the top 10 brands by review volume
  * Green bars (≥4.0 stars) indicate strong performance
  * Orange bars (3.0-3.99) would indicate moderate performance
  * Red bars (<3.0) would indicate poor performance
* **Right chart**: Bubble scatter plot showing the relationship between review volume (x-axis), average rating (y-axis), and average price (bubble size)

**Key Insights:**
* **VideoSecu dominates** the market with 63,632 reviews (82% of all reviews) and the highest rating (4.39 stars)
* All top 10 brands maintain **strong ratings between 4.0-4.39 stars**, indicating overall quality in the electronics category
* **Price-quality relationship**: VideoSecu commands both the highest price ($29.99) and highest rating, suggesting premium positioning
* Barnes & Noble has the lowest rating (4.00) and lowest price ($19.65), indicating a budget segment
* The tight rating range (0.39 stars) suggests **competitive parity** among major brands

---

### **Bonus Visualization: Rating Distribution**
**What it shows:**
* **Left chart**: Bar chart showing the count and percentage of reviews for each star rating (1-5)
* **Right chart**: Cumulative distribution curve showing the percentage of reviews at or below each rating level

**Key Insights:**
* **Highly positive skew**: 83% of reviews are 4-5 stars, with 5-star reviews alone comprising 53% of all feedback
* Only **7% negative reviews** (1-2 stars), indicating strong overall product quality
* The distribution suggests either genuine customer satisfaction or potential **review bias** (satisfied customers more likely to review)
* 3-star reviews (10%) represent the "neutral" segment, often the most informative for improvement opportunities

---

## 🎯 Recommended Gold Layer Enrichments

Based on this analysis, prioritize these enrichments:

1. **Temporal smoothing**: Add 3-month and 6-month rolling averages to reduce volatility
2. **Helpfulness scoring**: Calculate helpfulness ratios to identify high-quality reviews
3. **Text analytics**: Add review length and sentiment scores to correlate with ratings
4. **Comparative metrics**: Add "rating vs brand average" and "rating vs category average" columns
5. **Reviewer segmentation**: Flag power reviewers and calculate reviewer credibility scores
6. **Price tier categorization**: Group products into Budget/Mid-range/Premium segments for better analysis