
# Yelp Data Modeling for Rising Star Businesses
#

## Objective: Identify "Rising Star" Businesses
- Focus on businesses with significant growth in engagement, positive reviews, and popularity.
- A data warehouse enables quick insights:
  1. Discover trending businesses.
  2. Guide marketing strategies.
  3. Provide data-driven insights.


## Design Overview
- **Star Schema**: Fact (reviews, check-ins, tips) + Dimension (business, user, date) tables.
- **Slowly Changing Dimensions (SCD)**: Tracks historical changes.
- **Data Mart**: Highlights Rising Star metrics (growth, year-over-year).


1. Identify businesses with growing customer interest.
2. Support marketing and partnership strategies for trending businesses.
3. Provide valuable insights to business owners on factors contributing to their success.

<a href="https://ibb.co/JwrZrsR8"><img src="https://i.ibb.co/39SKSCyV/star-schema.png" alt="star-schema" border="0" /></a>


To achieve this, the data model is designed using a **Star Schema** approach, utilizing **Fact and Dimension Tables** and incorporating **Slowly Changing Dimensions (SCDs)** for tracking historical changes. The warehouse is built using **Delta Lake**, ensuring reliability and scalability.



## Delta Over Parquet for BI, Why?
- **ACID & Real-Time**: Delta supports transactions and quick updates, unlike plain Parquet.
- **Schema Evolution**: Adjusts easily to changing data models.
- **Time Travel**: Historical snapshots aid auditing and SCD logic.
- **High Concurrency**: Handles OLTP-like updates plus OLAP queries seamlessly.
![](https://miro.medium.com/v2/resize:fit:1242/format:webp/1*ZQn7kYvHzw_B5qwbCGtE0w.png)

## **Yelp Analytics Data Warehouse Setup**

### **Creating the Database**

1. Ensures a clean workspace for data processing.
2. Creates a structured environment for efficient querying.


In [0]:
%sql

DROP DATABASE IF EXISTS yelp_analytics cascade;
-- 1. Create the database and set it as current
CREATE DATABASE IF NOT EXISTS yelp_analytics;
USE yelp_analytics;



## **Dimension Tables (Descriptive Data)**

### **Business Dimension (SCD Type 2)**

1. Stores business information.
2. **SCD Type 2** ensures historical tracking (effective & expiry dates).
3. Helps in trend analysis over time.

### User Dimension (SCD Type 2)

1. Tracks user engagement.
2. Enables analyzing interactions of active and influential users.


In [0]:
%sql
--------------------------------------------------------
-- Dimension Tables
--------------------------------------------------------

-- Business Dimension (SCD Type 2)
CREATE TABLE IF NOT EXISTS dim_business (
    business_id STRING,
    name STRING,
    address STRING,
    city STRING,
    state STRING,
    postal_code STRING,
    latitude DOUBLE,
    longitude DOUBLE,
    is_open BOOLEAN,
    categories STRING,
    review_count INT,
    stars DOUBLE,
    -- SCD Type 2 columns for tracking history
    effective_date DATE,
    expiry_date DATE,
    current_flag BOOLEAN
)
USING DELTA;


-- User Dimension (SCD Type 2)
CREATE TABLE IF NOT EXISTS dim_user (
    user_id STRING,
    name STRING,
    yelping_since DATE,
    review_count INT,
    average_stars DOUBLE,
    fans INT,
    cool INT,
    funny INT,
    useful INT,
    elite STRING,
    -- SCD Type 2 columns
    effective_date DATE,
    expiry_date DATE,
    current_flag BOOLEAN
)
USING DELTA;




### **Date Dimension**

1. Standardizes date-based analytics.
2. Supports time-series comparisons.
3. Create Python script to init it.

In [0]:
%sql
-- Date Dimension (Static)
CREATE TABLE IF NOT EXISTS dim_date (
    full_date DATE,
    year INT,
    quarter INT,
    month INT,
    week INT,
    day INT,
    day_of_week INT
)
USING DELTA;

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, sequence, to_date, col, year, quarter, month, weekofyear, dayofmonth, dayofweek

# Initialize Spark session
spark = SparkSession.builder.getOrCreate()

# Define start and end dates for the date dimension
start_date = "2000-01-01"
end_date = "2030-12-31"

# Create a DataFrame with a sequence of dates
date_df = spark.sql(f"""
    SELECT explode(sequence(to_date('{start_date}'), to_date('{end_date}'), interval 1 day)) as full_date
""")

# Add additional date parts
date_dim = date_df.withColumn("year", year(col("full_date"))) \
    .withColumn("quarter", quarter(col("full_date"))) \
    .withColumn("month", month(col("full_date"))) \
    .withColumn("week", weekofyear(col("full_date"))) \
    .withColumn("day", dayofmonth(col("full_date"))) \
    .withColumn("day_of_week", dayofweek(col("full_date")))

# Write the DataFrame as a Delta table
date_dim.write.format("delta").mode("overwrite").saveAsTable("dim_date")


## **act Tables (Transactional Data)**

### **Reviews Fact Table**

1. Captures user reviews.
2. Links to `dim_user` and `dim_business`.

### **Check-ins Fact Table**

1. Tracks customer visits to businesses.
2. Helps identify high-traffic businesses.

### **Tips Fact Table**

1. Stores user tips and recommendations.

---

In [0]:
%sql

--------------------------------------------------------
-- Fact Tables
--------------------------------------------------------

-- Review Fact Table
CREATE TABLE IF NOT EXISTS fact_review (
    review_id STRING,
    business_id STRING,
    user_id STRING,
    review_date DATE,
    stars DOUBLE,
    cool INT,
    funny INT,
    useful INT
)
USING DELTA;

-- Check-in Fact Table
CREATE TABLE IF NOT EXISTS fact_checkin (
    business_id STRING,
    checkin_date DATE
)
USING DELTA;

-- Tip Fact Table
CREATE TABLE IF NOT EXISTS fact_tip (
    business_id STRING,
    user_id STRING,
    tip_date DATE,
    text STRING,
    compliment_count INT
)
USING DELTA;


## **Data Mart: Rising Star Businesses**
### **Rising Star Businesses Data Mart (SCD Type 3)**

1. **Tracks business growth over time** (`review_growth`).
2. **Month-over-month (MoM) comparison** enables early identification of trending businesses.
3. Uses **SCD Type 3** to retain historical YoY performance.



In [0]:
%sql
USE yelp_analytics;

CREATE TABLE IF NOT EXISTS datamart_rising_star_businesses (
    business_id STRING,
    name STRING,

    -- Period type for comparisons, now only YoY is considered
    period_type STRING,  -- Expected value: 'YoY'
    
    -- Period columns for comparisons
    current_period DATE,      
    previous_period DATE,    
    
    -- Label column for storing rising star labels (YoY only)
    rising_star_labels ARRAY<STRING>,  
    
    -- Current metrics from the latest load:
    current_review_count INT,   
    current_avg_stars DOUBLE,   
    current_rating_improvement DOUBLE,  
    
    -- Prior (historical) metrics (SCD Type 3)
    prior_review_count INT,     
    prior_avg_stars DOUBLE,     
    prior_rating_improvement DOUBLE,
    
    last_update_date DATE       
)
USING DELTA;
