# Yelp Dataset - Kaggle Initialization & Configuration

## Overview
This notebook sets up the environment to download and manage the **Yelp Dataset** from Kaggle, using **Apache Spark, Delta Lake, and SQL metadata tables** for structured ingestion.

## Steps in the Notebook

### 1. Install Required Packages
```python
!pip install kagglehub[pandas-datasets]
!pip install great-expectations



In [0]:
!pip install kagglehub[pandas-datasets]
!pip install great-expectations


You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-a1a189f1-0c27-4447-b12e-be41f02f6ba1/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-a1a189f1-0c27-4447-b12e-be41f02f6ba1/bin/python -m pip install --upgrade pip' command.[0m


## Step 2: Import Libraries and Download Dataset
To work with the **Yelp Dataset**

In [0]:
import kagglehub
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from delta.tables import DeltaTable

# Download Yelp dataset from Kaggle
path = kagglehub.dataset_download("yelp-dataset/yelp-dataset")
print("Path to dataset files:", path)

# Initialize Spark session
spark = SparkSession.builder.appName("YelpDataIngestion").getOrCreate()


Downloading from https://www.kaggle.com/api/v1/datasets/download/yelp-dataset/yelp-dataset?dataset_version_number=4...


  0%|          | 0.00/4.07G [00:00<?, ?B/s]  0%|          | 5.00M/4.07G [00:00<01:23, 52.4MB/s]  0%|          | 13.0M/4.07G [00:00<01:04, 67.2MB/s]  1%|          | 22.0M/4.07G [00:00<00:54, 79.2MB/s]  1%|          | 31.0M/4.07G [00:00<00:53, 81.9MB/s]  1%|          | 43.0M/4.07G [00:00<00:52, 82.9MB/s]  1%|          | 51.0M/4.07G [00:00<00:52, 81.6MB/s]  2%|▏         | 66.0M/4.07G [00:00<00:42, 102MB/s]   2%|▏         | 76.0M/4.07G [00:00<00:51, 83.5MB/s]  2%|▏         | 91.0M/4.07G [00:01<00:42, 101MB/s]   2%|▏         | 102M/4.07G [00:01<00:44, 95.0MB/s]  3%|▎         | 114M/4.07G [00:01<00:42, 101MB/s]   3%|▎         | 128M/4.07G [00:01<00:38, 111MB/s]  3%|▎         | 141M/4.07G [00:01<00:36, 116MB/s]  4%|▎         | 156M/4.07G [00:01<00:33, 125MB/s]  4%|▍         | 169M/4.07G [00:01<00:34, 122MB/s]  4%|▍         | 183M/4.07G [00:01<00:33, 124MB/s]  5%|▍         | 195M/4.07G [00:02<00:51, 81.5MB/s]  5%|▍         | 206M/4.07G [00:02<00:47, 87.5MB/s]  5%|▌         

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/yelp-dataset/yelp-dataset/versions/4


## Step 3: Create Config Database & Metadata Tables
To efficiently manage and track the **Yelp Dataset**, we create a **database (`config_db`)** along with three **Delta tables** to store metadata, ingestion logs, and table properties.



In [0]:
%sql
DROP DATABASE IF EXISTS config_db CASCADE;

CREATE DATABASE IF NOT EXISTS config_db;
USE config_db;

CREATE TABLE IF NOT EXISTS source_metadata (
    source_id INT,
    source_name STRING,
    file_path STRING,
    file_format STRING,
    ingestion_type STRING,
    schedule STRING
) USING DELTA;

CREATE TABLE IF NOT EXISTS elt_process_log (
    log_id INT,
    process_name STRING,
    target_table STRING,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    execution_time_seconds DOUBLE,
    size STRING,
    rows_affected INT,
    method_used STRING,
    status STRING,
    error_message STRING
) USING DELTA;

CREATE TABLE IF NOT EXISTS table_metadata (
    table_id INT,
    table_name STRING,
    table_type STRING,
    description STRING,
    storage_format STRING,
    partition_columns STRING,
    created_at TIMESTAMP
) USING DELTA;

CREATE TABLE IF NOT EXISTS data_quality_log (
    dq_log_id INT,
    table_name STRING,
    quality_check STRING,
    description STRING,
    total_rows INT,
    passed_rows INT,
    failed_rows INT,
    check_time TIMESTAMP ,
    execution_time_seconds DOUBLE,
    status STRING,
    error_message STRING
) USING DELTA;




Metadata Extraction from JSON
Initializes Spark and scans dataset directory.
Extracts and stores metadata in (config_db.source_metadata).

## Init Source Metadata Extraction from JSON
- Initializes Spark and scans dataset directory.
- Extracts and stores metadata in (`config_db.source_metadata`).

In [0]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

spark = SparkSession.builder.appName("AutoDetectSourceMetadata").getOrCreate()

directory_path = '/root/.cache/kagglehub/datasets/yelp-dataset/yelp-dataset/versions/4/'
files = os.listdir(directory_path)
metadata_list = []
source_id = 1

for file in files:
    if file.lower().endswith('.json'):
        file_path = os.path.join(directory_path, file)
        source_name = file.replace("yelp_academic_dataset_", "").replace(".json", "").lower()
        metadata_list.append((source_id, source_name, file_path, "JSON", "Batch", "Daily 00:00"))
        source_id += 1

schema = StructType([
    StructField("source_id", IntegerType(), False),
    StructField("source_name", StringType(), False),
    StructField("file_path", StringType(), False),
    StructField("file_format", StringType(), False),
    StructField("ingestion_type", StringType(), False),
    StructField("schedule", StringType(), False)
])

metadata_df = spark.createDataFrame(metadata_list, schema=schema)

metadata_df = metadata_df.selectExpr("source_id", "source_name", "file_path", "file_format", "ingestion_type", "schedule")

# Enable schema evolution if necessary
metadata_df.write.format("delta").mode("append").option("mergeSchema", "true").saveAsTable("config_db.source_metadata")

print("Auto-detected metadata inserted successfully!")


## Init Table Metadata 
-  stores metadata in (`config_db.table_metadata`).

In [0]:
%sql
INSERT INTO config_db.table_metadata (table_id, table_name, table_type, description, storage_format, partition_columns) VALUES
(1, 'dim_business', 'dimension', 'SCD Type 2', 'DELTA', 'business_id'),
(2, 'dim_business_attributes', 'dimension', 'SCD Type 2', 'DELTA', 'business_id'),
(3, 'dim_business_hours', 'dimension', 'SCD Type 1', 'DELTA', 'business_id'),
(4, 'dim_user', 'dimension', '`SCD Type 2', 'DELTA', 'user_id'),
(5, 'dim_date', 'dimension', 'Stores calendar date-related data', 'DELTA', 'full_date'),
(6, 'fact_review', 'fact', 'Stores user reviews including ratings and feedback', 'DELTA', 'review_date'),
(7, 'fact_checkin', 'fact', 'Stores check-in data for businesses', 'DELTA', 'checkin_date'),
(8, 'fact_tip', 'fact', 'Stores user tips for businesses', 'DELTA', 'tip_date'),
(9, 'datamart_rising_star_businesses', 'datamart', 'SCD Type 3', 'DELTA', 'mom_current_period');

num_affected_rows,num_inserted_rows
9,9


## Clean Up and Backfill If needed

In [0]:
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("YelpDataCleanup").getOrCreate()

# Check if initial load has been performed
initial_load_check = spark.sql("""
    SELECT COUNT(*) AS count FROM config_db.elt_process_log WHERE process_name LIKE 'Initial Load%'
""").collect()[0]["count"]

if initial_load_check == 0:
    print("Initial load not detected. Performing cleanup...")
           

    # Cleanup analytics tables dynamically
    table_metadata_df = spark.sql("SELECT table_name FROM config_db.table_metadata")
    tables = [f"yelp_analytics.{row.table_name.lower()}" for row in table_metadata_df.collect()]
    paths = [f"dbfs:/user/hive/warehouse/yelp_analytics.db/{table.split('.')[-1]}" for table in tables]

    for path in paths:
        dbutils.fs.rm(path, True)
        print(f"🗑 Removed directory: {path}")

    print("✅ Initial load: Existing tables and directories cleared.")
else:
    print("✅ Initial load already performed. Skipping cleanup.")


Initial load not detected. Performing cleanup...
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/dim_business
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/dim_business_attributes
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/dim_business_hours
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/dim_user
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/dim_date
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/fact_review
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/fact_checkin
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/fact_tip
🗑 Removed directory: dbfs:/user/hive/warehouse/yelp_analytics.db/datamart_rising_star_businesses
✅ Initial load: Existing tables and directories cleared.
