# Yelp Data Ingestion


<a href="https://ibb.co/WWk6nPXQ"><img src="https://i.ibb.co/XZ7bX3P9/ingestion.png" alt="ingestion" border="0"></a><br /><a target='_blank' href='https://imgbb.com/'>Ingestion Flow Diagram</a><br />



## Step 1: Convert to **Parquet** for efficient querying and storage.
- JSON Data still require Spark context addition, inefficient for OLAP, minimal writes.
- **Solution**: Convert to **Parquet** for efficient querying and storage.


![](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*ORHjAuLiy0ksWcQ65Fzepw.png)


## Step 2: After ingest, Do Logging & Monitoring
- Logs execution time, file size, and records ingested.

```python
log_sql = f"""
    INSERT INTO config_db.elt_process_log 
    (log_id, process_name, target_table, start_time, end_time, execution_time_seconds, size, rows_affected, status) 
    VALUES (last_log_id, 'Data Ingestion', '{table_name}', '{start_time}', '{end_time}', {exec_time}, '{file_size_mb:.2f} MB', {record_count}, 'Success')
"""
spark.sql(log_sql)
```

## Key Learning
- JSON is dynamically added to Spark before ingestion.
- Parquet improves performance.
- Peak mode optimizes large data loads, off-peak mode optimizes resource usage.
- Logs support monitoring and optimization.

- **Handle Peak vs. Off-Peak Mode**:
  - **Peak Mode**: If file size > 1GB, increases parallelism for faster processing.
  - **Off-Peak Mode**: Optimized resource usage for smaller files.
  
```python

file_size_mb = os.path.getsize(row.file_path) / (1024 * 1024)
    load_condition = "peak" if file_size_mb > PEAK_THRESHOLD_MB else "off-peak"
    
    if load_condition == "peak":
        spark.conf.set("spark.sql.shuffle.partitions", "500")
        print(f"⚡ Peak Load for {table_name} ({file_size_mb:.2f} MB). High parallelism applied.")
    else:
        spark.conf.set("spark.sql.shuffle.partitions", "100")
        print(f"🌙 Off-Peak Load for {table_name} ({file_size_mb:.2f} MB). Optimized resource usage.")
    
    df.withColumn("ingestion_date", lit(ingestion_date)).write.mode("overwrite").partitionBy("ingestion_date").parquet(os.path.join(parquet_base_path, table_name))
```

This ensures **efficient, scalable, and robust** Yelp data processing.

The codes run here :


In [0]:
import os
from pyspark.sql import SparkSession
from pyspark import SparkFiles
from datetime import datetime
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName("YelpDataIngestion").getOrCreate()

last_log_id_result = spark.sql("SELECT MAX(log_id) FROM config_db.elt_process_log").collect()
last_log_id = last_log_id_result[0][0] if last_log_id_result and last_log_id_result[0][0] is not None else 0

source_metadata_df = spark.sql("SELECT source_name, file_path, file_format, ingestion_type FROM config_db.source_metadata")

parquet_base_path = "/data/yelp/parquet/"
ingestion_date = datetime.now().strftime('%Y-%m-%d')

# Function to get file size in MB
def get_file_size(path):
    try:
        file_size = os.path.getsize(path) / (1024 * 1024)  # Convert bytes to MB
        return file_size
    except Exception as e:
        print(f"⚠️ Warning: Unable to retrieve size for {path}. {e}")
        return 0  # Default to 0MB if error occurs

# Define threshold (1GB per table) for determining peak vs. off-peak load
PEAK_THRESHOLD_MB = 1000  

# Process each table dynamically
for row in source_metadata_df.collect():
    table_name = row.source_name.lower().replace(" ", "_")  
    file_format = row.file_format.lower()
    start_time = datetime.now()
    
    sc = spark.sparkContext
    sc.addFile(row.file_path)
    
    local_path = SparkFiles.get(row.file_path.split('/')[-1])
    spark_read_path = f"file://{local_path}"

    file_size_mb = get_file_size(local_path)
    
    load_condition = "peak" if file_size_mb > PEAK_THRESHOLD_MB else "off-peak"

    if load_condition == "peak":
        spark.conf.set("spark.sql.shuffle.partitions", "500")  # Allowed runtime change
        print(f"⚡ Peak Load for {table_name} ({file_size_mb:.2f} MB). High parallelism applied.")
        print("⚠️ NOTE: Adjust `spark.executor.memory` via Spark submit: `--conf spark.executor.memory=16g`")
    else:
        spark.conf.set("spark.sql.shuffle.partitions", "100")  # Allowed runtime change
        print(f"🌙 Off-Peak Load for {table_name} ({file_size_mb:.2f} MB). Optimized resource usage.")

    # Read data based on format
    if file_format == "json":
        df = spark.read.json(spark_read_path)
    elif file_format == "csv":
        df = spark.read.option("header", "true").csv(spark_read_path)
    elif file_format == "parquet":
        df = spark.read.parquet(spark_read_path)
    else:
        print(f"❌ Unsupported file format: {file_format} for {table_name}")
        continue
    
    df = df.withColumn("ingestion_date", lit(ingestion_date))
    
    df.printSchema()
    record_count = df.count()
    print(f"Total {table_name} records: {record_count}")
    print(f"✅ Loaded {table_name} with {record_count} records from {row.file_path}")

    # Save as Parquet
    parquet_save_path = os.path.join(parquet_base_path, table_name)
    df.write.mode("overwrite").partitionBy("ingestion_date").parquet(parquet_save_path)
    print(f"✅ Saved {table_name} as Parquet at {parquet_save_path}")

    end_time = datetime.now()
    exec_time = (end_time - start_time).total_seconds()

    # Increment log_id for this process
    last_log_id += 1

    # Log with Peak/Off-Peak Mode
    log_sql = f"""
        INSERT INTO config_db.elt_process_log 
        (log_id, process_name, target_table, start_time, end_time, execution_time_seconds, size, rows_affected, method_used, status, error_message) 
        VALUES ({last_log_id}, 'Refresh Data Ingestion', '{table_name}', '{start_time}', '{end_time}', {exec_time}, '{file_size_mb:.2f} MB', {record_count}, 'DataFrame ({load_condition.capitalize()})', 'Success', NULL)
    """
    spark.sql(log_sql)

print("🎉 Data ingestion and Parquet saving completed successfully!")


🌙 Off-Peak Load for business (113.36 MB). Optimized resource usage.
root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |