# Bronze Layer: Ingest JSON to Bronze Delta Tables

## Purpose
This notebook ingests TrackMania 2020 replay data from JSON files stored in the Bronze Lakehouse Files area and writes them to Bronze Delta tables with minimal transformation.

## Inputs
- **Source**: JSON files in `tm2020_bronze` Lakehouse Files (`/lakehouse/default/Files/replays/*.json`)
- **Format**: One JSON file per replay, structure based on `tm_gbx` parser output
- **Schema**: metadata (player, map, race info) + ghost_samples (telemetry)

## Outputs
- **Target**: `bronze_replays_raw` Delta table in `tm2020_bronze` Lakehouse
- **Append mode**: Preserves all historical data
- **Columns**: replay_id, ingestion_timestamp, metadata, ghost_samples, source_file

## Processing Logic
1. Connect to Bronze Lakehouse
2. Read JSON files from Files area
3. Add ingestion metadata (timestamp, source file)
4. Generate unique replay_id from filename or hash
5. Write to Delta table (append mode)

In [None]:
# Lakehouse connection setup
# TODO: Configure Lakehouse connection for 'tm2020_bronze'
# In Microsoft Fabric, the default lakehouse is automatically attached
# Ensure this notebook is configured to use tm2020_bronze as default lakehouse

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name, lit
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, FloatType, DoubleType

# Initialize Spark session (pre-configured in Fabric)
spark = SparkSession.builder.appName("Bronze_JSON_Ingestion").getOrCreate()

print("Spark session initialized")
print(f"Spark version: {spark.version}")

In [None]:
# Define paths
# TODO: Update paths based on actual Lakehouse Files structure
input_path = "/lakehouse/default/Files/replays/*.json"  # Path to JSON files
output_table = "bronze_replays_raw"  # Bronze Delta table name

print(f"Input path: {input_path}")
print(f"Output table: {output_table}")

In [None]:
# Define schema for JSON files based on tm_gbx output structure
# This schema matches the output from tm_gbx/models.py

# Vec3 schema (position/velocity)
vec3_schema = StructType([
    StructField("x", DoubleType(), True),
    StructField("y", DoubleType(), True),
    StructField("z", DoubleType(), True)
])

# Ghost sample schema
ghost_sample_schema = StructType([
    StructField("time_ms", IntegerType(), True),
    StructField("position", vec3_schema, True),
    StructField("velocity", vec3_schema, True),
    StructField("speed", DoubleType(), True)
])

# Metadata schema
metadata_schema = StructType([
    StructField("player_login", StringType(), True),
    StructField("player_nickname", StringType(), True),
    StructField("map_name", StringType(), True),
    StructField("map_uid", StringType(), True),
    StructField("map_author", StringType(), True),
    StructField("race_time_ms", IntegerType(), True),
    StructField("checkpoints", ArrayType(IntegerType()), True),
    StructField("num_respawns", IntegerType(), True),
    StructField("game_version", StringType(), True),
    StructField("title_id", StringType(), True)
])

# Complete JSON schema
json_schema = StructType([
    StructField("metadata", metadata_schema, True),
    StructField("ghost_samples", ArrayType(ghost_sample_schema), True)
])

print("Schema defined successfully")

In [None]:
# Read JSON files from Lakehouse Files
# TODO: Implement actual file reading logic
# This is a placeholder - update based on actual file availability

try:
    # Read JSON files with schema
    df_json = spark.read.schema(json_schema).json(input_path)
    
    # Add ingestion metadata
    df_bronze = df_json \
        .withColumn("ingestion_timestamp", current_timestamp()) \
        .withColumn("source_file", input_file_name()) \
        .withColumn("replay_id", lit("TODO: Generate from filename or hash"))
    
    # Show sample data
    print(f"Records read: {df_bronze.count()}")
    df_bronze.select("replay_id", "ingestion_timestamp", "source_file", "metadata.player_nickname", "metadata.race_time_ms").show(5, truncate=False)
    
except Exception as e:
    print(f"Error reading JSON files: {e}")
    print("This is expected if no JSON files exist yet in the Lakehouse Files area")

In [None]:
# Write to Bronze Delta table
# TODO: Implement actual write logic after JSON files are available

try:
    # Write to Delta table in append mode
    df_bronze.write \
        .format("delta") \
        .mode("append") \
        .option("mergeSchema", "true") \
        .saveAsTable(output_table)
    
    print(f"Successfully wrote data to {output_table}")
    
    # Verify write
    df_verify = spark.table(output_table)
    print(f"Total records in {output_table}: {df_verify.count()}")
    
except Exception as e:
    print(f"Error writing to Delta table: {e}")
    print("Ensure df_bronze is defined (run previous cell successfully first)")

In [None]:
# Data quality checks (basic)
# TODO: Implement data quality validation

try:
    df_check = spark.table(output_table)
    
    print("=== Data Quality Report ===")
    print(f"Total records: {df_check.count()}")
    print(f"Null metadata: {df_check.filter('metadata IS NULL').count()}")
    print(f"Null ghost_samples: {df_check.filter('ghost_samples IS NULL').count()}")
    
    # Show schema
    print("\nTable Schema:")
    df_check.printSchema()
    
except Exception as e:
    print(f"Error running data quality checks: {e}")