# Data Lakehouse

This notebook demonstrates how to set up a basic data lakehouse architecture in Databricks on AWS for the MLB Tech Summit. It covers catalog, schema, and volume creation, as well as directory setup for raw data ingestion.

## Notebook Structure Overview

- **Step 1:** Import PySpark libraries.
- **Step 2:** Define key variables.
- **Step 3:** Create and use catalog.
- **Step 4:** Create schemas (bronze, silver, gold).
- **Step 5:** Create raw data volume.
- **Step 6:** Set up games data directory.
- **Step 7:** Define games table structure.

### Step 1: Import Required Libraries

This cell imports essential PySpark SQL functions and types. These libraries are used for data manipulation and transformation throughout the notebook.

In [0]:
import pyspark.sql.functions as F
import pyspark.sql.types as T

### Step 2: Define Key Variables

This cell sets up variables for catalog, schemas, tables, and directories. These variables help organize where data is stored and how it is referenced in later steps.

In [0]:
CATALOG = 'aa_mlb_tech_summit'
LANDING_ZONE = 'raw'
BRONZE_SCHEMA = 'bronze'
SILVER_SCHEMA = 'silver'
GOLD_SCHEMA = 'gold'
# SEMANTIC_SCHEMA = 'semantic'


RAW_VOL = f'/Volumes/{CATALOG}/{BRONZE_SCHEMA}/{LANDING_ZONE}'
BRONZE_GAMES_TABLE = 'all_games'
GAMES_DIRECTORY = f'{RAW_VOL}/all_games'


SILVER_GAMES_TABLE = 'all_games_clean'
SILVER_GAMES_TABLE_ENRICHED = 'all_games_enriched'
SILVER_PROMOTIONS_VIEW = 'promotions_exploded'

GOLD_ATTENDANCE_BASELINE = 'attendance_baseline'
GOLD_ATTENDANCE_BY_PROMO = "attendance_by_promo"
GOLD_ATTENDANCE_BY_TEAM_AND_PROMO_TYPE = "attendance_by_team_promo_type"

# SEMANTIC_ATTENDANCE_METRICS = "attendance_metrics"

### Step 3: Create and Use Catalog

This cell creates a catalog, which is a top-level container for organizing all MLB Tech Summit data assets. It then switches the active context to this catalog.

In [0]:
spark.sql(f"""
          CREATE CATALOG IF NOT EXISTS {CATALOG}
          COMMENT 'Catalog for storing and processing all MLB Tech Summit Training data'
          """)
spark.sql(f"""
          USE CATALOG {CATALOG}
          """)

### Step 4: Create Schemas

This cell creates three schemas: bronze (raw data), silver (clean data), and gold (aggregated data). Schemas help separate data by its processing stage and set the active schema to bronze.

In [0]:
spark.sql(f"""
          CREATE SCHEMA IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}
          COMMENT 'Bronze database landing all Tech Summit training data';
          """)

spark.sql(f"""
          CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}
          COMMENT 'Silver database for storing clean data for analysis, with quality constraints met';
          """)

spark.sql(f"""
          CREATE SCHEMA IF NOT EXISTS {CATALOG}.{GOLD_SCHEMA}
          COMMENT 'Gold database for serving aggregated data, KPIs to consumption layer';
          """)
spark.sql(f"""
          USE SCHEMA {BRONZE_SCHEMA};
          """)

### Step 5: Create Raw Data Volume

This cell creates a volume, which is a storage location for files within the bronze schema. Volumes are used to store raw data files before processing.

In [0]:
spark.sql(f"""
          CREATE VOLUME IF NOT EXISTS {LANDING_ZONE}
          COMMENT 'Raw data volume for bronze schema';
          """)

### Step 6: Create Directory for Raw Games Data

This cell creates a directory in the raw volume for storing all games data files. Directories help organize files for easy access and management.

In [0]:
dbutils.fs.mkdirs(f"{GAMES_DIRECTORY}")

### Step 7: Define Main Games Table Structure

This cell defines the structure of the main games table, including all relevant columns and metadata. The table is used for comprehensive analysis of MLB games, attendance, promotions, and uniforms.

In [0]:
spark.sql(
    f"""
    CREATE TABLE IF NOT EXISTS {BRONZE_GAMES_TABLE} (
      gamePk INT COMMENT 'Unique game identifier assigned by the MLB Stats API.',
      date DATE COMMENT 'Official calendar date on which the game was played.',
      gameDate TIMESTAMP COMMENT 'Timestamp of when the game was played.',
      season INT COMMENT 'MLB season year (e.g., 2024).',
      gameType STRING COMMENT 'Type of game: Regular (R), Postseason (P), Spring Training (S), or Exhibition (E).',
      status STRING COMMENT 'Game status description.',
      statusCode STRING COMMENT 'Game status code.',
      home_team_id INT COMMENT 'Home team identifier.',
      home_team_name STRING COMMENT 'Full name of the home team.',
      home_score DOUBLE COMMENT 'Total number of runs scored by the home team.',
      home_wins INT COMMENT 'Number of wins for the home team before the game.',
      home_losses INT COMMENT 'Number of losses for the home team before the game.',
      home_pct DOUBLE COMMENT 'Winning percentage of the home team before the game.',
      home_is_winner BOOLEAN COMMENT 'Indicates whether the home team won the game.',
      away_team_id INT COMMENT 'Away team identifier.',
      away_team_name STRING COMMENT 'Full name of the visiting (away) team.',
      away_score DOUBLE COMMENT 'Total number of runs scored by the away team.',
      away_wins INT COMMENT 'Number of wins for the away team before the game.',
      away_losses INT COMMENT 'Number of losses for the away team before the game.',
      away_pct DOUBLE COMMENT 'Winning percentage of the away team before the game.',
      away_is_winner BOOLEAN COMMENT 'Indicates whether the away team won the game.',
      venue_id INT COMMENT 'Venue identifier.',
      venue_name STRING COMMENT 'Name of the ballpark or stadium where the game took place.',
      doubleHeader STRING COMMENT 'Marks whether the game was part of a doubleheader (Y or N).',
      gameNumber INT COMMENT 'Identifies which game in a doubleheader this record represents.',
      dayNight STRING COMMENT 'Indicates whether the game was played during the day or at night.',
      description STRING COMMENT 'Text description of the game.',
      scheduledInnings INT COMMENT 'Number of scheduled innings for the game.',
      seriesGameNumber INT COMMENT 'Sequence number of this game within a multi-game series.',
      gamesInSeries INT COMMENT 'Total number of games scheduled in the series.',
      attendance DOUBLE COMMENT 'Official attendance count for the game.',
      attendance_high_for_date DOUBLE COMMENT 'Highest attendance for games on this date.',
      attendance_low_for_date DOUBLE COMMENT 'Lowest attendance for games on this date.',
      is_doubleheader_date BOOLEAN COMMENT 'Indicates if multiple games were played between the same teams on this date.',
      home_jersey STRING COMMENT 'Description of the home team’s jersey style or color.',
      home_pants STRING COMMENT 'Description of the home team’s pants style or color.',
      home_cap STRING COMMENT 'Description of the home team’s cap worn during the game.',
      home_jersey_code STRING COMMENT 'Internal or vendor code representing the home team’s jersey design.',
      away_jersey STRING COMMENT 'Description of the away team’s jersey style or color.',
      away_pants STRING COMMENT 'Description of the away team’s pants style or color.',
      away_cap STRING COMMENT 'Description of the away team’s cap worn during the game.',
      away_jersey_code STRING COMMENT 'Internal or vendor code representing the away team’s jersey design.',
      offer_names ARRAY<STRING> COMMENT 'Array of promotional offer names tied to the game.',
      promotion_types_array ARRAY<STRING> COMMENT 'Array of promotion categories.',
      offer_types_array ARRAY<STRING> COMMENT 'Array of specific promotional type labels.',
      num_promotions BIGINT COMMENT 'Total number of unique promotions linked to the game.',
      has_promotion BOOLEAN COMMENT 'Boolean flag indicating whether the game featured any promotion or giveaway.'
    )
    USING DELTA
    COMMENT 'Comprehensive game-level dataset integrating on-field results, attendance metrics, promotional details, and uniform metadata for Major League Baseball (MLB) games from the 2024 and 2025 seasons. Each record represents a single scheduled or completed MLB game, linking competitive outcomes (scores, winners, series context) with fan-engagement attributes such as giveaways, theme nights, and attendance. The dataset also includes uniform configurations for both teams and supports analysis across seasons, venues, and promotional effectiveness.'
    """
)