# 01_Environment_Setup & Ingestion
This notebook initializes the **Governance Layer** (Unity Catalog) and simulates a data ingestion stream by downloading the **UCI Drug Review Dataset** into a raw Volume.

## Architecture Mapping
* **Layer:** Foundation / Ingestion
* **Governance:** Creates `safety_signal_catalog` and `raw_data` schema.
* **Storage:** Downloads 215k raw reviews to `/Volumes/.../landing_zone`.

## Inputs & Outputs
* **Input:** URL (UCI Machine Learning Repository).
* **Output:** Raw TSV files (`train_data.tsv`, `test_data.tsv`) stored in Unity Catalog.

#### 1. CONFIGURATION

In [0]:
# Define global variables to ensure consistency across the pipeline.
# Using a dedicated Catalog avoids conflicts in shared workspaces
CATALOG_NAME = "safety_signal_catalog"
SCHEMA_NAME  = "raw_data"
VOLUME_NAME  = "landing_zone"

#### 2. GOVERNANCE SETUP (Unity Catalog)

In [0]:
# Using 'IF NOT EXISTS' to make this notebook safe to re-run.
print(f"Setting up Governance Layer: {CATALOG_NAME}.{SCHEMA_NAME}...")

# Create Catalog
spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG_NAME}")
spark.sql(f"USE CATALOG {CATALOG_NAME}")

# Create Schema (The database)
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {SCHEMA_NAME}")

# Create Volume
# A 'Volume' is the best place for non-tabular data (like ZIP/CSV files)
spark.sql(f"CREATE VOLUME IF NOT EXISTS {SCHEMA_NAME}.{VOLUME_NAME}")

print(f"Governance Ready: /Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}")

#### 3. DOWNLOAD DATA

In [0]:
%sh
# Download the Zip file to the driver's temporary folder
wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip -O /tmp/drugs_data.zip

# nzip the file
unzip -o -q /tmp/drugs_data.zip -d /tmp/drugs_data/

# List the files to confirm success
ls -lh /tmp/drugs_data/

In [0]:
# Download data from UCI Repository
print(" Successfully downloaded dataset from UCI Machine Learning Repository!")

#### 4. SAVE DATASET TO VOLUME

#### Technical Decision: Using `shutil` vs `dbutils`
* **Context:** I initially attempted to use `dbutils.fs.cp` to move files from the driver's `/tmp` directory to the Volume.
* **The Challenge:** On Databricks **Shared Clusters** (which use Unity Catalog security), accessing the local driver filesystem via `dbutils` is blocked for process isolation (Security Exception).
* **The Solution:** I switched to Python's native `shutil` library. Since Unity Catalog Volumes are mounted as standard FUSE paths in the OS, `shutil` works reliably across **all** cluster modes (Shared & Single User), making the pipeline 100% reproducible.

In [0]:
# Move files from the temporary driver storage to our permanent Unity Catalog Volume
#source_path = "file:/tmp/drugs_data"
#volume_path = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}"

# Move the Train and Test files
#dbutils.fs.cp(f"{source_path}/drugsComTrain_raw.tsv", f"{volume_path}/train_data.tsv")
#dbutils.fs.cp(f"{source_path}/drugsComTest_raw.tsv",  f"{volume_path}/test_data.tsv")

#print("Data is successfully stored in the Lakehouse Volume.")
#display(dbutils.fs.ls(volume_path))

In [0]:
import shutil
import os

# Define Paths
# Note: In Unity Catalog, Volumes are mounted as standard file paths.
source_path = "/tmp/drugs_data"
volume_path = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}"

print(f"Moving data from '{source_path}' to '{volume_path}'...")

# Move Files
try:
    # Copy Train Data
    shutil.copy2(f"{source_path}/drugsComTrain_raw.tsv", f"{volume_path}/train_data.tsv")
    print(f" Training data copied to {volume_path}")
    
    # Copy Test Data
    shutil.copy2(f"{source_path}/drugsComTest_raw.tsv",  f"{volume_path}/test_data.tsv")
    print(f" Testing data copied to {volume_path}")

except Exception as e:
    print(f"Error during copy: {e}")

# Verify Final State
print("\nFinal Volume Contents:")
files = dbutils.fs.ls(volume_path)
for f in files:
    print(f"{f.name} ({f.size / 1024 / 1024:.2f} MB)")