# **01 – Exploring UK Flood Data (Exploratory Analysis)**
#
# This notebook performs a **non-persistent exploration** of the UK Environment Agency Flood API.  
# It fetches a small live sample (≤100 records), inspects the JSON structure,
# and performs light profiling to understand schema, value ranges, and potential data quality issues.
#
# **Context**
# - Data source: [Environment Agency Flood Monitoring API](https://environment.data.gov.uk/flood-monitoring/id/floods)
# - Goal: Understand structure and variability before designing the Bronze schema
# - Output: Insights only – *no writes to storage or Unity Catalog yet*


# **1. Environment Setup**
# Connect via Databricks Connect (Spark 13.3 LTS) or local PySpark session.


In [None]:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import requests, json

spark = SparkSession.builder.getOrCreate()
print(f"Spark session active: {spark.version}")



# **2. Fetch Live Data**
# Limit to 100 records for quick iteration.


In [None]:

API_URL = "https://environment.data.gov.uk/flood-monitoring/id/floods"

def fetch_flood_data(limit=100):
    try:
        r = requests.get(API_URL, timeout=15)
        r.raise_for_status()
        data = r.json()
        items = data.get("items", [])[:limit]
        print(f"Fetched {len(items)} records.")
        return items
    except Exception as e:
        print(f"Error fetching data: {e}")
        return []

records = fetch_flood_data(100)

# **3. Inspect Raw Structure**
# Examine a single record to understand keys and nesting.
# 
# **Add commentary after viewing:** note any nested objects (e.g. `floodArea`), timestamps, or missing values.


In [None]:
if records:
    print(json.dumps(records[0], indent=2))
else:
    print("No records returned.")


# **4. Convert to DataFrame**
# 
# In this step, we let Spark **infer the schema** from the raw JSON data returned by the API.  
# Once inferred, we capture it for reference and future reuse in ingestion pipelines.  
# This approach avoids hardcoding field names until the structure is confirmed.

In [None]:
# Infer schema automatically from the API records
df_inferred = spark.read.json(spark.sparkContext.parallelize(records))

# Display inferred schema in tree format
print("=== Inferred Schema ===")
df_inferred.printSchema()

# Optionally, store schema for reproducibility (e.g., to use later in Bronze ingestion)
schema_json = df_inferred.schema.json()
# with open("schema/flood_alerts_schema.json", "w") as f:
#    f.write(schema_json)

# Recreate a DataFrame using the captured schema (enforces consistent structure)
schema = StructType.fromJson(json.loads(schema_json))
df_raw = spark.createDataFrame(records, schema=schema)

# Display counts and sample rows
print(f"Rows: {df_raw.count()}, Columns: {len(df_raw.columns)}")
df_raw.show(5, truncate=False)

# **5. Flatten Structure**
# Extract nested fields for easier profiling.
# 
# **Add commentary later:** describe which attributes look most stable across records.


In [None]:

df_flat = (
    df_raw
    .withColumn("flood_area_label", col("floodArea.label"))
    .withColumn("flood_area_notation", col("floodArea.notation"))
    .withColumn("polygon", col("floodArea.polygon"))
    .withColumn("ingest_time", current_timestamp())
    .drop("floodArea")
)

df_flat.show(5, truncate=False)


# **6. Basic Profiling**
# Quick statistics to understand coverage and possible nulls.
# 
# These will inform what constraints or expectations to enforce later in Bronze.


In [None]:
df_flat.selectExpr(
    "count(*) as total_records",
    "count(distinct floodAreaID) as unique_area_ids",
    "count(distinct eaAreaName) as unique_ea_areas",
    "count(distinct severity) as unique_severity_levels",
    "min(timeRaised) as earliest_alert",
    "max(timeRaised) as latest_alert"
).show(truncate=False)


# **7. Severity Distribution**
# How are severity levels distributed?  
# (Values usually range 1–4 where 1 = Severe Flood Warning.)


In [None]:

(
    df_flat.groupBy("severityLevel", "severity")
    .count()
    .orderBy("severityLevel")
    .show(truncate=False)
)


# **8. Sample Flood Areas**
# Quick look at geographic diversity of current alerts.


In [None]:
df_flat.select("flood_area_label", "eaAreaName").distinct().show(20, truncate=False)



# **9. Placeholder – Future Bronze Write**
# 
# Once the schema is validated, this block will create the Unity Catalog table:
# 
# ```python
# TARGET_TABLE = "flood_dev.bronze.alerts"
# df_flat.write.format("delta").mode("overwrite").saveAsTable(TARGET_TABLE)
# ```
# 
# For now, **do not execute**; we’re in exploration mode only.



# **10. Reflections and Notes**
# Use this section to capture your observations after running the notebook.
# 
# - Which fields appear reliable enough for Bronze ingestion?  
# - Do timestamps need parsing to `TimestampType`?  
# - Are there categorical fields worth modelling as dimensions later?  
# 
# **Next:** formalise schema → design Bronze expectations → implement first DLT pipeline.
