# 🥈 Silver Layer — Transformation & Validation Notebook
## Notebook: 02_silver_transformation

**What this notebook does:**
- Reads raw data from Bronze Delta table
- Removes duplicates and null values
- Validates pollutant readings (no negatives, no nulls)
- Assigns AQI categories to PM2.5 readings
- Writes clean, validated data to Silver Delta table

**Run after:** `01_bronze_ingestion`
**Run before:** `03_gold_aggregation`

In [0]:
# ============================================================
# RELOAD CONFIGURATION
# ============================================================

DATABASE_NAME = "air_quality_db"

BRONZE_TABLE     = f"{DATABASE_NAME}.bronze_raw_measurements"
SILVER_TABLE     = f"{DATABASE_NAME}.silver_clean_measurements"
GOLD_TABLE_CITY  = f"{DATABASE_NAME}.gold_city_rankings"
GOLD_TABLE_TREND = f"{DATABASE_NAME}.gold_pollutant_trends"
GOLD_TABLE_AQI   = f"{DATABASE_NAME}.gold_aqi_summary"

TARGET_POLLUTANTS = ["pm25", "pm10", "no2", "o3", "co", "so2"]

AQI_CATEGORIES = {
    "Good":                  (0.0,   12.0),
    "Moderate":              (12.1,  35.4),
    "Unhealthy (Sensitive)": (35.5,  55.4),
    "Unhealthy":             (55.5,  150.4),
    "Very Unhealthy":        (150.5, 250.4),
    "Hazardous":             (250.5, 9999.0)
}

spark.sql(f"USE {DATABASE_NAME}")

print("✅ Configuration reloaded.")

✅ Configuration reloaded.


In [0]:
# ============================================================
# READ FROM BRONZE TABLE
# ============================================================

bronze_df = spark.sql(f"""
    SELECT *
    FROM {BRONZE_TABLE}
    WHERE city != 'TestCity'
""")

bronze_count = bronze_df.count()

print(f"✅ Bronze records loaded : {bronze_count}")
print(f"\n📊 Schema:")
bronze_df.printSchema()

✅ Bronze records loaded : 7215

📊 Schema:
root
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- pollutant: string (nullable = true)
 |-- value: double (nullable = true)
 |-- unit: string (nullable = true)
 |-- location_name: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- measured_at: timestamp (nullable = true)
 |-- ingested_at: timestamp (nullable = true)
 |-- source_url: string (nullable = true)
 |-- raw_json: string (nullable = true)



In [0]:
# ============================================================
# STEP 1 — REMOVE DUPLICATES
# A duplicate is same city + location + pollutant + timestamp
# ============================================================

from pyspark.sql.functions import col

deduped_df = bronze_df.dropDuplicates([
    "city",
    "location_name",
    "pollutant",
    "measured_at"
])

duplicates_removed = bronze_count - deduped_df.count()

print(f"✅ Deduplication complete.")
print(f"   Records before : {bronze_count}")
print(f"   Duplicates removed : {duplicates_removed}")
print(f"   Records after  : {deduped_df.count()}")

✅ Deduplication complete.
   Records before : 7215
   Duplicates removed : 0
   Records after  : 7215


In [0]:
# ============================================================
# STEP 2 — FILTER TO TARGET POLLUTANTS ONLY
# Drop any readings for pollutants we don't track
# ============================================================

filtered_df = deduped_df.filter(
    col("pollutant").isin(TARGET_POLLUTANTS)
)

dropped_pollutants = deduped_df.count() - filtered_df.count()

print(f"✅ Pollutant filter applied.")
print(f"   Records dropped (unknown pollutants) : {dropped_pollutants}")
print(f"   Records remaining : {filtered_df.count()}")

print(f"\n📊 Pollutant breakdown:")
filtered_df.groupBy("pollutant").count().orderBy("count", ascending=False).show()

✅ Pollutant filter applied.
   Records dropped (unknown pollutants) : 0
   Records remaining : 7215

📊 Pollutant breakdown:
+---------+-----+
|pollutant|count|
+---------+-----+
|     pm25| 2152|
|       o3| 1155|
|      no2| 1152|
|     pm10| 1052|
|      so2|  852|
|       co|  852|
+---------+-----+



In [0]:
# ============================================================
# STEP 3 — VALIDATE READINGS
# Flag records as valid/invalid based on business rules:
#   - value must not be NULL
#   - value must be >= 0 (no negative concentrations)
#   - value must be < 10000 (unrealistically high = sensor error)
#   - measured_at must not be NULL
# ============================================================

from pyspark.sql.functions import when

validated_df = filtered_df.withColumn(
    "is_valid",
    when(
        col("value").isNull() |
        (col("value") < 0) |
        (col("value") >= 10000) |
        col("measured_at").isNull(),
        False
    ).otherwise(True)
)

valid_count   = validated_df.filter(col("is_valid") == True).count()
invalid_count = validated_df.filter(col("is_valid") == False).count()

print(f"✅ Validation complete.")
print(f"   Valid records   : {valid_count}")
print(f"   Invalid records : {invalid_count}")

print(f"\n📊 Invalid record breakdown:")
validated_df.filter(col("is_valid") == False) \
    .groupBy("pollutant") \
    .count() \
    .orderBy("count", ascending=False) \
    .show()

✅ Validation complete.
   Valid records   : 7033
   Invalid records : 182

📊 Invalid record breakdown:
+---------+-----+
|pollutant|count|
+---------+-----+
|     pm25|  102|
|       co|   70|
|      no2|    9|
|       o3|    1|
+---------+-----+



In [0]:
# ============================================================
# STEP 4 — ASSIGN AQI CATEGORIES
# Based on US EPA PM2.5 scale.
# Only PM2.5 readings get a meaningful AQI category.
# All other pollutants get 'N/A'.
# ============================================================

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def assign_aqi_category(pollutant, value):
    """
    Assigns AQI category based on PM2.5 value.
    Returns 'N/A' for non-PM2.5 pollutants.
    """
    if pollutant != "pm25" or value is None:
        return "N/A"

    if 0.0 <= value <= 12.0:
        return "Good"
    elif value <= 35.4:
        return "Moderate"
    elif value <= 55.4:
        return "Unhealthy (Sensitive)"
    elif value <= 150.4:
        return "Unhealthy"
    elif value <= 250.4:
        return "Very Unhealthy"
    else:
        return "Hazardous"

# Register as Spark UDF
aqi_udf = udf(assign_aqi_category, StringType())

categorised_df = validated_df.withColumn(
    "aqi_category",
    aqi_udf(col("pollutant"), col("value"))
)

print(f"✅ AQI categories assigned.")
print(f"\n📊 AQI Category distribution for PM2.5:")
categorised_df.filter(col("pollutant") == "pm25") \
    .groupBy("aqi_category") \
    .count() \
    .orderBy("count", ascending=False) \
    .show()

✅ AQI categories assigned.

📊 AQI Category distribution for PM2.5:
+--------------------+-----+
|        aqi_category|count|
+--------------------+-----+
|            Moderate|  777|
|                Good|  713|
|           Unhealthy|  318|
|Unhealthy (Sensit...|  233|
|           Hazardous|   81|
|      Very Unhealthy|   30|
+--------------------+-----+



In [0]:
# ============================================================
# STEP 5 — SELECT FINAL SILVER COLUMNS
# Drop raw_json and source_url — not needed beyond Bronze
# Keep only what Silver consumers need
# ============================================================

from pyspark.sql.functions import col

silver_df = categorised_df.select(
    col("city"),
    col("country"),
    col("pollutant"),
    col("value"),
    col("unit"),
    col("location_name"),
    col("latitude"),
    col("longitude"),
    col("measured_at"),
    col("ingested_at"),
    col("aqi_category"),
    col("is_valid")
)

print(f"✅ Silver DataFrame ready.")
print(f"   Total records   : {silver_df.count()}")
print(f"   Valid records   : {silver_df.filter(col('is_valid') == True).count()}")
print(f"   Invalid records : {silver_df.filter(col('is_valid') == False).count()}")
print(f"\n📊 Final Silver Schema:")
silver_df.printSchema()

✅ Silver DataFrame ready.
   Total records   : 7215
   Valid records   : 7033
   Invalid records : 182

📊 Final Silver Schema:
root
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- pollutant: string (nullable = true)
 |-- value: double (nullable = true)
 |-- unit: string (nullable = true)
 |-- location_name: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- measured_at: timestamp (nullable = true)
 |-- ingested_at: timestamp (nullable = true)
 |-- aqi_category: string (nullable = true)
 |-- is_valid: boolean (nullable = false)



In [0]:
# ============================================================
# WRITE TO SILVER DELTA TABLE
# Overwrite ensures Silver always reflects latest clean state
# ============================================================

(
    silver_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(SILVER_TABLE)
)

# Confirm
silver_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {SILVER_TABLE}").collect()[0]["cnt"]

print(f"✅ Silver table written successfully.")
print(f"   Total records in Silver : {silver_count}")

✅ Silver table written successfully.
   Total records in Silver : 7215


In [0]:
# ============================================================
# SILVER VALIDATION REPORT
# ============================================================

from pyspark.sql.functions import round as spark_round, avg

print("=" * 55)
print("  🥈 SILVER TRANSFORMATION — SUMMARY REPORT")
print("=" * 55)
print(f"  Bronze records ingested   : {bronze_count}")
print(f"  Duplicates removed        : {duplicates_removed}")
print(f"  Unknown pollutants dropped: {dropped_pollutants}")
print(f"  Valid records             : {valid_count}")
print(f"  Invalid records           : {invalid_count}")
print(f"  Final Silver records      : {silver_count}")
print("=" * 55)

print(f"\n📊 Average PM2.5 by City (valid only):\n")
spark.sql(f"""
    SELECT
        city,
        ROUND(AVG(value), 2)  AS avg_pm25,
        COUNT(*)              AS readings
    FROM {SILVER_TABLE}
    WHERE pollutant = 'pm25'
    AND   is_valid  = true
    GROUP BY city
    ORDER BY avg_pm25 DESC
""").show(25, truncate=False)

print(f"\n▶️  Next Step: Open and run  03_gold_aggregation")

  🥈 SILVER TRANSFORMATION — SUMMARY REPORT
  Bronze records ingested   : 7215
  Duplicates removed        : 0
  Unknown pollutants dropped: 0
  Valid records             : 7033
  Invalid records           : 182
  Final Silver records      : 7215

📊 Average PM2.5 by City (valid only):

+-----------+--------+--------+
|city       |avg_pm25|readings|
+-----------+--------+--------+
|Delhi      |267.44  |50      |
|Dhaka      |121.64  |100     |
|Lahore     |120.05  |150     |
|Cairo      |63.46   |50      |
|Karachi    |48.7    |150     |
|Lagos      |45.42   |150     |
|Mumbai     |36.14   |194     |
|Mexico City|29.98   |52      |
|Shanghai   |28.64   |150     |
|Jakarta    |25.5    |54      |
|Seoul      |25.08   |150     |
|Lima       |23.12   |150     |
|Beijing    |15.58   |100     |
|Nairobi    |10.83   |150     |
|Tokyo      |7.55    |150     |
|New York   |7.09    |100     |
|London     |6.76    |150     |
+-----------+--------+--------+


▶️  Next Step: Open and run  03_gold_agg

In [0]:
spark.sql("""
    SELECT
        city,
        aqi_category,
        COUNT(*)                                    AS readings,
        ROUND(COUNT(*) * 100.0 /
            SUM(COUNT(*)) OVER (PARTITION BY city), 1) AS pct_of_city
    FROM air_quality_db.silver_clean_measurements
    WHERE pollutant  = 'pm25'
    AND   is_valid   = true
    AND   aqi_category != 'N/A'
    GROUP BY city, aqi_category
    ORDER BY city, readings DESC
""").show(50, truncate=False)

+-----------+---------------------+--------+-----------+
|city       |aqi_category         |readings|pct_of_city|
+-----------+---------------------+--------+-----------+
|Beijing    |Good                 |49      |49.0       |
|Beijing    |Moderate             |48      |48.0       |
|Beijing    |Unhealthy (Sensitive)|2       |2.0        |
|Beijing    |Unhealthy            |1       |1.0        |
|Cairo      |Unhealthy            |24      |48.0       |
|Cairo      |Moderate             |14      |28.0       |
|Cairo      |Unhealthy (Sensitive)|12      |24.0       |
|Delhi      |Hazardous            |30      |60.0       |
|Delhi      |Very Unhealthy       |12      |24.0       |
|Delhi      |Unhealthy            |8       |16.0       |
|Dhaka      |Unhealthy            |78      |78.0       |
|Dhaka      |Very Unhealthy       |10      |10.0       |
|Dhaka      |Hazardous            |8       |8.0        |
|Dhaka      |Unhealthy (Sensitive)|4       |4.0        |
|Jakarta    |Moderate          