# HDB Resale Flat Prices ‚Äî ETL Pipeline

This notebook runs the full end-to-end ETL pipeline for HDB resale flat prices (Mar 2012 ‚Äì Dec 2016).

### Pipeline Stages
| Stage | Description |
|-------|-------------|
| **0. Download** | Fetch raw CSVs from data.gov.sg via API |
| **1. Load** | Read and align both CSV snapshots |
| **2. Clean** | Type casting, null drops, lease recomputation |
| **3. Deduplicate** | Remove duplicate records, save audit file |
| **4. Validate** | Apply business rules, flag violations |
| **5. Anomaly Detection** | 3-sigma price outlier detection per town/flat type |
| **6. Profile** | Statistical summary of cleaned dataset |
| **7. Transform** | Create synthetic Resale Identifier |
| **8. Hash** | SHA-256 hash the Resale Identifier |

> **Prerequisites:** Ensure `download_hdb_data.py` is in the same folder as this notebook.
> Install dependencies: `pip install pandas requests`

## Stage 0 ‚Äî Download Raw Data

> **How this works:** `%run` executes `download_hdb_data.py` directly inside the notebook kernel,
> so all `print()` output appears here in real time.

> **Requirement:** `download_hdb_data.py` must be in the **same folder** as this notebook.

The script will:
1. Connect to the data.gov.sg API
2. Auto-discover matching datasets by keyword
3. Download both CSV files into the `hdb_data/` folder

In [1]:
import os

# ‚îÄ‚îÄ Preflight check ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
DOWNLOADER = 'download_hdb_data.py'

if not os.path.isfile(DOWNLOADER):
    raise FileNotFoundError(
        f"'{DOWNLOADER}' not found in '{os.getcwd()}'.\n"
        f"Please copy download_hdb_data.py into the same folder as this notebook."
    )

print(f'‚úì {DOWNLOADER} found ‚Äî starting download...')
print('=' * 60)

# ‚îÄ‚îÄ Run downloader in-kernel so all output is visible ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
%run download_hdb_data.py

‚úì download_hdb_data.py found ‚Äî starting download...
HDB Resale Flat Prices ‚Äî Auto-discover & Download
Fetching collection metadata...
Found 5 datasets in collection. Fetching names...

  d_8b84c4ee58e3cfc0ece0d773c8ca6abc ‚Üí Resale flat prices based on registration date from Jan-2017 onwards
  d_43f493c6c50d54243cc1eab0df142d6a ‚Üí Resale Flat Prices (Based on Approval Date), 2000 - Feb 2012
  d_2d5ff9ea31397b66239f245f57751537 ‚Üí Resale Flat Prices (Based on Registration Date), From Mar 2012 to Dec 2014
  d_ebc5ab87086db484f88045b47411ebc5 ‚Üí Resale Flat Prices (Based on Approval Date), 1990 - 1999
  d_ea9ed51da2787afaf8e51f827c304208 ‚Üí Resale Flat Prices (Based on Registration Date), From Jan 2015 to Dec 2016

  Matched: [Registration Date, Mar 2012, Dec 2014] ‚Üí Resale Flat Prices (Based on Registration Date), From Mar 2012 to Dec 2014
  Matched: [Registration Date, Jan 2015, Dec 2016] ‚Üí Resale Flat Prices (Based on Registration Date), From Jan 2015 to Dec 2016

Downlo

In [2]:
# ‚îÄ‚îÄ Verify downloaded files exist before proceeding ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
EXPECTED_FILES = [
    os.path.join('hdb_data', 'Resale_Flat_Prices_Based_on_Registration_Date_From_Mar_2012_to_Dec_2014.csv'),
    os.path.join('hdb_data', 'Resale_Flat_Prices_Based_on_Registration_Date_From_Jan_2015_to_Dec_2016.csv'),
]

print('Checking downloaded files...')
all_found = True
for f in EXPECTED_FILES:
    if os.path.isfile(f):
        size_kb = os.path.getsize(f) / 1024
        print(f'  ‚úì {f}  ({size_kb:.1f} KB)')
    else:
        print(f'  ‚ùå MISSING: {f}')
        all_found = False

if not all_found:
    raise RuntimeError('Some files are missing. Re-run the download cell above before continuing.')
else:
    print('\n‚úÖ All files present ‚Äî safe to proceed to Stage 1.')

Checking downloaded files...
  ‚úì hdb_data\Resale_Flat_Prices_Based_on_Registration_Date_From_Mar_2012_to_Dec_2014.csv  (4063.3 KB)
  ‚úì hdb_data\Resale_Flat_Prices_Based_on_Registration_Date_From_Jan_2015_to_Dec_2016.csv  (2998.9 KB)

‚úÖ All files present ‚Äî safe to proceed to Stage 1.


## Stage 1 ‚Äî Imports & Configuration

In [3]:
import os
import sys
import pandas as pd
import hashlib
import re
import time
from datetime import datetime

print('‚úì Libraries imported')

‚úì Libraries imported


In [4]:
# ‚îÄ‚îÄ Output directories ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
RAW_DIR          = 'hdb_data'
OUTPUT_DIR       = 'output'
RAW_OUT_DIR      = os.path.join(OUTPUT_DIR, 'raw')
CLEANED_OUT_DIR  = os.path.join(OUTPUT_DIR, 'cleaned')
TRANSFORM_OUT_DIR= os.path.join(OUTPUT_DIR, 'transformed')
HASHED_OUT_DIR   = os.path.join(OUTPUT_DIR, 'hashed')
FAILED_OUT_DIR   = os.path.join(OUTPUT_DIR, 'failed')
AUDIT_OUT_DIR    = os.path.join(OUTPUT_DIR, 'audit')
PROFILE_OUT_DIR  = os.path.join(OUTPUT_DIR, 'profiling')

for d in [RAW_OUT_DIR, CLEANED_OUT_DIR, TRANSFORM_OUT_DIR,
          HASHED_OUT_DIR, FAILED_OUT_DIR, AUDIT_OUT_DIR, PROFILE_OUT_DIR]:
    os.makedirs(d, exist_ok=True)

print('‚úì Output directories created')

‚úì Output directories created


In [5]:
# ‚îÄ‚îÄ Input CSV files ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
CSV_FILES = [
    'Resale_Flat_Prices_Based_on_Registration_Date_From_Mar_2012_to_Dec_2014.csv',
    'Resale_Flat_Prices_Based_on_Registration_Date_From_Jan_2015_to_Dec_2016.csv'
]
CSV_PATHS = [os.path.join(RAW_DIR, f) for f in CSV_FILES]

EXPECTED_START = pd.Period('2012-03', freq='M')
EXPECTED_END   = pd.Period('2016-12', freq='M')

print('‚úì CSV paths configured')
for p in CSV_PATHS:
    exists = '‚úì Found' if os.path.isfile(p) else '‚ùå Missing'
    print(f'  {exists}: {p}')

‚úì CSV paths configured
  ‚úì Found: hdb_data\Resale_Flat_Prices_Based_on_Registration_Date_From_Mar_2012_to_Dec_2014.csv
  ‚úì Found: hdb_data\Resale_Flat_Prices_Based_on_Registration_Date_From_Jan_2015_to_Dec_2016.csv


In [6]:
# ‚îÄ‚îÄ Validation reference sets ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
VALID_TOWNS = {
    'ANG MO KIO','BEDOK','BISHAN','BUKIT BATOK','BUKIT MERAH',
    'BUKIT PANJANG','BUKIT TIMAH','CENTRAL AREA','CHOA CHU KANG','CLEMENTI',
    'GEYLANG','HOUGANG','JURONG EAST','JURONG WEST','KALLANG/WHAMPOA',
    'MARINE PARADE','PASIR RIS','PUNGGOL','QUEENSTOWN','SEMBAWANG',
    'SENGKANG','SERANGOON','TAMPINES','TOA PAYOH','WOODLANDS','YISHUN'
}

VALID_FLAT_TYPES = {'1 ROOM','2 ROOM','3 ROOM','4 ROOM','5 ROOM','EXECUTIVE','MULTI-GENERATION'}

VALID_FLAT_MODELS = {
    '2-room','Adjoined flat','Apartment','DBSS','Improved','Improved-Maisonette',
    'Maisonette','Model A','Model A2','Model A-Maisonette','Multi Generation',
    'New Generation','Premium Apartment','Premium Apartment Loft','Premium Maisonette',
    'Simplified','Standard','Terrace','Type S1','Type S2'
}

VALID_STOREY_FORMAT = r'^\d{2} TO \d{2}$'

print(f'‚úì Validation sets loaded: {len(VALID_TOWNS)} towns, {len(VALID_FLAT_TYPES)} flat types, {len(VALID_FLAT_MODELS)} flat models')

‚úì Validation sets loaded: 26 towns, 7 flat types, 20 flat models


## Stage 2 ‚Äî Load & Align Raw Snapshots

Reads both CSV files and aligns their columns (via `reindex`) before concatenating into one master DataFrame.

In [7]:
def timed_step(name, func, *args, **kwargs):
    start = time.time()
    result = func(*args, **kwargs)
    elapsed = time.time() - start
    print(f'‚è± {name} executed in {elapsed:.2f}s')
    return result

def load_and_align_snapshots():
    print('üì¶ Loading and aligning CSV snapshots...')
    dfs = []
    for path in CSV_PATHS:
        print(f'   - {path}')
        df = pd.read_csv(path)
        print(f'     ‚Üí {len(df):,} rows, {len(df.columns)} columns')
        dfs.append(df)
    all_columns = sorted(set().union(*(df.columns for df in dfs)))
    aligned_dfs = [df.reindex(columns=all_columns) for df in dfs]
    master_df = pd.concat(aligned_dfs, ignore_index=True)
    return master_df

df = timed_step('Load & Align CSVs', load_and_align_snapshots)

raw_file = os.path.join(RAW_OUT_DIR, 'hdb_resale_raw.csv')
df.to_csv(raw_file, index=False)
print(f'üíæ Raw dataset saved: {raw_file}')
print(f'\nShape: {df.shape[0]:,} rows √ó {df.shape[1]} columns')
df.head(3)

üì¶ Loading and aligning CSV snapshots...
   - hdb_data\Resale_Flat_Prices_Based_on_Registration_Date_From_Mar_2012_to_Dec_2014.csv
     ‚Üí 52,203 rows, 10 columns
   - hdb_data\Resale_Flat_Prices_Based_on_Registration_Date_From_Jan_2015_to_Dec_2016.csv
     ‚Üí 37,153 rows, 11 columns
‚è± Load & Align CSVs executed in 0.15s
üíæ Raw dataset saved: output\raw\hdb_resale_raw.csv

Shape: 89,356 rows √ó 11 columns


Unnamed: 0,block,flat_model,flat_type,floor_area_sqm,lease_commence_date,month,remaining_lease,resale_price,storey_range,street_name,town
0,172,Improved,2 ROOM,45.0,1986,2012-03,,250000.0,06 TO 10,ANG MO KIO AVE 4,ANG MO KIO
1,510,Improved,2 ROOM,44.0,1980,2012-03,,265000.0,01 TO 05,ANG MO KIO AVE 8,ANG MO KIO
2,610,New Generation,3 ROOM,68.0,1980,2012-03,,315000.0,06 TO 10,ANG MO KIO AVE 4,ANG MO KIO


## Stage 3 ‚Äî Data Cleaning

- Cast `month`, `resale_price`, `floor_area_sqm` to correct types
- Drop rows with nulls in critical fields
- Recompute `remaining_lease` from `lease_commence_date`

In [8]:
# Type casting
df['month']          = pd.to_datetime(df['month'], format='%Y-%m', errors='coerce')
df['resale_price']   = pd.to_numeric(df['resale_price'], errors='coerce')
df['floor_area_sqm'] = pd.to_numeric(df['floor_area_sqm'], errors='coerce')

before = len(df)
df = df.dropna(subset=['month', 'resale_price', 'floor_area_sqm'])
after = len(df)
print(f'‚úì Type casting complete')
print(f'  Rows dropped (null critical fields): {before - after:,}')
print(f'  Rows remaining: {after:,}')

‚úì Type casting complete
  Rows dropped (null critical fields): 0
  Rows remaining: 89,356


In [9]:
def recompute_remaining_lease(df):
    today = pd.Timestamp.today()
    def compute(row):
        lease_start = row.get('lease_commence_date')
        if pd.isna(lease_start):
            return None
        end = pd.Timestamp(year=int(lease_start), month=1, day=1) + pd.DateOffset(years=99)
        if end < today:
            return '0 years 0 months'
        months_remaining = (end.year - today.year) * 12 + (end.month - today.month)
        return f'{months_remaining//12} years {months_remaining%12} months'
    df['remaining_lease'] = df.apply(compute, axis=1)
    return df

df = recompute_remaining_lease(df)
print('‚úì remaining_lease recomputed')
print(f'  Sample values:')
print(df[['lease_commence_date','remaining_lease']].drop_duplicates().head(5).to_string(index=False))

‚úì remaining_lease recomputed
  Sample values:
 lease_commence_date    remaining_lease
                1986 58 years 11 months
                1980 52 years 11 months
                1984 56 years 11 months
                1981 53 years 11 months
                1978 50 years 11 months


## Stage 4 ‚Äî Deduplication

Identifies duplicate records where all fields except `resale_price` are identical.
Keeps the row with the **higher** price and saves removed duplicates to audit.

In [10]:
def deduplicate_dataset(df):
    key_cols = [c for c in df.columns if c != 'resale_price']
    df_sorted  = df.sort_values('resale_price', ascending=False)
    df_cleaned = df_sorted.drop_duplicates(subset=key_cols, keep='first')
    df_dupes   = df_sorted.loc[~df_sorted.index.isin(df_cleaned.index)]
    return df_cleaned, df_dupes

df_cleaned, df_duplicates = deduplicate_dataset(df)

print(f'‚úì Deduplication complete')
print(f'  Original rows : {len(df):,}')
print(f'  Duplicates    : {len(df_duplicates):,}')
print(f'  Clean rows    : {len(df_cleaned):,}')

if not df_duplicates.empty:
    path = os.path.join(AUDIT_OUT_DIR, 'duplicates.csv')
    df_duplicates.to_csv(path, index=False)
    print(f'  ‚ö†Ô∏è  Duplicates saved: {path}')
else:
    print('  ‚úì No duplicates found')

‚úì Deduplication complete
  Original rows : 89,356
  Duplicates    : 1,560
  Clean rows    : 87,796
  ‚ö†Ô∏è  Duplicates saved: output\audit\duplicates.csv


## Stage 5 ‚Äî Business Rule Validation

Applies 6 domain-specific rules row by row. Any failing row is captured with a `comments` column.

In [11]:
def extra_validation(df):
    rows = []
    for _, r in df.iterrows():
        issues = []
       
        # 1. Validate resale_price
        if r.get('resale_price',0) <= 0:
            issues.append("invalid resale_price")

        # 2. Validate floor_area_sqm 	
        if r.get('floor_area_sqm',0) <= 0 or r.get('floor_area_sqm',0) > 500:
            issues.append("invalid floor_area_sqm")

        # 3. Validate town
        if r.get('town') not in VALID_TOWNS:
            issues.append("invalid town")

        # 4. Validate flat_type
        if r.get('flat_type') not in VALID_FLAT_TYPES:
            issues.append("invalid flat_type")

        # 5. Validate flat_model
        if r.get('flat_model') not in VALID_FLAT_MODELS:
            issues.append("invalid flat_model")

        # 6. Validate storey_range format (e.g. "01 TO 03")
        if not re.match(VALID_STOREY_FORMAT, str(r.get('storey_range'))):
            issues.append("invalid storey_range")

        # 7. Validate month ‚Äî must be within Mar 2012 to Dec 2016 (year + month check)
        month_val = r.get('month')
        if pd.isna(month_val):
            issues.append("missing month")
        else:
            month_period = pd.Period(month_val, freq="M")
            if month_period < EXPECTED_START or month_period > EXPECTED_END:
                issues.append(f"month out of range: {month_period} (expected {EXPECTED_START} to {EXPECTED_END})")
                
        if issues:
            row_copy = r.copy()
            row_copy['comments'] = '; '.join(issues)
            rows.append(row_copy)
    return pd.DataFrame(rows)

df_rules_fail = extra_validation(df_cleaned)

print(f'‚úì Validation complete')
print(f'  Rule violations : {len(df_rules_fail):,} rows')

if not df_rules_fail.empty:
    path = os.path.join(AUDIT_OUT_DIR, 'rule_violations.csv')
    df_rules_fail.to_csv(path, index=False)
    print(f'  ‚ö†Ô∏è  Violations saved: {path}')
    print(df_rules_fail['comments'].value_counts().head(10))
else:
    print('  ‚úì No rule violations found')

‚úì Validation complete
  Rule violations : 0 rows
  ‚úì No rule violations found


## Stage 6 ‚Äî Anomaly Detection (3-Sigma)

Detects statistically unusual resale prices using the **3-sigma (Z-score) method**.

### Why 3-Sigma?
Based on the Empirical Rule (68-95-99.7 Rule):

| Sigma | Data Coverage | Frequency | Decision |
|-------|--------------|-----------|----------|
| 1œÉ | ~68% | 1 in 3 | Too sensitive |
| 2œÉ | ~95% | 1 in 20 | Too many false positives |
| **3œÉ** | **~99.7%** | **1 in 370** | **Selected threshold ‚úì** |

### Why localised grouping (per Town + Flat Type)?
Each flat is compared only against its own peer group (same town, same flat type) ‚Äî
so a \$900k Executive flat is never unfairly penalised for being more expensive than a 3-room flat next door.

> **Assumption:** Prices within each Town + Flat Type group are approximately normally distributed.
> If heavily skewed, consider Median Absolute Deviation (MAD) as a more robust alternative.

In [12]:
def detect_anomalous_prices(df):
    df = df.copy()
    df['price_anomaly'] = False
    anomalies_list = []

    for (town, flat_type), group in df.groupby(['town', 'flat_type']):
        if len(group) < 3:  # Skip groups too small for meaningful std
            continue
        mean  = group['resale_price'].mean()
        std   = group['resale_price'].std()
        lower = mean - 3 * std
        upper = mean + 3 * std
        anomalies = group[(group['resale_price'] < lower) | (group['resale_price'] > upper)]
        if not anomalies.empty:
            df.loc[anomalies.index, 'price_anomaly'] = True
            anomalies_list.append(anomalies)

    return pd.concat(anomalies_list) if anomalies_list else pd.DataFrame()

df_anomalies = detect_anomalous_prices(df_cleaned)

print(f'‚úì Anomaly detection complete')
print(f'  Anomalous rows: {len(df_anomalies):,}')

if not df_anomalies.empty:
    path = os.path.join(AUDIT_OUT_DIR, 'anomalies.csv')
    df_anomalies.to_csv(path, index=False)
    print(f'  ‚ö†Ô∏è  Anomalies saved: {path}')
    print('\n  Top anomalous towns:')
    print(df_anomalies['town'].value_counts().head(5))
else:
    print('  ‚úì No anomalies detected')

‚úì Anomaly detection complete
  Anomalous rows: 450
  ‚ö†Ô∏è  Anomalies saved: output\audit\anomalies.csv

  Top anomalous towns:
town
BEDOK              72
TAMPINES           46
KALLANG/WHAMPOA    42
ANG MO KIO         31
HOUGANG            31
Name: count, dtype: int64


## Stage 6b ‚Äî Combine Failed Records & Finalise Cleaned Dataset

In [13]:
# Merge all failed records
failed_records = pd.concat([df_duplicates, df_rules_fail, df_anomalies]).drop_duplicates()
print(f'Total failed records (dupes + violations + anomalies): {len(failed_records):,}')

if not failed_records.empty:
    path = os.path.join(FAILED_OUT_DIR, 'hdb_resale_failed.csv')
    failed_records.to_csv(path, index=False)
    print(f'  ‚ö†Ô∏è  Failed records saved: {path}')

# Remove failed from cleaned
df_cleaned_final = df_cleaned.loc[~df_cleaned.index.isin(failed_records.index)]
cleaned_file = os.path.join(CLEANED_OUT_DIR, 'hdb_resale_cleaned.csv')
df_cleaned_final.to_csv(cleaned_file, index=False)

print(f'\nüíæ Cleaned dataset saved: {cleaned_file}')
print(f'   Final rows: {len(df_cleaned_final):,}')
df_cleaned_final.head(3)

Total failed records (dupes + violations + anomalies): 2,008
  ‚ö†Ô∏è  Failed records saved: output\failed\hdb_resale_failed.csv

üíæ Cleaned dataset saved: output\cleaned\hdb_resale_cleaned.csv
   Final rows: 87,346


Unnamed: 0,block,flat_model,flat_type,floor_area_sqm,lease_commence_date,month,remaining_lease,resale_price,storey_range,street_name,town
83511,1G,Type S2,5 ROOM,106.0,2011,2016-09-01,83 years 11 months,1120000.0,43 TO 45,CANTONMENT RD,CENTRAL AREA
86801,1D,Type S2,5 ROOM,107.0,2011,2016-11-01,83 years 11 months,1100000.0,46 TO 48,CANTONMENT RD,CENTRAL AREA
86800,1B,Type S2,5 ROOM,106.0,2011,2016-11-01,83 years 11 months,1100000.0,31 TO 33,CANTONMENT RD,CENTRAL AREA


## Stage 7 ‚Äî Data Profiling

Generates a statistical summary of the cleaned dataset including null counts, numeric distributions, and duplicate counts.

In [14]:
def profile_dataset(df):
    profile = {}
    profile['total_rows']    = len(df)
    profile['total_columns'] = len(df.columns)
    profile.update({f'null_count_{col}': df[col].isna().sum() for col in df.columns})
    for col in ['resale_price', 'floor_area_sqm']:
        if col in df.columns:
            profile[f'{col}_min']    = df[col].min()
            profile[f'{col}_max']    = df[col].max()
            profile[f'{col}_mean']   = df[col].mean()
            profile[f'{col}_median'] = df[col].median()
    profile['duplicate_rows'] = df.duplicated().sum()
    return profile

profile = timed_step('Profile Cleaned Dataset', profile_dataset, df_cleaned_final)
profile_df = pd.DataFrame([profile])

path = os.path.join(PROFILE_OUT_DIR, 'profile_cleaned.csv')
profile_df.to_csv(path, index=False)
print(f'üìä Profiling report saved: {path}')

# Display key stats
print('\n‚îÄ‚îÄ Key Statistics ‚îÄ‚îÄ')
print(f"  Rows          : {profile['total_rows']:,}")
print(f"  Columns       : {profile['total_columns']}")
print(f"  Duplicates    : {profile['duplicate_rows']}")
print(f"  Price min     : ${profile['resale_price_min']:,.0f}")
print(f"  Price max     : ${profile['resale_price_max']:,.0f}")
print(f"  Price mean    : ${profile['resale_price_mean']:,.0f}")
print(f"  Price median  : ${profile['resale_price_median']:,.0f}")

‚è± Profile Cleaned Dataset executed in 0.12s
üìä Profiling report saved: output\profiling\profile_cleaned.csv

‚îÄ‚îÄ Key Statistics ‚îÄ‚îÄ
  Rows          : 87,346
  Columns       : 11
  Duplicates    : 0
  Price min     : $192,000
  Price max     : $1,120,000
  Price mean    : $450,913
  Price median  : $428,000


## Stage 8 ‚Äî Transformation: Resale Identifier

Creates a synthetic `Resale Identifier` field encoding location, price context, and timing:

```
Format:  S + block_numeric(3) + avg_price_prefix(2) + month_num(2) + town_initial(1)
Example: S042450303A
```

In [15]:
def create_resale_identifier(df):
    df_copy = df.copy()
    if 'block' not in df_copy.columns:
        df_copy['block'] = '000'
    df_copy['block_numeric'] = (
        df_copy['block'].astype(str)
        .str.extract(r'(\d+)')[0]
        .fillna('000')
        .str.zfill(3)
    )
    df_copy['year_month'] = df_copy['month'].dt.to_period('M')
    avg_price = df_copy.groupby(['year_month','town','flat_type'])['resale_price'].transform('mean')
    df_copy['Resale Identifier'] = (
        'S' +
        df_copy['block_numeric'] +
        avg_price.astype(int).astype(str).str[:2] +
        df_copy['month'].dt.month.astype(str).str.zfill(2) +
        df_copy['town'].str[0]
    )
    return df_copy

df_transformed = create_resale_identifier(df_cleaned_final)

path = os.path.join(TRANSFORM_OUT_DIR, 'hdb_resale_transformed.csv')
df_transformed.to_csv(path, index=False)
print(f'üíæ Transformed dataset saved: {path}')
print('\nSample Resale Identifiers:')
print(df_transformed[['town','flat_type','resale_price','Resale Identifier']].head(5).to_string(index=False))

üíæ Transformed dataset saved: output\transformed\hdb_resale_transformed.csv

Sample Resale Identifiers:
        town flat_type  resale_price Resale Identifier
CENTRAL AREA    5 ROOM     1120000.0         S0011009C
CENTRAL AREA    5 ROOM     1100000.0         S0011011C
CENTRAL AREA    5 ROOM     1100000.0         S0011011C
CENTRAL AREA    5 ROOM     1088000.0         S0019311C
CENTRAL AREA    5 ROOM     1070000.0         S0011008C


## Stage 9 ‚Äî Hashing

Applies **SHA-256** hashing to the `Resale Identifier` field, producing `Resale Identifier Hashed`.

- Anonymises the synthetic key while preserving uniqueness
- Deterministic: same input always produces the same 64-character hex output

In [16]:
def hash_resale_identifier(df):
    df['Resale Identifier Hashed'] = df['Resale Identifier'].apply(
        lambda x: hashlib.sha256(str(x).encode()).hexdigest()
    )
    return df

df_hashed = hash_resale_identifier(df_transformed)

path = os.path.join(HASHED_OUT_DIR, 'hdb_resale_hashed.csv')
df_hashed.to_csv(path, index=False)
print(f'üíæ Hashed dataset saved: {path}')
print('\nSample hashes:')
print(df_hashed[['Resale Identifier','Resale Identifier Hashed']].head(3).to_string(index=False))

üíæ Hashed dataset saved: output\hashed\hdb_resale_hashed.csv

Sample hashes:
Resale Identifier                                         Resale Identifier Hashed
        S0011009C dc3cb88029cf26ce1e87b42904bcd0b6eb033a6ebc5f806d36f88a45aa38cb6c
        S0011011C 4dc0cb821bca135bb0aefdb588f218ea152babc0536de3e6d577519d3fccc2c2
        S0011011C 4dc0cb821bca135bb0aefdb588f218ea152babc0536de3e6d577519d3fccc2c2


## ‚úÖ ETL Complete ‚Äî Final Summary

In [17]:
print('=' * 60)
print('‚úÖ HDB RESALE ETL PIPELINE COMPLETE')
print('=' * 60)
print(f'  Final rows     : {len(df_hashed):,}')
print(f'  Final columns  : {len(df_hashed.columns)}')
print(f'  Date range     : {df_hashed["month"].min().date()} ‚Üí {df_hashed["month"].max().date()}')
print()
print('Output files:')
outputs = [
    (RAW_OUT_DIR,       'hdb_resale_raw.csv'),
    (CLEANED_OUT_DIR,   'hdb_resale_cleaned.csv'),
    (TRANSFORM_OUT_DIR, 'hdb_resale_transformed.csv'),
    (HASHED_OUT_DIR,    'hdb_resale_hashed.csv'),
    (FAILED_OUT_DIR,    'hdb_resale_failed.csv'),
    (AUDIT_OUT_DIR,     'duplicates.csv'),
    (AUDIT_OUT_DIR,     'rule_violations.csv'),
    (AUDIT_OUT_DIR,     'anomalies.csv'),
    (PROFILE_OUT_DIR,   'profile_cleaned.csv'),
]
for folder, fname in outputs:
    path = os.path.join(folder, fname)
    status = '‚úì' if os.path.isfile(path) else '‚ö†Ô∏è '
    print(f'  {status} {path}')

‚úÖ HDB RESALE ETL PIPELINE COMPLETE
  Final rows     : 87,346
  Final columns  : 15
  Date range     : 2012-03-01 ‚Üí 2016-12-01

Output files:
  ‚úì output\raw\hdb_resale_raw.csv
  ‚úì output\cleaned\hdb_resale_cleaned.csv
  ‚úì output\transformed\hdb_resale_transformed.csv
  ‚úì output\hashed\hdb_resale_hashed.csv
  ‚úì output\failed\hdb_resale_failed.csv
  ‚úì output\audit\duplicates.csv
  ‚ö†Ô∏è  output\audit\rule_violations.csv
  ‚úì output\audit\anomalies.csv
  ‚úì output\profiling\profile_cleaned.csv
