# üèôÔ∏è Urban Growth GPU Pipeline (Simple Version)
This notebook loads the Austin Construction Permits CSV (~2GB), extracts useful insights, 
and prepares a clean dataset for downstream models (RAG, clustering, predictions, etc.).

In [1]:
import cudf
import cupy as cp
import numpy as np
import re
from cudf.core.column import as_column

data_path = "/home/asus/atx-hackathon/data/Issued_Construction_Permits_20251213.csv"
print("Loading file:", data_path)

Loading file: /home/asus/atx-hackathon/data/Issued_Construction_Permits_20251213.csv


## üîπ Step 1 ‚Äî Load CSV into cuDF (GPU)

In [5]:
import cudf

dfs = []
chunk_size = 256 * 1024 * 1024  # 256MB
offset = 0

while True:
    try:
        gdf_chunk = cudf.read_csv(
            data_path,
            byte_range=(offset, chunk_size)
        )
        dfs.append(gdf_chunk)
        offset += chunk_size
    except Exception:
        break

df = cudf.concat(dfs)
df
df.head()

Unnamed: 0,Permit Type,Permit Type Desc,Permit Num,Permit Class Mapped,Permit Class,Work Class,Condominium,Project Name,Description,TCAD ID,...,Contractor Zip,Applicant Full Name,Applicant Organization,Applicant Phone,Applicant Address 1,Applicant Address 2,Applicant City,Applicant Zip,Certificate Of Occupancy,Total Lot SQFT
0,PP,Plumbing Permit,2025-142580 PP,Residential,Residential,Irrigation,,13109 TAMAR CT,New install,262201310,...,,,,,,,,,,
1,PP,Plumbing Permit,2025-125697 PP,Residential,Residential,Irrigation,,9013 HAMADRYAS DR,Irrigation install,340130503,...,,,,,,,,,,
2,PP,Plumbing Permit,2025-125699 PP,Residential,Residential,Irrigation,,9015 HAMADRYAS DR,Irrigation install,340130506,...,,,,,,,,,,
3,PP,Plumbing Permit,2025-133647 PP,Residential,Residential,Irrigation,,1405 IBERVILLE DR,Install sprinkler system,274281601,...,,,,,,,,,,
4,PP,Plumbing Permit,2025-125469 PP,Residential,Residential,Irrigation,,5511 FORKS RD,irrigation install,438011501,...,,,,,,,,,,


## üîπ Step 2 ‚Äî Basic Cleanup
- Normalize text fields
- Fill missing values
- Create lower‚Äëcase searchable versions

In [6]:
text_cols = ["DESCRIPTION", "WORK_CLASS", "ISSUED_DATE"]

for col in text_cols:
    if col in df.columns:
        df[col] = df[col].fillna("")
        df[col + "_clean"] = df[col].str.lower()
        
df.head()

Unnamed: 0,Permit Type,Permit Type Desc,Permit Num,Permit Class Mapped,Permit Class,Work Class,Condominium,Project Name,Description,TCAD ID,...,Contractor Zip,Applicant Full Name,Applicant Organization,Applicant Phone,Applicant Address 1,Applicant Address 2,Applicant City,Applicant Zip,Certificate Of Occupancy,Total Lot SQFT
0,PP,Plumbing Permit,2025-142580 PP,Residential,Residential,Irrigation,,13109 TAMAR CT,New install,262201310,...,,,,,,,,,,
1,PP,Plumbing Permit,2025-125697 PP,Residential,Residential,Irrigation,,9013 HAMADRYAS DR,Irrigation install,340130503,...,,,,,,,,,,
2,PP,Plumbing Permit,2025-125699 PP,Residential,Residential,Irrigation,,9015 HAMADRYAS DR,Irrigation install,340130506,...,,,,,,,,,,
3,PP,Plumbing Permit,2025-133647 PP,Residential,Residential,Irrigation,,1405 IBERVILLE DR,Install sprinkler system,274281601,...,,,,,,,,,,
4,PP,Plumbing Permit,2025-125469 PP,Residential,Residential,Irrigation,,5511 FORKS RD,irrigation install,438011501,...,,,,,,,,,,


## üîπ Step 3 ‚Äî Feature Extraction (Simple Patterns)
We create quick flags for:
üè† ADUs (Accessory Dwelling Units)
üîå EV Chargers
üè¢ Multi‚Äëfamily projects
‚òÄÔ∏è Solar installs
üõ†Ô∏è Renovations

In [7]:
def make_flag(col, pattern):
    return df[col].str.contains(pattern, regex=True)

df["flag_adu"]   = make_flag("DESCRIPTION_clean", r"\badu\b| accessory dwelling")
df["flag_ev"]    = make_flag("DESCRIPTION_clean", r"ev charger|electric vehicle")
df["flag_multi"] = make_flag("DESCRIPTION_clean", r"multi[- ]family|apt|apartment")
df["flag_solar"] = make_flag("DESCRIPTION_clean", r"solar|pv system")
df["flag_reno"]  = make_flag("DESCRIPTION_clean", r"renovation|remodel|repair")

df[["flag_adu", "flag_ev", "flag_multi", "flag_solar", "flag_reno"]].head()

KeyError: 'DESCRIPTION_clean'

## üîπ Step 4 ‚Äî Quick Insights
Counts of major project types.

In [None]:
counts = {
    "ADUs": int(df.flag_adu.sum()),
    "EV Chargers": int(df.flag_ev.sum()),
    "Multi-Family": int(df.flag_multi.sum()),
    "Solar": int(df.flag_solar.sum()),
    "Renovations": int(df.flag_reno.sum())
}

counts

## üîπ Step 5 ‚Äî Save a Clean Parquet File
This output can be used directly by your RAG/chatbot or your ML forecasting pipeline.

In [None]:
output_path = "/home/asus/atx-hackathon/data/permits_clean.parquet"
df.to_parquet(output_path)
print("Saved:", output_path)

## ‚úîÔ∏è Complete!
You now have:
- GPU-loaded 2M row dataset
- Cleaned text
- Flags for major urban trends
- Summary stats
- Parquet file ready for RAG, analysis, or clustering