#**Aadhaar Sentinel**
### **Decision-Support Analytics for Aadhaar Capacity Planning & Integrity Review**

**Team:** Team Sentinel

**Date:** 19 January, 2026

This notebook analyzes Aadhaar enrolment and update activity at the **PINCODE level** to identify:
- Areas with unusually high operational load
- Pincodes whose update patterns deviate from regional norms

The goal is to support **capacity planning, audits, and targeted interventions**.

In [285]:
# Environment Setup
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [286]:
#!pip install pandas numpy matplotlib scikit-learn folium streamlit

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
plt.style.use("default")

In [287]:
# Suppress Warnings
import warnings

warnings.filterwarnings('ignore')
pd.options.mode.chained_assignment = None
print("✓ Warnings suppressed for clean output.")



In [288]:
def load_and_concat_csvs(folder_path):
    """Load all CSV files from a folder and concatenate them"""
    csv_files = [
        os.path.join(folder_path, f)
        for f in os.listdir(folder_path)
        if f.endswith('.csv')
    ]

    print(f"Found {len(csv_files)} files in {folder_path}")
    df_list = [pd.read_csv(f) for f in csv_files]
    return pd.concat(df_list, ignore_index=True)

# Load Datasets
demo_df = load_and_concat_csvs("/content/drive/MyDrive/UIDAI/UIDAI_Datasets/aadhaar_demographic_updates")
bio_df = load_and_concat_csvs("/content/drive/MyDrive/UIDAI/UIDAI_Datasets/aadhaar_biometric_update_pincode")
enr_df = load_and_concat_csvs("/content/drive/MyDrive/UIDAI/UIDAI_Datasets/aadhaar_enrolment_pincode")

print("Demographic:", demo_df.shape)
print("Biometric:", bio_df.shape)
print("Enrolment:", enr_df.shape)

Found 5 files in /content/drive/MyDrive/UIDAI/UIDAI_Datasets/aadhaar_demographic_updates
Found 4 files in /content/drive/MyDrive/UIDAI/UIDAI_Datasets/aadhaar_biometric_update_pincode
Found 3 files in /content/drive/MyDrive/UIDAI/UIDAI_Datasets/aadhaar_enrolment_pincode
Demographic: (2071700, 6)
Biometric: (1861108, 6)
Enrolment: (1006029, 7)


# Computing Total Activity per Record

Each dataset reports activity by age group.
We first compute **total activity per row** before aggregating to PINCODE level.

In [289]:
# CLEANING STEP: Standardize text columns immediately to prevent split rows
for df in [demo_df, bio_df, enr_df]:
    # Ensure they are strings first to avoid errors
    df["district"] = df["district"].astype(str).str.upper().str.strip()
    df["state"] = df["state"].astype(str).str.upper().str.strip()

print("✓ Standardized District and State names to UPPERCASE.")

demo_df["demo_total"] = (
    demo_df["demo_age_5_17"].astype(int) +
    demo_df["demo_age_17_"].astype(int)
)

bio_df["bio_total"] = (
    bio_df["bio_age_5_17"].astype(int) +
    bio_df["bio_age_17_"].astype(int)
)

enr_df["enrol_total"] = (
    enr_df["age_0_5"].astype(int) +
    enr_df["age_5_17"].astype(int) +
    enr_df["age_18_greater"].astype(int)
)

✓ Standardized District and State names to UPPERCASE.


# Aggregation to PINCODE Level

All further analysis is performed at the PINCODE level to ensure:
- privacy preservation
- operational relevance
- stability of signals

In [290]:
demo_pin = (
    demo_df
    .groupby(["pincode", "district", "state"], as_index=False)["demo_total"]
    .sum()
)

bio_pin = (
    bio_df
    .groupby(["pincode", "district", "state"], as_index=False)["bio_total"]
    .sum()
)

enr_pin = (
    enr_df
    .groupby(["pincode", "district", "state"], as_index=False)["enrol_total"]
    .sum()
)

print(demo_pin.shape, bio_pin.shape, enr_pin.shape)

(31391, 4) (31198, 4) (28913, 4)


### Sentinel Metrics Definition
We derive three key risk indicators for every pincode:

1.  **TAI (Total Activity Index):**
    * `Demo + Bio + Enrol`
    * *Interpretation:* Measures pure operational load. High TAI = Busy Center.
2.  **DPR (Demographic Pressure Ratio):**
    * `Demo Updates / (Bio Updates + 1)`
    * *Interpretation:* Measures integrity risk. High DPR (>1.5) implies disjointed updates (e.g., bulk address changes without biometrics).
3.  **PNA (Population-Normalized Activity):**
    * `TAI / Estimated Population`
    * *Interpretation:* Measures capacity anomalies. High PNA (>3.0) indicates a "Ghost Village" effect or massive cross-border footfall.

Sentinel Metrics

We construct two key indicators:
- **Total Activity Index (TAI)** – overall Aadhaar operational load
- **Demographic Pressure Ratio (DPR)** – intensity of demographic corrections relative to biometric updates

In [291]:
sentinel_df = (
    demo_pin
    .merge(bio_pin, on=["pincode", "district", "state"], how="outer")
    .merge(enr_pin, on=["pincode", "district", "state"], how="outer")
    .fillna(0)
)

# Forcing Pincodes to be 6-digit strings (e.g., 11001 becomes "011001")
sentinel_df["pincode"] = sentinel_df["pincode"].astype(str).str.zfill(6)

sentinel_df["TAI"] = (
    sentinel_df["demo_total"] +
    sentinel_df["bio_total"] +
    sentinel_df["enrol_total"]
)

sentinel_df["DPR"] = sentinel_df["demo_total"] / (sentinel_df["bio_total"] + 1)

print("Dataframe Shape:", sentinel_df.shape)

# We use include='all' so it shows stats for Pincodes (Unique count) AND Numbers (Mean/Max)
print("\nSummary Statistics:")
display(sentinel_df.describe(include='all'))

print("\nFirst 5 Rows (Check Leading Zeros):")
display(sentinel_df[["pincode", "district", "state", "TAI", "DPR"]].head())

Dataframe Shape: (32898, 8)

Summary Statistics:


Unnamed: 0,pincode,district,state,demo_total,bio_total,enrol_total,TAI,DPR
count,32898.0,32898,32898,32898.0,32898.0,32898.0,32898.0,32898.0
unique,19815.0,1002,60,,,,,
top,509339.0,BARDDHAMAN,ANDHRA PRADESH,,,,,
freq,10.0,174,3209,,,,,
mean,,,,1498.425041,2120.587726,165.22895,3784.241717,1.194439
std,,,,3537.930376,3736.695054,442.583647,7318.430284,3.405612
min,,,,0.0,0.0,0.0,0.0,0.0
25%,,,,63.0,49.0,7.0,155.0,0.361308
50%,,,,414.0,831.0,45.0,1376.0,0.614305
75%,,,,1479.0,2544.75,147.0,4300.0,1.057483



First 5 Rows (Check Leading Zeros):


Unnamed: 0,pincode,district,state,TAI,DPR
0,100000,100000,100000,220.0,2.0
1,110001,CENTRAL DELHI,DELHI,471.0,0.810484
2,110001,NEW DELHI,DELHI,5121.0,1.010839
3,110002,CENTRAL DELHI,DELHI,11061.0,0.552567
4,110003,CENTRAL DELHI,DELHI,8365.0,0.425505


In [292]:
# Population Normalization (PNA)
pin_counts = (
    sentinel_df
    .groupby("district")["pincode"]
    .nunique()
    .reset_index(name="pin_count")
)
district_pop = pd.read_csv(
    "/content/drive/MyDrive/UIDAI/UIDAI_Datasets/Census_2011.csv"
)
# Fix: Rename 'District' column to 'district' for consistency
district_pop.rename(columns={"District": "district"}, inplace=True)
# Convert district column to string type before applying string methods
district_pop["district"] = district_pop["district"].astype(str).str.upper().str.strip()
pin_counts["district"] = pin_counts["district"].str.upper().str.strip()
sentinel_df["district"] = sentinel_df["district"].str.upper().str.strip()

district_stats = pin_counts.merge(
    district_pop,
    on="district",
    how="left"
)

district_stats["est_pincode_pop"] = (
    district_stats["TOT_P"] / # Changed to 'TOT_P'
    district_stats["pin_count"]
)

district_stats["est_pincode_pop"] = district_stats["est_pincode_pop"].fillna(30000)

sentinel_df = sentinel_df.merge(
    district_stats[["district", "est_pincode_pop"]],
    on="district",
    how="left"
)

sentinel_df["est_pincode_pop"].fillna(30000, inplace=True)

# Compute PNA

sentinel_df["PNA"] = sentinel_df["TAI"] / sentinel_df["est_pincode_pop"]
sentinel_df["PNA"].describe()

Unnamed: 0,PNA
count,32898.0
mean,0.126141
std,0.243948
min,0.0
25%,0.005167
50%,0.045867
75%,0.143333
max,6.317767


# Identifying Operational Outliers

Outliers are defined using percentile-based thresholds to ensure robustness
and interpretability.

In [293]:
# OUTLIER LOGIC (Percentile-Based Thresholding)
tai_threshold = sentinel_df["TAI"].quantile(0.98)
dpr_threshold = sentinel_df["DPR"].quantile(0.98)
pna_threshold = sentinel_df["PNA"].quantile(0.98)

outliers_df = sentinel_df[(sentinel_df["TAI"] >= tai_threshold) |
    (sentinel_df["DPR"] >= dpr_threshold) |
    (sentinel_df["PNA"] >= pna_threshold)
].copy()

print("Final flagged pincodes:", outliers_df.shape[0])

# Sample Outliers
outliers_df[["district", "state", "TAI", "DPR", "PNA"]].head(10)

Final flagged pincodes: 1317


Unnamed: 0,district,state,TAI,DPR,PNA
9,NORTH DELHI,DELHI,39176.0,0.972499,1.305867
11,NORTH DELHI,DELHI,28965.0,1.118757,0.9655
12,CENTRAL DELHI,DELHI,36727.0,1.47495,1.224233
20,WEST DELHI,DELHI,24726.0,1.058813,0.8242
24,WEST DELHI,DELHI,40958.0,0.799838,1.365267
25,SOUTH DELHI,DELHI,32230.0,1.371749,1.074333
33,SOUTH DELHI,DELHI,44308.0,1.473491,1.476933
36,WEST DELHI,DELHI,28680.0,0.910654,0.956
42,EAST DELHI,DELHI,34195.0,1.560654,1.139833
46,NORTH WEST DELHI,DELHI,71366.0,1.503434,2.378867


In [294]:
# 2. Exporting BOTH files
sentinel_df.to_csv('sentinel_final.csv', index=False)
outliers_df.to_csv('outliers_final.csv', index=False)

# Save directly to your Drive folder
sentinel_df.to_csv('/content/drive/MyDrive/UIDAI/UIDAI_Datasets/sentinel_final.csv', index=False)
outliers_df.to_csv('/content/drive/MyDrive/UIDAI/UIDAI_Datasets/outliers_final.csv', index=False)

print("✓ Exported sentinel_final.csv (all pincodes)")
print("✓ Exported outliers_final.csv (flagged pincodes only)")
print("✓ Saved directly to Google Drive!")

✓ Exported sentinel_final.csv (all pincodes)
✓ Exported outliers_final.csv (flagged pincodes only)
✓ Saved directly to Google Drive!


# Population Estimation Methodology

District-level population (Census 2011) was used as a proxy and divided by the number of pincodes per district to estimate **PINCODE-level population**.

This is suitable for:

*   Prototyping
*   Policy simulation
*   Capacity analysis

In production, this can be replaced with **official UIDAI / Census / State Registry APIs**.

In [295]:
# 98th Percentile PNA Threshold

sentinel_df["PNA"].quantile(0.98)

np.float64(0.8128446666666637)

In [296]:
# Load District Latitude–Longitude Reference
import folium
import pandas as pd

latlon_df = pd.read_csv(
    "https://raw.githubusercontent.com/SaravananSuriya/Phonepe-Pulse-Data-Visualization-and-Exploration/main/lat-%26-lon-india-district.csv"
)

latlon_df.columns = latlon_df.columns.str.lower()
latlon_df["district_clean"] = latlon_df["district"].str.upper().str.strip()

latlon_df = latlon_df[["district_clean", "latitude", "longitude"]]

In [297]:
# Map Outlier PINCODES → Districts
map_df = outliers_df.copy()

# 1. Standardize Names
map_df["district_clean"] = (
    map_df["district"]
    .astype(str)
    .str.upper()
    .str.strip()
)

# 2. Apply Name Corrections (Crucial for National Map coverage)
name_corrections = {
    "CHHATRAPATI SAMBHAJINAGAR": "AURANGABAD",
    "AHILYANAGAR": "AHMEDNAGAR",
    "DHARASHIV": "OSMANABAD",
    "SRIBHUMI": "KAMRUP",
    "BENGALURU SOUTH": "BANGALORE",
    "KOTPUTLI-BEHROR": "JAIPUR",
    "DIDWANA-KUCHAMAN": "NAGAUR",
    "PASCHIM BARDHAMAN": "BARDDHAMAN",
    "KOCH BIHAR": "COOCH BEHAR"
}

map_df["district_clean"] = map_df["district_clean"].replace(name_corrections)

# 3. Merge with Lat/Lon
map_df = map_df.merge(
    latlon_df,
    on="district_clean",
    how="left"
)

# 4. Drop only if we truly can't find coordinates
map_df = map_df.dropna(subset=["latitude", "longitude"])

print(f"Mapped {len(map_df)} outlier pincodes to districts (Name Corrected)")

Mapped 944 outlier pincodes to districts (Name Corrected)


In [298]:
def risk_color(row):
    # 1. Check for DUAL Risk first (Both High)
    if (row["DPR"] >= dpr_threshold) and (row["PNA"] >= pna_threshold):
        return "#800080"  # Purple – Dual Critical

    # 2. Check for Single Risks
    if row["DPR"] >= dpr_threshold:
        return "#8B0000"  # Dark Red – Integrity Risk
    if row["PNA"] >= pna_threshold:
        return "#003366"  # Dark Blue – Capacity Risk

    # 3. Default
    return "#F4D03F"      # Yellow – Moderate / Edge Case

# **National Surveillance Map (PINCODE-Level)**

In [299]:
surveillance_map = folium.Map(
    location=[22.5937, 78.9629],
    zoom_start=5,
    tiles="CartoDB positron"
)

legend_html = """
<div style="position: fixed;
     bottom: 50px; left: 50px; width: 260px; height: 170px;
     border:2px solid grey; z-index:9999; font-size:14px;
     background-color:white; padding:10px; opacity:0.95;">
<b> Surveillance Legend</b><br>
<i style="background:#800080;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>CRITICAL DUAL RISK</b><br>
<i style="background:#8B0000;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>Integrity Risk</b><br>
<i style="background:#003366;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>Capacity Risk</b><br>
<i style="background:#F4D03F;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>Moderate Risk</b>
</div>
"""
surveillance_map.get_root().html.add_child(folium.Element(legend_html))

for _, row in map_df.iterrows():
    folium.CircleMarker(
        location=[row["latitude"], row["longitude"]],
        radius=4,
        color=risk_color(row),
        fill=True,
        fill_opacity=0.7,
        popup=(
            f"<b>{row['district']}</b><br>"
            f"DPR: {row['DPR']:.2f}<br>"
            f"PNA: {row['PNA']:.2f}"
        )
    ).add_to(surveillance_map)

surveillance_map.save("Aadhaar_Sentinel_National.html")
surveillance_map

In [300]:
# 12.1 Compute District Risk Summary (Balanced Scoring)
district_risk = (
    outliers_df
    .groupby(["district", "state"])
    .agg(
        outlier_pincode_count=("pincode", "nunique"),
        max_dpr=("DPR", "max"),
        max_pna=("PNA", "max")
    )
    .reset_index()
)

# --- BALANCED RANKING LOGIC ---
# Goal: Keep "Massive Fraud" (High DPR) AND "System Failures" (High PNA)
# Formula:
# 1. We Count the dots.
# 2. We add DPR Score (1 point per unit).
# 3. We add PNA Score (50 points per unit).
#    (Why 50? Because PNA is usually small < 1.0. A PNA of 1.0 is catastrophic,
#     so we scale it up to match the severity of a DPR of 50.)

district_risk["audit_priority_score"] = (
    (district_risk["outlier_pincode_count"] * 1.0) +
    (district_risk["max_dpr"] * 1.0) +
    (district_risk["max_pna"] * 50.0)  # The "Pune Saver" Bonus
)

# Filter for Top 20 to ensure we catch all edge cases
district_risk = (
    district_risk
    .sort_values("audit_priority_score", ascending=False)
    .head(20)
)

# 12.2 District Name Canonicalization
name_corrections = {
    "CHHATRAPATI SAMBHAJINAGAR": "AURANGABAD",
    "AHILYANAGAR": "AHMEDNAGAR",
    "DHARASHIV": "OSMANABAD",
    "SRIBHUMI": "KAMRUP",
    "BENGALURU SOUTH": "BANGALORE",
    "KOTPUTLI-BEHROR": "JAIPUR",
    "DIDWANA-KUCHAMAN": "NAGAUR",
    "PASCHIM BARDHAMAN": "BARDDHAMAN",
    "KOCH BIHAR": "COOCH BEHAR",
    "MAHABUBABAD": "WARANGAL",    # Proxy
    "HANUMANGARH": "HANUMANGARH",
}

district_risk["merge_key"] = (
    district_risk["district"]
    .astype(str)
    .str.upper()
    .str.strip()
    .replace(name_corrections)
)

# 12.3 Merge with Geo Coordinates
top_map_df = district_risk.merge(
    latlon_df,
    left_on="merge_key",
    right_on="district_clean",
    how="left"
)

top_map_df = top_map_df.dropna(subset=["latitude", "longitude"])

print(f"✅ Mapped {len(top_map_df)} targets.")
display(top_map_df[["district", "max_dpr", "max_pna", "audit_priority_score"]])

✅ Mapped 20 targets.


Unnamed: 0,district,max_dpr,max_pna,audit_priority_score
0,WEST DELHI,12.230769,6.105633,324.512436
1,MORADABAD,1.261478,6.317767,319.149812
2,NORTH EAST DELHI,1.637388,5.376267,273.450721
3,AHILYANAGAR,193.0,0.0067,262.335
4,ALIGARH,1.387115,5.082933,257.533782
5,SAHARANPUR,1.24471,4.915867,250.038043
6,FIROZABAD,1.75695,4.8645,246.98195
7,RAMPUR,1.780274,4.2711,220.335274
8,NORTH WEST DELHI,2.112933,4.156467,218.936267
9,THANE,2.176837,3.971367,207.74517


# **Top-15 District Audit Focus Map (Dual-Risk)**

In [301]:
# 13. Top-20 District Audit Focus Map (Severity Weighted)
focus_map = folium.Map(
    location=[22.5937, 78.9629],
    zoom_start=5,
    tiles="CartoDB positron"
)

# Legend
target_legend_html = """
<div style="position: fixed;
     bottom: 50px; left: 50px; width: 240px; height: 160px;
     border:2px solid grey; z-index:9999; font-size:14px;
     background-color:white; padding:10px; opacity:0.95; box-shadow: 2px 2px 5px grey;">
<b>🎯 Audit Priority Legend</b><br>
<i style="background:#800080;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>CRITICAL DUAL RISK</b><br>
<i style="background:#8B0000;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>Integrity Target</b><br>
<i style="background:#003366;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>Capacity Target</b><br>
<i style="background:#F4D03F;width:12px;height:12px;display:inline-block;border-radius:50%;"></i>
<b>Moderate Risk</b><br>
<small><i>Target size scaled by cluster density</i></small>
</div>
"""
focus_map.get_root().html.add_child(folium.Element(target_legend_html))

# Plot
for _, row in top_map_df.iterrows():
    high_dpr = row["max_dpr"] >= dpr_threshold
    high_pna = row["max_pna"] >= pna_threshold

    if high_dpr and high_pna:
        color = "#800080"
        risk_type = "DUAL CRITICAL"
    elif high_dpr:
        color = "#8B0000"
        risk_type = "Integrity Risk"
    elif high_pna:
        color = "#003366"
        risk_type = "Capacity Risk"
    else:
        color = "#F4D03F"
        risk_type = "Moderate"

    # Size based on Score, not just count
    radius_size = 8 + (row["audit_priority_score"] * 0.05)
    # Cap size so it doesn't cover the map
    if radius_size > 25: radius_size = 25

    folium.CircleMarker(
        location=[row["latitude"], row["longitude"]],
        radius=radius_size,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.8,
        popup=(
            f"<b>TARGET: {row['district']}</b><br>"
            f"Type: {risk_type}<br>"
            f"Score: {row['audit_priority_score']:.0f}<br>"
            f"Max DPR: {row['max_dpr']:.2f}<br>"
            f"Max PNA: {row['max_pna']:.2f}"
        )
    ).add_to(focus_map)

focus_map.save("Aadhaar_Sentinel_Top20_Targets.html")
focus_map

In [302]:
'''
from google.colab import files
import time

print(" Packaging Submission Artifacts...")

# List of files to download
artifacts = [
    "sentinel_final.csv",
    "outliers_final.csv",
    "Aadhaar_Sentinel_National.html",
    "Aadhaar_Sentinel_Top20_Targets.html"
]

for artifact in artifacts:
    try:
        files.download(artifact)
        print(f"⬇ Downloading: {artifact}")
        time.sleep(1) # Pause to prevent browser blocking multiple downloads
    except Exception as e:
        print(f"⚠️ Could not download {artifact}: {e}")

print("\n✅ All files ready for presentation!")
'''

'\nfrom google.colab import files\nimport time\n\nprint(" Packaging Submission Artifacts...")\n\n# List of files to download\nartifacts = [\n    "sentinel_final.csv",\n    "outliers_final.csv",\n    "Aadhaar_Sentinel_National.html",\n    "Aadhaar_Sentinel_Top20_Targets.html"\n]\n\nfor artifact in artifacts:\n    try:\n        files.download(artifact)\n        print(f"⬇ Downloading: {artifact}")\n        time.sleep(1) # Pause to prevent browser blocking multiple downloads\n    except Exception as e:\n        print(f"⚠️ Could not download {artifact}: {e}")\n\nprint("\n✅ All files ready for presentation!")\n'