# Aadhaar Demographic Data Analysis
## UIDAI Hackathon Submission
---

This notebook presents a comprehensive analysis of anonymized Aadhaar demographic update data. The pipeline processes approximately 2 million records to extract meaningful insights about demographic trends across different age groups, states, and time periods.

### Key Highlights
- **Automated Data Pipeline**: Programmatic extraction and processing of large-scale CSV datasets
- **Data Engineering**: Merging multiple data partitions into a unified DataFrame
- **Visual Analytics**: High-resolution visualizations showcasing demographic patterns

### Step 1: Import Required Libraries

In [None]:
# Installing dependencies
!pip install -q seaborn matplotlib pandas numpy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import os

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print('‚úÖ Libraries imported successfully!')

### Step 2: Upload Dataset

The dataset (`aadhaar-hackathon.zip`) contains anonymized Aadhaar demographic update records. Running the cell below will prompt for file upload.

In [None]:
from google.colab import files

uploaded = files.upload()
zip_filename = list(uploaded.keys())[0]
print(f"\n‚úÖ Dataset uploaded: {zip_filename}")

### Step 3: Extract Dataset

Extracting the compressed archive to access the CSV data files.

In [None]:
extract_path = '/content/data'
os.makedirs(extract_path, exist_ok=True)

with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print(f'‚úÖ Dataset extracted to: {extract_path}')

# Display extracted files
print('\nüìÇ Extracted files:')
for root, dirs, files_list in os.walk(extract_path):
    for file in files_list:
        print(f'  - {os.path.join(root, file)}')

### Step 4: Load and Merge CSV Files

The dataset is partitioned across multiple CSV files. This step consolidates all partitions into a single DataFrame for unified analysis.

In [None]:
csv_files = []
for root, dirs, filenames in os.walk(extract_path):
    for f in filenames:
        if f.endswith('.csv'):
            csv_files.append(os.path.join(root, f))

print(f'Found {len(csv_files)} CSV files')

df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)

print(f'\n‚úÖ Loaded {len(df):,} rows of data')
print(f'\nColumns: {list(df.columns)}')

### Step 5: Data Exploration

Examining the structure and statistical properties of the dataset.

In [None]:
# Preview of the dataset
df.head()

In [None]:
# Dataset structure and memory usage
df.info()

In [None]:
# Statistical summary
df.describe()

### Step 6: Age Group Distribution Analysis

Visualizing the distribution of demographic updates across different age groups.

In [None]:
# Identify age-related columns
age_cols = [col for col in df.columns if 'age' in col.lower() or 'demo' in col.lower()]
print(f'Age-related columns found: {age_cols}')

# Calculate totals for each age group
age_totals = df[age_cols].sum()

# Pie chart visualization
plt.figure(figsize=(10, 10))
colors = sns.color_palette('husl', len(age_totals))
plt.pie(age_totals, labels=age_totals.index, autopct='%1.1f%%', startangle=90, colors=colors)
plt.title('Age Group Distribution of Demographic Updates', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('age_distribution_pie.png', dpi=150, bbox_inches='tight')
plt.show()

print('\n‚úÖ Age distribution pie chart generated')

In [None]:
# Bar chart for age group comparison
plt.figure(figsize=(12, 6))
age_totals.plot(kind='bar', color=colors, edgecolor='black', linewidth=0.5)
plt.title('Total Update Count by Age Group', fontsize=14, fontweight='bold')
plt.xlabel('Age Group')
plt.ylabel('Total Updates')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('age_group_updates.png', dpi=150, bbox_inches='tight')
plt.show()

print('\n‚úÖ Age group bar chart generated')

### Step 7: Time Series Analysis

Analyzing the temporal trends in demographic updates to identify patterns over time.

In [None]:
# Identify date column
date_cols = [col for col in df.columns if 'date' in col.lower()]
date_col = date_cols[0] if date_cols else None

if date_col:
    # Convert to datetime
    df[date_col] = pd.to_datetime(df[date_col], format='%d-%m-%Y', errors='coerce')
    
    # Aggregate daily updates
    daily_updates = df.groupby(date_col)[age_cols].sum()
    
    # Time series plot
    plt.figure(figsize=(14, 6))
    for col in age_cols:
        plt.plot(daily_updates.index, daily_updates[col], label=col, linewidth=2)
    
    plt.title('Daily Demographic Updates Over Time', fontsize=14, fontweight='bold')
    plt.xlabel('Date')
    plt.ylabel('Number of Updates')
    plt.legend(title='Age Group', loc='upper right')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('time_series.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print('\n‚úÖ Time series analysis completed')

### Step 8: State-wise Geographic Analysis

Examining the distribution of demographic updates across different states.

In [None]:
# Identify state column
state_cols = [col for col in df.columns if 'state' in col.lower()]
state_col = state_cols[0] if state_cols else None

if state_col:
    # State-wise aggregation
    state_summary = df.groupby(state_col)[age_cols].sum()
    state_totals = state_summary.sum(axis=1).sort_values(ascending=True)
    
    # Top 15 states visualization
    top_states = state_totals.tail(15)
    
    plt.figure(figsize=(12, 8))
    bars = plt.barh(top_states.index, top_states.values, color=sns.color_palette('viridis', len(top_states)))
    plt.title('Top 15 States by Total Demographic Updates', fontsize=14, fontweight='bold')
    plt.xlabel('Total Updates')
    plt.ylabel('State')
    
    # Add value labels
    for bar, val in zip(bars, top_states.values):
        plt.text(val + 1000, bar.get_y() + bar.get_height()/2, f'{val:,.0f}', va='center', fontsize=9)
    
    plt.tight_layout()
    plt.savefig('state_wise_updates.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print('\n‚úÖ State-wise analysis completed')

### Step 9: District-level Insights

Drilling down to district-level data for granular geographic analysis.

In [None]:
# Identify district column
district_cols = [col for col in df.columns if 'district' in col.lower()]
district_col = district_cols[0] if district_cols else None

if district_col:
    # District-wise aggregation
    district_summary = df.groupby(district_col)[age_cols].sum()
    district_totals = district_summary.sum(axis=1).sort_values(ascending=False)
    
    # Top 20 districts
    top_districts = district_totals.head(20)
    
    print('Top 20 Districts by Total Updates:')
    print('-' * 40)
    for i, (district, count) in enumerate(top_districts.items(), 1):
        print(f'{i:2}. {district}: {count:,.0f}')

### Step 10: Summary Statistics

Consolidated summary of key findings from the analysis.

In [None]:
print('=' * 60)
print('ANALYSIS SUMMARY')
print('=' * 60)
print(f'\nüìä Total Records Processed: {len(df):,}')
print(f'üìÖ Date Range: {df[date_col].min().strftime("%d-%m-%Y")} to {df[date_col].max().strftime("%d-%m-%Y")}')
print(f'üó∫Ô∏è  States Covered: {df[state_col].nunique()}')
print(f'üìç Districts Covered: {df[district_col].nunique()}')
print(f'\nüìà Total Demographic Updates:')
for col in age_cols:
    print(f'   - {col}: {df[col].sum():,}')
print(f'\n   Total: {df[age_cols].sum().sum():,}')
print('\n' + '=' * 60)

### Step 11: Download Generated Visualizations

Downloading all generated charts and outputs.

In [None]:
import shutil

# Create output directory
output_dir = '/content/outputs'
os.makedirs(output_dir, exist_ok=True)

# Move generated images to output folder
for img in ['age_distribution_pie.png', 'age_group_updates.png', 'time_series.png', 'state_wise_updates.png']:
    if os.path.exists(img):
        shutil.copy(img, output_dir)

# Create zip archive
shutil.make_archive('/content/analysis_outputs', 'zip', output_dir)

# Download
from google.colab import files
files.download('/content/analysis_outputs.zip')

print('\n‚úÖ All visualizations downloaded successfully!')

---
## Conclusion

This analysis pipeline successfully processes large-scale Aadhaar demographic data to reveal meaningful patterns in demographic updates. The visualizations demonstrate clear trends across age groups, geographic regions, and time periods, providing actionable insights for understanding demographic update behavior.