Aadhaar Enrolment & Updates Analysis

A comprehensive data analysis system for unlocking societal trends in Aadhaar enrolment and updates. This project identifies meaningful patterns, anomalies, and predictive indicators while focusing on service center accessibility and geographic/demographic coverage gaps.

Overview

This analysis system processes three interconnected datasets (demographic, biometric, enrolment) to:

Perform univariate, bivariate, and multivariate analysis
Detect anomalies and outliers in enrolment patterns
Identify historical trends and forecast future volumes
Analyze service center accessibility and coverage gaps
Generate actionable insights and policy recommendations
Produce a comprehensive PDF report with visualizations

Installation

Prerequisites

Python 3.9 or higher
pip (Python package manager)
Virtual environment (recommended)

Setup Steps

Clone or download the project

cd aadhaar-analysis

Create a virtual environment

python -m venv venv

Activate the virtual environment

On Windows:

venv\Scripts\activate

On macOS/Linux:

source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Data Setup

Data Directory Structure

The project expects data files in the following structure:

data/
├── demographic/
│   ├── api_data_aadhar_demographic_0_500000.csv
│   ├── api_data_aadhar_demographic_500000_1000000.csv
│   └── ... (additional demographic files)
├── biometric/
│   ├── api_data_aadhar_biometric_0_500000.csv
│   ├── api_data_aadhar_biometric_500000_1000000.csv
│   └── ... (additional biometric files)
└── enrolment/
    ├── api_data_aadhar_enrolment_0_500000.csv
    ├── api_data_aadhar_enrolment_500000_1000000.csv
    └── ... (additional enrolment files)

Expected Data Formats

Demographic Data (CSV):

date: Date of demographic update (YYYY-MM-DD)
state: State name
district: District name
pincode: Postal code
demo_age_5_17: Count of demographic updates for age 5-17
demo_age_17_: Count of demographic updates for age 17+

Biometric Data (CSV):

date: Date of biometric enrolment (YYYY-MM-DD)
state: State name
district: District name
pincode: Postal code
bio_age_5_17: Count of biometric enrolments for age 5-17
bio_age_17_: Count of biometric enrolments for age 17+

Enrolment Data (CSV):

date: Date of enrolment (YYYY-MM-DD)
state: State name
district: District name
pincode: Postal code
age_0_5: Enrolments for age 0-5
age_5_17: Enrolments for age 5-17
age_18_greater: Enrolments for age 18+

Running the Analysis

Basic Usage

Run the complete analysis pipeline with default configuration:

python main.py

Advanced Usage

Run with custom configuration and output directory:

python main.py --config config.json --output custom_output --log-level INFO

Command-Line Options

--config CONFIG: Path to configuration file (JSON format)
--output OUTPUT: Output directory for results (default: output)
--log-level LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Configuration File

The config.json file controls all analysis parameters:

{
  "data": {
    "demographic_dir": "data/demographic",
    "biometric_dir": "data/biometric",
    "enrolment_dir": "data/enrolment",
    "output_dir": "output",
    "cache_dir": ".cache"
  },
  "analysis": {
    "numeric_percentiles": [0.25, 0.5, 0.75],
    "outlier_methods": ["iqr", "zscore"],
    "iqr_multiplier": 1.5,
    "zscore_threshold": 3.0,
    "decomposition_period": 12,
    "forecast_periods": 12,
    "min_region_size": 10
  },
  "visualization": {
    "figure_size": [12, 6],
    "dpi": 300,
    "style": "seaborn-v0_8-darkgrid",
    "color_palette": "husl",
    "save_format": "png",
    "map_zoom_level": 5,
    "choropleth_colormap": "YlOrRd"
  },
  "report": {
    "report_title": "Aadhaar Enrolment & Updates Analysis Report",
    "author": "Data Analysis Team",
    "include_code_appendix": true,
    "include_methodology": true,
    "include_recommendations": true,
    "page_size": "A4",
    "margin_inches": 0.75
  },
  "log_level": "INFO",
  "log_file": "analysis.log"
}

Output

Generated Files

After running the analysis, the following files are generated in the output directory:

output/
├── aadhaar_analysis_report.pdf          # Main PDF report
├── demographic_cleaned.parquet          # Cleaned demographic data
├── biometric_cleaned.parquet            # Cleaned biometric data
├── enrolment_cleaned.parquet            # Cleaned enrolment data
├── preprocessing_report.json            # Data quality report
└── visualizations/
    ├── distribution_*.png               # Distribution plots
    ├── correlation_heatmap.png          # Correlation matrix
    ├── time_series.png                  # Time series plot
    ├── geographic_map.png               # Geographic visualization
    ├── comparative_analysis.png         # Comparative plots
    └── accessibility_dashboard.png      # Accessibility dashboard

PDF Report Contents

The generated PDF report includes:

Title Page: Report title, author, and generation date
Table of Contents: Navigable section index
Problem Statement: Overview of analysis objectives
Datasets: Description of data sources and structure
Methodology: Data cleaning, preprocessing, and analysis methods
Analysis Results:
- Univariate analysis (distributions, trends)
- Bivariate analysis (correlations, relationships)
- Anomaly detection findings
- Trend analysis and forecasts
- Service accessibility analysis
Visualizations: All generated charts and maps
Key Findings: Synthesized insights and patterns
Recommendations: Policy recommendations based on findings
Code Appendix: Complete Python source code with syntax highlighting

Analysis Phases

The pipeline executes the following analysis phases:

Phase 1: Data Loading

Loads all CSV files from configured directories
Validates data schema and integrity
Reports data quality metrics

Phase 2: Data Preprocessing

Handles missing values (forward fill, median/mode imputation)
Standardizes geographic data (state/district names, pincodes)
Parses temporal data (date conversion, feature extraction)
Removes duplicates
Saves cleaned data in Parquet format

Phase 3: Univariate Analysis

Computes descriptive statistics (mean, median, std dev, quartiles)
Analyzes distributions for numeric and categorical variables
Identifies temporal trends
Analyzes geographic patterns

Phase 4: Bivariate & Multivariate Analysis

Computes correlation matrices
Analyzes categorical relationships (cross-tabulations)
Identifies temporal relationships and synchronization
Detects geographic clustering and regional patterns

Phase 5: Anomaly Detection

Statistical outlier detection (IQR, z-score methods)
Temporal anomaly detection (change points, seasonal anomalies)
Geographic anomaly detection (regional outliers)
Multivariate anomaly detection (Isolation Forest)

Phase 6: Trend Analysis & Forecasting

Time series decomposition (trend, seasonal, residual)
Trend phase identification (growth, plateau, decline)
Future value forecasting (ARIMA, exponential smoothing)
Demographic shift analysis

Phase 7: Service Accessibility Analysis

Coverage gap detection (demographic vs biometric mismatches)
Service desert identification (high-population, low-enrolment areas)
Age group coverage analysis
Stagnant region identification
Accessibility index computation

Phase 8: Visualization Generation

Distribution plots (histograms, box plots, violin plots)
Correlation heatmaps
Time series plots with trend lines
Geographic maps (choropleth)
Comparative analysis charts
Accessibility dashboard

Phase 9: Insight Synthesis

Converts statistical findings to plain language
Identifies causal relationships and patterns
Highlights unexpected findings
Generates executive summary

Phase 10: Report Generation

Creates comprehensive PDF report
Embeds all visualizations
Includes code appendix
Generates table of contents and page numbers

Testing

Running Tests

Run all tests:

pytest

Run tests with coverage:

pytest --cov=src --cov-report=html

Run specific test file:

pytest tests/test_dataloading.py

Run with verbose output:

pytest -v

Test Structure

Tests are organized by module:

tests/
├── test_dataloading.py              # Data loading tests
├── test_preprocessing.py            # Preprocessing tests
├── test_univariate_analysis.py      # Univariate analysis tests
├── test_bivariate_analysis.py       # Bivariate analysis tests
├── test_anomaly_detection.py        # Anomaly detection tests
├── test_trend_analysis.py           # Trend analysis tests
├── test_accessibility_analysis.py   # Accessibility analysis tests
├── test_visualization.py            # Visualization tests
├── test_insight_synthesis.py        # Insight synthesis tests
├── test_report_generation.py        # Report generation tests
├── test_config.py                   # Configuration tests
├── test_logging_config.py           # Logging tests
└── conftest.py                      # Pytest fixtures

Code Quality

Code Formatting

Format code with Black:

black src/ tests/ main.py

Linting

Check code style with Flake8:

flake8 src/ tests/ main.py

Type Checking

Check types with Mypy:

mypy src/ main.py

Project Structure

aadhaar-analysis/
├── main.py                          # Main orchestration script
├── config.json                      # Configuration file
├── requirements.txt                 # Python dependencies
├── README.md                        # This file
├── pytest.ini                       # Pytest configuration
├── src/                             # Source code
│   ├── __init__.py
│   ├── config.py                    # Configuration management
│   ├── logging_config.py            # Logging setup
│   ├── dataloading.py               # Data loading module
│   ├── preprocessing.py             # Data preprocessing
│   ├── univariate_analysis.py       # Univariate analysis
│   ├── bivariate_analysis.py        # Bivariate analysis
│   ├── anomaly_detection.py         # Anomaly detection
│   ├── trend_analysis.py            # Trend analysis
│   ├── accessibility_analysis.py    # Accessibility analysis
│   ├── visualization.py             # Visualization generation
│   ├── insight_synthesis.py         # Insight synthesis
│   ├── report_generation.py         # PDF report generation
│   ├── baseline_model.py            # Baseline model (optional)
│   └── feature_engineering.py       # Feature engineering (optional)
├── tests/                           # Test suite
│   ├── __init__.py
│   ├── conftest.py                  # Pytest fixtures
│   ├── test_*.py                    # Module-specific tests
│   └── __pycache__/
├── data/                            # Input data directory
│   ├── demographic/                 # Demographic CSV files
│   ├── biometric/                   # Biometric CSV files
│   └── enrolment/                   # Enrolment CSV files
├── output/                          # Output directory (generated)
│   ├── aadhaar_analysis_report.pdf  # Main report
│   ├── *_cleaned.parquet            # Cleaned data
│   └── visualizations/              # Generated plots
└── .cache/                          # Cache directory (generated)

Troubleshooting

Common Issues

Issue: "No such file or directory" for data files

Ensure data files are in the correct directory structure
Check that file paths in config.json are correct
Verify CSV files have the expected column names

Issue: Memory error with large datasets

Reduce the number of files processed at once
Use data sampling for initial testing
Increase available system memory

Issue: Visualization not displaying

Ensure matplotlib backend is set correctly
Check that output directory has write permissions
Verify image format is supported

Issue: PDF generation fails

Ensure reportlab is installed: pip install reportlab
Check that output directory exists and is writable
Verify all visualizations were generated successfully

Getting Help

For issues or questions:

Check the log file (analysis.log) for detailed error messages
Review the test files for usage examples
Consult the docstrings in source code modules
Check the generated PDF report for analysis details

Performance Considerations

Data Loading: Optimized for large CSV files using chunked reading
Memory Usage: Efficient data types (int32 for counts, datetime64 for dates)
Computation: Vectorized operations using numpy/pandas
Visualization: Lazy loading of large maps, cached computations

Dependencies

Key dependencies and their purposes:

pandas: Data manipulation and analysis
numpy: Numerical computing
scikit-learn: Machine learning (anomaly detection, clustering)
scipy: Scientific computing (statistics, signal processing)
statsmodels: Statistical modeling (time series, ARIMA)
matplotlib: Plotting and visualization
seaborn: Statistical data visualization
plotly: Interactive visualizations
folium: Geographic mapping
reportlab: PDF generation
pytest: Testing framework
hypothesis: Property-based testing

License

This project is provided as-is for analysis and research purposes.

Contact

For questions or feedback about this analysis system, please contact the Data Analysis Team.

Changelog

Version 1.0.0 (Initial Release)

Complete analysis pipeline implementation
All analysis modules (univariate, bivariate, anomaly detection, trends, accessibility)
Comprehensive visualization generation
PDF report generation with code appendix
Full test coverage
Configuration management
Logging and error handling

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
EDA		EDA
charts		charts
data		data
.gitignore		.gitignore
5_year.ipynb		5_year.ipynb
5yr_update_forecast_statewise.csv		5yr_update_forecast_statewise.csv
README.md		README.md
anomaly.py		anomaly.py
index.html		index.html
requirements.txt		requirements.txt
state_scorecard.csv		state_scorecard.csv
test.html		test.html
test.py		test.py

stargalax/UIDAI-Data-Hackathon

Folders and files

Latest commit

History

Repository files navigation