Skip to content

stargalax/UIDAI-Data-Hackathon

Repository files navigation

Aadhaar Enrolment & Updates Analysis

A comprehensive data analysis system for unlocking societal trends in Aadhaar enrolment and updates. This project identifies meaningful patterns, anomalies, and predictive indicators while focusing on service center accessibility and geographic/demographic coverage gaps.

Overview

This analysis system processes three interconnected datasets (demographic, biometric, enrolment) to:

  • Perform univariate, bivariate, and multivariate analysis
  • Detect anomalies and outliers in enrolment patterns
  • Identify historical trends and forecast future volumes
  • Analyze service center accessibility and coverage gaps
  • Generate actionable insights and policy recommendations
  • Produce a comprehensive PDF report with visualizations

Installation

Prerequisites

  • Python 3.9 or higher
  • pip (Python package manager)
  • Virtual environment (recommended)

Setup Steps

  1. Clone or download the project
cd aadhaar-analysis
  1. Create a virtual environment
python -m venv venv
  1. Activate the virtual environment

On Windows:

venv\Scripts\activate

On macOS/Linux:

source venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt

Data Setup

Data Directory Structure

The project expects data files in the following structure:

data/
├── demographic/
│   ├── api_data_aadhar_demographic_0_500000.csv
│   ├── api_data_aadhar_demographic_500000_1000000.csv
│   └── ... (additional demographic files)
├── biometric/
│   ├── api_data_aadhar_biometric_0_500000.csv
│   ├── api_data_aadhar_biometric_500000_1000000.csv
│   └── ... (additional biometric files)
└── enrolment/
    ├── api_data_aadhar_enrolment_0_500000.csv
    ├── api_data_aadhar_enrolment_500000_1000000.csv
    └── ... (additional enrolment files)

Expected Data Formats

Demographic Data (CSV):

  • date: Date of demographic update (YYYY-MM-DD)
  • state: State name
  • district: District name
  • pincode: Postal code
  • demo_age_5_17: Count of demographic updates for age 5-17
  • demo_age_17_: Count of demographic updates for age 17+

Biometric Data (CSV):

  • date: Date of biometric enrolment (YYYY-MM-DD)
  • state: State name
  • district: District name
  • pincode: Postal code
  • bio_age_5_17: Count of biometric enrolments for age 5-17
  • bio_age_17_: Count of biometric enrolments for age 17+

Enrolment Data (CSV):

  • date: Date of enrolment (YYYY-MM-DD)
  • state: State name
  • district: District name
  • pincode: Postal code
  • age_0_5: Enrolments for age 0-5
  • age_5_17: Enrolments for age 5-17
  • age_18_greater: Enrolments for age 18+

Running the Analysis

Basic Usage

Run the complete analysis pipeline with default configuration:

python main.py

Advanced Usage

Run with custom configuration and output directory:

python main.py --config config.json --output custom_output --log-level INFO

Command-Line Options

  • --config CONFIG: Path to configuration file (JSON format)
  • --output OUTPUT: Output directory for results (default: output)
  • --log-level LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Configuration File

The config.json file controls all analysis parameters:

{
  "data": {
    "demographic_dir": "data/demographic",
    "biometric_dir": "data/biometric",
    "enrolment_dir": "data/enrolment",
    "output_dir": "output",
    "cache_dir": ".cache"
  },
  "analysis": {
    "numeric_percentiles": [0.25, 0.5, 0.75],
    "outlier_methods": ["iqr", "zscore"],
    "iqr_multiplier": 1.5,
    "zscore_threshold": 3.0,
    "decomposition_period": 12,
    "forecast_periods": 12,
    "min_region_size": 10
  },
  "visualization": {
    "figure_size": [12, 6],
    "dpi": 300,
    "style": "seaborn-v0_8-darkgrid",
    "color_palette": "husl",
    "save_format": "png",
    "map_zoom_level": 5,
    "choropleth_colormap": "YlOrRd"
  },
  "report": {
    "report_title": "Aadhaar Enrolment & Updates Analysis Report",
    "author": "Data Analysis Team",
    "include_code_appendix": true,
    "include_methodology": true,
    "include_recommendations": true,
    "page_size": "A4",
    "margin_inches": 0.75
  },
  "log_level": "INFO",
  "log_file": "analysis.log"
}

Output

Generated Files

After running the analysis, the following files are generated in the output directory:

output/
├── aadhaar_analysis_report.pdf          # Main PDF report
├── demographic_cleaned.parquet          # Cleaned demographic data
├── biometric_cleaned.parquet            # Cleaned biometric data
├── enrolment_cleaned.parquet            # Cleaned enrolment data
├── preprocessing_report.json            # Data quality report
└── visualizations/
    ├── distribution_*.png               # Distribution plots
    ├── correlation_heatmap.png          # Correlation matrix
    ├── time_series.png                  # Time series plot
    ├── geographic_map.png               # Geographic visualization
    ├── comparative_analysis.png         # Comparative plots
    └── accessibility_dashboard.png      # Accessibility dashboard

PDF Report Contents

The generated PDF report includes:

  1. Title Page: Report title, author, and generation date
  2. Table of Contents: Navigable section index
  3. Problem Statement: Overview of analysis objectives
  4. Datasets: Description of data sources and structure
  5. Methodology: Data cleaning, preprocessing, and analysis methods
  6. Analysis Results:
    • Univariate analysis (distributions, trends)
    • Bivariate analysis (correlations, relationships)
    • Anomaly detection findings
    • Trend analysis and forecasts
    • Service accessibility analysis
  7. Visualizations: All generated charts and maps
  8. Key Findings: Synthesized insights and patterns
  9. Recommendations: Policy recommendations based on findings
  10. Code Appendix: Complete Python source code with syntax highlighting

Analysis Phases

The pipeline executes the following analysis phases:

Phase 1: Data Loading

  • Loads all CSV files from configured directories
  • Validates data schema and integrity
  • Reports data quality metrics

Phase 2: Data Preprocessing

  • Handles missing values (forward fill, median/mode imputation)
  • Standardizes geographic data (state/district names, pincodes)
  • Parses temporal data (date conversion, feature extraction)
  • Removes duplicates
  • Saves cleaned data in Parquet format

Phase 3: Univariate Analysis

  • Computes descriptive statistics (mean, median, std dev, quartiles)
  • Analyzes distributions for numeric and categorical variables
  • Identifies temporal trends
  • Analyzes geographic patterns

Phase 4: Bivariate & Multivariate Analysis

  • Computes correlation matrices
  • Analyzes categorical relationships (cross-tabulations)
  • Identifies temporal relationships and synchronization
  • Detects geographic clustering and regional patterns

Phase 5: Anomaly Detection

  • Statistical outlier detection (IQR, z-score methods)
  • Temporal anomaly detection (change points, seasonal anomalies)
  • Geographic anomaly detection (regional outliers)
  • Multivariate anomaly detection (Isolation Forest)

Phase 6: Trend Analysis & Forecasting

  • Time series decomposition (trend, seasonal, residual)
  • Trend phase identification (growth, plateau, decline)
  • Future value forecasting (ARIMA, exponential smoothing)
  • Demographic shift analysis

Phase 7: Service Accessibility Analysis

  • Coverage gap detection (demographic vs biometric mismatches)
  • Service desert identification (high-population, low-enrolment areas)
  • Age group coverage analysis
  • Stagnant region identification
  • Accessibility index computation

Phase 8: Visualization Generation

  • Distribution plots (histograms, box plots, violin plots)
  • Correlation heatmaps
  • Time series plots with trend lines
  • Geographic maps (choropleth)
  • Comparative analysis charts
  • Accessibility dashboard

Phase 9: Insight Synthesis

  • Converts statistical findings to plain language
  • Identifies causal relationships and patterns
  • Highlights unexpected findings
  • Generates executive summary

Phase 10: Report Generation

  • Creates comprehensive PDF report
  • Embeds all visualizations
  • Includes code appendix
  • Generates table of contents and page numbers

Testing

Running Tests

Run all tests:

pytest

Run tests with coverage:

pytest --cov=src --cov-report=html

Run specific test file:

pytest tests/test_dataloading.py

Run with verbose output:

pytest -v

Test Structure

Tests are organized by module:

tests/
├── test_dataloading.py              # Data loading tests
├── test_preprocessing.py            # Preprocessing tests
├── test_univariate_analysis.py      # Univariate analysis tests
├── test_bivariate_analysis.py       # Bivariate analysis tests
├── test_anomaly_detection.py        # Anomaly detection tests
├── test_trend_analysis.py           # Trend analysis tests
├── test_accessibility_analysis.py   # Accessibility analysis tests
├── test_visualization.py            # Visualization tests
├── test_insight_synthesis.py        # Insight synthesis tests
├── test_report_generation.py        # Report generation tests
├── test_config.py                   # Configuration tests
├── test_logging_config.py           # Logging tests
└── conftest.py                      # Pytest fixtures

Code Quality

Code Formatting

Format code with Black:

black src/ tests/ main.py

Linting

Check code style with Flake8:

flake8 src/ tests/ main.py

Type Checking

Check types with Mypy:

mypy src/ main.py

Project Structure

aadhaar-analysis/
├── main.py                          # Main orchestration script
├── config.json                      # Configuration file
├── requirements.txt                 # Python dependencies
├── README.md                        # This file
├── pytest.ini                       # Pytest configuration
├── src/                             # Source code
│   ├── __init__.py
│   ├── config.py                    # Configuration management
│   ├── logging_config.py            # Logging setup
│   ├── dataloading.py               # Data loading module
│   ├── preprocessing.py             # Data preprocessing
│   ├── univariate_analysis.py       # Univariate analysis
│   ├── bivariate_analysis.py        # Bivariate analysis
│   ├── anomaly_detection.py         # Anomaly detection
│   ├── trend_analysis.py            # Trend analysis
│   ├── accessibility_analysis.py    # Accessibility analysis
│   ├── visualization.py             # Visualization generation
│   ├── insight_synthesis.py         # Insight synthesis
│   ├── report_generation.py         # PDF report generation
│   ├── baseline_model.py            # Baseline model (optional)
│   └── feature_engineering.py       # Feature engineering (optional)
├── tests/                           # Test suite
│   ├── __init__.py
│   ├── conftest.py                  # Pytest fixtures
│   ├── test_*.py                    # Module-specific tests
│   └── __pycache__/
├── data/                            # Input data directory
│   ├── demographic/                 # Demographic CSV files
│   ├── biometric/                   # Biometric CSV files
│   └── enrolment/                   # Enrolment CSV files
├── output/                          # Output directory (generated)
│   ├── aadhaar_analysis_report.pdf  # Main report
│   ├── *_cleaned.parquet            # Cleaned data
│   └── visualizations/              # Generated plots
└── .cache/                          # Cache directory (generated)

Troubleshooting

Common Issues

Issue: "No such file or directory" for data files

  • Ensure data files are in the correct directory structure
  • Check that file paths in config.json are correct
  • Verify CSV files have the expected column names

Issue: Memory error with large datasets

  • Reduce the number of files processed at once
  • Use data sampling for initial testing
  • Increase available system memory

Issue: Visualization not displaying

  • Ensure matplotlib backend is set correctly
  • Check that output directory has write permissions
  • Verify image format is supported

Issue: PDF generation fails

  • Ensure reportlab is installed: pip install reportlab
  • Check that output directory exists and is writable
  • Verify all visualizations were generated successfully

Getting Help

For issues or questions:

  1. Check the log file (analysis.log) for detailed error messages
  2. Review the test files for usage examples
  3. Consult the docstrings in source code modules
  4. Check the generated PDF report for analysis details

Performance Considerations

  • Data Loading: Optimized for large CSV files using chunked reading
  • Memory Usage: Efficient data types (int32 for counts, datetime64 for dates)
  • Computation: Vectorized operations using numpy/pandas
  • Visualization: Lazy loading of large maps, cached computations

Dependencies

Key dependencies and their purposes:

  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • scikit-learn: Machine learning (anomaly detection, clustering)
  • scipy: Scientific computing (statistics, signal processing)
  • statsmodels: Statistical modeling (time series, ARIMA)
  • matplotlib: Plotting and visualization
  • seaborn: Statistical data visualization
  • plotly: Interactive visualizations
  • folium: Geographic mapping
  • reportlab: PDF generation
  • pytest: Testing framework
  • hypothesis: Property-based testing

License

This project is provided as-is for analysis and research purposes.

Contact

For questions or feedback about this analysis system, please contact the Data Analysis Team.

Changelog

Version 1.0.0 (Initial Release)

  • Complete analysis pipeline implementation
  • All analysis modules (univariate, bivariate, anomaly detection, trends, accessibility)
  • Comprehensive visualization generation
  • PDF report generation with code appendix
  • Full test coverage
  • Configuration management
  • Logging and error handling

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published