A comprehensive data analysis system for unlocking societal trends in Aadhaar enrolment and updates. This project identifies meaningful patterns, anomalies, and predictive indicators while focusing on service center accessibility and geographic/demographic coverage gaps.
This analysis system processes three interconnected datasets (demographic, biometric, enrolment) to:
- Perform univariate, bivariate, and multivariate analysis
- Detect anomalies and outliers in enrolment patterns
- Identify historical trends and forecast future volumes
- Analyze service center accessibility and coverage gaps
- Generate actionable insights and policy recommendations
- Produce a comprehensive PDF report with visualizations
- Python 3.9 or higher
- pip (Python package manager)
- Virtual environment (recommended)
- Clone or download the project
cd aadhaar-analysis- Create a virtual environment
python -m venv venv- Activate the virtual environment
On Windows:
venv\Scripts\activateOn macOS/Linux:
source venv/bin/activate- Install dependencies
pip install -r requirements.txtThe project expects data files in the following structure:
data/
├── demographic/
│ ├── api_data_aadhar_demographic_0_500000.csv
│ ├── api_data_aadhar_demographic_500000_1000000.csv
│ └── ... (additional demographic files)
├── biometric/
│ ├── api_data_aadhar_biometric_0_500000.csv
│ ├── api_data_aadhar_biometric_500000_1000000.csv
│ └── ... (additional biometric files)
└── enrolment/
├── api_data_aadhar_enrolment_0_500000.csv
├── api_data_aadhar_enrolment_500000_1000000.csv
└── ... (additional enrolment files)
Demographic Data (CSV):
date: Date of demographic update (YYYY-MM-DD)state: State namedistrict: District namepincode: Postal codedemo_age_5_17: Count of demographic updates for age 5-17demo_age_17_: Count of demographic updates for age 17+
Biometric Data (CSV):
date: Date of biometric enrolment (YYYY-MM-DD)state: State namedistrict: District namepincode: Postal codebio_age_5_17: Count of biometric enrolments for age 5-17bio_age_17_: Count of biometric enrolments for age 17+
Enrolment Data (CSV):
date: Date of enrolment (YYYY-MM-DD)state: State namedistrict: District namepincode: Postal codeage_0_5: Enrolments for age 0-5age_5_17: Enrolments for age 5-17age_18_greater: Enrolments for age 18+
Run the complete analysis pipeline with default configuration:
python main.pyRun with custom configuration and output directory:
python main.py --config config.json --output custom_output --log-level INFO--config CONFIG: Path to configuration file (JSON format)--output OUTPUT: Output directory for results (default:output)--log-level LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
The config.json file controls all analysis parameters:
{
"data": {
"demographic_dir": "data/demographic",
"biometric_dir": "data/biometric",
"enrolment_dir": "data/enrolment",
"output_dir": "output",
"cache_dir": ".cache"
},
"analysis": {
"numeric_percentiles": [0.25, 0.5, 0.75],
"outlier_methods": ["iqr", "zscore"],
"iqr_multiplier": 1.5,
"zscore_threshold": 3.0,
"decomposition_period": 12,
"forecast_periods": 12,
"min_region_size": 10
},
"visualization": {
"figure_size": [12, 6],
"dpi": 300,
"style": "seaborn-v0_8-darkgrid",
"color_palette": "husl",
"save_format": "png",
"map_zoom_level": 5,
"choropleth_colormap": "YlOrRd"
},
"report": {
"report_title": "Aadhaar Enrolment & Updates Analysis Report",
"author": "Data Analysis Team",
"include_code_appendix": true,
"include_methodology": true,
"include_recommendations": true,
"page_size": "A4",
"margin_inches": 0.75
},
"log_level": "INFO",
"log_file": "analysis.log"
}After running the analysis, the following files are generated in the output directory:
output/
├── aadhaar_analysis_report.pdf # Main PDF report
├── demographic_cleaned.parquet # Cleaned demographic data
├── biometric_cleaned.parquet # Cleaned biometric data
├── enrolment_cleaned.parquet # Cleaned enrolment data
├── preprocessing_report.json # Data quality report
└── visualizations/
├── distribution_*.png # Distribution plots
├── correlation_heatmap.png # Correlation matrix
├── time_series.png # Time series plot
├── geographic_map.png # Geographic visualization
├── comparative_analysis.png # Comparative plots
└── accessibility_dashboard.png # Accessibility dashboard
The generated PDF report includes:
- Title Page: Report title, author, and generation date
- Table of Contents: Navigable section index
- Problem Statement: Overview of analysis objectives
- Datasets: Description of data sources and structure
- Methodology: Data cleaning, preprocessing, and analysis methods
- Analysis Results:
- Univariate analysis (distributions, trends)
- Bivariate analysis (correlations, relationships)
- Anomaly detection findings
- Trend analysis and forecasts
- Service accessibility analysis
- Visualizations: All generated charts and maps
- Key Findings: Synthesized insights and patterns
- Recommendations: Policy recommendations based on findings
- Code Appendix: Complete Python source code with syntax highlighting
The pipeline executes the following analysis phases:
- Loads all CSV files from configured directories
- Validates data schema and integrity
- Reports data quality metrics
- Handles missing values (forward fill, median/mode imputation)
- Standardizes geographic data (state/district names, pincodes)
- Parses temporal data (date conversion, feature extraction)
- Removes duplicates
- Saves cleaned data in Parquet format
- Computes descriptive statistics (mean, median, std dev, quartiles)
- Analyzes distributions for numeric and categorical variables
- Identifies temporal trends
- Analyzes geographic patterns
- Computes correlation matrices
- Analyzes categorical relationships (cross-tabulations)
- Identifies temporal relationships and synchronization
- Detects geographic clustering and regional patterns
- Statistical outlier detection (IQR, z-score methods)
- Temporal anomaly detection (change points, seasonal anomalies)
- Geographic anomaly detection (regional outliers)
- Multivariate anomaly detection (Isolation Forest)
- Time series decomposition (trend, seasonal, residual)
- Trend phase identification (growth, plateau, decline)
- Future value forecasting (ARIMA, exponential smoothing)
- Demographic shift analysis
- Coverage gap detection (demographic vs biometric mismatches)
- Service desert identification (high-population, low-enrolment areas)
- Age group coverage analysis
- Stagnant region identification
- Accessibility index computation
- Distribution plots (histograms, box plots, violin plots)
- Correlation heatmaps
- Time series plots with trend lines
- Geographic maps (choropleth)
- Comparative analysis charts
- Accessibility dashboard
- Converts statistical findings to plain language
- Identifies causal relationships and patterns
- Highlights unexpected findings
- Generates executive summary
- Creates comprehensive PDF report
- Embeds all visualizations
- Includes code appendix
- Generates table of contents and page numbers
Run all tests:
pytestRun tests with coverage:
pytest --cov=src --cov-report=htmlRun specific test file:
pytest tests/test_dataloading.pyRun with verbose output:
pytest -vTests are organized by module:
tests/
├── test_dataloading.py # Data loading tests
├── test_preprocessing.py # Preprocessing tests
├── test_univariate_analysis.py # Univariate analysis tests
├── test_bivariate_analysis.py # Bivariate analysis tests
├── test_anomaly_detection.py # Anomaly detection tests
├── test_trend_analysis.py # Trend analysis tests
├── test_accessibility_analysis.py # Accessibility analysis tests
├── test_visualization.py # Visualization tests
├── test_insight_synthesis.py # Insight synthesis tests
├── test_report_generation.py # Report generation tests
├── test_config.py # Configuration tests
├── test_logging_config.py # Logging tests
└── conftest.py # Pytest fixtures
Format code with Black:
black src/ tests/ main.pyCheck code style with Flake8:
flake8 src/ tests/ main.pyCheck types with Mypy:
mypy src/ main.pyaadhaar-analysis/
├── main.py # Main orchestration script
├── config.json # Configuration file
├── requirements.txt # Python dependencies
├── README.md # This file
├── pytest.ini # Pytest configuration
├── src/ # Source code
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── logging_config.py # Logging setup
│ ├── dataloading.py # Data loading module
│ ├── preprocessing.py # Data preprocessing
│ ├── univariate_analysis.py # Univariate analysis
│ ├── bivariate_analysis.py # Bivariate analysis
│ ├── anomaly_detection.py # Anomaly detection
│ ├── trend_analysis.py # Trend analysis
│ ├── accessibility_analysis.py # Accessibility analysis
│ ├── visualization.py # Visualization generation
│ ├── insight_synthesis.py # Insight synthesis
│ ├── report_generation.py # PDF report generation
│ ├── baseline_model.py # Baseline model (optional)
│ └── feature_engineering.py # Feature engineering (optional)
├── tests/ # Test suite
│ ├── __init__.py
│ ├── conftest.py # Pytest fixtures
│ ├── test_*.py # Module-specific tests
│ └── __pycache__/
├── data/ # Input data directory
│ ├── demographic/ # Demographic CSV files
│ ├── biometric/ # Biometric CSV files
│ └── enrolment/ # Enrolment CSV files
├── output/ # Output directory (generated)
│ ├── aadhaar_analysis_report.pdf # Main report
│ ├── *_cleaned.parquet # Cleaned data
│ └── visualizations/ # Generated plots
└── .cache/ # Cache directory (generated)
Issue: "No such file or directory" for data files
- Ensure data files are in the correct directory structure
- Check that file paths in config.json are correct
- Verify CSV files have the expected column names
Issue: Memory error with large datasets
- Reduce the number of files processed at once
- Use data sampling for initial testing
- Increase available system memory
Issue: Visualization not displaying
- Ensure matplotlib backend is set correctly
- Check that output directory has write permissions
- Verify image format is supported
Issue: PDF generation fails
- Ensure reportlab is installed:
pip install reportlab - Check that output directory exists and is writable
- Verify all visualizations were generated successfully
For issues or questions:
- Check the log file (
analysis.log) for detailed error messages - Review the test files for usage examples
- Consult the docstrings in source code modules
- Check the generated PDF report for analysis details
- Data Loading: Optimized for large CSV files using chunked reading
- Memory Usage: Efficient data types (int32 for counts, datetime64 for dates)
- Computation: Vectorized operations using numpy/pandas
- Visualization: Lazy loading of large maps, cached computations
Key dependencies and their purposes:
- pandas: Data manipulation and analysis
- numpy: Numerical computing
- scikit-learn: Machine learning (anomaly detection, clustering)
- scipy: Scientific computing (statistics, signal processing)
- statsmodels: Statistical modeling (time series, ARIMA)
- matplotlib: Plotting and visualization
- seaborn: Statistical data visualization
- plotly: Interactive visualizations
- folium: Geographic mapping
- reportlab: PDF generation
- pytest: Testing framework
- hypothesis: Property-based testing
This project is provided as-is for analysis and research purposes.
For questions or feedback about this analysis system, please contact the Data Analysis Team.
- Complete analysis pipeline implementation
- All analysis modules (univariate, bivariate, anomaly detection, trends, accessibility)
- Comprehensive visualization generation
- PDF report generation with code appendix
- Full test coverage
- Configuration management
- Logging and error handling