Analyze South African auto insurance data (Feb 2014–Aug 2015) to find low-risk segments, validate risk hypotheses, and build pricing/risk models for AlphaCare Insurance Solutions (ACIS).
- EDA: portfolio loss ratio, geography/vehicle/gender splits, outliers, temporal trends.
- Hypothesis tests (risk & margin): provinces, zip codes, gender.
- Data versioning with DVC for auditability.
- Modeling:
- Claim severity (TotalClaims | claims > 0).
- Premium / CalculatedPremiumPerTerm (and optional claim probability).
- Interpretability with SHAP/LIME.
- Deliverables: interim + final reports with business recommendations.
Historical policy & claims data (Feb 2014–Aug 2015). Key fields:
- Policy:
UnderwrittenCoverID,PolicyID,TransactionMonth - Client:
IsVATRegistered,Citizenship,LegalType,MaritalStatus,Gender - Location:
Province,PostalCode,MainCrestaZone,SubCrestaZone - Vehicle:
VehicleType,Make,Model,RegistrationYear,Kilowatts,Bodytype, etc. - Plan:
SumInsured,CalculatedPremiumPerTerm,CoverType,CoverGroup,Product - Payments/Claims:
TotalPremium,TotalClaims
src/
data/ # loading, cleaning, feature prep
eda/ # profiling, visuals, outlier checks
stats/ # hypothesis tests, power checks
models/ # training, eval, interpretability
viz/ # plotting utilities
notebooks/ # exploratory + reporting notebooks
scripts/ # CLI entry points (EDA, tests, train, eval)
dvc.yaml # data & model pipelines (versioned)
Use the project virtual env for all commands:
.\.venv\Scripts\Activate.ps1
# create/activate venv
python -m venv .venv
. .\.venv\Scripts\Activate.ps1
# install deps
pip install -r requirements.txt
# optional: install extras for SHAP/LIME/XGBoost
pip install shap lime xgboost
# prepare processed data via DVC pipeline
dvc repro
# run EDA (example)
python .\scripts\run_eda.py --data data/processed/insurance_clean.csv --out outputs/eda
# run hypothesis tests
python .\scripts\run_hypothesis_tests.py --data data/processed/insurance_clean.csv --out outputs/stats
# train models
python .\scripts\train_sentiment_analysis_model.py --data data/processed/insurance_clean.csv --out outputs/modelsPipeline is defined in dvc.yaml with stage prepare_data:
stages:
prepare_data:
cmd: .\.venv\Scripts\python scripts/prepare_insurance_data.py
deps:
- data/raw/MachineLearningRating_v3.txt
- scripts/prepare_insurance_data.py
outs:
- data/processed/insurance_clean.csvWorkflow (run inside the activated venv):
# initialize once (already done in repo)
dvc init
dvc remote add -d localstorage .dvc/storage
# reproduce pipeline
dvc repro
# check status
dvc status
# push tracked data to remote
dvc pushTracked data:
- Raw:
data/raw/MachineLearningRating_v3.txt(.dvctracked) - Processed:
data/processed/insurance_clean.csv(stage output)
Notes:
.gitignorekeeps data ignored but allows.dvcmetadata to be tracked underdata/.- Always activate
.venvbefore runningdvc reproto ensure consistent deps.
- Task 1 (EDA & Stats): summaries, dtype checks, missingness, univariate/bivariate plots, geography/vehicle/gender splits, outliers, 3+ insight plots.
- Task 2 (DVC): init DVC, set remote, add raw/processed data via pipeline stage, commit
.dvc/dvc.lock, push to remote. - Task 3 (Hypothesis Testing): tests on provinces, zip codes, margin differences, gender; report p-values and business interpretation.
- Task 4 (Modeling): severity regression, premium prediction (and optional claim probability), compare Linear/RandomForest/XGBoost; report RMSE/R² (regression) and feature importance with SHAP/LIME.
outputs/eda/— profiles & plots.outputs/stats/— test tables & p-values.outputs/models/— artifacts, metrics, feature importances.reports/interim.md— covers Task 1–2.reports/final.md— Medium-style report: overview, approach, EDA, tests, modeling, recommendations, limitations.
![]() Total Premium Distribution |
![]() Total Claims Distribution |
![]() Total Claims Outliers |
![]() Loss Ratio by Province (Top 10) |
![]() Loss Ratio by Vehicle Type (Top 12) |
![]() Monthly Loss Ratio Trend |
![]() Monthly Claim Frequency Trend |
pytest -q- Use meaningful branches (
task-1,task-2, …) and PRs to merge tomain. - Follow conventional commits, PEP8, and add docstrings/type hints.
- For large categorical cardinality, prefer target/WOE encoding or hashing for tree models; one-hot for low-cardinality fields.






