**Machine Learning Data Pipeline (Short Steps)**

1. **Understand Data** – Load, inspect types, nulls, and distributions.
2. **Clean Data** – Handle missing values, outliers, duplicates, fix types.
3. **Feature Engineering** – Create new features, encode categorical, scale numeric.
4. **EDA** – Analyze correlations, patterns, target distribution.
5. **Split Data** – Train/test/validation (use stratified if classification).
6. **Train Model** – Fit baseline models, cross-validate.
7. **Evaluate Model** – Check metrics (accuracy, RMSE, etc.), confusion matrix.
8. **Tune Model** – Hyperparameter tuning, feature selection, advanced models.
9. **Interpret Model** – Feature importance, SHAP/LIME.
10. **Deploy Model** – Save model, create API (FastAPI/Flask), test inference.
11. **Monitor Model** – Watch for drift, retrain on new data.


## 1. Understand the data

#### Imports

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltq
import seaborn as sns

from sklearn.model_selection import train_test_split


In [6]:
#load the data
df = pd.read_csv('data.csv', sep=',')
df.head()


Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


In [7]:
df.describe()

Unnamed: 0,id,Age,Driving_License,Region_Code,Previously_Insured,Annual_Premium,Policy_Sales_Channel,Vintage,Response
count,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0
mean,190555.0,38.822584,0.997869,26.388807,0.45821,30564.389581,112.034295,154.347397,0.122563
std,110016.836208,15.511611,0.04611,13.229888,0.498251,17213.155057,54.203995,83.671304,0.327936
min,1.0,20.0,0.0,0.0,0.0,2630.0,1.0,10.0,0.0
25%,95278.0,25.0,1.0,15.0,0.0,24405.0,29.0,82.0,0.0
50%,190555.0,36.0,1.0,28.0,0.0,31669.0,133.0,154.0,0.0
75%,285832.0,49.0,1.0,35.0,1.0,39400.0,152.0,227.0,0.0
max,381109.0,85.0,1.0,52.0,1.0,540165.0,163.0,299.0,1.0
