After feature engineering, I saved the cleaned dataset to a CSV for easy reuse.
now loading it back

In [1]:
import pandas as pd

df=pd.read_csv('../data/cleaned.csv')
df = df.drop(columns=['Unnamed: 0'])

Removed the auto-generated index column to keep things clean and model-ready

then separated features (X) and target (y) and split them

In [2]:
from sklearn.model_selection import train_test_split

X=df.drop('charges',axis=1)
y=df['charges']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=33)


80-20 Split to ensure the model learns patterns and still gets tested on unseen data

Before training, I standardized the numerical features to bring them on the same scale using StandardScaler. Because models like Linear Regression and SVM are sensitive to feature scales.
especially important for engineered features like smoker_bmi

In [3]:
from sklearn.preprocessing import StandardScaler

scalar=StandardScaler()
cols=['age','bmi','children','smoker_bmi','smoker_children']

X_train_scaled=scalar.fit_transform(X_train[cols])
X_test_scaled=scalar.transform(X_test[cols])


After scaling selected numerical features, I combined them with the categorical ones (like sex, smoker, and region dummies) using np.hstack()

In [4]:
import numpy as np

X_train_lin = np.hstack([X_train_scaled, X_train.drop(cols, axis=1).values])
X_test_lin = np.hstack([X_test_scaled, X_test.drop(cols, axis=1).values])


Not all features need scaling.

numerical features like age, bmi, children, smoker_bmi, and smoker_children vary a lot in range — and models like Linear Regression, Ridge, SVR are sensitive to that.

binary/categorical features (e.g., sex, smoker, and one-hot regions) are already in 0/1 format — scaling them might break their meaning.

So I only scaled the numerical ones

In [5]:
print("Original X_train shape:", X_train.shape)
print("X_train_scaled shape :", X_train_scaled.shape)
print("X_train_lin shape:", X_train_lin.shape)

Original X_train shape: (1069, 10)
X_train_scaled shape : (1069, 5)
X_train_lin shape: (1069, 10)


In [6]:
from sklearn.linear_model import LinearRegression,Ridge

linear=LinearRegression()
linear.fit(X_train_lin,y_train)
linear_pred=linear.predict(X_test_lin)
linear_pred_real=np.expm1(linear_pred)

In [7]:
from sklearn.model_selection import GridSearchCV

param_grid_ridge={
    'alpha':[0.1,1.0,5,10]
}
ridge=GridSearchCV(Ridge(),param_grid=param_grid_ridge,cv=3)
ridge.fit(X_train_lin,y_train)
ridge_pred=ridge.predict(X_test_lin)
ridge_pred_real=np.expm1(ridge_pred)

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

poly_model=make_pipeline(PolynomialFeatures(),LinearRegression())
param_grid_poly={
    'polynomialfeatures__degree':[2,3,4,5]
}

poly=GridSearchCV(poly_model,param_grid=param_grid_poly,cv=3)
poly.fit(X_train_lin,y_train)
poly_pred=poly.predict(X_test_lin)
poly_pred_real=np.expm1(poly_pred)

In [9]:
from sklearn.svm import SVR

param_grid_svr={
    'epsilon':[0.1,0.5],
    'C':[0.1,1],
    'kernel':['linear']
}

scalr_y=StandardScaler()
y_train_scaled=scalr_y.fit_transform(y_train.values.reshape(-1,1)).ravel()

svr=GridSearchCV(SVR(),param_grid=param_grid_svr,cv=2,n_jobs=-1)
svr.fit(X_train_lin,y_train_scaled)
svr_pred_scaled=svr.predict(X_test_lin)
svr_pred=scalr_y.inverse_transform(svr_pred_scaled.reshape(-1,1)).ravel()
svr_pred_real=np.expm1(svr_pred)

In [10]:
from sklearn.ensemble import RandomForestRegressor

param_grid_random={
    'n_estimators':[100,200],
    'min_samples_split':[2,5],
    'max_depth':[None,5,10,20]
}

random=GridSearchCV(RandomForestRegressor(random_state=44),param_grid=param_grid_random,cv=3)
random.fit(X_train,y_train)
random_pred=random.predict(X_test)
random_pred_real=np.expm1(random_pred)

###  Model Training & Hyperparameter Tuning

| Model                  | Key Details                                                 | Hyperparameters Searched                                  | Post-processing                        |
|------------------------|-------------------------------------------------------------|-----------------------------------------------------------|----------------------------------------|
| **Linear Regression**   | Baseline straight-line model                                | —                                                         | Trained on `log(charges)`, used `exp`  |
| **Ridge Regression**    | L2-regularized model to reduce overfitting/multicollinearity| `alpha ∈ {0.1, 1.0, 5, 10}` (GridSearchCV, cv=3)          | Trained on `log(charges)`, used `exp`  |
| **Polynomial Regression**| Captures non-linear patterns using polynomial features     | `degree ∈ {2, 3, 4, 5}` (GridSearchCV, cv=3)              | Trained on `log(charges)`, used `exp`  |
| **Support Vector Regressor (SVR)** | Margin-based model with scaled target               | `C ∈ {0.1, 1}`, `epsilon ∈ {0.1, 0.5}` (GridSearchCV, cv=2) | Target scaled before training; inverse transform + `exp` applied |
| **Random Forest Regressor**  | Ensemble of decision trees (handles non-linearities well) | `n_estimators ∈ {100, 200}`, `max_depth ∈ {None, 5, 10, 20}`, `min_samples_split ∈ {2, 5}` (GridSearchCV, cv=3) | Trained on `log(charges)`, used `exp`  |




In [11]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

y_test_real=np.expm1(y_test)
def eval(name,ytest,ypred):
    print(name)
    print('mae',mean_absolute_error(ytest,ypred))
    print('mse',mean_squared_error(ytest,ypred))
    print('r2',r2_score(ytest,ypred),'\n')

eval('Linear Regression',y_test_real,linear_pred_real)
eval('Ridge Regression',y_test_real,ridge_pred_real)
eval('Polynomial Regression',y_test_real,poly_pred_real)
eval('svr',y_test_real,svr_pred_real)
eval('Randome Forest',y_test_real,random_pred_real)

Linear Regression
mae 4438.7955217337385
mse 73330163.55100016
r2 0.46626006357454974 

Ridge Regression
mae 4429.93234976219
mse 73151479.18714501
r2 0.46756063316809704 

Polynomial Regression
mae 2615.2149541153804
mse 23229889.20142734
r2 0.830919242708126 

svr
mae 4818.559740056024
mse 102255651.15888683
r2 0.25572340077042044 

Randome Forest
mae 2035.8426458599383
mse 18264620.832100995
r2 0.8670593778918747 



###  Model Evaluation Results

All models were evaluated using:

- **MAE** (Mean Absolute Error) — lower is better  
- **MSE** (Mean Squared Error) — lower is better  
- **R² Score** — closer to 1 means better fit

| Model                | MAE    | MSE         | R² Score |
|----------------------|--------|-------------|----------|
| Linear Regression    | 4439   | 73,330,163  | 0.466    |
| Ridge Regression     | 4430   | 73,151,479  | 0.468    |
| Polynomial Regression| 2615   | 23,229,889  | 0.831    |
| SVR                  | 4819   | 102,255,651 | 0.256    |
| **Random Forest**  | **2036** | **18,264,621** | **0.867**  |

In [12]:
import joblib

joblib.dump(random,'../data/random_forest.joblib')

['../data/random_forest.joblib']

After evaluating all models, **Random Forest** was the top performer  
with the **lowest MAE (2036)** and **highest R² score (0.867)**.  

So, I **saved it using `joblib`** for deployment in the Streamlit app.