In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/ogut77/DataScience/master/insurance.csv"
df = pd.read_csv(url)


In [None]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Context in Insurance Data
This dataset is often used to predict charges based on the other variables (age, sex, bmi, children, smoker, region). For example:

Input Variables (X): age, sex, bmi, children, smoker, region (features used to make predictions).

Output Variable (y): charges (what you’re trying to predict).

Describtion of variables
1. Age
Description: The age of the individual (the insured person).
Type: Numerical (integer).
Example Values: 19, 45, 62, etc.
Role in Insurance: Age is a key factor in determining insurance charges. Older individuals often have higher medical costs (and thus higher charges) due to increased health risks.
2. Sex
Description: The gender of the individual.
Type: Categorical (text or binary).
Example Values: "male," "female"
Role in Insurance: Gender can influence insurance charges because health risks and medical expenses may differ between males and females (e.g., pregnancy-related costs for females).
3. BMI (Body Mass Index)
Description: A measure of body fat based on height and weight (calculated as weight in kg divided by height in meters squared).
Type: Numerical (float).
Example Values: 25.3, 30.1, 18.5, etc.
Role in Insurance: Higher BMI often correlates with increased health risks (e.g., obesity-related conditions like diabetes or heart disease), leading to higher insurance charges.
4. Children
Description: The number of children (dependents) covered under the individual’s insurance plan.
Type: Numerical (integer).
Example Values: 0, 1, 3, etc.
Role in Insurance: More children can increase insurance costs slightly, as it may reflect additional healthcare needs, though the effect is often less pronounced than other factors like smoking or age.
5. Smoker
Description: Indicates whether the individual smokes tobacco.
Type: Categorical (text or binary).
Example Values: "yes," "no" .
Role in Insurance: Smoking is a major factor in insurance charges. Smokers typically have much higher medical costs due to risks like lung disease or cancer, so their charges are significantly elevated.
6. Region
Description: The geographic region where the individual lives.
Type: Categorical (text).
Example Values: "northeast," "southeast," "southwest," "northwest" (common in U.S.-based datasets).
Role in Insurance: Charges can vary by region due to differences in healthcare costs, lifestyle factors, or local insurance regulations.
7. Charges
Description: The insurance charges (or premiums/costs) billed to the individual, typically in a currency like USD.
Type: Numerical (float).
Example Values: 1684.52, 11234.89, 32050.23, etc.
Role in Insurance: This is usually the target variable (output) in predictive modeling. It represents the amount the insurance company charges, influenced by all the other columns (age, sex, BMI, etc.).



In [1]:
#1. Check if there is null value in dataset df (5 pt)
import pandas as pd

url = "https://raw.githubusercontent.com/ogut77/DataScience/master/insurance.csv"
df = pd.read_csv(url)


print(df.isnull().sum())


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [2]:
#2. Assign charges to y  and others to X using df. y is output variable and X is input variables (5 pt)

y = df["charges"]

X = df.drop(columns=["charges"])

print(X.head())
print(y.head())


   age     sex     bmi  children smoker     region
0   19  female  27.900         0    yes  southwest
1   18    male  33.770         1     no  southeast
2   28    male  33.000         3     no  southeast
3   33    male  22.705         0     no  northwest
4   32    male  28.880         0     no  northwest
0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64


In [3]:
#3. Use  get_dummies() function from the pandas library to convert categorical variables in a DataFrame (X).
# Drop first drops the first category’s dummy variable to avoid multicollinearity (5 pt)

X = pd.get_dummies(X, drop_first=True)

print(X.head())


   age     bmi  children  sex_male  smoker_yes  region_northwest  \
0   19  27.900         0     False        True             False   
1   18  33.770         1      True       False             False   
2   28  33.000         3      True       False             False   
3   33  22.705         0      True       False              True   
4   32  28.880         0      True       False              True   

   region_southeast  region_southwest  
0             False              True  
1              True             False  
2              True             False  
3             False             False  
4             False             False  


In [None]:
#Use following methods for the evaluation on test and train data
def evalmetric(y,ypred):
 from scipy.stats import pearsonr
 import numpy as np
 e = y - ypred
 mse_f = np.mean(e**2)
 rmse_f = np.sqrt(mse_f)
 mae_f = np.mean(abs(e))
 mape_f = 100*np.mean(abs(e/y))
 crl, _ = pearsonr(y, ypred)
 r2_f = crl*crl
 print("MSE:", mse_f)
 print("RMSE:", rmse_f)
 print("MAE:",mae_f)
 print("MAPE:",mape_f)
 print("R-Squared:", round(r2_f, 4))


In [4]:
#4.Get the correlation between X variables and y variables.(5 pt)

df_encoded = pd.concat([X, y], axis=1)

correlation = df_encoded.corr()["charges"].sort_values(ascending=False)

print(correlation)


charges             1.000000
smoker_yes          0.787251
age                 0.299008
bmi                 0.198341
region_southeast    0.073982
children            0.067998
sex_male            0.057292
region_northwest   -0.039905
region_southwest   -0.043210
Name: charges, dtype: float64


In [6]:
#5.Split a dataset into 25%  of data as test data  and 75% of data as training data ( pt)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print("Size X_train:", X_train.shape)
print("Size X_test:", X_test.shape)
print("Size y_train:", y_train.shape)
print("SIze y_test:", y_test.shape)


Size X_train: (1003, 8)
Size X_test: (335, 8)
Size y_train: (1003,)
SIze y_test: (335,)


In [9]:
#6. Using Decision Tree and Linear Regression methods, compare the performance results on both test and training data
#to determine which one is more likely to overfit and which is more likely to underfit.
# Do you think that Lasso and Ridge regularization are more likely to improve the results of Linear model test data,
# or would Random Forest or Boosting methods are more likely to improve the results of Decison tree test data?
#Explain your reasoning.(35 pt)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Train the Decision Tree model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# Predictions
y_train_pred_lr = lr_model.predict(X_train)
y_test_pred_lr = lr_model.predict(X_test)

y_train_pred_dt = dt_model.predict(X_train)
y_test_pred_dt = dt_model.predict(X_test)

# Define evaluation function
def evalmetric(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    print(f"Mean Squared Error: {mse:.4f}")

# Evaluate models
print("Linear Regression (Train):")
evalmetric(y_train, y_train_pred_lr)
print("\nLinear Regression (Test):")
evalmetric(y_test, y_test_pred_lr)

print("\nDecision Tree (Train):")
evalmetric(y_train, y_train_pred_dt)
print("\nDecision Tree (Test):")
evalmetric(y_test, y_test_pred_dt)



Linear Regression (Train):
Mean Squared Error: 37004502.1841

Linear Regression (Test):
Mean Squared Error: 35117755.7361

Decision Tree (Train):
Mean Squared Error: 182648.1711

Decision Tree (Test):
Mean Squared Error: 36580230.1640


In [11]:
#7. Explain performance of linear regressin on test data
# using  Root mean squared error, mean absolute error, mean absolute percentage error and R2 metric (10 pt)

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Define evaluation function
def eval_metrics(y_true, y_pred):
    rmse = mean_squared_error(y_true, y_pred) ** 0.5  # Correct RMSE calculation
    mae = mean_absolute_error(y_true, y_pred)  # MAE
    mape = (abs((y_true - y_pred) / y_true)).mean() * 100  # MAPE
    r2 = r2_score(y_true, y_pred)  # R² Score

    print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%")
    print(f"R² Score: {r2:.4f}")

# Evaluate Linear Regression on test data
print("Linear Regression Performance on Test Data:")
eval_metrics(y_test, y_test_pred_lr)


Linear Regression Performance on Test Data:
Root Mean Squared Error (RMSE): 5926.0236
Mean Absolute Error (MAE): 4243.6541
Mean Absolute Percentage Error (MAPE): 44.47%
R² Score: 0.7673


In [14]:
#8. Use Random Forest and Boosting methods (XGBoost, LightGBM, and CatBoost)
#to obtain the evaluation scores on  test data.
#Which Boosting technique yielded the best performance on the test data based on the R² metric?
#Did you achieve a better result compared to Random Forest on the test data based on the R² metric?
#If there is improvement on Random forest or boosting methods over decison tree, explain  (30 pt)

!pip install catboost --quiet

from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def eval_metrics(y_true, y_pred, model_name):
    rmse = mean_squared_error(y_true, y_pred) ** 0.5  # RMSE
    mae = mean_absolute_error(y_true, y_pred)  # MAE
    mape = (abs((y_true - y_pred) / y_true)).mean() * 100  # MAPE
    r2 = r2_score(y_true, y_pred)  # R² Score

    print(f"\n{model_name} Performance on Test Data:")
    print(f"RMSE: {rmse:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"MAPE: {mape:.2f}%")
    print(f"R² Score: {r2:.4f}")

    return r2

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_test_pred_rf = rf_model.predict(X_test)

xgb_model = XGBRegressor(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)
y_test_pred_xgb = xgb_model.predict(X_test)

lgbm_model = LGBMRegressor(n_estimators=100, random_state=42)
lgbm_model.fit(X_train, y_train)
y_test_pred_lgbm = lgbm_model.predict(X_test)

catboost_model = CatBoostRegressor(iterations=100, random_state=42, verbose=0)
catboost_model.fit(X_train, y_train)
y_test_pred_catboost = catboost_model.predict(X_test)

r2_rf = eval_metrics(y_test, y_test_pred_rf, "Random Forest")
r2_xgb = eval_metrics(y_test, y_test_pred_xgb, "XGBoost")
r2_lgbm = eval_metrics(y_test, y_test_pred_lgbm, "LightGBM")
r2_catboost = eval_metrics(y_test, y_test_pred_catboost, "CatBoost")

r2_scores = {
    "Random Forest": r2_rf,
    "XGBoost": r2_xgb,
    "LightGBM": r2_lgbm,
    "CatBoost": r2_catboost
}

best_model = max(r2_scores, key=r2_scores.get)
best_r2 = r2_scores[best_model]

print(f"\nBest Boosting Technique: {best_model} with R² Score = {best_r2:.4f}")

y_test_pred_dt = dt_model.predict(X_test)
r2_dt = r2_score(y_test, y_test_pred_dt)
print(f"\nDecision Tree R² Score: {r2_dt:.4f}")

if best_r2 > r2_dt:
    print(f"\n{best_model} outperformed Decision Tree by {best_r2 - r2_dt:.4f} in R² Score.")
else:
    print("\nNo significant improvement over Decision Tree.")

if best_r2 > r2_rf:
    print(f"{best_model} performed better than Random Forest by {best_r2 - r2_rf:.4f} in R² Score.")
else:
    print("Random Forest performed better or equally compared to Boosting models.")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000222 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1003, number of used features: 8
[LightGBM] [Info] Start training from score 13267.935814

Random Forest Performance on Test Data:
RMSE: 4807.7448
MAE: 2653.6148
MAPE: 30.27%
R² Score: 0.8468

XGBoost Performance on Test Data:
RMSE: 5141.3464
MAE: 2957.2133
MAPE: 34.57%
R² Score: 0.8248

LightGBM Performance on Test Data:
RMSE: 4690.9613
MAE: 2700.7204
MAPE: 33.10%
R² Score: 0.8542

CatBoost Performance on Test Data:
RMSE: 4686.6262
MAE: 2667.4459
MAPE: 31.72%
R² Score: 0.8544

Best Boosting Technique: CatBoost with R² Score = 0.8544

Decision Tree R² Score: 0.7576

CatBoost outperformed Decision Tree by 0.0969 in R² Score.
CatBoost performed better than Random Forest by 0.0076 in R² Score.
