# DSC-515 Predictive Modeling for Medical Costs

### Overview 
This project focuses on predicting individual medical costs using linear regression models built from the Medical Cost Personal Dataset. This notebook will explore and prepare the data, engineer meaningful features, and apply multiple regression techniques, including baseline, Ridge, and Lasso, to evaluate model performance. Finally, this notebook will compare results, interpret feature importance, and discuss how these insights can support better decision‑making in health‑care settings.

#### Task 1. Data Processing and Feature Engineering
Load and explore the dataset, address missing values, summarize patterns with visuals and statistics, and note key observations and assumptions while identifying predictors from the medical cost distribution. Create and justify new features, and apply normalization or standardization only where needed based on model requirements.

#### Task 2. Modeling and Evaluation
Implement a baseline linear regression model using your selected features, split the data into training and testing sets, and compare it with Ridge and Lasso regression while evaluating each through residual checks, diagnostics, and performance metrics. Train the validated models, generate predictions, and assess results using measures such as RMSE, MAE, and R‑squared to guide your interpretation.

#### Task 3. Analysis and Interpretation
Compare model performance using tables and plots, noting any unexpected patterns or errors, and evaluate feature importance to determine which predictors most strongly influence medical cost estimates. Use these insights to recommend how health‑care providers can apply the predictions and key features to improve patient care and resource planning.

## Task 1. Data Processing and Feature Engineering
The dataset of interest is the Medical Cost Personal Datasets, which contains information about individuals' medical costs and various demographic and health-related features. This dataset will be the foundation for our predictive modeling efforts.

To begin, import the necessary Python packages for data exploration, visualization, and building regression models, as well as load the dataset into a Pandas DataFrame.

In [None]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots
import kagglehub
from kagglehub import KaggleDatasetAdapter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

In [None]:
# Load the dataset directly (no file path needed)
path = kagglehub.dataset_download("mirichoi0218/insurance")

# Load the CSV
df_medical = pd.read_csv(path + "/insurance.csv")

# Display the first few rows of the dataset
print(df_medical.head())

# Check if data types are appropriate
print(df_medical.info()) # May need to convert boolean columns to numeric later

One pair of duplicate rows was identified and removed during the cleaning process. Although those could have been unique patient records, they were treated as duplicates for the sake of this analysis.

In [None]:
# Check for duplicate rows
print(f"Before removing duplicates: {df_medical.shape[0]} rows")
df_medical.duplicated().any()  # 1 pair of duplicate rows found

# Save duplicate rows if any exist
duplicate_rows = df_medical[df_medical.duplicated(keep=False)]
print(f"Duplicate rows found:\n{duplicate_rows}")

# There appears to be one duplicate entry. 
# Create a final dataframe without duplicates
df_medical_final = df_medical.drop_duplicates()

# Confirm removal of duplicates
print(f"After removing duplicates: {df_medical_final.shape[0]} rows")

The final cleaning steps involves converting boolean columns to integers ('sex' and 'smoker'), and creating a new feature by converting the continuous age variable into categorical age‑range groups. These categories align with the World Health Organization’s (WHO) adult age‑classification framework, which defines major life‑stage groups such as young adults (25–44), middle‑aged adults (44–60), and older adults (60+). Using WHO‑based groupings provides a standardized, globally recognized structure for demographic analysis and supports clearer interpretation in predictive modeling.

In [None]:
# Convert 'sex' and 'smoker' to numeric data types.
df_medical_final['sex'] = df_medical_final['sex'].replace({'male': 1, 'female': 0}).astype(int)
df_medical_final['smoker'] = df_medical_final['smoker'].replace({'yes': 1, 'no': 0}).astype(int)
# Check the updated data types
print(df_medical_final.dtypes[['sex', 'smoker']])

# Convert 'age' to ranges based on WHO adult age classifications.
age_bins = [18, 25, 45, 60, 100]
age_labels = ['18-24', '25-44', '45-59', '60+']

df_medical_final.loc[:, 'age_group'] = pd.cut(
    df_medical_final['age'],
    bins=age_bins,
    labels=age_labels,
    right=False
)

# Check the new 'age_group' feature.
print(df_medical_final.loc[:, ['age', 'age_group']].head(10))

# Check the distribution of the new 'age_group' feature.
print(df_medical_final.loc[:, 'age_group'].value_counts())

'Age', 'sex', 'bmi', 'children', and 'charges' are numeric features in the dataset. Although the description of 'charges' is somewhat vague, this variable represents the individual medical costs we aim to predict. Individuals range from 18 to 64 years old, with a mean age of about 39. 'Bmi' spans roughly 16 to 53 and shows moderate right skew, while 'charges' display substantial variability and a strong right skew due to a small number of high‑cost cases. The number of 'children' is generally low, with most individuals reporting 0 to 2 children.

In [None]:
# Summary statistics for numerical columns.
print(df_medical_final.describe())

# Do men or women have more children on average?
avg_children_by_sex = df_medical_final.groupby('sex')['children'].mean()
print(f"Average number of children by sex:\n{avg_children_by_sex}")

The remaining features, 'smoker' and 'region', are categorical variables. The dataset is nearly evenly split across the two sex categories, with 676 males and 662 females, and both groups have a similar average number of children. Smoker status is highly imbalanced, with 1,064 non‑smokers and only 274 smokers. All four geographic regions are represented, with the Southeast appearing most frequently.

In [None]:
# Summary statistics for categorical columns.
print(df_medical_final.describe(include=['object']))

# What are all unique values for region?
print(df_medical_final['region'].unique())

Next, we review the distribution of the target variable, charges, to assess its skewness and identify any potential outliers. The numerical summary shows that charges are right‑skewed, as the mean is noticeably higher than the median. This pattern indicates that a small subset of individuals incur exceptionally high medical costs, creating a long upper tail and potential outliers. The histogram below will further illustrate this skewed distribution.

In [None]:
# Plot histogram of charges
fig, ax = subplots(figsize=(8, 6))
ax.hist(df_medical_final.loc[:, 'charges'], bins=30, color='slategray', edgecolor='black')
ax.set_title('Distribution of Medical Charges')
ax.set_xlabel('Charges')
ax.set_ylabel('Frequency')
plt.show()

The correlation matrix below shows the correlation coefficients between numerical features in the dataset. The strongest correlation to the target variable 'charges' is with 'smoker', 'sex', and 'age', indicating that these features may be important predictors of medical charges. 'Bmi' also shows a modest positive relationship with 'charges', suggesting it may contribute to cost variation even if its influence is weaker. Other variables, such as 'number of children', exhibit only minimal correlation, implying they are unlikely to play a substantial role in predicting medical expenses.

In [None]:
# Compute correlation matrix for numerical features.
df_medical_final.info()  # Confirm which columns are numeric

# Compute correlation matrix for numeric features only
print(df_medical_final.select_dtypes(include='number').corr())

## Task 2. Modeling and Evaluation
The next task will initiate the modeling process, implementing a baseline linear regression model using the predictor features selected during Task 1. These predictors were chosen based on their correlation with the target variable. After defining the feature matrix predictors (X) and the target feature (y), the dataset was split into training and testing subsets to ensure that model performance could be evaluated on unseen data. This split provides a fair assessment of how well the baseline model generalizes beyond the training sample.

For the baseline model, the spread of points around the prediction line shows that although it captures the overall trend, its precision varies across individual cases. The baseline regression achieved an R² of about 0.80, meaning it explains a substantial portion of the variation in medical charges despite the noticeable prediction errors.

In [None]:
# Define X and y
X = df_medical_final[['age', 'bmi', 'sex', 'smoker']]
y = df_medical_final['charges']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=42   # Research suggests using 42 as a standard random state for reproducibility
) 

# Test the sizes of the splits
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Baseline Linear Regression Model
baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)

# Evaluate baseline model performance
y_pred = baseline_model.predict(X_test)
mse_baseline = mean_squared_error(y_test, y_pred)
r2_baseline = r2_score(y_test, y_pred)
print(f"Baseline Linear Regression Model - MSE: {mse_baseline}, R²: {r2_baseline}")

Ridge Regression (L2) is a regularization technique that adds a penalty equal to the square of the magnitude of coefficients to the loss function. This approach helps prevent overfitting by shrinking the coefficients of less important features, leading to a more generalized model. Ridge Regression is particularly useful when dealing with multicollinearity among predictors, as it can stabilize coefficient estimates. A common disadvantage of Ridge Regression is that it may not perform well when the true relationship between the features and the target variable is highly nonlinear.

The L2 produced an R² of 0.80 and an MSE of roughly 36 million, showing that it explains a substantial portion of the variance in medical charges while maintaining prediction errors comparable to the baseline model. The predictions fall within a reasonable range, suggesting that the L2 penalty helped control coefficient magnitude without decreasing overall accuracy. These results show that Ridge provided a more regularized and reliable version of linear regression, particularly useful given the variability and skewness of medical‑cost data.

In [None]:
ridge_model = Ridge(alpha=1.0)

# Fit the model on the training data.
ridge_model.fit(X_train, y_train)

# Predictions on the test set.
y_pred = ridge_model.predict(X_test)
print(f"Ridge Regression Predictions:\n{y_pred[:10]}")

# Evaluate the Ridge Regression model
mse_ridge = mean_squared_error(y_test, y_pred)
r2_ridge = r2_score(y_test, y_pred)
print(f"Ridge Regression Mean Squared Error: {mse_ridge:.2f}")
print(f"Ridge Regression R^2 Score: {r2_ridge:.2f}")

Lasso Regression (L1) is another regularization technique that adds a penalty equal to the absolute value of the magnitude of coefficients to the loss function. This approach can lead to sparse models where some feature coefficients are exactly zero, effectively performing feature selection. Lasso Regression is particularly useful when we suspect that many features are irrelevant or when we want a more interpretable model. However, a potential drawback is that it may not perform well when the true relationship is complex and involves many small effects.

In this analysis, the Lasso model achieved an MSE of 35,834,592 and an R² of 0.805, indicating that it explained about 80% of the variance in the outcome. These results show that the model generalized well to unseen data and maintained stability in its predictions, reflecting the balance Lasso strikes between model simplicity and predictive accuracy.

In [None]:
lasso_model = Lasso(alpha=0.1)

# Fit the model on the training data.
lasso_model.fit(X_train, y_train)

# Predictions on the test set.
y_pred = lasso_model.predict(X_test)
print(f"Lasso Regression Predictions:\n{y_pred[:10]}")

# Evaluate Lasso model performance
mse_lasso = mean_squared_error(y_test, y_pred)
r2_lasso = r2_score(y_test, y_pred)
print(f"Lasso Regression Model - MSE: {mse_lasso}, R²: {r2_lasso}")

All three models made almost the same predictions, which is why their residuals overlap so closely on the plot below. This tells you that the data has a strong linear pattern. The only small concern is that the residuals spread out more at higher predicted values, which makes sense, due to some patients having unusually high charges. Overall, the models appear stable, and there is no need for major adjustments.

In [None]:
# Predictions
y_pred_lin = baseline_model.predict(X_test)
y_pred_ridge = ridge_model.predict(X_test)
y_pred_lasso = lasso_model.predict(X_test)

# Residuals
res_lin = y_test - y_pred_lin
res_ridge = y_test - y_pred_ridge
res_lasso = y_test - y_pred_lasso

# Diverging colors from a standard palette
colors = ["#3B4CC0", "#B5B5B5", "#B40426"]  # blue → grey → red

plt.figure(figsize=(8,6))
plt.scatter(y_pred_lin, res_lin, alpha=0.5, color=colors[0], label="Linear")
plt.scatter(y_pred_ridge, res_ridge, alpha=0.5, color=colors[1], label="Ridge")
plt.scatter(y_pred_lasso, res_lasso, alpha=0.5, color=colors[2], label="Lasso")

plt.axhline(0, color='grey')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted Values (All Models)")
plt.legend()
plt.show()

Finally, the validated models are trained on the training data and predictions are made on the test set.

In [None]:
# Final model training on the entire training set
baseline_model.fit(X_train, y_train)
ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)

# Predictions on the test set
y_pred_baseline = baseline_model.predict(X_test)
y_pred_ridge = ridge_model.predict(X_test)
y_pred_lasso = lasso_model.predict(X_test)

# Show first 10 predictions for each model
print("Baseline Model Predictions:", y_pred_baseline[:10])
print("Ridge Model Predictions:", y_pred_ridge[:10])
print("Lasso Model Predictions:", y_pred_lasso[:10])

# Evaluate all models
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
r2_baseline = r2_score(y_test, y_pred_baseline)
print(f"Baseline Linear Regression Model - MSE: {mse_baseline:.2f}, R²: {r2_baseline:.2f}")

mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
print(f"Ridge Linear Regression Model - MSE: {mse_ridge:.2f}, R²: {r2_ridge:.2f}")

mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"Lasso Linear Regression Model - MSE: {mse_lasso:.2f}, R²: {r2_lasso:.2f}")


## Task 3. Analysis and Interpretation
The three models performed almost identically, with all MSE and R² values differing only in the fourth or fifth decimal place. This indicates that the baseline linear regression already captured the underlying relationships well, leaving little need for regularization to improve performance. Ridge regression performed slightly worse, suggesting that the coefficient penalty may have softened meaningful signal. Lasso matched the baseline model closely, showing that no major predictors were eliminated and that the overall error patterns remained stable across models. Residual behavior appeared consistent, with no evidence under- or over‑prediction at higher values. Higher charges naturally produce larger errors in predictions.

In [None]:
# Table to compare model performances
model_performance = pd.DataFrame({
    'Model': ['Baseline Linear Regression', 'Ridge Regression', 'Lasso Regression'],
    'Mean Squared Error': [mse_baseline, mse_ridge, mse_lasso],
    'R^2 Score': [r2_baseline, r2_ridge, r2_lasso]
})
print(model_performance)


Earlier in the analysis it was determined that smoker status is the strongest predictor of healthcare costs, and the model coefficients reinforce this conclusion. Across all three regression models, the smoker coefficient is consistently around $23,000, indicating a substantial increase in predicted charges for smokers compared to non‑smokers. In contrast, age and BMI have much smaller positive effects, and sex contributes very little, which aligns with their weaker correlations with the target. These results make sense given known cost drivers in healthcare and support smoking is a significant factor in determining healthcare costs.

In [None]:
# Baseline Linear Regression coefficients
print("Baseline Coefficients:")
print(pd.Series(baseline_model.coef_, index=X.columns))

# Ridge Regression coefficients
print("\nRidge Coefficients:")
print(pd.Series(ridge_model.coef_, index=X.columns))

# Lasso Regression coefficients
print("\nLasso Coefficients:")
print(pd.Series(lasso_model.coef_, index=X.columns))

## Conclusions
These predictions have real‑world value for healthcare cost management because they highlight which patient groups drive spending. Smoking status remains the strongest predictor, but age also plays a meaningful role. The dataset shows that the largest portion of patients fall into the 25–44 age group (531), followed by 45–59 (415), 18–24 (277), and 60+ (114). Understanding both the size of each age group and the model’s finding that charges increase by roughly $250 per year of age can help healthcare organizations anticipate where costs are likely to increase. This allows providers to target interventions or allocate more resources that reflect the actual distribution of their patient population.

## References
Choi, M. (2016). Medical Cost Personal Datasets [Data set]. Kaggle. https://www.kaggle.com/datasets/mirichoi0218/insurance  

HealthXWire. (2025). WHO categories of age: Understanding global health definitions. https://wis.it.com/who-categories-of-age-global-health-definitions

James, G., Witten, D., Hastie, T., Tibshirani, R., Taylor, J., & Guo, W. (2023). An introduction to statistical learning: With applications in Python. Springer.

Koivunen-Niemi, L., & K. (2022). Learn best practices for color use in data visualization with python and data from our world in data (2018). In Sage Research Methods: Data Visualization. SAGE Publications, Ltd. https://doi.org/10.4135/9781529605198

Microsoft. (2024). Copilot [Large language model]. Microsoft. https://copilot.microsoft.com
[^1]: Copilot was used only for debugging code and refining wording; all analysis and modeling were performed independently.
