## **Introduction**

Health insurance premiums are influenced by a multitude of factors, ranging from individual characteristics like age and gender to broader lifestyle choices such as smoking and exercise habits. Understanding these variables and their impact on insurance charges is key for creating effective pricing models and offering competitive insurance plans. This project leverages a dataset from Kaggle, "US Health Insurance Dataset," to predict health insurance charges based on various demographic and lifestyle factors.

The objective of this project is to gain practical experience in predictive modeling, focusing on regression techniques such as linear regression and more advanced models like Random Forest and XGBoost. By analyzing the dataset, which includes features such as age, sex, BMI, smoking status, and region, the relationships between these factors and insurance charges are explored. Through careful data preprocessing, feature engineering, and model tuning, the goal is to develop a model that can accurately predict insurance charges while minimizing errors.

In addition to refining predictive modeling skills, this project provides insight into the essential data preprocessing steps required for regression tasks, particularly in the context of healthcare data. Missing values are handled, categorical variables are encoded, and features are scaled and transformed to ensure optimal model performance. Ultimately, this project emphasizes the importance of understanding the factors influencing insurance pricing and the practical applications of regression in domains such as healthcare economics and insurance.

## **About the Data**

The dataset used in this project is the "US Health Insurance Dataset", sourced from Kaggle. It provides information on individual medical charges billed by health insurance providers in the United States. This dataset serves as an excellent foundation for exploring regression techniques in a real-world healthcare context, where understanding the impact of personal and behavioral attributes on insurance costs is both practical and insightful.

The dataset contains 1,338 entries and 7 features, along with the target variable charges. Each row represents a unique individual, capturing a range of attributes related to demographics, physical metrics, and lifestyle choices.
Key Columns

Below are some of the important columns examined to understand the data:

    age: Age of the individual

    sex: Gender of the individual (male or female)

    bmi: Body Mass Index — a standard measure of body fat based on height and weight

    children: Number of dependents covered by health insurance

    smoker: Whether the person is a smoker (yes or no)

    region: Geographic location within the U.S. (northeast, northwest, southeast, southwest)

    charges: The medical costs billed to the individual (target variable)

Initial Exploration and Insights

To gain an initial understanding of the dataset, the following steps were performed:

    Data Inspection: Summary statistics and .info() methods were used to understand data types, non-null counts, and the presence of missing values (none were found).

    Categorical Distribution: Frequency counts were used to analyze distributions of categorical features like sex, smoker, and region.

    Feature Relationships: Correlations among numerical features (e.g., age, bmi, children, charges) were evaluated using correlation matrices to identify patterns and potentially influential variables.

    Outlier Detection: Boxplots and interquartile range (IQR) analysis were used to detect extreme values in the charges column, often driven by smoker status and high BMI levels.

Although this section focuses on preprocessing, subsequent phases of the project will incorporate scatter plots and error analysis to visualize the effectiveness of different regression models—specifically, Linear Regression, Random Forest Regression, and XGBoost. These will help illustrate how well the models predict actual insurance charges and reveal areas where prediction accuracy could be improved.

This dataset is publicly available at: [US Health Insurance Dataset](https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset).

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from scipy import stats

# Load dataset
df = pd.read_csv('insurance.csv')

# Overview of the dataset
print("Dataset Info:")
print(df.info())

print("\nMissing Values:\n", df.isnull().sum())

print("\nPreview:")
print(df.head())

print("\nColumns:", df.columns.tolist())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None

Missing Values:
 age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Preview:
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4 

## **Pre-Processing**

In [2]:
# Fill missing values
for col in ['age', 'bmi', 'children', 'charges']:
    if col in df.columns:
        df[col] = df[col].fillna(df[col].median())

for col in ['sex', 'smoker', 'region']:
    if col in df.columns:
        df[col] = df[col].fillna(df[col].mode()[0])

# Encode categorical variables
if 'sex' in df.columns:
    df['sex'] = LabelEncoder().fit_transform(df['sex'])

if 'smoker' in df.columns:
    df['smoker'] = LabelEncoder().fit_transform(df['smoker'])

# One-hot encode 'region'
if 'region' in df.columns:
    df = pd.get_dummies(df, columns=['region'], drop_first=True)

# Scale numerical features
scaler = StandardScaler()
for col in ['age', 'bmi', 'children']:
    if col in df.columns:
        df[[col]] = scaler.fit_transform(df[[col]])

# Feature Engineering: Create interaction feature
df['age_bmi'] = df['age'] * df['bmi']

# Check for outliers using IQR
Q1 = df['charges'].quantile(0.25)
Q3 = df['charges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['charges'] >= lower_bound) & (df['charges'] <= upper_bound)]

# Define features and target
X = df.drop('charges', axis=1)
y = df['charges']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the training set and mean charge in the training set
print("\nX_train shape:", X_train.shape)
print("y_train mean charges:", y_train.mean())


X_train shape: (959, 9)
y_train mean charges: 9738.038385693431


Before training regression models to predict health insurance charges, several preprocessing steps were implemented to clean and prepare the dataset for effective learning. These transformations ensured the data was both consistent and well-suited for input into machine learning algorithms.

#### 1. Handling Missing Values
To maintain data integrity, missing values were handled using appropriate imputation strategies based on data type:

- **Numerical columns** (`age`, `bmi`, `children`, `charges`) were imputed using the **median**. Median imputation is robust to outliers and ensures continuity of numerical trends without being skewed by extreme values.
- **Categorical columns** (`sex`, `smoker`, `region`) were imputed with their **mode**, representing the most frequent category, which is a common and effective technique for nominal data.

#### 2. Encoding Categorical Variables
Machine learning models require numerical inputs. Thus, categorical variables were encoded as follows:

- **Binary labels** like `sex` and `smoker` were encoded using **Label Encoding**, converting them into `0` and `1` to represent the categories efficiently.
- The **`region`** column, which contains multiple categories, was transformed using **One-Hot Encoding**. This created binary columns for each region (excluding one to avoid multicollinearity via `drop_first=True`), allowing models to capture geographic influence on charges.

#### 3. Feature Scaling
To eliminate scale discrepancies across features, numerical variables (`age`, `bmi`, and `children`) were standardized using **StandardScaler**. This transformation adjusts features to have a mean of 0 and standard deviation of 1, which is essential for many algorithms to perform optimally, especially when features vary in scale.

#### 4. Feature Engineering
A new interaction feature, **`age_bmi`**, was created by multiplying `age` and `bmi`. This feature captures the potential compounding effect of age and body mass on medical expenses, providing the model with a richer representation of risk factors.

#### 5. Outlier Removal
Outliers in the target variable `charges` were addressed using the **Interquartile Range (IQR) method**. Data points falling outside 1.5 times the IQR from the first and third quartiles were removed. This helps reduce the skewness in the target distribution and improves model generalization.

#### 6. Train-Test Split
The cleaned dataset was divided into training and testing sets using an **80-20 split**. This approach reserves 20% of the data for final evaluation while training the model on the remaining 80%, enabling validation of model performance on unseen data.