# Final Project: Regression Analysis
## Predicting Medical Insurance Charges

**Name:** Saratchandra Golla    
**Date:** November 24, 2025

### Introduction
The objective of this project is to apply **regression analysis** to predict the continuous numerical target variable, **insurance charges**, using the **Medical Cost Personal Datasets** (`insurance.csv`).

The methodology adheres to the project guidelines, including: data inspection, exploratory data analysis (EDA), using `ColumnTransformer` and `Pipeline` for efficient preprocessing and modeling, training a baseline Linear Regression model, implementing improved models (Ridge and Polynomial Regression), and rigorously comparing their performance using $R^2$, MAE, and RMSE.

## 1. Import and Inspect the Data

In [1]:
# --- Library Imports and Setup ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings

# Suppress minor warnings for cleaner output
warnings.filterwarnings('ignore')

# Define global variables based on the dataset structure
TARGET_VARIABLE = 'charges'
NUMERICAL_FEATURES = ['age', 'bmi', 'children']
CATEGORICAL_FEATURES = ['sex', 'smoker', 'region']

### 1.1 Load the dataset and display the first 10 rows.

In [4]:
# 1.1 Load the dataset and display the first 10 rows.
df = pd.read_csv("data/insurance.csv")
print("1.1 Dataset Head (First 10 Rows):")
display(df.head(10))

1.1 Dataset Head (First 10 Rows):


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


### 1.2 Check for missing values and display summary statistics.

In [3]:
# 1.2 Check for missing values and display summary statistics.
print("\n1.2 Data Information (df.info()):")
df.info()

print("\n1.2 Summary Statistics (df.describe(include='all')): ")
print(df.describe(include='all').T)


1.2 Data Information (df.info()):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

1.2 Summary Statistics (df.describe(include='all')): 
           count unique        top  freq          mean           std  \
age       1338.0    NaN        NaN   NaN     39.207025      14.04996   
sex         1338      2       male   676           NaN           NaN   
bmi       1338.0    NaN        NaN   NaN     30.663397      6.098187   
children  1338.0    NaN        NaN   NaN      1.094918      1.205493   
smoker      1338      2       

### Reflection 1
**What do you notice about the dataset? Are there any data issues?**    
The dataset has **no missing values** (1338 non-null entries in all columns). This is ideal for initial modeling. The target variable, `charges`, is continuous and numeric. Key data issues to address are:
1. The wide range and large standard deviation in `charges`, suggesting a high variance in the target.
2. The presence of **categorical features** (`sex`, `smoker`, `region`) that must be encoded.