![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [238]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [239]:
insurance.shape

(1338, 7)

In [240]:
insurance.describe()

Unnamed: 0,age,bmi,children
count,1272.0,1272.0,1272.0
mean,35.214623,30.56055,0.948899
std,22.478251,6.095573,1.303532
min,-64.0,15.96,-4.0
25%,24.75,26.18,0.0
50%,38.0,30.21,1.0
75%,51.0,34.485,2.0
max,64.0,53.13,5.0


In [241]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB


In [242]:
insurance.isnull().sum()

age         66
sex         66
bmi         66
children    66
smoker      66
region      66
charges     54
dtype: int64

In [243]:
insurance=insurance.dropna()
insurance.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [244]:
insurance.shape

(1208, 7)

In [245]:
categorical_columns = insurance.select_dtypes(include=['object', 'category']).columns

for column in categorical_columns:
    unique_values = insurance[column].unique()
    print(f"Column: {column}")
    print(f"Categories: {unique_values}")
    print(f"Number of Categories: {len(unique_values)}\n")

Column: sex
Categories: ['female' 'male' 'woman' 'F' 'man' 'M']
Number of Categories: 6

Column: smoker
Categories: ['yes' 'no']
Number of Categories: 2

Column: region
Categories: ['southwest' 'Southeast' 'southeast' 'northwest' 'Northwest' 'Northeast'
 'northeast' 'Southwest']
Number of Categories: 8

Column: charges
Categories: ['16884.924' '1725.5523' '$4449.462' ... '$1629.8335' '2007.945'
 '29141.3603']
Number of Categories: 1207



In [246]:
df=insurance
# Standardizing the 'sex' column
df['sex'] = df['sex'].str.lower().map({'female': 'Female', 'f': 'Female', 
                                       'male': 'Male', 'm': 'Male', 
                                       'woman': 'Female', 'man': 'Male'})

# Standardizing the 'smoker' column
df['smoker'] = df['smoker'].str.lower()

# Standardizing the 'region' column
df['region'] = df['region'].str.lower()


# Cleaning the 'charges' column
df['charges'] = df['charges'].replace('[\$,]', '', regex=True).astype(float)


In [247]:
categorical_columns = insurance.select_dtypes(include=['object', 'category']).columns

for column in categorical_columns:
    unique_values = insurance[column].unique()
    print(f"Column: {column}")
    print(f"Categories: {unique_values}")
    print(f"Number of Categories: {len(unique_values)}\n")

Column: sex
Categories: ['Female' 'Male']
Number of Categories: 2

Column: smoker
Categories: ['yes' 'no']
Number of Categories: 2

Column: region
Categories: ['southwest' 'southeast' 'northwest' 'northeast']
Number of Categories: 4



In [248]:
# Verify and remove any rows with negative or zero values in the age
df = df[df['age'] > 0]

# To handle negative values in the children column, replace them with 0 
df.loc[df['children'] < 0, 'children'] = 0

In [249]:
# Define features and target
categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']
target = 'charges'

In [250]:
# Separate features (X) and target (y)
X = df[categorical_features + numerical_features]
y = df[target]

# One-hot encode categorical features
X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)

# Standardize numerical features
scaler = StandardScaler()
X_numerical = pd.DataFrame(scaler.fit_transform(X[numerical_features]), 
                           columns=numerical_features, index=X.index)

# Combine processed categorical and numerical features
X = pd.concat([X_numerical, X_categorical], axis=1)

In [251]:
X

Unnamed: 0,age,bmi,children,sex_Male,smoker_yes,region_northwest,region_southeast,region_southwest
0,-1.427171,-0.439874,-0.853770,0,1,0,0,1
1,-1.497807,0.519066,-0.014607,1,0,0,1,0
2,-0.791445,0.393276,1.663719,1,0,0,1,0
3,-0.438264,-1.288543,-0.853770,1,0,1,0,0
4,-0.508900,-0.279778,-0.853770,1,0,1,0,0
...,...,...,...,...,...,...,...,...
1332,0.903824,2.304620,1.663719,0,0,0,0,1
1333,0.762551,0.061650,1.663719,1,0,1,0,0
1335,-1.497807,1.022223,-0.853770,0,0,0,1,0
1336,-1.285898,-0.782935,-0.853770,0,0,0,0,1


In [252]:
# Initialize the model
model = LinearRegression()

# Perform cross-validation to evaluate R² score using cross_validate()
cv_results = cross_validate(
    model,
    X,  # Processed features (could be X_train if using training data directly)
    y,            # Target variable (could be y_train if using training data directly)
    cv=5,         # Number of splits
    scoring='r2',  # R² score evaluation
    return_train_score=True  # If you want to access train scores as well
)

# Fit the model with the training data
model.fit(X, y)


In [253]:
# Output the cross-validation results for R² score (Test R² score)
# Check available keys in cv_results to find the correct key
print(cv_results.keys())

# Assuming the correct key is 'test_score' based on the available keys
print(f"Mean Train R²: {cv_results['train_score'].mean():.4f}")
print(f"Mean Test R²: {cv_results['test_score'].mean():.4f}")

dict_keys(['fit_time', 'score_time', 'test_score', 'train_score'])
Mean Train R²: 0.7518
Mean Test R²: 0.7451


In [254]:
# Evaluate the model using cross_val_score and calculate the mean R² score
r2_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

# Calculate and print the R² score
r2_score = r2_scores.mean()
print(f"Final R² Score: {r2_score:.4f}")

Final R² Score: 0.7451


In [255]:
# Load validation dataset
validation_data = pd.read_csv('validation_dataset.csv')

In [256]:
# One-hot encode categorical features in validation_data
validation_data_categorical = pd.get_dummies(validation_data[categorical_features], drop_first=True)

# Standardize numerical features in validation_data
validation_data_numerical = pd.DataFrame(scaler.transform(validation_data[numerical_features]), 
                                         columns=numerical_features, index=validation_data.index)

# Combine processed categorical and numerical features
validation_data_processed = pd.concat([validation_data_numerical, validation_data_categorical], axis=1)

In [257]:
# Ensure the columns in validation_data_processed match the training data
missing_cols = set(X.columns) - set(validation_data_processed.columns)
for col in missing_cols:
    validation_data_processed[col] = 0
validation_data_processed = validation_data_processed[X.columns]

In [258]:
# Predict insurance charges for new customers
validation_data['predicted_charges'] = model.predict(validation_data_processed)

# Adjust unrealistic predictions
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

In [259]:
validation_data['predicted_charges']= np.round(validation_data['predicted_charges'])
validation_data

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,1000.0
1,39.0,male,26.41,0.0,yes,northeast,31135.0
2,27.0,male,29.15,0.0,yes,southeast,28139.0
3,71.0,male,65.502135,13.0,yes,southeast,56479.0
4,28.0,male,38.06,0.0,no,southeast,7335.0
5,70.0,female,72.958351,11.0,yes,southeast,57910.0
6,29.0,female,32.11,2.0,no,northwest,6867.0
7,42.0,female,41.325,1.0,no,northeast,13201.0
8,48.0,female,36.575,0.0,no,northwest,12562.0
9,63.0,male,33.66,3.0,no,southeast,16198.0
