![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [68]:
# Re-run this cell
# Import required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [69]:
# explore info/data within the csv file
insurance.info()
insurance.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB


Unnamed: 0,age,bmi,children
count,1272.0,1272.0,1272.0
mean,35.214623,30.56055,0.948899
std,22.478251,6.095573,1.303532
min,-64.0,15.96,-4.0
25%,24.75,26.18,0.0
50%,38.0,30.21,1.0
75%,51.0,34.485,2.0
max,64.0,53.13,5.0


In [70]:
# check if there are missing values in the dataframe
missing_values = insurance.isnull().sum()
print(missing_values)

age         66
sex         66
bmi         66
children    66
smoker      66
region      66
charges     54
dtype: int64


# Data cleaning

This Python function, clean_dataset, takes a pandas DataFrame named insurance as input and performs several data cleaning operations on it:

1. Relabels the 'sex' column to ensure consistency by replacing variations of 'M' with 'male' and variations of 'F' with 'female'.
2. Converts all values in the 'region' column to lowercase to standardize them.
3. Cleans the 'charges' column by removing any non-numeric characters (such as '$') using regular expressions and then converts the result to float data type.
4. Removes any rows from the DataFrame where the 'age' column has a value less than or equal to zero.
5. Replaces negative values in the 'children' column with zero.
6. Finally, it returns the cleaned DataFrame with any rows containing missing values (NaN) removed using the dropna() function.

This function ensures that the dataset is cleaned and standardized, making it ready for further analysis or modeling.

In [71]:
def clean_dataset(insurance):
    # relabel the sex column (only contain male and female tags)
    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})

    # lower case all values in region column 
    insurance['region'] = insurance['region'].str.lower()

    # clean charges column; remove non-numeric characters, convert to float 
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex = True).astype(float)

    # verify and remove any rows with negative or zero values in age column 
    insurance = insurance[insurance['age']>0]

    # replace negative values in the children column 
    insurance.loc[insurance['children'] < 0 , 'children'] = 0 

    return insurance.dropna()

# Model Evaluation 

This Python function, create_and_evaluate_regression_model, is designed to create and evaluate a regression model using the provided insurance dataset. Here's a breakdown of what the function does:

Data Preprocessing:
- Splits the dataset into predictor variables (X) and the target variable (y), dropping the 'charges' column from X.
- Identifies categorical and numerical features.

Encoding Categorical Variables:
- Converts categorical variables into dummy variables using one-hot encoding (pd.get_dummies()), dropping the first category to avoid multicollinearity issues.

Combining Features:
- Combines the processed numerical features with the dummy variables.

Scaling Features and Building Regression Model:
- Scales the numerical features using StandardScaler.
- Initializes a linear regression model.
- Sets up a pipeline (Pipeline) that first scales the features and then applies linear regression.

Model Fitting:
- Fits the pipeline model to the scaled predictor variables (X_scaled) and the target variable (y).

Model Evaluation:
- Performs cross-validation using 5 folds.
- Evaluates the model's performance using mean squared error (MSE) and R-squared (R2) scores.
- Calculates the mean MSE and mean R2 scores across the folds.
- Finally, the function returns the trained pipeline model along with the mean MSE and mean R2 scores.

In [72]:
def create_and_evaluate_regression_model(insurance):
    # data preprocessing 
    X = insurance.drop('charges', axis = 1)
    y = insurance['charges']

    categorical_features = ['sex', 'smoker', 'region']
    numerical_features = ['age', 'bmi', 'children']

    # convert categorical variables to dummy variables
    X_categorical = pd.get_dummies(X[categorical_features], drop_first = True)

    # Combining numerical features with the dummy variables
    X_processed = pd.concat([X[numerical_features], X_categorical], axis = 1)

    # Scaling numerical features and Regression Model
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_processed)
    lin_reg = LinearRegression()
    
    # Pipeline
    steps = [("scaler", scaler), ("lin_reg", lin_reg)]
    insurance_model_pipeline = Pipeline(steps)

    # Fitting the model
    insurance_model_pipeline.fit(X_scaled, y)

    # Evaluating the model
    mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)
    
    return insurance_model_pipeline, mean_mse, mean_r2

In [73]:
# Usage example
cleaned_insurance = clean_dataset(insurance)
insurance_model, mean_mse, r2_score = create_and_evaluate_regression_model(cleaned_insurance)
print("Mean MSE:", mean_mse)
print("Mean R2:", r2_score)

Mean MSE: 37431001.52191915
Mean R2: 0.7450511466263761


In [74]:
# Predict on validation data
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)

# Ensure categorical variables are properly transformed
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)

# Make predictions using the trained model
validation_predictions = insurance_model.predict(validation_data_processed)

# Add predicted charges to the validation data
validation_data['predicted_charges'] = validation_predictions

# Adjust predictions to ensure minimum charge is $1000
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the updated dataframe
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,128624.195643
1,39.0,male,26.41,0.0,yes,northeast,220740.537449
2,27.0,male,29.15,0.0,yes,southeast,181357.588606
3,71.0,male,65.502135,13.0,yes,southeast,423490.68727
4,28.0,male,38.06,0.0,no,southeast,193247.431989
