### Medical Cost Personal Insurance Project
Project Description
Health insurance is a type of insurance that covers medical expenses that arise due to an illness. These expenses could be related to hospitalisation costs, cost of medicines or doctor consultation fees. The main purpose of medical insurance is to receive the best medical care without any strain on your finances. Health insurance plans offer protection against high medical costs. It covers hospitalization expenses, day care procedures, domiciliary expenses, and ambulance charges, besides many others. Based on certain input features such as age , bmi,,no of dependents ,smoker ,region  medical insurance is calculated .
Columns                                            
age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9.
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance


Predict : Can you accurately predict insurance costs?


Dataset Link-
https://github.com/dsrscientist/dataset4
https://github.com/dsrscientist/dataset4/blob/main/medical_cost_insurance.csv


In [1]:
import pandas as pd

# Load the dataset from the provided URL
url = "https://raw.githubusercontent.com/dsrscientist/dataset4/main/medical_cost_insurance.csv"
data = pd.read_csv(url)

# Display the first few rows of the dataset to get an overview
print(data.head())


   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Handling Missing Values (example: fill missing values with mean)
data.fillna(data.mean(), inplace=True)

# Encoding Categorical Variables
categorical_columns = ['sex', 'smoker', 'region']
encoder = OneHotEncoder(drop='first')
encoded_features = encoder.fit_transform(data[categorical_columns]).toarray()

# Feature Scaling
scaler = StandardScaler()
numerical_columns = ['age', 'bmi', 'children']
scaled_features = scaler.fit_transform(data[numerical_columns])

# Combine encoded categorical features and scaled numerical features
preprocessed_data = pd.DataFrame(
    data=scaled_features,
    columns=numerical_columns,
).join(pd.DataFrame(data=encoded_features, columns=encoder.get_feature_names(categorical_columns)))

# Splitting Data
X = preprocessed_data  # Include 'charges' column
y = data['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the preprocessed data
print(preprocessed_data.head())


        age       bmi  children  sex_male  smoker_yes  region_northwest  \
0 -1.438764 -0.453320 -0.908614       0.0         1.0               0.0   
1 -1.509965  0.509621 -0.078767       1.0         0.0               0.0   
2 -0.797954  0.383307  1.580926       1.0         0.0               0.0   
3 -0.441948 -1.305531 -0.908614       1.0         0.0               1.0   
4 -0.513149 -0.292556 -0.908614       1.0         0.0               1.0   

   region_southeast  region_southwest  
0               0.0               1.0  
1               1.0               0.0  
2               1.0               0.0  
3               0.0               0.0  
4               0.0               0.0  


  data.fillna(data.mean(), inplace=True)


In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create a Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display evaluation results
print("Mean Squared Error:", mse)
print("R-squared:", r2)


Mean Squared Error: 33596915.851361476
R-squared: 0.7835929767120722


In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get best parameters and best estimator
best_params = grid_search.best_params_
best_rf_model = grid_search.best_estimator_

# Make predictions using the best model
y_pred_rf = best_rf_model.predict(X_test)

# Calculate evaluation metrics for the best model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Display results
print("Best Hyperparameters:", best_params)
print("Random Forest Mean Squared Error:", mse_rf)
print("Random Forest R-squared:", r2_rf)


Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
Random Forest Mean Squared Error: 19084132.43349689
Random Forest R-squared: 0.8770738269477704
