<a href="https://colab.research.google.com/github/reuben-mwangi/Medical-Insurance/blob/main/medical_cost_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Medical Cost Prediction

The aim of this analysis is to predict the medical expenses based on the patients infomation. The dataset used for this analysis is insurance dataset from Kaggle. The dataset contains 1338 observations and 7 variables. The variables are as follows:

**Variable**       
Age            -  age of primary beneficiary,
bmi            -  body mass index,
Children       -  number of children covered by health insurance,
Smoker         -  smoking,
Region         - the beneficiary area in the US,
Charges        - individual medical cost billed by health insurance.

In [None]:
# importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("/content/insurance.csv")
df.head()

# Data Preprocessing

In [None]:
# number of rows and columns
df.shape

In [None]:
# checking for missing values
df.info()

In [None]:
# checking descriptive statistics
df.describe()

In [None]:
# value counts for categorical variables
print(df.sex.value_counts(),"\n",df.smoker.value_counts(),"\n",df.region.value_counts())

## Replacing the categorical variables with numerical values

* Sex: 1-male, 0 - female
* Smoker: 1-yes , 0 - no
* Region: 0- northeast, 1 - northwest, 2 - southeast, 3- southwest



In [None]:
# changing categorical variables to numerical ones
df["sex"]= df["sex"].map({"male":1,"female":0})
df["smoker"]= df["smoker"].map({"yes":1,"no":0})
df["region"]= df["region"].map({"southwest":0,"southeast":1,"northwest":2,"northeast":3})

In [None]:
df.head(10)

# Exploratory Data Analysis

Visualization of the data is a good way to understand the data. In this section, I will plot the distribution of each variable to get an overview about their counts and distributions.

In [None]:
# age distribution
sns.histplot(df.age,bins=20,kde=False,color="green")
plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

In [None]:
# gender plot
sns.countplot(x= "sex", data= df)
plt.title("Gender Distribution")

 It is clear that the number of males and females are almost equal in the dataset.

In [None]:
# bmi distribution
sns.histplot(df.bmi, bins=20, kde= True, color = "green")
plt.title("BMI Distribution")
plt.xlabel("BMI")
plt.ylabel("Count")
plt.show()

The majority of the patients have BMI between 25 and 40 which is consindered as overweight and could be major factor in increasing the medical cost.

In [None]:
# Child count distribution
sns.countplot(x = "children",data=df)
plt.title("Children Distribution")
plt.xlabel("Children")
plt.ylabel("Count")
plt.show()

What we can see, most of patients have no children and very few patients have more than 3 children.

In [None]:
#Regionwise plot

sns.countplot(x="region",data=df)
plt.title("Region Distribution")
plt.xlabel("Region")
plt.ylabel("Count")
plt.show()

The count of patients from northwest is slightly higher than the other regions, but the number of patients from other regions are almost equal.

In [None]:
# Count of smokers
sns.countplot(x="smoker",data=df)
plt.title("Smoker Distribution")
plt.xlabel("Smokers")
plt.ylabel("Count")
plt.show()

Smokers are very few in the dataset. Nearly 80% of the patients are non-smokers.
Smoker count with respect to the children count.

In [None]:
sns.countplot(x=df.smoker,hue = df.children)

Evident that highest number of patients who are non smoker don't have children.

In [None]:
# Charges distribution
sns.histplot(df.charges,bins= 20, kde =True, color ="red")
plt.title("Charges Distribution")
plt.xlabel("Medical Expense")
plt.ylabel("Count")
plt.show()

Most of the medical expenses are below 20,000, with neglible number of patients having medical expenses above 50,000.
From all the above plots, we have a clear understanding about the count of patients under each category of the variables. Now I wil look into the correlation between the variables.

# Correlation

In [None]:
# cor
df.corr()

In [None]:
# plot the corr map
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),annot=True,cmap = "coolwarm")
plt.show()

The variable smoker shows a significant correlation with the medical expenses. Now I will explore more into patients smoking habits and their relation with other factors.

# Plotting the smoker count with patient's gender

In [None]:
sns.catplot(x="smoker",kind="count",hue="sex",data=df)
plt.title("Smoker Count with gender")
plt.xlabel("Smoker")
plt.ylabel("Count")
plt.show()

We can notice more male smokers than female smokers. So, I will assume that medical treatment expense for males would be more than females, given the impact of smoking on the medical expenses.

In [None]:
sns.violinplot(x="sex",y="charges",data=df)

In [None]:
plt.figure(figsize=(12,5))
plt.title("Box plot for charges of women")
sns.boxplot(y="smoker",x="charges",data= df[(df.sex == 0)],orient="h")

In [None]:
plt.figure(figsize=(12,5))
plt.title("Box plot for charges of men")
sns.boxplot(y="smoker",x="charges",data= df[(df.sex == 1)],orient="h")

The assumption is true, that the medical expenses of males is greater than that of females. In addition to that medical expense of smokers is greater than that of non-smokers.

# Smokers and age distribution

In [None]:
# smokers and age distribution
sns.catplot(x="smoker",y="age",kind="swarm",data=df)

From graph, we can see that there signficant number of smokers of age 19. Now I will study the medical expense of smokers of age 19.

In [None]:
# smokers of age 19

plt.figure(figsize=(12,5))
plt.title("Box plot for  charges of smokers of age 19")
sns.boxplot(y="smoker",x="charges",data=df[(df.age==19)],orient="h")
plt.xlabel("Medical Expense")
plt.ylabel("Smoker")
plt.show()

Surprisingly the medical expense of smokers of age 19 is very high in comparison to non-smokers. In non smokers we can see some outliers, which may be due to illness or accidents.
It is clear that the medical expense of smokers is higher than that of non-smokers. Now I will plot the charges distribution with respect to patients age of smokers and non-smokers.

In [None]:
#Non smokers charge distribution

plt.figure(figsize=(7,5))
plt.title("scatterplot for charges of non smokers")
sns.scatterplot(x="age",y="charges",data=df[(df.smoker== 0 )])
plt.xlabel("Age")
plt.ylabel("Medical Expense")
plt.show()

Majority of the points shows that medical expenses increases with age which may be due to the fact that older people are more prone to ilness. But  there are some outliers which shows that there are other  illness or accidents which may increase the medical expense.

In [None]:
# smokers charge distribution

plt.figure(figsize=(7,5))
plt.title("scatterplot for charges of smokers")
sns.scatterplot(x="age",y="charges",data=df[(df.smoker== 1 )])
plt.xlabel("Age")
plt.ylabel("Medical Expense")
plt.show()

Here what we can see pecularity in the graph.In the graph there  are two segments, one with high medical expense which may be due to smoking, related illnes and other with low medical expense which may be due to age related illness.

In [None]:
# Age charge distribution

sns.lmplot(x="age",y="charges",data=df, hue="smoker")
plt.xlabel("Age")
plt.ylabel("Medical Expense")
plt.show()

Now, we clearly understand the variation in  charges with respect to age and smoking habits. The medical expense of smokers is higher than that of non-smokers. In non-smokers, the cost of treatment increases with age which is obvious. But for smokers, the cost of treatment is high even for younger patients, which means the smoking patients are spending upon their smoking related illness as well as age related illness.

# Charges distribution for patients with BMI greater than 30 ie. obese patients.

In [None]:
#bmi charges distribution for obese people
plt.figure(figsize=(7,5))
sns.distplot(df[df.bmi>=30]["charges"])
plt.title("Charges Distribution for Obese People")
plt.xlabel("Medical Expense")
plt.show()

# charges distribution for patients with BMI less than 30

In [None]:
#bmi charges distribution for non-obese people
plt.figure(figsize=(7,5))
sns.distplot(df[df.bmi<30]["charges"])
plt.title("Charges Distribution for Non-Obese People")
plt.xlabel("Medical Expense")
plt.show()

Therefore,patients with BMI less than 30 are spending less on medical treatment than those with BMI greater than 30

# Build Model  to Predict the Medical Expense

## Train Test Split

In [None]:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.drop("charges", axis=1), df["charges"], test_size=0.2, random_state=42)


# Model Building

## Linear Regression

In [None]:
#Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr

In [None]:
#model training
lr.fit(x_train,y_train)
#model accuracy
lr.score(x_train,y_train)

In [None]:
#model prediction
y_pred = lr.predict(x_test)

# Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
poly_reg

In [None]:
#transforming the features to higher degree
x_train_poly = poly_reg.fit_transform(x_train)
#splitting the data
x_train, x_test, y_train, y_test = train_test_split(x_train_poly, y_train, test_size=0.2, random_state=0)

In [None]:
plr = LinearRegression()
#model training
plr.fit(x_train,y_train)
#model accuracy
plr.score(x_train,y_train)

In [None]:
#model prediction
y_pred = plr.predict(x_test)

# Decision Tree Regressor


In [None]:
#decision tree regressor
from sklearn.tree import DecisionTreeRegressor
dtree = DecisionTreeRegressor()
dtree

In [None]:
#model training
dtree.fit(x_train,y_train)
#model accuracy
dtree.score(x_train,y_train)

In [None]:
#model prediction
dtree_pred = dtree.predict(x_test)

# Random Forest Regressor

In [None]:
#random forest regressor
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100)
rf

In [None]:
#model training
rf.fit(x_train,y_train)
#model accuracy
rf.score(x_train,y_train)

In [None]:
#model prediction
rf_pred = rf.predict(x_test)

# Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

# Linear Regression

In [None]:
#distribution of actual and predicted values
plt.figure(figsize=(7,5))
ax1 = sns.distplot(y_test,hist=False,color='r',label='Actual Value')
sns.distplot(y_pred,hist=False,color='b',label='Predicted Value',ax=ax1)
plt.title('Actual vs Predicted Values for Linear Regression')
plt.xlabel('Medical Expense')
plt.show()

In [None]:

print('MAE:', mean_absolute_error(y_test, y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('R2 Score:', r2_score(y_test, y_pred))

# Polynomial Regressor

In [None]:
#acutal vs predicted values for polynomial regression
plt.figure(figsize=(7,5))
ax1 = sns.distplot(y_test,hist=False,color='r',label='Actual Value')
sns.distplot(y_pred,hist=False,color='b',label='Predicted Value',ax=ax1)
plt.title('Actual vs Predicted Values for Polynomial Regression')
plt.xlabel('Medical Expense')
plt.show()

In [None]:
print('MAE:', mean_absolute_error(y_test, y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('R2 Score:', r2_score(y_test, y_pred))

Decision Tree Regressor

In [None]:
#distribution plot of actual and predicted values
plt.figure(figsize=(7,5))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(dtree_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for Decision Tree Regression')
plt.xlabel('Medical Expense')
plt.ylabel('Distribution')
plt.show()

In [None]:
print('MAE:', mean_absolute_error(y_test, dtree_pred))
print('MSE:', mean_squared_error(y_test, dtree_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, dtree_pred)))
print('Accuracy:', dtree.score(x_test,y_test))

# Random Forest Regressor

In [None]:
#distribution plot of actual and predicted values
plt.figure(figsize=(7,5))
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(rf_pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for Random Forest Regressor')
plt.xlabel('Medical Expense')
plt.ylabel('Distribution')
plt.show()


In [None]:
print('MAE:', mean_absolute_error(y_test, rf_pred))
print('MSE:', mean_squared_error(y_test, rf_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, rf_pred)))
print('Accuracy:', rf.score(x_test,y_test))

# Conclusion

### From the above models, we can see that Decision Tree Regressor and Random Forest Regressor are giving the best results. But, Random Forest Regressor is giving the best results with the least RMSE value. Therefore, I will use Random Forest Regressor to predict the medical expense of patients.

### Moreover, the medical expense of smokers is higher than that of non-smokers. The medical expense of patients with BMI greater than 30 is higher than that of patients with BMI less than 30. The medical expense of older patients is higher than that of younger patients.

### Thus, from the overall analysis, we can conclude that the medical expense of patients depends on their age, BMI, smoking habits.