## How life burns your pockets during medical emergency?

This project is mainly focussed on **Storytelling** from a small dataset. The main aim of the project is to bring out the fun of data exploration in a small dataset. Through multiple graphs and plots I will try to present the whole data and their interaction with other variable in a lucid manner. 

The data set is sourced from Kaggle and it seems like a abridged version of some insurance company's database. The main aim of the project with the data is to find the cost's dependencies of factors like BMI or body mass index, smoking habits, age and no of childrens and also roughly touches on the average spendings of males and females based on the above mentioned factors.

So without much ado lets begin the project by importing some of our dependencies. Following a brief description and their need in this project:-
1. OS: for file navigation from storage devices
2. Numpy for array and calculations
3. Pandas for data management including dataframes and series
4. Matplotlib and its subordinates like Pyplot, Style, MAXNLocator for plotting, style and Ticks
5. Sklearns StandardScaler for Data scaling
6. Sklearns KMeans for K-menas clustering
7. Sklearns Silhouette_Samples and Silhouette_score are for calculating silhouette scores for each individual and whole data set respectively.
8. Statsmodel ols (lowercase) for R style Oridinary Least sqaures analysis to find the Pvalues for our hypothesis.

In [None]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib import style
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import StandardScaler
from matplotlib.ticker import MaxNLocator
from statsmodels.formula.api import ols

### Data Insites

In [None]:
path="/kaggle/input/insurance/insurance.csv"

In [None]:
raw_data=pd.read_csv(path)
raw_data

We load the data using Pandas read_csv. So one the first glace we see that our data has a shape of 1338,7 and has a mixed data type categorical and continuous. More overview of the data are follows:-

In [None]:
raw_data.info()

So we have around 3 object variables and 4 continuous variables and all the categories are non-null.

In [None]:
raw_data_c=raw_data.drop(["sex", "smoker", "region"], axis=1).copy() #only continuous variable dataset will be used for plots

In [None]:
raw_data.describe()

From the desription of data we find except *Charges* other don't have much **outliers**

In [None]:
raw_data.isnull().sum()

Confirmation that no **NULL values in out dataset**

### Basic Histogram

**To find the nature of data**

The basic histogram, called by the **hist()** function gives a visual represention of the distribution of the data based on the *Normal distribution*. From this graph we can understand:-
1. Data type of each variable: "Continuous variables" will have a continuous normal distribution curve. "Categorical variable" will have distinct plots.
2. Skewness of data highlights the presence of **outliers**. Here we note **Charges** and **Age** have outliers where as BMI doesn't.
3. Region is categorical

In [None]:
plt.figure(figsize=(14,8))
style.use("seaborn-dark-palette")
plt.subplot(2,2,1)
plt.hist(raw_data["age"])
plt.xlabel("Ages", fontsize=12)
plt.subplot(2,2,2)
plt.hist(raw_data["bmi"])
plt.xlabel("BMI", fontsize=12)
plt.subplot(2,2,3)
plt.hist(raw_data["charges"])
plt.xlabel("Charges", fontsize=12)
plt.subplot(2,2,4)
plt.hist(raw_data["region"])
plt.xlabel("Region", fontsize=12)

### Correlation

In [None]:
corr_mat=raw_data_c.corr()
corr_mat

#### Plotting of Correlation plot

In [None]:
plt.figure(figsize=(10,8))
corar=np.array(corr_mat.values)
sns.set(font_scale=1.5)
sns.heatmap(corr_mat, annot=corar,cmap="coolwarm_r")

From correlation plot we can see **age** and **charges** have very slight positive correlation with **charges** which we will try to prove in due course.

In [None]:
raw_data.age.describe()

### Relationship between Age and Charges.

We first convert the age into bins/groups of categorical variables like Child, Young Adult, Adult and Old to analyse its relationship with medical expenses "charges".

In [None]:
raw_data.loc[(raw_data.age>17) & (raw_data.age<=30), "age_cat"]="Young Adult"
raw_data.loc[(raw_data.age>30) & (raw_data.age<=59), "age_cat"]="Adult"
raw_data.loc[(raw_data.age>59), "age_cat"]="Old"
raw_data

In [None]:
labels=raw_data.age_cat.unique().tolist()
count=raw_data.age_cat.value_counts()
print(count)
count=count.values
style.use("ggplot")
plt.figure(figsize=(8,8))
explode=(0.1,0,0)
plt.pie(count, labels=labels,explode=explode, autopct="%1.1f%%", textprops={'fontsize': 20})

The pie chart shows that our datasets has more Children and Young adults by count than adult and old.

In [None]:
charge_avg_age=raw_data.groupby("age_cat")["charges"].mean()
labels_avg=charge_avg_age.keys()
charge_avg_age=charge_avg_age.tolist()

charge_sum_age=raw_data.groupby(["age_cat"])["charges"].sum()
labels_sum=charge_sum_age.keys()
charge_sum_age=charge_sum_age.tolist()

charge_std_age=raw_data.groupby(["age_cat"])["charges"].std()
labels_std=charge_std_age.keys()
charge_std_age=charge_std_age.tolist()


style.use("seaborn")
plt.figure(figsize=(16,10))
plt.subplot(2,2,1)
plt.bar(labels_avg, charge_avg_age, color="green")
plt.ylabel("Mean Charges", fontsize=16)
plt.subplot(2,2,2)
plt.bar(labels_sum, charge_sum_age, color="indigo")
plt.ylabel("Sum  (1e7)", fontsize=16)
plt.subplot(2,2,3)
plt.bar(labels_sum, charge_std_age, color="black")
plt.ylabel("Charges Standard Deviation", fontsize=16)

From the above three bar plots we note the following:-

1. **Adults** comprises of 33.2% of the whole data set and thus sum total of their medical expenses is the highest but the mean cost per adult patient is **less than \$15,000** with a standard deviation of **\$12,000**. Adult age ranges from 30 to 59 with critical age starts post 59 where lots of ailments crop up due to work stress and socio-environmental factors. 

2. **Young Adult** is the age ranges from 18 to 30. A age when human body is at its peak. With a 58.3%  of Young adults representation still get lowest total. With a mean of around **less than \$10,000** and standard deviation of around **\$10,000**

3. **Old** age is the age where the medical cost becomes the primary expenditure and its evident by the fact that the mean cost is among the highest which shoots **above \$20,000** with a standard deviation of **\$13,000**. 

### Relationship between Sex and Charges

From the hist() plot we have seen that charges have outliers so we do a log transformation of charges to get rid of the impact of outliers.

In [None]:
raw_data["log_charges"]=np.log(raw_data["charges"])
raw_data 

In [None]:
plt.figure(figsize=(16,6))

plt.subplot(1,2,1)
raw_data["charges"].hist()
plt.xlabel("Charges", fontsize=16)


plt.subplot(1,2,2)
raw_data["log_charges"].hist()
plt.xlabel("Log of Charges", fontsize=16)

We plot a scatter box plot or **Swarm** plot of both **Charges** and **log charges** comparing with both the **genders** and observe the following.

In [None]:
plt.figure(figsize=(15,10))
sns.set(font_scale=1.5)
plt.subplot(1,2,1)
sns.swarmplot(raw_data["sex"], raw_data["charges"], palette ="seismic")
plt.subplot(1,2,2)
sns.boxenplot(raw_data["sex"], raw_data["log_charges"], palette ="seismic")

The information about charges gathered from the swarm plot and the boxen plot (both are special cases of box plot) shows independency of charges on gender. With the mean lying around \$10,000.

### Relationship between BMI and Gender

In [None]:
plt.figure(figsize=(14,8))
sns.set(font_scale=1.5)
sns.boxenplot(raw_data["sex"], raw_data["bmi"], palette ="seismic_r")

The distribution of BMI has a mean of around of 30 with upper quartile ranges from 34 to 35 and lower quartile ranges from 25 for both the gender. Also male genders has outliers suggest by few BMI over 50.

In [None]:
raw_data.loc[(raw_data.age<19), "bmi_cat"]="Underweight"
raw_data.loc[(raw_data.age>=19) & (raw_data.age<=25), "bmi_cat"]="Normal"
raw_data.loc[(raw_data.age>25) & (raw_data.age<=30), "bmi_cat"]="Overweight"
raw_data.loc[(raw_data.age>30), "bmi_cat"]="Obese"
raw_data

In [None]:
bmi_val=raw_data["bmi_cat"].value_counts()
bmi_val=bmi_val.tolist()
style.use("seaborn-dark-palette")
labels=raw_data["bmi_cat"].unique()
plt.figure(figsize=(12,5))
plt.bar(labels, bmi_val)
plt.ylabel("Count", fontsize=16)

### Relationship between BMI and Charges

In [None]:
bmi_avg_charge=raw_data.groupby("bmi_cat")["charges"].mean()
labels_a=bmi_avg_charge.keys()
bmi_avg_charge=bmi_avg_charge.tolist()

bmi_count_charge=raw_data.groupby("bmi_cat")["charges"].count()
labels_c=bmi_count_charge.keys()
bmi_count_charge=bmi_count_charge.tolist()


style.use("seaborn-dark-palette")
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
plt.bar(labels_a, bmi_avg_charge)
plt.ylabel("Mean Charges", fontsize=16)

plt.subplot(1,2,2)
plt.bar(labels_c, bmi_count_charge)
plt.ylabel("Count", fontsize=16)


From the above plots we can see obesity has quite a impact on medical cost. Our datset has high number of Obese patients and also the mean cost is above \$14,000. Thus its better to keep our weights under control.

### Relationship between Smoking and Charges

In [None]:
plt.figure(figsize=(14,8))
sns.set(font_scale=1.5)
sns.swarmplot(raw_data["smoker"], raw_data["charges"],hue=raw_data["sex"], palette="winter")

In [None]:
plt.figure(figsize=(15,10))
style.use("ggplot")
ax=plt.subplot(2,1,1)
smk_bmi=raw_data.groupby(["smoker", "bmi_cat"])["charges"].mean().unstack()
print(smk_bmi)
smk_bmi.plot(ax=ax)

ax=plt.subplot(2,1,2)
smk_bmi=raw_data.groupby(["smoker", "bmi_cat"])["charges"].mean().plot(ax=ax)
ax.tick_params('x',labelrotation=45)

From all the 3 graphs we note that Smoking and medical cost as direct relationship. Smoker tend to spend more on medical expenses that non-smoker. Which also indirectly suggests that smoker tend to develop more medical complication than non smokers.

### Standardization of Data

Standard scaling is required to bring all the variables to the same page. *BMI* and *Age* range in tens where as *Children* range in once while *Charges* ranged in 5 digits. Thus to keep all on same page we use the standard scaler.

Standardize features by removing the mean and scaling to unit variance

$$z=(x-u)/sd$$

In [None]:
raw_data_c

In [None]:
std_scl=StandardScaler()
raw_data_std=std_scl.fit_transform(raw_data_c)
print("columns as age, bmi. children, charges")
print(raw_data_std)

In [None]:
bmi_charg_c=raw_data_std[:,[1,3]]
print(bmi_charg_c)
print(bmi_charg_c.shape)

### Clustering 

Using KMeans and Silhoutte scores

To find the best number of cluster (n_clusters=k) we run a for-loop and compute the WSS (Within sum of squares) Elbow method and Silhoutte scores for each "k".

**BMI**

In [None]:
wss=[]
sil=[]
for k in range(2,16):
    kmeans=KMeans(n_clusters=k, random_state=1).fit(bmi_charg_c)
    wss.append(kmeans.inertia_)
    labels=kmeans.labels_
    silhoutte=silhouette_score(bmi_charg_c, labels, metric = 'euclidean')
    sil.append(silhoutte)

In [None]:
k=range(2,16)
style.use("bmh")
fig,ax=plt.subplots(figsize=(14,6))
ax.set_facecolor("white")
ax.plot(k, wss, color="green")
ax.xaxis.set_major_locator(MaxNLocator(nbins=15, integer=True))
ax.set_xlabel("No of clusters", fontsize=20)
ax.set_ylabel("WSS (With in Sum of squares)", fontsize=20)
ax2=ax.twinx()
ax2.plot(k, sil, color="blue")
ax2.set_ylabel("Silhouette scores", fontsize=20)
ax2.grid(True,color="silver")
plt.show()

From the plot we see the "elbow" at 3 and silhouutee score almost best at that point.

In [None]:
k=3
kmeans=KMeans(n_clusters=k, random_state=1).fit(bmi_charg_c)
clusters=kmeans.labels_
centrids=kmeans.cluster_centers_
raw_data["clusters"]=clusters
raw_data

In [None]:
raw_data2=raw_data.sort_values(["clusters"]).copy()

In [None]:
for i in range(0,k+1):
    raw_data2["clusters"]=raw_data2["clusters"].replace(i, chr(i+65))
    
raw_data2

In [None]:
raw_data2["clusters"].unique()

In [None]:
x=raw_data2.iloc[:,[2,6]].values
print(x.shape)
y=kmeans.fit_predict(x)
print(y.shape)

In [None]:
plt.figure(figsize=(14,8))
style.use("Solarize_Light2")
plt.scatter(x[y==0,0], x[y==0,1], color="red", label="A")
plt.scatter(x[y==1,0], x[y==1,1], color="blue", label="B")
plt.scatter(x[y==2,0], x[y==2,1], color="green", label="C")

plt.xlabel("BMI", fontsize=16)
plt.ylabel("Charges", fontsize=16)
plt.title("Charges depends on BMI??", fontsize=18)

From the above as we have defined we got 3 distinct clusters. With BMI (15 to 35) has a expense of \$10,000 to \$30,000 where as higher BMI's have much higher cost.

We also Run the same clustering for **"Age"**

In [None]:
age_charg_c=raw_data_std[:,[0,3]]
print(age_charg_c)
print(age_charg_c.shape)

In [None]:
wss=[]
sil=[]
for k in range(2,16):
    kmeans=KMeans(n_clusters=k, random_state=1).fit(age_charg_c)
    wss.append(kmeans.inertia_)
    labels=kmeans.labels_
    silhoutte=silhouette_score(age_charg_c, labels, metric = 'euclidean')
    sil.append(silhoutte)

In [None]:
k=range(2,16)
style.use("bmh")
fig,ax=plt.subplots(figsize=(14,6))
ax.set_facecolor("white")
ax.plot(k, wss, color="green")
ax.xaxis.set_major_locator(MaxNLocator(nbins=15, integer=True))
ax.set_xlabel("No of clusters", fontsize=20)
ax.set_ylabel("WSS (With in Sum of squares)", fontsize=20)
ax2=ax.twinx()
ax2.plot(k, sil, color="blue")
ax2.set_ylabel("Silhouette scores", fontsize=20)
ax2.grid(True,color="silver")
plt.show()

In [None]:
k=3
kmeans=KMeans(n_clusters=k, random_state=1).fit(age_charg_c)
clusters=kmeans.labels_
centrids=kmeans.cluster_centers_
raw_data["clusters"]=clusters
raw_data

In [None]:
raw_data2=raw_data.sort_values(["clusters"]).copy()

In [None]:
for i in range(0,k+1):
    raw_data2["clusters"]=raw_data2["clusters"].replace(i, chr(i+65))
    
raw_data2

In [None]:
x=raw_data2.iloc[:,[0,6]].values
print(x.shape)
y=kmeans.fit_predict(x)
print(y.shape)

In [None]:
plt.figure(figsize=(14,8))
style.use("Solarize_Light2")
plt.scatter(x[y==0,0], x[y==0,1], color="red", label="A")
plt.scatter(x[y==1,0], x[y==1,1], color="blue", label="B")
plt.scatter(x[y==2,0], x[y==2,1], color="green", label="C")

plt.xlabel("Age", fontsize=16)
plt.ylabel("Charges", fontsize=16)
plt.title("Charges depends on Age??", fontsize=18)

We dont see much distinction about groups here with quite high overlaps. All the three expenses ranges has all the age groups

### PValue test 

We convert categorical variable "Smoker" as 0 and 1 or a continuous binary variable and run a OLS test

In [None]:
raw_data2["smoker"]=raw_data2["smoker"].replace(["yes", "no"],[1,0])

Dependent variable: Charges, Independent variable: BMI, age, children and smoker or non-smoker

### Hypothesis

$$H0 = "Charges are independent of the variables like age, bmi, no of childrens and smoking habits"$$
$$H1 = "Charges are dependent of the variables like age, bmi, no of childrens and smoking habits"$$

In [None]:
pval=ols("charges~bmi+age+children+smoker", data=raw_data).fit()
print(pval.summary())

We do check that all the 4 independent variable has a **Pvalue of less than 0.05** thus we **reject the null hypothesis**.
and conclude that "Charges" are dependent on the mentioned variables.