## Exploratory Data Analysis on Insurance cost dataset

#### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns

### Read the dataset

Columns:

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

In [None]:
data = pd.read_csv('../input/insurance/insurance.csv')
data.head()

#### Check data types and non-null values

In [None]:
data.info()

#### Encode categorical variables
##### There are 3 categorical variables, two binary.

In [None]:
encoder = LabelEncoder()

#sex
data.sex = encoder.fit_transform(data.sex) 
# smoker or not
data.smoker = encoder.fit_transform(data.smoker)
#region
data.region = encoder.fit_transform(data.region)


#### Plot correlation map 

In [None]:
plt.figure(figsize=(10,6))

sns.heatmap(data.corr());

##### There is a high correlation between smoker and charges columns.

#### Smoking visualisation
##### Smoking people are encoded with number 1 and Non-smoking people are encoded with number 0.

#### Draw histograms for charges distribution

In [None]:
fig = plt.figure(figsize=(12,5))
fig.set_figheight(5)
fig.set_figwidth(20)

ax=fig.add_subplot(121)

sns.distplot(data[(data.smoker == 1)]["charges"],color='g',ax=ax)
ax.set_title('Distribution of charges for smokers')

ax=fig.add_subplot(122)
sns.distplot(data[(data.smoker == 0)]['charges'],ax=ax, color = 'b')
ax.set_title('Distribution of charges for non-smokers')

##### We can observe that smokers pay more for their insurance, the majority of non-smokers paying up to 15000 dollars. In the same time, it is believed the number of non-smokers is bigger.

In [None]:
sns.catplot(x="smoker", kind="count", palette="rocket", data=data)

#### Scatter plot for age correlation taking into consideration smoking status

In [None]:
#Plot data and regression model fits across a FacetGrid(a grid of subplots)
sns.lmplot(x="age", y="charges", hue="smoker", data=data, palette = 'rocket', size = 7)
ax.set_title('Smokers and non-smokers')

##### It can be observed on the chart that with age, there is a slightly increase in the amount paid for the insurance.

#### Sex visualization

In [None]:
fig = plt.figure(figsize=(12,5))
fig.set_figheight(5)
fig.set_figwidth(20)

ax=fig.add_subplot(121)

sns.distplot(data[(data.sex == 1)]["charges"],color='g',ax=ax)
ax.set_title('Distribution of charges for male')

ax=fig.add_subplot(122)
sns.distplot(data[(data.sex == 0)]['charges'],ax=ax, color = 'b')
ax.set_title('Distribution of charges for female')

##### Both distributions are right skewed. Central tendency indicators can show more insight. 

In [None]:

female = data[data['sex'] == 0]
male = data[data['sex'] == 1]
print("Mode charges for female:",female["charges"].mode()[0])
print("Median charges for female:",female["charges"].median())
print("Mean charges for female:",round(female["charges"].mean(),2))
print("===")
print("Mode charges for male:",male["charges"].mode()[0])
print("Median charges for male:",male["charges"].median())
print("Mean charges for male:",round(male["charges"].mean(), 2))

##### On average, a male pays with 1387 dollars more than a woman for insurance.

#### BMI visualization

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of bmi")
ax = sns.distplot(data["bmi"], color = 'b')

##### The feature bmi has a normal distribution, around the value of 30, which is the threshold for obesity.

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of charges for people with BMI greater than 30")
ax = sns.distplot(data[(data.bmi >= 30)]['charges'], color='r')

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of charges for people with BMI less than 30")
ax = sns.distplot(data[(data.bmi < 30)]['charges'], color = 'b')

##### People with a BMI greater than 30, pay more for insurance.

#### Plot a bar chart showing how many children the patients have

In [None]:
sns.catplot(data=data, x="children", kind="count", palette="rocket")

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of charges for people with no kids")
ax = sns.distplot(data[(data.children < 1)]['charges'], color = 'b')

In [None]:
plt.figure(figsize=(12,5))
plt.title("Distribution of charges for people with more than 3")
ax = sns.distplot(data[(data.children > 3)]['charges'], color = 'r')

##### Both distributions are right skewed, the visualization didn't show if the number of children affects the amount paid for insurance. ANOVA test would clarify if there's a significant difference between groups.

##### In conclusion, a bmi bigger than 30, smoking, ageing, being a male can lead to larger insurance costs.