<h1> Brief Introduction </h1>
In this project, my main aim is to show ways to go deep into the data story-telling even though the dataset is small. Also, I will work on a model that could give us an approximation as to what will be the charges of the patients. Nevertheless, we must go deeply into what factors influenced the charge of a specific patient. In order to do this we must look for patterns in our data analysis and gain extensive insight of what the data is telling us.  Lastly, we will go step by step to understand the story behind the patients in this dataset only through this way we could have a better understanding of what features will help our model have a closer accuracy to the true patient charge. 

<h4>Things to Notice</h4>
I will importing the library bassed upon the requirement, so that it will easy for you to understand which library is used where and for what purpose. 

## Data Exploration

Here in this section, I will try to explore the data as much as I can. In other words I will try to find hidden patterns. I will also try to give plausible explaination for each of the steps and interpret the graph as much as possible. 

In [None]:
import pandas as pd
insurance = pd.read_csv("../input/insurance/insurance.csv")

In [None]:
insurance.head()

### Finding the missing values

- It seems that the data does not have any missing values. 

In [None]:
insurance.isna().sum()/len(insurance)

**Describe** helps us to find statistical information about the dataset that we are working with. 

In [None]:
insurance.describe()

The important thing to notice is the standard deviation. I believe that the standard deviation lies at the core of statistics. And if we somehow manage to get this thing we have won the battle. 

**Note**: A good standard deviation is somewhere between 0 to 1. Even if it is 1 or bit higher than it, it is manageable.

### Visualisation

In order to get a good picture of what is happen under the hood we need to take help of the visualisation tools. With matplotlib and seaborn we can manage to do that. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#just a trail
sns.distplot(insurance.age)

### A note about Logs or Logarithms
- Using Logarithms: Logarithms helps us have a normal distribution which could help us in a number of different ways such as outlier detection, implementation of statistical concepts based on the central limit theorem and for our predictive model in the foreseen future. 
- Here we will observe that the stardard deviation after using log is below 1 standard deviation. This will help us to find the error while testing the ML model so keep that in mind. 
- Below is the example of how Logs can transform the distribution to normal distribution. 

In [None]:
print(np.std(np.log(insurance.charges)))
sns.distplot(np.log(insurance.charges))

### Age Analysis:

Turning Age into Categorical Variables:
- Young Adult: from 18 - 35
- Senior Adult: from 36 - 55
- Elder: 56 or older
- Share of each Category: Young Adults (42.9%), Senior Adults (41%) and Elder (16.1%)

In [None]:
insurance['age_cat'] = np.nan
lst = [insurance]

In [None]:
lst

In [None]:
for col in lst:
    col.loc[(col['age'] >= 18) & (col['age'] <= 35), 'age_cat'] = 'Young Adult'
    col.loc[(col['age'] > 35) & (col['age'] <= 55), 'age_cat'] = 'Senior Adult'
    col.loc[col['age'] > 55, 'age_cat'] = 'Elder'
    

In [None]:
print(lst)

In [None]:
age_cat = insurance.age_cat.map({'Young Adult':0, 
 'Senior Adult':1,
 'Elder':2})


In [None]:
labels = insurance["age_cat"].unique()
amount = insurance["age_cat"].value_counts().tolist()

In [None]:
my_circle=plt.Circle( (0,0), 0.7, color='white')

plt.figure(figsize=(10,10))
plt.pie(amount, labels=labels, colors=['red','green','blue'])

p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.distplot(insurance.bmi)
plt.show()

Is there a Relationship between BMI and Age
- BMI frequency: Most of the BMI frequency is concentrated between 28 - 32.
- Correlations Age and charges have a correlation of 0.29 while bmi and charges have a correlation of 0.19
- Relationship betweem BMI and Age: The correlation for these two variables is 0.10 which is not that great. Therefore, we can disregard that age has a huge influence on BMI.
- Also, the influence of BMI and Age is very little. Which means these two factors does effect charges as much as we wanted.

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(insurance.corr())
plt.show()

print('*'*100)
print(insurance.corr())

In [None]:
young_adults = insurance["bmi"].loc[insurance["age_cat"] == "Young Adult"].values
senior_adult = insurance["bmi"].loc[insurance["age_cat"] == "Senior Adult"].values
elders = insurance["bmi"].loc[insurance["age_cat"] == "Elder"].values

**observations**: Young adults have extreme outliers. We need to deal with it. 

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data= [young_adults, senior_adult, elders])


In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols


moore_lm = ols("bmi ~ age_cat", data=insurance).fit()
print(moore_lm.summary())

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
label = LabelEncoder()

In [None]:
#sex
label.fit(insurance.sex.drop_duplicates())
insurance.sex = label.transform(insurance.sex)
insurance.sex.head()

In [None]:
#smoker or non-smoker
insurance.smoker = label.fit_transform(insurance.smoker)
insurance.smoker.head()

In [None]:
#region

insurance.region = label.fit_transform(insurance.region)
insurance.region.head()

In [None]:
insurance.describe()

In [None]:
insurance.corr()

- Smoker shows the strong correlation with charges. That means that the smokers pay more treatment charges than anyone else. 
- Strong correlation suggest that as the independent variable increases it potential the dependent also get affected from the potential.

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(insurance.corr())
plt.show()

In [None]:
moore_lm = ols("charges ~ smoker", data=insurance).fit()
print(moore_lm.summary())

In [None]:
plt.figure(figsize=(10,10))
sns.distplot(insurance.charges)
plt.show()

In [None]:
insurance.loc[(insurance.smoker == 1)].charges

In [None]:
f = plt.figure(figsize=(20,10))

ax = f.add_subplot(121)
sns.distplot(insurance.loc[(insurance.smoker == 1)].charges, ax=ax)
ax.set_title('Smokers')


ax = f.add_subplot(122)
sns.distplot(insurance.loc[(insurance.smoker == 0)].charges, color='r', ax = ax)
ax.set_title('Non-Smokers')

Smoking patients spends much on treatment

In [None]:
plt.figure(figsize=(15,10))
sns.catplot(x='smoker', kind='count', hue = 'sex', palette='PuBuGn_r', data=insurance)
plt.show()

In [None]:
f = plt.figure(figsize=(20,20))

ax = f.add_subplot(211)
sns.boxenplot(x = 'age', y='charges', hue='sex', data=insurance, ax=ax)

ax = f.add_subplot(212)
sns.scatterplot(x = 'charges', y='age', hue='smoker', data=insurance, ax=ax)

plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.distplot(insurance.age, color='r')

In [None]:
insurance["weight_condition"] = np.nan
lst = [insurance]

for col in lst:
    col.loc[col["bmi"] < 18.5, "weight_condition"] = "Underweight"
    col.loc[(col["bmi"] >= 18.5) & (col["bmi"] < 24.986), "weight_condition"] = "Normal Weight"
    col.loc[(col["bmi"] >= 25) & (col["bmi"] < 29.926), "weight_condition"] = "Overweight"
    col.loc[col["bmi"] >= 30, "weight_condition"] = "Obese"

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(18,8))

# I wonder if the cluster that is on the top is from obese people
sns.stripplot(x="age_cat", y="charges", data=insurance, ax=ax1, linewidth=1, palette="Reds")
ax1.set_title("Relationship between Charges and Age")


sns.stripplot(x="age_cat", y="charges", hue="weight_condition", data=insurance, ax=ax2, linewidth=1, palette="Set2")
ax2.set_title("Relationship of Weight Condition, Age and Charges")

sns.stripplot(x="smoker", y="charges", hue="weight_condition", data=insurance, ax=ax3, linewidth=1, palette="Set2")
ax3.legend_.remove()
ax3.set_title("Relationship between Smokers and Charges")

plt.show()

In [None]:
import seaborn as sns
sns.set(style="ticks")
pal = ["#FA5858", "#58D3F7"]

sns.pairplot(insurance, hue="smoker", palette=pal)
plt.title("Smokers")

In [None]:
f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18,8))
sns.scatterplot(x="bmi", y="charges", hue="weight_condition", data=insurance, palette="Set1", ax=ax1)
ax1.set_title("Relationship between Charges and BMI by Weight Condition")

sns.scatterplot(x="bmi", y="charges", hue="smoker", data=insurance, palette="Set1", ax=ax2)
ax2.set_title("Relationship between Charges and BMI by Smoking Condition")


In [None]:
sns.scatterplot(x='children', y='age', data=insurance, hue='charges')

In [None]:
insurance.children.unique()

In [None]:
plt.hist(insurance.children)

In [None]:
sns.barplot(insurance.children, insurance.charges)

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='children', y='charges', data=insurance)

In [None]:
plt.boxplot(insurance.children)
plt.show()

In [None]:
insurance.children.std()

# Unsupervised Learning

In [None]:
from sklearn.cluster import KMeans

In [None]:
cluster = KMeans(n_clusters=3)

In [None]:
insurance.head()

In [None]:
X = insurance.drop(['age_cat', 'weight_condition' ], axis=1)
y = insurance.charges

In [None]:
cluster.fit(X)

In [None]:
cluster.cluster_centers_

In [None]:
X.values[:,0]

In [None]:
fig = plt.figure(figsize=(12,8))

plt.scatter(X.values[:,2], X.values[:,6], c=cluster.labels_, cmap="Set1_r", s=25)
plt.scatter(cluster.cluster_centers_[:,2] ,cluster.cluster_centers_[:,6], color='black', marker="o", s=250)

# Feature Engineering

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(insurance.corr())
plt.show()

In [None]:
X = insurance.drop('region', axis=1)

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(X.corr())
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.scatterplot(x='children', y='bmi', hue='weight_condition', data=insurance)
plt.show()

In [None]:
X.std()

In [None]:
plt.figure(figsize=(10, 12))
plt.boxplot(insurance.bmi)
plt.show()

## Removing Outliers

In [None]:
new_bmi = X.bmi.values
q25, q75 = np.percentile(new_bmi, 25), np.percentile(new_bmi, 75)
print(f'Quartile 25: {q25} | Quartile 75: {q75}')
new_bmi_iqr = q75 - q25
print(f'iqr: {new_bmi_iqr}')

In [None]:
new_bmi_cutoff = new_bmi_iqr * 1.5
new_bmi_lower, new_bmi_upper = q25 - new_bmi_cutoff, q75 + new_bmi_cutoff
print('Lower: ', new_bmi_lower)
print('Upper :', new_bmi_upper)

In [None]:
outliers = [x for x in new_bmi if x<new_bmi_lower or x>new_bmi_upper]
outliers, len(outliers)

In [None]:
final_df = X.drop(X[(X.bmi>new_bmi_upper) | (X.bmi<new_bmi_lower)].index)

In [None]:
plt.figure(figsize=(10,15))
plt.boxplot(final_df.bmi)
plt.show()

In [None]:
new_age = X.age.values
q25, q75 = np.percentile(new_age, 25), np.percentile(new_age, 75)
print(f'Quartile 25: {q25} | Quartile 75: {q75}')
new_age_iqr = q75 - q25
print(f'iqr: {new_age_iqr}')

new_age_cutoff = new_age_iqr * 1.5
new_age_lower, new_age_upper = q25 - new_age_cutoff, q75 + new_age_cutoff
print('Lower: ', new_age_lower)
print('Upper :', new_age_upper)

outliers = [x for x in new_age if x<new_age_lower or x>new_age_upper]
outliers, len(outliers)

final_df = X.drop(X[(X.age>new_age_upper) | (X.age<new_age_lower)].index)

plt.figure(figsize=(10,15))
plt.boxplot(final_df.age)
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
Scale = StandardScaler()
final_df.bmi = Scale.fit_transform(final_df.bmi.values.reshape(-1,1))


In [None]:
final_df.std()

# Modeling 
## Linear Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
import math
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_test), y_test),
                m.score(X_train, y_train), m.score(X_test, y_test)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [None]:
X = final_df.drop(['charges', 'age_cat', 'weight_condition'], axis=1)
y = np.log(final_df.charges)


X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 23, test_size=0.3)

In [None]:
model=LinearRegression()
model.fit(X_train, y_train)

In [None]:
print_score(model)

In [None]:
model.intercept_

In [None]:
model.coef_

In [None]:
plt.figure(figsize=(10,10))
plt.plot(X_train, y_train, 'ro')
plt.plot(X_train,model.coef_[0]*X_train + model.intercept_)
plt.show()

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
model=RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
model=RandomForestRegressor(n_estimators=25, n_jobs=-1, max_depth=6, max_features=0.5)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
rmse(model.predict(X_test), y_test)

In [None]:
from sklearn.tree import export_graphviz
from IPython import display
from io import StringIO
import re

In [None]:
import graphviz
import IPython

def draw_tree(t, df, size=10, ratio=0.6, precision=0):
    """ Draws a representation of a random forest in IPython.
    Parameters:
    -----------
    t: The tree you wish to draw
    df: The data used to train the tree. This is used to get the names of the features.
    """
    s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True,
                      special_characters=True, rotate=True, precision=precision)
    IPython.display.display(graphviz.Source(re.sub('Tree {',
       f'Tree {{ size={size}; ratio={ratio}', s)))

In [None]:
draw_tree(model.estimators_[0], X, precision=5)

In [None]:
print_score(model)

In [None]:
X.columns

In [None]:
np.exp(model.predict([[30, 0, 0.4, 3, 0]]))

In [None]:
insurance.loc[(insurance.age == 30) & (insurance.bmi<=20)]

## Feature Importance

In [None]:
feature_importances = pd.DataFrame(model.feature_importances_,
                                   index = X.columns,
                                    columns=['importance']).sort_values('importance',  ascending=False)

In [None]:
feature_importances.plot.barh(figsize=(15,8))