# <font color='Black'>Welcome to this Notebook </font> 
# <font color='Green'>MEDICAL COST EDA AND OLS REGRESSION</font>  

# This notebook contains:
* Data Preprocessing
* Advanced Data Visualization
* Exploratory Data Analysis EDA
* Checking Missing Values
* Data Engineering/Data Extraction/Data Standardization
* Outlier Detection with LOF Technique
* Model Building/OLS Regression Model 
* Advanced Statistical Analysis such as Backward Selection Technique
* Model Evaluation 
* Visualization of Residual Plot and Prediction Error of the Regression Model 


In [None]:
import numpy as np
from numpy import log, log1p

import pandas as pd

import scipy.stats as stats
from scipy.stats import shapiro,boxcox,yeojohnson
from scipy.stats import boxcox

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [None]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split,cross_val_score,cross_val_predict
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import train_test_split,ShuffleSplit,GridSearchCV,cross_val_score,cross_val_predict

In [None]:
!pip install yellowbrick
!pip install dython

In [None]:
from yellowbrick.regressor import residuals_plot
from yellowbrick.regressor import prediction_error
from dython import nominal
from mlxtend.plotting import plot_linear_regression,plot_learning_curves
import missingno as msno
import pylab

# Business Understanding

**age:** age of primary beneficiary

**sex:** insurance contractor gender, female, male

**bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
    objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

**children:** Number of children covered by health insurance / Number of dependents

**smoker:** Smoking

**region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

**charges:** Individual medical costs billed by health insurance


# <font color='blue'>Exploratory Data Analysis </font>  

In [None]:
#Let's import the dataset into the kernel by executing the code below:
data=pd.read_csv("../input/health-insurance-dataset/Health_insurance.csv")
df=data.copy()

df.head()

In [None]:
#Let's understand the data shape of our dataset:
print("row :",df.shape[0]," ","column :",df.shape[1])

In [None]:
#Let’s understand the statistical information about Numerical Columns in our dataset:
df.describe().T

In [None]:
#Let’s understand the statistical information about Object Columns in our dataset:
df.describe(include=["object"]).T

In [None]:
#Missing Data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

Great! we do not have any missing values in our dataset.

how many children covered not by health insurance?

In [None]:
df.eq(0).sum()

In [None]:
nominal.associations(df,figsize=(20,10),mark_columns=True,cmap="rainbow");

In [None]:
plt.rcParams.update({'font.size': 12})
plt.figure(figsize=(10,5))
corr=df.corr()

mask=np.zeros_like(corr,dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
sns.heatmap(corr.abs(),annot=True,cmap="coolwarm",mask=mask);

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.distplot(df.charges,color="b");

plt.subplot(122)
sns.distplot(log(df.charges),color="b");

**Notice:** There are two main reasons to use logarithmic scales in charts and graphs. The first is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. The second is to show percent change or multiplicative factors. 

In [None]:
#Lets’ visualize the numerical column in proportion of the Smoker Column
sns.pairplot(df,kind="reg",hue="smoker",aspect=2);

In [None]:
#Lets’ visualize the numerical column in proportion of the Sex Column
sns.pairplot(df,kind="reg",hue="sex",aspect=2);

In [None]:
#Let’s draw the scatter plot of the “bmi” and “charges” column in proportion to the “smoker” column
sns.relplot(x="bmi",y="charges",hue="smoker",data=df,kind="scatter",aspect=2);

In [None]:
#Let’s draw the scatter plot of the “bmi” and “charges” column in proportion to the “children” column
sns.relplot(x="bmi",y="charges",hue="children",data=df,kind="scatter",aspect=2,palette='coolwarm');

In [None]:
sns.catplot(x="age", y="charges", hue="smoker", data=df,aspect=3,kind="point");

In [None]:
sns.lmplot(x="bmi", y="charges", hue="smoker", data=df,aspect=2);

In [None]:
#Let's draw the joinplot by executing the code below
plt.figure(figsize=(12,5));
sns.jointplot(x="bmi", y="charges" ,data=df, kind="reg");

In [None]:
plt.figure(figsize=(12,5));
sns.jointplot(x="age", y="bmi" ,data=df);

In [None]:
#Let's see the distribution of the age column by executing the code below:
plt.figure(figsize=(12,5));
sns.distplot(df.age);

In [None]:
#Let's visualize the Probability Plot for the charges column
plt.figure(figsize=(12,5));
stats.probplot(df.charges, dist="norm", plot=pylab) ;

In [None]:
#Creating the groupby smoker with the values of charges
#Plotting of groupby (smoker) with the values of "charges"
plt.figure(figsize=(12,5));
df.groupby("smoker")["charges"].mean().plot.bar(color="r");

In [None]:
plt.figure(figsize=(12,5));
df.groupby("children")["charges"].mean().plot.bar(color="g");

In [None]:
print(sns.FacetGrid(df,hue="sex",height=5,aspect=2).map(sns.kdeplot,"charges",shade=True).add_legend());

In [None]:
print(sns.FacetGrid(df,hue="region",height=5,aspect=2).map(sns.kdeplot,"charges",shade=False).add_legend());

In [None]:
print(sns.catplot(x="sex",y="charges",hue="smoker",data=df,kind="bar",aspect=2));

In [None]:
print(sns.catplot(x="sex",y="charges",hue="region",data=df,kind="bar",aspect=2));

In [None]:
sns.catplot(x="smoker",y="charges",data=df,kind="box",aspect=2);

In [None]:
sns.catplot(x="sex",y="charges",data=df,kind="box",aspect=2);

In [None]:
sns.catplot(x="sex",y="charges",hue="smoker",data=df,kind="box",aspect=2);

In [None]:
sns.catplot(x="region",y="charges",data=df,kind="box",aspect=2);

In [None]:
sns.catplot(x="children",y="charges",data=df,kind="box",aspect=2);

### Data Discretization

In [None]:
labels=["too_weak","normal","heavy","too_heavy"]
ranges=[0,18.5,24.9,29.9,np.inf]
df["bmi"]=pd.cut(df["bmi"],bins=ranges,labels=labels)

In [None]:
print(sns.FacetGrid(df,hue="bmi",height=5,aspect=2).map(sns.kdeplot,"charges",shade=False).add_legend());

In [None]:
print(sns.catplot(x="bmi",y="charges",kind="bar",data=df,aspect=2));

In [None]:
print(sns.catplot(x="bmi",y="charges",hue="children",kind="bar",data=df,aspect=3));

In [None]:
print(sns.catplot(x="bmi",y="charges",hue="smoker",data=df,kind="bar",aspect=2));

In [None]:
plt.rcParams.update({'font.size': 12})
plt.figure(figsize=(10,5))
corr=df.corr()
mask=np.zeros_like(corr,dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
sns.heatmap(corr.abs(),annot=True,cmap="coolwarm",mask=mask);

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(121)
sns.boxplot(df["charges"],color="y");
plt.subplot(122)
sns.boxplot(df["age"],color="y");

In [None]:
pd.crosstab(df.age,df.children)[:10]

In [None]:
df[(df["age"]==18)&(df["sex"]=="female")&(df["children"]>0)]

In [None]:
df[(df["age"]==18) & (df["sex"]=="male") & df["children"]>0]

### Unsupervised Outlier Detection using Local Outlier Factor (LOF)

In [None]:
clf=LocalOutlierFactor(n_neighbors=50)
clf.fit_predict(df[["age","children"]])

In [None]:
clf_scores=clf.negative_outlier_factor_

In [None]:
np.sort(clf_scores)[0:20]

In [None]:
treshold=np.sort(clf_scores)[20]
treshold

In [None]:
df[clf_scores<treshold]

In [None]:
df[(df["age"]==18)&(df["children"]>1)]

In [None]:
df.drop(df[(df["age"]==18)&(df["children"]>0)].index,inplace=True)

In [None]:
df.corr()

In [None]:
print(sns.catplot(x="children",y="charges",hue="smoker",data=df,kind="bar",aspect=3));

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df_new=df.copy()
df_new=pd.get_dummies(data=df,columns=["sex","smoker"],drop_first=True)

In [None]:
df_new.head()

In [None]:
df_new=pd.get_dummies(data=df_new,columns=["region","bmi"])

In [None]:
df_new.head()

In [None]:
df_new.charges=log(df_new.charges)

sc=StandardScaler()
df_scaled=pd.DataFrame(sc.fit_transform(df_new),columns=df_new.columns,index=df_new.index)

df_scaled.head()

In [None]:
X=df_scaled.drop("charges",axis=1)
y=df_scaled["charges"] 

# <font color='blue'>Train Test Split </font>  

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# <font color='blue'>OLS Regression</font>  

In [None]:
lm=sm.OLS(y_train,X_train)
model=lm.fit()
model.summary()

In [None]:
X=df_scaled.drop(["charges","region_northwest"],axis=1)
y=df_scaled["charges"] 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
lm=sm.OLS(y_train,X_train)
model=lm.fit()
model.summary()

In [None]:
X=df_scaled.drop(["charges","region_northwest","bmi_heavy"],axis=1)
y=df_scaled["charges"] 

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
lm=sm.OLS(y_train,X_train)
model=lm.fit()
model.summary()

In [None]:
X=df_scaled.drop(["charges","region_northwest","bmi_heavy","bmi_too_weak"],axis=1)
y=df_scaled["charges"] 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
lm=sm.OLS(y_train,X_train)
model=lm.fit()
model.summary()

In [None]:
X=df_scaled.drop(["charges","region_northwest","bmi_heavy","bmi_too_weak","bmi_normal"],axis=1)
y=df_scaled["charges"] 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
lm=sm.OLS(y_train,X_train)
model=lm.fit()
model.summary()

In [None]:
X=df_scaled.drop(["charges","region_northwest","bmi_heavy","bmi_too_weak","bmi_normal","region_northeast"],axis=1)
y=df_scaled["charges"] 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
lm=sm.OLS(y_train,X_train)
model=lm.fit()
model.summary()

In [None]:
model.params

In [None]:
model=LinearRegression()
lin_mo=model.fit(X_train,y_train)
y_pred=lin_mo.predict(X_test)

In [None]:
lin_mo.score(X_train,y_train)

In [None]:
lin_mo.score(X_test,y_test)

In [None]:
r2_score(y_test,y_pred)

In [None]:
plt.figure(figsize=(12,5));
ax1=sns.distplot(y_test,hist=False, color='r')
sns.distplot(y_pred,ax=ax1,hist=False, color='b');

In [None]:
plt.figure(figsize=(12,8));
residuals_plot(model, X_train, y_train, X_test, y_test,line_color="red");

In [None]:
plt.figure(figsize=(12,8));
prediction_error(model, X_train, y_train, X_test, y_test);

In [None]:
model.coef_

In [None]:
model.intercept_

# Conclusion:
* We did perform the Data Preprocessing 
* We did visualize some advanced plots and conduct the Exploratory Data Analysis EDA
* We did perform some Data Engineering/Data Extraction
* We did perform Outlier Detection with LOF Technique
* We did perform the Standardization Technique
* We did build the OLS Regression Model
* We did conduct some advanced statistical analytics such as Backward Selection Technique to improve the model performance
* We did visualize the Residual Plot and Prediction Error of the Regression Model
### I have tried to perform some advanced statistical analysis in this notebook, and I hope you enjoy reading my notebook
