**Columns**

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

In [None]:
#importing the relevant library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

In [None]:
#reading the data

raw_data = pd.read_csv('../input/insurance/insurance.csv')
raw_data.head()

In [None]:
# Looking for no. of missing value in each column
raw_data.isnull().sum()

We don't have any missing value in dataset

In [None]:
#checking for missing value by heatmap

sns.heatmap(raw_data.isnull(),cbar=True)

In [None]:
# Some other tools to check for missing value
from missingno import matrix,heatmap
matrix(raw_data)

In [None]:
# Number of male and female patient  
raw_data['sex'].value_counts()

In [None]:
raw_data['sex'].value_counts().plot.bar(color='y')

No. of Male and Female Paitent are almost equal.

In [None]:
raw_data['age'].value_counts().sort_values(ascending=False)

In [None]:
fig = plt.figure(figsize=(12,6))
sns.countplot(raw_data['age'],hue=raw_data['smoker'],palette=['red','green'],)

In [None]:
#Check Point
df = raw_data.copy()

In [None]:
# Encoding the data with map function

df['sex'] = raw_data['sex'].map({'female':0,'male':1})
df['smoker'] = raw_data['smoker'].map({'yes':1,'no':0})
df['region'] = raw_data['region'].map({'southeast':0,'southwest':1,'northwest':2,'northeast':3})

In [None]:
raw_data['region'].value_counts()

In [None]:
df.head()

In [None]:
sns.heatmap(df.corr(),cmap='Wistia',annot=True)

Only Smoker is highly correleted to charges and other is having low or no correlection

In [None]:
# Check Point 2
df2 = raw_data.copy()
df2.head()

In [None]:
## Label Encoding

from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
df2['region'] = lb.fit_transform(df2['region'])
df2['sex'] = lb.fit_transform(df2['sex'])
df2['smoker'] = lb.fit_transform(df2['smoker'])
df2.head()

In [None]:
# Check point encoding
df3 = raw_data.copy()
df3.head()

In [None]:
## Using OneHotEncoding

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first')
x = pd.DataFrame(ohe.fit_transform(df3[['sex','children','smoker','region']]).toarray())
x.columns = ['Male','children_1','children_2','children_3','children_4','children_5','smoker','northwest','southeast','southwest']
df4 = pd.concat([df3.drop(['sex','children','smoker','region'],axis=1),x],axis=1)
df4.head()

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(df4.corr(),annot=True)

In [None]:
## Creating the check Point
new_data = df4

In [None]:
sns.scatterplot(x=raw_data['bmi'],y=raw_data['charges'],hue=raw_data['smoker'])

Person with smoking addiction seems to have more charges to paid at medical treatment

In [None]:
sns.lmplot(data=df4,x='bmi',y='charges',aspect=2,height=6,hue='smoker')

In [None]:
# Looking for continous variable

fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
sns.distplot(df4['charges'],ax=axes[0],color='purple')
sns.distplot(df4['bmi'],ax=axes[1],color='orange')

Average charges at 10000 to 20000 and are right skewed. and BMI is normally distrubuted

In [None]:
# 
sns.distplot(np.log(df4['charges']),color='purple')

In [None]:
df4.columns

In [None]:
# Checkin point applying linear regression

new_data = df4.copy()
new_data = new_data.reindex(['age', 'bmi', 'Male', 'children_1', 'children_2',
       'children_3', 'children_4', 'children_5', 'smoker', 'northwest',
       'southeast', 'southwest','charges'],axis=1)
new_data.head()

In [None]:
import statsmodels.api as sm
x1 = new_data.iloc[:,:-1] # independent variable
y = new_data.iloc[:,-1] #dependent variable

In [None]:
x = sm.add_constant(x1)
result = sm.OLS(y,x).fit()
result.summary()

# Adjusted R^2

Our model explain about 75% variability of data.

Looking for P-values:
*  it is the smallest level of significane at which we can still reject the null hypothesis.
* The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. 


for Significane Level of 0.05
*  p-value < 0.05 --> reject the null hypothesis
*  p-value > 0.05  --> cannot reject the null hypthesis

After performing Backward elimination and Multiple Linear regression with Significant Value=0.05,
I got predictive variable as folowing:
* Age
* BMI
* Smoker
___________________________________________________________________
___________________________________________________________________

1. **Children** features donot seem to be good predictive variable as it is for some it predict and others for not.
2. **Region** features also seem to have low predictive power.
3. **Sex** features donot seem to be not a good predictive as both are having almost number in dataset.