Objective:
---------
The objective of this analysis is to determine whether smokers have statistically higher mean individual medical costs billed by health insurance than do non-smokers. Furthermore, is a person's BMI correlated with individual medical costs billed by health insurance?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LinearRegression  #Import Linear regression model
from sklearn.model_selection import train_test_split  #To split the dataset into Train and test randomly
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error, r2_score
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
df=pd.read_csv("../input/insurance.csv")

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.describe().T

In [None]:
df.isna().sum()

In [None]:
from scipy.stats import kurtosis, skew, stats

Here I am finding Skew and Kurtosis for expenses.


In [None]:
print("Summary Statistics of Medical Costs")
print(df['expenses'].describe())
print("skew:  {}".format(skew(df['expenses'])))
print("kurtosis:  {}".format(kurtosis(df['expenses'])))
print("missing charges values: {}".format(df['expenses'].isnull().sum()))
print("missing smoker values: {}".format(df['smoker'].isnull().sum()))

Skewness
-------

I am Getting positive Skewness , it means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode. 
    *Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode.
    
    *If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
        *If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
    *If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.
    
    *So here My data is Highly Skewed. My Skew value is 1.51

Kurtosis
---------
Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

Mesokurtic : --  It means that the extreme values of the distribution are similar to that of a normal distribution characteristic.

Leptokurtic (Kurtosis > 3)  -- Distribution is longer, tails are fatter which means that data are heavy-tailed or profusion of outliers. 

Platykurtic: (Kurtosis < 3):  -- Distribution is shorter, tails are thinner than the normal distribution which means that data are light-tailed or lack of outliers.


    *In My Dataset Kurtosis is Platykurtic:        (Kurtosis < 3) bcoz no Out Liers 


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
f, axes = plt.subplots(1, 2)
sns.kdeplot(df['expenses'], ax=axes[0])
sns.boxplot(df['expenses'], ax=axes[1])
plt.show()

Both the boxplot and kernel density estimation plot reveal that the expenses data is right skewed. Furthermore, there are some outliers but no missing charges and smoker values

Objective Part 1: Do smokers have statistically higher mean individual medical costs billed by health insurance than do non-smokers?

In [None]:
#prepare our 2 groups to test
#smoker = df[df['smoker']==1]
#non_smoker = df[df['smoker']==0]
ax = sns.swarmplot(x='smoker',y='expenses',data=df)
ax.set_title("Smoker vs Expenses")
plt.xlabel("Smoker (Yes - 1, No - 0)")
plt.ylabel("Expenses")
plt.show(ax)

In [None]:
#plt.title('Distribution of Medical Costs for Smokers Vs Non-Smokers')
#ax = sns.kdeplot(smoker['expenses'], bw=10000, label='smoker')
#ax = sns.kdeplot(non_smoker['expenses'], bw=10000, label='non-smoker')
#plt.show()

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df.iloc[:,4] = labelencoder.fit_transform(df.iloc[:,4])

In [None]:
df.head()

In [None]:
df.corr()

In [None]:
x = df[['age','bmi','smoker']]
y = df['expenses']
#train_test_split() to split the dataset into train and test set at random.
#test size data set should be 30% data
X_train,X_test,Y_train, Y_test = train_test_split(x,y,test_size=0.3,random_state=42)
#Creating an linear regression model object
model = LinearRegression()
model.fit(X_train, Y_train) 

In [None]:
print("Intercept value:", model.intercept_)
print("Coefficient values:", model.coef_)

In [None]:
coef_df = pd.DataFrame(list(zip(X_train.columns,model.coef_)), columns = ['Features','Predicted Coeff'])
coef_df


In [None]:
Y_train_predict = model.predict(X_train)
Y_test_predict = model.predict(X_test)

In [None]:
ax = sns.scatterplot(Y_train,Y_train_predict)
ax.set_title("Actual Expenses vs Predicted Expenses")
plt.xlabel("Actual Expenses")
plt.ylabel("Predicted Expenses")
plt.show(ax)

In [None]:
print("MAE")
print("train : ",mean_absolute_error(Y_train,Y_train_predict))
print("test : ",mean_absolute_error(Y_test,Y_test_predict))

In [None]:
print("MSE")
print("train : ",mean_squared_error(Y_train,Y_train_predict))
print("test : ",mean_squared_error(Y_test,Y_test_predict))

In [None]:
print("Rsquare")
print("train : ",r2_score(Y_train,Y_train_predict))
print("test : ",r2_score(Y_test,Y_test_predict))

In [None]:
smoker_model = LinearRegression()
smoker_model.fit(X_train[['smoker']], Y_train)
print("intercept:",smoker_model.intercept_, "coeff:", smoker_model.coef_)

#print("Train - Mean squared error:", np.mean((Y_train - model.predict(X_train)) ** 2))
smoker_df = pd.DataFrame(list(zip(Y_train, smoker_model.predict(X_train[['smoker']]))), columns = ['Actual Expenses','Predicted Expenses'])
smoker_df.head()
#X_train['smoker'].shape

In [None]:
print("MSE:",np.sqrt(mean_squared_error(Y_train, Y_train_predict)))
print("MSE only for Smoker:", np.sqrt(mean_squared_error(Y_train,smoker_model.predict(X_train[['smoker']]))))

In [None]:
#R-Squared value for Train data set
print("R-squared value:",round(r2_score(Y_train, Y_train_predict),3))
print("R-squared value only for smoker:", round(r2_score(Y_train,smoker_model.predict(X_train[['smoker']]))),3)

In [None]:
#Mean absolute error for Train data set
print("Mean absolute error:",mean_absolute_error(Y_train, Y_train_predict))
print("Mean absolute Error only for Smoker:", mean_absolute_error(Y_train,smoker_model.predict(X_train[['smoker']])))