# Assessing the Effect of Smoking On Individuals' Health Insurance Premiums
This notebook illustrates the relationship between smoking and its effect on insurance premiums in the United States of America. 

## Content
Columns

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

## A: Data acquisition and dataset preparation for analysis

In [None]:
import pandas as pd
import numpy as np
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
from scipy import stats, integrate
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection  import train_test_split

# allow plots to appear directly in the notebook
%matplotlib inline
import seaborn as sns
sns.set(color_codes=True)

In [None]:
insurance_df=pd.read_csv('../input/insurance/insurance.csv')
insurance_df.info()

In [None]:
insurance_df.head()

Fortunately, there are no missing values in the dataset. There are three data types; integer, float, object, which only require two modifications. First changing sex column name to gender and charges to prices column;
Second encoding data for sex and smoker columns as flowing:
 Gender column: male=0,female=1 
Smoker column: no=0, yes=1

In [None]:
insurance_df.rename(columns={'sex':'gender'},inplace=True)
insurance_df.rename(columns={'charges':'prices'},inplace=True)
insurance_df.head()

In [None]:
df1=insurance_df[:]
print(df1['gender'].unique(), df1['smoker'].unique(), df1['region'].unique())

In [None]:
df1['gender']= pd.get_dummies(df1['gender'])
df1['smoker']=df1['smoker'].replace('yes','1')
df1['smoker']=df1['smoker'].replace('no','0')
df1['smoker']=df1['smoker'].astype(int)


In [None]:
df1.head()

# B: Statictice information and Variables relationshipe

In [None]:
df1.describe()

18 years old is the minimum age of patients in the dataset and the maximum age is 64 years; this is a good point for the dataset, because the majority of smokers are in this scope. From min, max and quarters infer that gender is evenly distributed.  Non-smokers outnumber smokers 4 to 1.

In [None]:
sns.heatmap(df1.corr(),cmap='Wistia',annot=True)

Only smokers are highly correlated to charges and others are having low or no correlation. For better analysis at first, the age column categorized to Young Adult, Senior Adult, Elder and bmi column to obese and non-obese.

In [None]:
f, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(x="region", y="prices", data=df1, dodge=False);

In [None]:
f, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(x="gender", y="prices", data=df1, dodge=False);

In [None]:
f, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(x="smoker", y="prices", data=df1, dodge=False);

In [None]:
df1['bmi30']=np.nan
lst=[df1]
for col in lst:
    col.loc[col['bmi']<30,'bmi30']='non_obese'
    col.loc[col['bmi']>=30,'bmi30']='obese'

In [None]:
f, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(x="bmi30", y="prices", data=df1, dodge=False);

In [None]:
df1_gb4=df1.groupby(['bmi30'])['prices'].mean()
df1_gb4

In [None]:
df1['age_cat'] = np.nan
lst = [df1]
for col in lst:
    col.loc[(col['age'] >= 18) & (col['age'] <= 35), 'age_cat'] = 'Young Adult'
    col.loc[(col['age'] > 35) & (col['age'] <= 55), 'age_cat'] = 'Senior Adult'
    col.loc[col['age'] > 55, 'age_cat'] = 'Elder'
    

In [None]:
f, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(x="age_cat", y="prices", data=df1, dodge=False);

In [None]:
df1_gb1=df1.groupby(['smoker','gender'])['prices'].mean()
df1_gb1

In [None]:
sns.lmplot(x="smoker", y="prices", hue="gender", data=df1);

In [None]:
df1_gb2=df1.groupby(['smoker','age_cat'])['prices'].mean()
df1_gb2

In [None]:
sns.lmplot(x="smoker", y="prices", hue="age_cat", data=df1);

In [None]:
df1_gb3=df1.groupby(['smoker','bmi30'])['prices'].mean()
df1_gb3

In [None]:
sns.lmplot(x="smoker", y="prices", hue="bmi30", data=df1)

In [None]:
sns.lmplot(x="smoker", y="prices", hue="region", data=df1)

RESULTS:
* Prices are higher for older group ages and do not seem to be affected by gender.
* Although obese and non-obese people have the same median prices, their average expenditure differ by almost 5000 U.S dollars. 
* We can disclose that region of origin doesn’t have much impact with the amount of prices.
* The comorbidity between smoking and obesity has higher prices than smokers who are in shape.
* Age and gender almost have the same effect for pricing of smokers.                  

# Model Building


In [None]:
df1['region']=df1['region'].replace('southwest','1')
df1['region']=df1['region'].replace('southeast','2')
df1['region']=df1['region'].replace('northwest','3')
df1['region']=df1['region'].replace('northeast','4')
df1['region']=df1['region'].astype(int)

In [None]:
sns.pairplot(df1, x_vars=['smoker','bmi','age','region'], y_vars='prices', size=7, aspect=0.7, kind='reg')

Based on former parts the model for prediction price could be multivariate linear regression:
 * Y = β_0+β_1 x_1+β_2 x_2+ β_3 x_3 +β_4 x_1 x_2+ β_5 x_1 x_3 + β_6 x_2 x_3


### Data normalization 
The values of several ranges are transformed into similar range by min-max method.

In [None]:
df1['smoker']=(df1['smoker']-df1['smoker'].min())/(df1['smoker'].max()-df1['smoker'].min())
df1['gender']=(df1['gender']-df1['gender'].min())/(df1['gender'].max()-df1['gender'].min())
df1['age']=(df1['age']-df1['age'].min())/(df1['age'].max()-df1['age'].min())
df1['bmi']=(df1['bmi']-df1['bmi'].min())/(df1['bmi'].max()-df1['bmi'].min())
df1['region']=(df1['region']-df1['region'].min())/(df1['region'].max()-df1['region'].min())
df1['prices']=(df1['prices']-df1['prices'].min())/(df1['prices'].max()-df1['prices'].min())
df1.head()

### Hypothesis Testing and p-values

In [None]:
lm1 = smf.ols(formula='prices ~ smoker', data=df1).fit()
lm1.params

In [None]:
lm1.pvalues

In [None]:
# create X and y
feature_cols = ['smoker']
X = df1[feature_cols]
y = df1.prices

# instantiate and fit
lm2 = LinearRegression()
lm2.fit(X, y)

# print the coefficients
print(lm2.intercept_)
print(lm2.coef_)


In [None]:
lm1.pvalues

***
The p-value for smoker is far less than 0.05, and so there is a relationship between smoker and prices.

### STATSMODELS
### Feature Selection 

In [None]:
lm1 = smf.ols(formula='prices ~ smoker + bmi + age', data=df1).fit()
lm1.rsquared

In [None]:
lm1 = smf.ols(formula='prices ~ smoker + bmi + age + gender', data=df1).fit()
lm1.rsquared

In [None]:
lm1 = smf.ols(formula='prices ~ smoker + bmi + age + gender+ region', data=df1).fit()
lm1.rsquared

In [None]:
lm1.summary()

Smoking, bmi and age have significant p-values,gender have insignificant p-values, p-values of region is acceptable.

### Model Evaluation Using Train/Test Split
Train/test split with RMSE are used to see whether gender and region should be kept in the model. First we considered smokers, bmi and age; the MSE and R-square values then gender added and results show no significant value changes. eventually test done by region and we have a little changes that it can be ignored. 

In [None]:
lm=LinearRegression()
x=df1[['smoker','bmi','age']]
lm.fit(x,df1['prices'])
print(lm.intercept_)
print(lm.coef_)


In [None]:
X = df1[['smoker','bmi','age']]
y = df1.prices
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
lm2.fit(X_train, y_train)
y_pred = lm2.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


In [None]:
X = df1[['smoker','bmi','age','gender']]
y = df1.prices
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
lm2.fit(X_train, y_train)
y_pred = lm2.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


In [None]:
X = df1[['smoker','bmi','age','region']]
y = df1.prices
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
lm2.fit(X_train, y_train)
y_pred = lm2.predict(X_test)
print (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Gender and region has no specifique effect for predicting prices. 
It seems that multiple regression model with smoker,bmi,age as input variable is the best.
Thus model based by 3 variable:
* y= -0.04753361030859585 + 0.38027509 x_1+ 0.19141071 x_2+ 0.19057399 x_3 