## Introduction

Goal: Explore dataset to find relationships and predict medical cost with multiple linear regression

props to the uploader of this dataset Miri Choi; 

this data orinated from Machine Learning with R by Brett Lantz, a book that provides an introduction to machine learning using R

**Columns:**
* age: age of primary beneficiary
* sex: insurance contractor gender, female, male
* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.
* children: Number of children covered by health insurance / Number of dependents
* smoker: Smoking (Yes/No)
* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
* charges: Individual medical costs billed by health insurance

In [None]:
#essentials
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.tools as tls
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
import plotly.express as px
init_notebook_mode(connected=True)

#machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

#show input file directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#read data
insurance_data = pd.read_csv('/kaggle/input/insurance/insurance.csv')

In [None]:
#basic info
insurance_data.info()

there are no missing values

In [None]:
#basic stats
insurance_data.describe()

In [None]:
#check head of data
insurance_data.head()

## Exploratory Data Analysis

In [None]:
#check distribution of age
insurance_data['age'].hist(bins=20)

In [None]:
#explore relationship across dataset
sns.pairplot(insurance_data)

In [None]:
#compare charges between male and female
sns.stripplot(x='sex',y='charges',data=insurance_data)

In [None]:
#compare charges between smokers and non smokers
sns.stripplot(x='smoker',y='charges',data=insurance_data)

In [None]:
g = sns.FacetGrid(data=insurance_data,col='smoker')
g.map(sns.distplot,'charges',bins=30,kde=False)

In [None]:
#plot bmi vs. charges in relationship with smoker y/n
px.scatter(insurance_data,x='bmi',y='charges',color='smoker',color_discrete_sequence=['red','blue'])

In [None]:
#plot bmi vs. charges in relationship with smoker y/n
sns.lmplot(x='bmi',y='charges',data= insurance_data, col = 'smoker')

In [None]:
#plot number of children vs. charges
sns.barplot(x='children',y='charges',data = insurance_data)

In [None]:
#plot region vs. charges
sns.barplot(x='region',y='charges',data = insurance_data)

In [None]:
#plot bmi vs. region
sns.stripplot(x='region',y='bmi',data=insurance_data)

## Machine Learning

**Linear Regression**

In [None]:
insurance_data.info()

In [None]:
#assign dummy variable to categorical features

insurance_data['sex'].replace('male','1',inplace = True)
insurance_data['sex'].replace('female','0',inplace = True)

insurance_data['smoker'].replace('yes','1',inplace = True)
insurance_data['smoker'].replace('no','0',inplace = True)



In [None]:
#create dummy variables for region

def newcolumn_sw(c):
  if c['region'] == 'southwest':
    return 1
  else:
    return 0
insurance_data['southwest'] = insurance_data.apply(newcolumn_sw, axis=1)

def newcolumn_se(c):
  if c['region'] == 'southeast':
    return 1
  else:
    return 0
insurance_data['southeast'] = insurance_data.apply(newcolumn_se, axis=1)

def newcolumn_nw(c):
  if c['region'] == 'northwest':
    return 1
  else:
    return 0
insurance_data['northwest'] = insurance_data.apply(newcolumn_nw, axis=1)

def newcolumn_ne(c):
  if c['region'] == 'northeast':
    return 1
  else:
    return 0
insurance_data['northeast'] = insurance_data.apply(newcolumn_ne, axis=1)

In [None]:
insurance_data

In [None]:
#assign features and labels
X = insurance_data[['age','bmi','sex','smoker','children','southwest','southeast','northwest','northeast']]
y = insurance_data['charges']

In [None]:
#split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
#assign linear model object
lm = LinearRegression()

In [None]:
#fit linear model
lm.fit(X_train,y_train)

In [None]:
# The coefficients
print('Coefficients: \n', lm.coef_)

In [None]:
#call predictions
predictions = lm.predict(X_test)

In [None]:
#regression plot of the real test values versus the predicted values

plt.figure(figsize=(16,8))
sns.regplot(y_test,predictions)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Linear Model Predictions")
plt.grid(False)
plt.show()

In [None]:
#plot residiual
sns.distplot((y_test-predictions),bins=50);

In [None]:
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

In [None]:
#calculate metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

In [None]:
#calculate r squared
SS_Residual = sum((y_test-predictions)**2)
SS_Total = sum((y_test-np.mean(y_test))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
print('R Squared:', r_squared)