# Linear Regression as Predictive Model for Healthcare Insurance

# Description:
###  It is no surprise that healthcare insurance is expensive. It is the public perception that the rate/charges is determined by many factors, such as age, gender, smoking status, occupation, and so forth. It is essential to understand what factors affect the rates/charges to make informed decision. 

# Project Objective:
### This interesting project will use linear regression model in the sklearn library to predict the charges. Thus, the target feature or y-variable is "charges".

# Process:
### The project will start off with basic descriptive analysis, followed by detailed exploratory data analysis, data visualization on features, and data cleansing. The cleansed data will be scaled and fit to the model. Evaluation of the model included MAE, MSE, and RMSE. Last, a basic description of the coefficient as conclusion. 

# Potential Impact
### As a consumer, it is beneficial to understand which factor carries more weight and have more impact on the charges. This can help patients to make informed decision. As for insurance companies, it helps to have a deeper understand their target audience and improve strategy on policy/coverage. Knowledge on specific population can help identify potential opportunities, such as population health, preventive health measures promotion, and so forth. This ultimately can have a positive impact on profitability. 

### Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
data = pd.read_csv("../input/insurance/insurance.csv")

In [None]:
data.head()

In [None]:
len(data)

### This is a relatively small dataset with only ~1300 observations. Noted that we have a mixture of datatypes to work on.

In [None]:
data.info()

### The following is basic yet important information on the dataset. Noted that the feature "charges" will be our y-variable. The min value is only ~1200 and max at ~63000 with std deviation ~12000. This data has extreme valuee that could affect the model later on. 

In [None]:
data.describe()

# Exploratory Data Analysis
### A quick way to visualize the correlation between features. 

In [None]:
plt.figure(figsize=(12,7))
sns.pairplot(data)

### Diving into features

In [None]:
data["sex"].value_counts(normalize=True)*100

### The gender feature is ~50%, which does not explain a whole lot. Noted that this feature is object data type. Will map this into a binary classification. 

In [None]:
data["sex"] = data["sex"].map({"male": 1, "female": 0})

In [None]:
data["sex"].value_counts()

### Features, such as smoker and region also are object data type. Similarly, it will be converted into binary classification. 

In [None]:
data.select_dtypes(object).columns

In [None]:
data["smoker"].value_counts()

In [None]:
data["smoker"] = data["smoker"].map({"yes": 1, "no": 0})

### About 20% of the people are smokers. 

In [None]:
data["smoker"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(12,7))
sns.set_context("paper", font_scale=1.5)
sns.countplot(x="sex", color="pink", data=data, hue="smoker")
plt.title("Number of Smokers in Gender")
plt.xlabel("Gender")
plt.ylabel("Number of People")

### About 23% are male smokers and 17% are female. 

In [None]:
gender_gb = data.groupby("sex")["smoker"].value_counts(normalize=True)*100
gender_gb 

In [None]:
data["region"].value_counts()

### Noted that regions have 4 variables. These will be converted into numeric classes using dummies technique. 

In [None]:
region_dummy = pd.get_dummies(data["region"], prefix="region_", drop_first=True, dtype=int)
region_dummy

### Dropping the original "region" feature since feature-engineering completed. 

In [None]:
data = pd.concat([data, region_dummy], axis=1)
data = data.drop("region", axis=1)

In [None]:
data.head()

### It is time to dive into the "charges" feature. Recalling the mean, std, and max values have large gaps. This is visualized using a simple distplot. 

In [None]:
data["charges"].describe()

In [None]:
plt.figure(figsize=(12,7))
sns.distplot(x=data["charges"], bins=30, color="seagreen")

In [None]:
data.info()

### Preparing to scale the data using MinMaxScaler from sklearn library

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

### Creating variables for X and y. The data is ready for to be split for train and test set. 

In [None]:
X = data.drop("charges", axis=1)
y = data["charges"]

In [None]:
from sklearn.model_selection import train_test_split

### The test size will be set as 30% as usual. The random_state will be used and set at 42, which is arbitrary, so the random sample will be the same each time. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### The data is scaled using the MinMaxScaler. Noted that fit_transform is not used on X_test to prevent leakage.

In [None]:
scaled_data = scaler.fit_transform(X_train, y_train)
scaled_test = scaler.transform(X_test)

### Importing the linear regression from the sklearn library. The scaled data will be fit to the model. 

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_model = LinearRegression()

In [None]:
linear_model.fit(X_train, y_train)

In [None]:
linear_predict = linear_model.predict(X_test)

### After prediction is complete, it is time to evaluate the model. 

In [None]:
from sklearn import metrics

In [None]:
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, linear_predict))
print("Mean Squared Error: ", metrics.mean_squared_error(y_test, linear_predict))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(y_test, linear_predict)))

### Our model's performance is evaluated by MAE, MSE, and RMSE. Recall that the charges mean is ~13000 and std deviation ~12000. The scatterplot below shows our model's prediction against the actual charges. It shows a linear relationship with lots of noise, likely due to outliers and extreme values in charges. From the graph, it is reasonable to describe that the model does not do well in predicting outliers and extreme values. 

In [None]:
plt.figure(figsize=(12,7))
sns.set_context("paper", font_scale=1)
sns.scatterplot(x=y_test, y=linear_predict)
plt.xlabel("Charges")
plt.ylabel("Prediction")

In [None]:
plt.figure(figsize=(15,8))
sns.set_context("paper", font_scale=1.5)
sns.kdeplot(x=y_test, y=linear_predict, fill=True, color="red")
plt.xlabel("Charges")
plt.ylabel("Prediction")

In [None]:
print("Metrics Variance Score:", metrics.explained_variance_score(y_test, linear_predict))

In [None]:
plt.figure(figsize=(12,5))
sns.set_context("paper", font_scale=1.5)
sns.distplot(x=y_test-linear_predict, bins=10, color="red")
plt.title("Residual: Y_test - Prediction")

In [None]:
linear_model.coef_

### Setting up the features' coefficients. 

In [None]:
linear_coef = pd.DataFrame(linear_model.coef_, X.columns, columns=["Coefficient"])
linear_coef.sort_values("Coefficient", ascending=False)

### Given others are constant, the charges will increase by ~$23600 for every 1 unit inncrease in smoker. 
### The second most important factor is children, followed BMI and age. Apparently, gender does not affect much. 
### As expected, the model describes that smoker plays an important factor in determining the insurance charges. 