**<h1>1. Introduction</h1>**

This kernel aims to exercising simple data pre-processing and regression model building for predicting medical cost based on medical dataset.

**<h1>2. Dataset Overview</h1>**

Let's take a glimpse at the dataset before we processed it furthermore. Then we can decide what pre-processing method suitable for making medical cost prediction.

In [None]:
import pandas as pd

# Load dataset
data_path = '../input/insurance.csv'
data = pd.read_csv(data_path)

# See general information about the dataset & 5 sample data points
print(data.info())
data.head(5)

As seen above, the medical cost dataset consists of 1338 data, which each data has 7 columns, 3 of them are categorical (stated by object data type). Categorical value can be found in *sex*, *smoker*, and *region* column. Those categorical values need to be converted to numerical values in prior to building the prediction model. In addition, there is no missing values in the dataset, showed by number 1338 (the number of total data) on each column information.

**<h1>3. Data Pre-processing</h1>**

**<h2>Label Encoding</h2>**

Label encoding is a process to transform categorical value to numerical value. *LabelEncoder* from scikit-learn library is used in this transformation process. Decimal number is assigned to each categorical value within the column. New columns containing encoded categorical values then added at the end of medical dataset column.


In [None]:
from sklearn.preprocessing import LabelEncoder

# Label encoder initialization
le_sex = LabelEncoder()
le_smoker = LabelEncoder()
le_region = LabelEncoder()

# Label encoding
data['sex_encoded'] = le_sex.fit_transform(data.sex)
data['smoker_encoded'] = le_smoker.fit_transform(data.smoker)
data['region_encoded'] = le_region.fit_transform(data.region)

# See the encoding mapping 
# (categorical value encoded by the index)
print('sex column encoding mapping : %s' % list(le_sex.classes_))
print('smoker column encoding mapping : %s' % list(le_smoker.classes_))
print('region column encoding mapping : %s' % list(le_region.classes_))

# See label encoding result
data.head(5)

As seen above, numerical value assigned for *region* column is not binary (it has 4 categorical values).  It doesn't seem like a problem at first, but it does. Unless the categorical value has some order (i.e. cold, warm, hot) "binarization" of categorical value is necessary. Therefore One Hot Encoding is applied to transform decimal number assigned to *region* column to binary values. See this neat [article](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) for more explanation as to how label encoding is not enough to transform categorical values.

**<h2>One Hot Encoding</h2>**

One Hot Encoding (OHE) utilized in this kernel is using module from scikit-learn. The OHE module takes decimal number assigned to categorical value as input. The OHE result then added to the end of the dataset column.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# One hot encoder initialization
ohe_region = OneHotEncoder()

# One hot encoding (OHE) to array
arr_ohe_region = ohe_region.fit_transform(data.region_encoded.values.reshape(-1,1)).toarray()

# Convert array OHE to dataframe and append to existing dataframe
dfOneHot = pd.DataFrame(arr_ohe_region, columns=['region_'+str(i) for i in range(arr_ohe_region.shape[1])])
data = pd.concat([data, dfOneHot], axis=1)

# See the preprocessing result
data.head(5)

So far, we added the transformed categorical column to the end of the dataset column, but not all the column will be used for building the prediction model. So unnecessary column will be dropped, such as the categorical column and non-binary encoded column.

In [None]:
# Drop categorical features
preprocessed_data = data.drop(['sex','smoker','region',
                               'region_encoded'], axis=1)

# See the preprocessing final result
preprocessed_data.head(5)

**<h1>3. Data Preparation</h1>**

The pre-processed dataset then divided to training data and testing data. Training data is used for building the prediction model, while testing data is used for evaluating the prediction model.

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset to training and testing
train, test = train_test_split(preprocessed_data, test_size=0.2)

# Split the feature and the target
train_y = train.charges.values
train_x = train.drop(columns=['charges']).values
test_y = test.charges.values
test_x = test.drop(columns=['charges']).values

# See the size of training and testing
print('Training features : ', train_x.shape)
print('Training target : ', train_y.shape)
print('Testing features : ', test_x.shape)
print('Testing target : ', test_y.shape)

**<h1>4. Building Prediction Model (Training)</h1>**

In this kernel, 2 method are employed to building the prediction model : Linear Regression & Random Forest. The two method then evaluated and compared in term of performance.

In [None]:
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor

# Building linear regression model
lr_model = linear_model.LinearRegression()
lr_model.fit(train_x, train_y)

# Building Random Forest model
rf_model = RandomForestRegressor()
rf_model.fit(train_x, train_y)

# Make prediction
lr_predict = lr_model.predict(test_x)
rf_predict = rf_model.predict(test_x)

# Sample the prediction
sample_id = 7
print('Actual Charges : %.2f' % test_y[sample_id])
print('Linear Regression Prediction : %.2f' % lr_predict[sample_id])
print('Random Forest Prediction : %.2f' % rf_predict[sample_id])

**<h1>5. Model Evaluation (Testing)</h1>**

The prediction made using testing data evaluated using *Mean Squared Error (MSE)* and R2-Score.  MSE is the average squared difference between predicted values and actual values. R2-Score (coefficient of determination) is regression score function. The result show that random forest model yield better prediction than linear regression indicated both by MSE and R2-Score.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import math

# Evaluate prediction model using MSE
lr_mse = mean_squared_error(test_y, lr_predict)
print('MSE-Linear Regression : %.2f (square-rooted)' % math.sqrt(lr_mse))
rf_mse = mean_squared_error(test_y, rf_predict)
print('MSE-Random Forest : %.2f (square-rooted)' % math.sqrt(rf_mse))

# Evaluate prediction model using R2-Score
lr_r2 = r2_score(test_y, lr_predict)
print('R2-Linear Regression : %.2f' % lr_r2)
rf_r2 = r2_score(test_y, rf_predict)
print('R2-Random Forest : %.2f' % rf_r2)