This kernel explores about regression and different ways you can improve its accuracy. We're going to try different regression algorithms such as Linear, Lasso, Ridge, and Support Vector Regression. We'll also explore boosting and stacking methods.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor, StackingRegressor

import matplotlib.pyplot as plt
import seaborn as sns

As usual, we double check where are our input data is..

In [None]:
!ls ../input/insurance

In [None]:
insurance = pd.read_csv('../input/insurance/insurance.csv')
insurance.head()

In [None]:
insurance.info()

Notice that sex and smoker are categorical features with 2 different values each? Let's convert them to a numerical feature--to be specific boolean values. Using the map function, we'll set these values:

- Male = 0; Female = 1
- Non-somker = 0; Smoker = 1

Then double check if it was indeed converted.

In [None]:
insurance['sex'] = insurance['sex'].map({'male': 0, 'female': 1})
insurance['smoker'] = insurance['smoker'].map({'yes': 1, 'no': 0})
insurance.head()

In [None]:
insurance.info()

Next is we check for missing values..

In [None]:
insurance.isnull().sum()

So thankfully we didn't have any missing values. If you do encounter however, you have to fix them. You can look at a sample in one of my kernels [here](https://www.kaggle.com/danaelisanicolas/data-cleaning) with how I dealt with this scenario.

Next is we check the correlation of the features.

In [None]:
sns.heatmap(insurance.corr(), annot=True)

So we know that smoking is the feature who has the most effect on the charges. Followed by age and bmi. Now, notice we're missing region at the heatmap? It's because it's a categorical feature. We can try to convert it to numerical to see if, in our case, affects the charges of each customer.

First we check for unique values. We can use pandas unique() function to see this.

In [None]:
insurance['region'].unique()

So we have 4 unique values. We can't easily convert this to boolean obviously. So we use dummy variables. Fortunately, pandas has their get_dummies function so we can do this easily.

In [None]:
region = pd.get_dummies(insurance['region'])
region.head()

So now we see northeast, northwest, southeast, and southwest as new columns. It is set to 1 if the specific customer is living in that region. 0 for others. Anyway, since region is on a different dataframe, let's merge it with our original insurance dataframe.

In [None]:
insurance.drop(['region'], axis=1, inplace=True)
insurance = pd.merge(insurance, region, on=insurance.index)
insurance.drop(['key_0'], axis=1, inplace=True)
insurance.head()

And actually check if region does affect charges.

In [None]:
_, ax = plt.subplots(figsize=(10,8))
sns.heatmap(insurance.corr(), annot=True, ax=ax)

So apparently it does not affect charges at all. Negligible if I may add. So let's just drop it altogether.

In [None]:
insurance.drop(['northeast', 'northwest', 'southeast', 'southwest'], axis=1, inplace=True)

And now, let's check the distribution of our data. 

In [None]:
insurance['charges'].describe()

In [None]:
sns.distplot(insurance['charges'])

Obviously, our data is not a normal distribution at all. Another thing I should note is that the charges' range are too big. It'll be hard for us to interpret the our models later once we create and fit them.

Just for the purpose of showing let's NOT normalise and transform it. I'll show you what I'm talking about. 

First, let's just select the highly correlated features which are age, bmi, and if the person is a smoker. Set it as x, then our target which is charges (it's what we're trying to predict after all) is set as y.

In [None]:
#Feature Selection
#3 features vs all

x = insurance[['age', 'bmi', 'smoker']]

y = insurance['charges']

Create our train and test variables using the train_test_split function of scikit-learn. I will be using a test_size of 0.2 because I want a 80-20 split of my train and test. And I will also set the random state for reproducibility.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

Then create our basic Linear Regression model (without any hyperparameters) and fit our train data. Once fitted, we get y prediction using the x test variable and now compare it with the actual y test. In short, fit with train, and check with the test.

In [None]:
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
y_pred = lr_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

Now let's try if using all features will make a difference. So set our x to all features (removing the target variable of course). Then just repeat what we did above.

In [None]:
x = insurance.drop(['charges'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
y_pred = lr_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

Now here's the thing, the MSE scores is waaaay high. How does that even mean? How can we interpret it now. We know that MSE says how far you are with the actual value. In both ways we are getting 35M MSE which means we are 35M away from actual value. However relatively, how far are we? Now we scale the values so we will know.

So the idea of normalising data is to make it a gaussian curve or what statisticians usually call a normal distribution. It looks like a bell shape. Look at the original charges distribution vs the transformed charges distribution that I will show next:

In [None]:
sns.distplot(insurance['charges'])

In [None]:
#gaussian curve
transformed_charges = np.log(insurance['charges'])
sns.distplot(transformed_charges)

See the difference? The 1st one tends to lean on the left side of the graph. After normalising, we see it now as a bell shape. Might not be perfect one but still fairly distributed. And how does the values look like? Let's look at the head of the normalised values.

In [None]:
transformed_charges.head()

So basically it's scaled down having values that range from 5-ish to 12. 

Now in that scenario we used np.log in normalising the values of the charges. However there's what we call StandardScaler which can do it for our whole data.

In [None]:
scaler = StandardScaler()
scaler.fit(insurance)
insurance_normed = pd.DataFrame(scaler.transform(insurance), columns=insurance.columns)
insurance_normed.head()

So now we can compare it to our original data and the normalised one.

In [None]:
sns.distplot(insurance)

In [None]:
sns.distplot(insurance_normed)

And do the fitting..

In [None]:
x = insurance_normed.drop(['charges'], axis=1)
y = insurance_normed['charges']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
y_pred = lr_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

Now the MSE looks better right? And much more readable. It means we are 0.24 away from our actual values. As MSE is much closer to 0 then that means we are much closer to the actual values.

Now that's just the basic Linear Regression. Let's try exploring our regression models. Looking at SVR, Ridge, and Lasso.

In [None]:
s_model = SVR(kernel='linear')
s_model.fit(x_train, y_train)
y_pred = s_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
rd_model = Ridge()
rd_model.fit(x_train, y_train)
y_pred = rd_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
ls_model = Lasso()
ls_model.fit(x_train, y_train)
y_pred = ls_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

So apparently even with other models, Linear Regression is still the best one we have so far. We can try fine tuning our model by trying out different hyper parameters.

But what if there's an algorithm that can try and find the best hyperparameters for us? Well, fortunately there is. There's what we call GridSearchCV. You can just list the range of hyperparameters and its values and feed it to GSC. We also need to feed which model and what scoring system will we check for best params.

Unfortunately Linear Regression has limited paramteres. Let's try using GSC in our SVR model since the result is fairly similar to our Linear Regression model.

In [None]:
#GridSearch
#Linear Regression

parameters = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'tol': [0.001, 0.01, 0.1, 1]
}

s_model = SVR()
s_regressor = GridSearchCV(s_model, param_grid=parameters, scoring='neg_mean_squared_error')
grid_result = s_regressor.fit(x_train, y_train)
print(grid_result.best_params_)

So GSC says our hyperparameters are
- C = 10
- kernel = 'rbf'
- tol = 0.001

So let's do that and check our accuracy and MSE

In [None]:
s_model = SVR(C=10, gamma='scale', kernel='rbf', tol=0.001)
s_model.fit(x_train, y_train)
y_pred = s_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

Nice. So from 70% awhile ago with SVR, we got +17% accuracy from using the best hyperparameters that GSC told us. MSE is also from 0.30 to 0.13.

Now the next thing that we'll try is what we call Boosting. Basically the idea here is to permutate different kinds of combination of data and split then whatever is kind of the 'majority vote' of these splits and permutations, that will be the fit. I'll be using GradientBoostingRegressor for this.

In [None]:
gbr_model = GradientBoostingRegressor(n_estimators=3, max_depth=3, learning_rate=1, criterion='mse', random_state=1)
gbr_model.fit(x_train, y_train)
y_pred = gbr_model.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

Okay so.. SVR with best hyperparams is still our best bet then. But not by far.

Next is we'll try doing what we call stacking. Stacking is basically having different models (regression for this scenario) and fit it to those models. Then whatever is the result or the y_pred of those models will be our new data. Then using the new data we'll fit and train it using the final model or what we call the final estimator. For this, I'll be using GradientBoostingRegressor as my final estimator, and Ridge, Lasso, SVR, and Linear Regression will be my models.

In [None]:
estimators = [('ridge', Ridge()),
              ('lasso', Lasso()),
              ('svr', SVR()),
              ('lr', LinearRegression())]
reg = StackingRegressor(
    estimators=estimators,
    final_estimator=GradientBoostingRegressor())
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
print('r2 score: ' + str(metrics.r2_score(y_test, y_pred)))
print('mse: ' + str(metrics.mean_squared_error(y_test, y_pred)))

So there!

I've shown you different ways of doing regression models. Now for the final step in reporting back to your stakeholders, you should return your scaled data to the original data to get the actual predictions. I might do that on the next commit.

Otherwise if this kernel helped you in anyway, you can help others see this kernel as well by giving an upvote! Thanks!