# <p style="text-align: center;"> A Variety of Regression Models for Medical Care Cost</p>
# <p style="text-align: center;">  Prediction </p>


# Introduction

In this work, I am trying to present a variety of machine learning techniques that are used for regression problems. This work is not extinsive about the available regression model. However, it presents several techniques. All these models have been trained and tested on the Medical Cost Personal Datasets. 

Let's not waste a lot of time and let's start with importing the libraries we will be using for all the steps of this work. 

Specific libraries for the models and for the metrics will be imported when needed so that you can easily identify what is needed for that specific model.

In [None]:
#let's start with importing the libraries we need to:
# 1- read and manipulate data:
import pandas as pd
import numpy as np
# 2- visualize data and graphs
import matplotlib.pyplot as plt
import seaborn as sns

# I. Data Acquisition 

In [None]:
raw_data=pd.read_csv('../input/insurance/insurance.csv')
X=raw_data.drop(["charges"],axis=1)
y=raw_data["charges"]
raw_data.head()

## Description
After reading the csv file we obtain a dataframe that contains 1338 observations/subjects (rows of data) and 7 columns (or variables). Let's detail the variables. So we have: 
* 6 independent variables (also known as regressors or predictors):
    - The age of the subject. This is a numerical variable.
    - The gender (sex) of the subject. This is a categorical variable in a string data type. 
    - BMI Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9. This is a numerical variable in a float type.
    - The number of children of the subject. This is a numerical variable.
    - A categorical variable that indicates whether the subject smokes. 
    - The adress of the subject which is a categorical variable in of string type.
* A dependent variable (also known as prediction) which is the cost of medical care.

# II. Data Processing

## Check for missing values

Most datasets in real life tend to present alot of missing data. Therefore, one should always start by cleaning the data from missing values or from incorrect entries. There are alot of imputations for missing values. However, they are not discussed here.

In [None]:
raw_data.isnull().sum()

As we can see our data does not contain any missing value. 

## Visualization & first analysis

### Plots & Histograms
Let's start by ploting the dependent variable as a function of each of the independent variables.

In [None]:
plt.figure(figsize=(16,16))
i=1
for elt in X.columns:
    plt.subplot(3,2,i)
    plt.scatter(X[elt],y)
    plt.xlabel(elt)
    plt.ylabel('Medical Cost (u.m.)')
    i+=1
plt.tight_layout(0.05)
plt.show()

From this set of graphics, we cannot conclude much. One obvious observation is that the age contribue directly in the cost of medical care. Actually in the graphe of medical cost as a function of age, we can easily identify three groups of evolution of cost (highlighted in the figure below)

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(X['age'],y)
plt.plot([17,65,65,17,17],[4000,18000,10000,0,4000],color='red')
plt.plot([17,65,65,17,17],[25000,36000,20000,10000,25000],color='red')
plt.plot([17,65,65,17,17],[40000,55000,40000,30000,40000],color='red')
plt.xlabel('Age (years)')
plt.ylabel('Medical Cost (u.m.)')
plt.show()

Probably if we combine two variables we can observe some interesting things. Maybe we can identify what's the other variable that cases the ages to produce three classes or evolutions of the medical cost.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
threedee = plt.figure(figsize=(10,8)).gca(projection='3d')
threedee.scatter(X["age"], X['bmi'],y)
threedee.set_xlabel('Age (years)')
threedee.set_ylabel('BMI')
threedee.set_zlabel('Medical Cost (u.m.)')
plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D
threedee = plt.figure(figsize=(10,8)).gca(projection='3d')
threedee.scatter(X["age"], X['children'],y)
threedee.set_xlabel('Age (years)')
threedee.set_ylabel('children')
threedee.set_zlabel('Medical Cost (u.m.)')
plt.show()

We can easily note that the number of children with the age give four possible evolutions of the medical cost.

Let's see the boxplots of the categorical variables.

In [None]:
plt.figure(figsize=(10,10))
cat_col=['smoker','sex','region','children']
i=1
for elt in cat_col:
    plt.subplot(2,2,i)
    sns.boxplot(x=X[elt],y=y)
    i+=1
plt.show()

Finally, let's see the histograms. I will leave the interpretation for you to do.

In [None]:
hist = raw_data.hist(bins=100,color='red',figsize=(16, 16))

### Data Processing
First of all, we intend to use our data to predict the potential medical cost of new patients. However, our data contain some categorical variables. So let's go and change these categorical variables into numerical ones. 
1. Smoker:
As it seems in the boxplot of medical cost as a function of being a smoker, we can notice that smoking subjects has a higher medical bill. Now this does not mean that if you are a smoker you will pay more for medical treatments (but chances you are). This is not causality! Please be carefull because most of the times people take corralation for causality. Since you either smoke or you don't, I propose that we change the smoker variable into a numerical one where being a smoker = 1 while 0 if you don't smoke.

In [None]:
X['smoker']=X['smoker'].apply(lambda x : 1 if x=="yes" else 0)
X.head(5)

Great! Now we have changed the smoker variable. Let's move on to the sex variable. 
2. sex:
As we can see on the boxplot of the sex variable, almost both categories have the same mean and the same quantiles. And since there is no superiority or relationship of order between categories, we can chose to make sex=1 for male or femal it won't effect the results. Since in almost all the databases sex is equal to 1 for male individuals and for the sake of consistency with other works, I propose to put sex = 1 for male 0 for female.

In [None]:
X['sex']=X['sex'].apply(lambda x : 1 if x=='male' else 0)
X.head(5)

Great! Now let's move to the last categorical variable.
3. region: In this data base region has 4 possible values (southwest, southeast, northwest, and northeast). These regions do not have an order relationship. And saying for example southwest = 2 and northwest = 4 is no sens. Therefore, we have to create some dummy variables. We each will have the name of the region it represents and will be equal to 1 when the subject is from that region.

In [None]:
for region_name in ['southwest', 'southeast', 'northwest', 'northeast']:
    X[region_name]=X['region'].apply(lambda x : 1 if x==region_name else 0)
X= X.drop('region',axis=1)
X.head(5)

Now we have obtained a new input dataframe that contain 9 independent variables that are numerical (even though some of them represent categories). Before we move on to the next part in which we will use different machine learning algorithms for predicting the medical bill of subjects, let's visualize one last plot (The corrolation map). 

In [None]:
f = plt.figure(figsize=(19, 15))
df=pd.concat([X, y], axis=1)
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=90)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16,y=-0.08)
plt.show()

As we said before, being a smoker and the age of the subjects are the most correlated variables with the medical cost.

# III. Performance & Metrics
One commonly used metrics for performance measurement in regression problems is the root mean squared error (RMSE). The formula of this metrics is as follow : 
$$RMSE=\sqrt{\frac{\sum{(y_i - \hat{y_i})²}}{N}}$$

Root Mean Squared Error It is just the square root of the mean square error. It is easily interpreted as it has the same units as the values being measured unlike MSE where it has all units squared, therefore, RMSE is the distance on average of data point from the fitted line measured along the vertical line.

Another metrics that is used for evaluating the model is the R². R² (R_squared) presents the fraction between the squared error with the predictions and the squared error with the mean of the data values. It is computed as follows:
$$R²= 1 - \frac{\sum{(y_i - \hat{y_i})²}{\sum{(y_i - \bar{y})²}$$

R² (R_squared) presents how well the model fits the data comparing to the mean of data. This metrics is usually between 0 and 1. And the bigger the value is the better our model fits the data. However in some cases we can find negative R² values. Well do not worry it is complitely normal. This happens when the model does not fit the data in a good way (at least not better that the mean of the data).


In [None]:
from sklearn.metrics import mean_squared_error,r2_score
from math import sqrt

Since we want to compare the used methods against each other to find the suitable one for this problem, we should define a datastructure that will contain all these information so we can use it later. Therefore I will declare a dictionnary to contain the R_squared score and the RMSE, and a dataframe that will contain the predictions.

In [None]:
Performance={}
Performance["Method"]=[]
Performance["R_squared (train)"]=[]
Performance["R_squared (test)"]=[]
Performance["RMSE (train)"]=[]
Performance["RMSE (test)"]=[]

In [None]:
Predictions=pd.DataFrame()
Training_Preds=pd.DataFrame()

One last step before we move to the next part. Our objective is to create a model capable of learning from a set of observation so that he can predict the medical care cost for new observations. Therefore, we want to measure the robustness (R_squared) of our model and it's capacity of predicting values for unknown subjects. Thus, we must devide our data into two parts a training part (known information feed to the model so he can learn) and a testing part (unknown information to the model, used to evaluate its performances). 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.drop('southwest',axis=1).values,y.values,
                                                    test_size=0.2,random_state=0)

You might wonder why did I drop the southwest variable before creating the training and testing data. Well it is called the dummy variables trap. Actually when categorical variables are transformed into N dummy variables we should always used only N-1 dummy variables in the data. Mainly to avoid redundent information that will effect badly our model behavior.

# IV. Simple Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

SLR = LinearRegression()
SLR.fit(X_train,y_train)
y_pred=SLR.predict(X_test)
y_p_t =SLR.predict(X_train)
Performance["Method"].append('SLR')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
print('SLR',r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,y_pred)))
Predictions["Ground Truth"]=y_test
Training_Preds["Ground Truth"]=y_train
Predictions['SLR']=y_pred
Training_Preds["SLR"]=y_p_t

# V. Multiple Linear Regression

In [None]:
#we are going to eliminate variables that won't make a difference 
#we add the constant variable x_0
X_Ttrain = np.append(np.ones((len(X_train),1)).astype(int),X_train,1)
X_Ttest = np.append(np.ones((len(X_test),1)).astype(int),X_test,1)

## Step-by-Step Backward Elimination

In [None]:
import statsmodels.api as sm
#our X_optimal is initialized to X_Ttrain
X_opt=X_Ttrain[:,:]

In [None]:
#Step 1 :Fit the ALL IN model
model_MLR=sm.OLS(endog=y_train,exog=X_opt).fit()
model_MLR.summary()

In [None]:
Column_to_delete=2
columns_to_keep=[]
for elt in range(X_opt.shape[1]):
    if elt != Column_to_delete :
        columns_to_keep.append(elt)
X_opt = X_opt[:,columns_to_keep]
X_Ttest = X_Ttest[:,columns_to_keep]
X_opt.shape

In [None]:
model_MLR=sm.OLS(endog=y_train,exog=X_opt).fit()
model_MLR.summary()

In [None]:
Column_to_delete=5
columns_to_keep=[]
for elt in range(X_opt.shape[1]):
    if elt != Column_to_delete :
        columns_to_keep.append(elt)
X_opt = X_opt[:,columns_to_keep]
X_Ttest = X_Ttest[:,columns_to_keep]
X_opt.shape

In [None]:
model_MLR=sm.OLS(endog=y_train,exog=X_opt).fit()
model_MLR.summary()

In [None]:
Column_to_delete=5
columns_to_keep=[]
for elt in range(X_opt.shape[1]):
    if elt != Column_to_delete :
        columns_to_keep.append(elt)
X_opt = X_opt[:,columns_to_keep]
X_Ttest = X_Ttest[:,columns_to_keep]
X_opt.shape

In [None]:
model_MLR=sm.OLS(endog=y_train,exog=X_opt).fit()
model_MLR.summary()

In [None]:
Column_to_delete=5
columns_to_keep=[]
for elt in range(X_opt.shape[1]):
    if elt != Column_to_delete :
        columns_to_keep.append(elt)
X_opt = X_opt[:,columns_to_keep]
X_Ttest = X_Ttest[:,columns_to_keep]
X_opt.shape

In [None]:
model_MLR=sm.OLS(endog=y_train,exog=X_opt).fit()
model_MLR.summary()

In [None]:
from sklearn.linear_model import LinearRegression
MLR=LinearRegression()
X_Ttrain = X_opt[:,1:]
X_Ttest = X_Ttest[:,1:]
MLR.fit(X_Ttrain,y_train)
y_pred = MLR.predict(X_Ttest)
y_p_t = MLR.predict(X_Ttrain)
Performance["Method"].append('MLR')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
print('MLR',r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,y_pred)))
Predictions['MLR']=y_pred
Training_Preds["MLR"]=y_p_t

# VI. Polynomial Regression

Polynomial regression is the same as doing a linear regression for some independent variables that some of them are the square, power three, etc of others. 

This is literally what we are going to do. First, we create the vector of polynomial features. Then, we create a simple linear regression. We fit the model with the new input features. Finally, we apply the same transformation to the features of the testing set and we predict the outputs.

As we increase the degree of the polynomial features, the obtained model is learning perfectly the training data. And so when the model faces new unknown data will not performe as he is intended to. This is the overfitting trap. 

In the code below, I choose some values of the degree to see the effects of increasing the degree and how the model reacts to new data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
degrees=[2,3,4,5,6]
for deg in degrees:
    Poly_Trans = PolynomialFeatures(degree=deg)
    X_poly_train = Poly_Trans.fit_transform(X_train)
    X_poly_test = Poly_Trans.transform(X_test)
    Poly_SLR=LinearRegression()
    Poly_SLR.fit(X_poly_train,y_train)
    y_pred = Poly_SLR.predict(X_poly_test)
    y_p_t = Poly_SLR.predict(X_poly_train)
    Performance["Method"].append('Poly_SLR_d={}'.format(deg))
    Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
    Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
    Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
    Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
    print('Poly_SLR_d={}'.format(deg),r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,y_pred)))
    Predictions['Poly_SLR_d={}'.format(deg)]=y_pred
    Training_Preds['Poly_SLR_d={}'.format(deg)]=y_p_t

# VII. Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
DTR = DecisionTreeRegressor(random_state=0)
DTR.fit(X_train,y_train)
y_pred = DTR.predict(X_test)
y_p_t = DTR.predict(X_train)
Performance["Method"].append('DTR')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
print('DTR',r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,y_pred)))
Predictions['DTR']=y_pred
Training_Preds['DTR']=y_p_t

# VIII. Support Vector Regressors

## Linear SVR

In [None]:
from sklearn.svm import SVR
L_SVR = SVR(kernel='linear',epsilon=0.1,C=1e4)
L_SVR.fit(X_train,y_train)
y_pred = L_SVR.predict(X_test)
y_p_t = L_SVR.predict(X_train)
Performance["Method"].append('L_SVR')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
print('L_SVR',r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,y_pred)))
Predictions['L_SVR']=y_pred
Training_Preds['L_SVR']=y_p_t

## Gaussian SVR

Unlike other SVR kernels the gaussian (aka 'RBF') requires that the input data is scaled. Therefore we start by preparing the scaled data.

In [None]:
from sklearn.preprocessing import MinMaxScaler

X_Scaler = MinMaxScaler()
X_train_s=X_Scaler.fit_transform(X_train)
X_test_s = X_Scaler.transform(X_test)

y_Scaler = MinMaxScaler()
y_train_s=y_Scaler.fit_transform(y_train.reshape(-1,1))
y_train_s=y_train_s.reshape(len(y_train),)
y_test_s=y_Scaler.transform(y_test.reshape(-1,1))
y_test_s=y_test_s.reshape(len(y_test),)

In [None]:
from sklearn.svm import SVR
RBF_SVR = SVR(kernel='rbf',C=1e4,gamma='auto')
RBF_SVR.fit(X_train_s,y_train_s)
y_pred_s = RBF_SVR.predict(X_test_s)
y_pred = y_Scaler.inverse_transform(y_pred_s.reshape(-1,1))
y_p_t = y_Scaler.inverse_transform(RBF_SVR.predict(X_train_s).reshape(-1,1))
Performance["Method"].append('RBF_SVR')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
print('RBF_SVR',r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,y_pred)))
Predictions['RBF_SVR']=y_pred
Training_Preds['RBF_SVR']=y_p_t

# IX. K-Nearest Neighbors

K-nearest neighbors is a similarity measure based machine learning method. It consists in searching and retrieving the k neighbors (from the training set) of a new observation. The similarity here is computed by means of a function (could be euclidien distance or other distance functions). Once the k-nearest neighbors are retrieved the function will return their dependent variables mean value. This value is considered as the prediction of the new observation. 

Several parameters influence the k-nn performance mainly the number of returned neighbors, the distance function, and the weights if there are weights to the input variables. 

In [None]:
from sklearn.neighbors import KNeighborsRegressor
#for different k values example [1,10,50]
k_values =[1,10,20,50]
for k in k_values:
    k_nn=KNeighborsRegressor(n_neighbors = k)
    k_nn.fit(X_train_s, y_train_s)  
    y_pred_s=k_nn.predict(X_test_s)
    y_pred = y_Scaler.inverse_transform(y_pred_s.reshape(-1,1))
    y_p_t = y_Scaler.inverse_transform(k_nn.predict(X_train_s).reshape(-1,1))
    Performance["Method"].append('{}-NN'.format(k))
    Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
    Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
    Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
    Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
    print('{}-NN'.format(k),r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,
                                                                                                    y_pred)))
    Predictions['{}-NN'.format(k)]=y_pred
    Training_Preds['{}-NN'.format(k)]=y_p_t

# X. Radius Neighbors

Similar to the K-nearest neighbors the radius neighbors search for the neighbors of a given new observation (from the test data) in the training set of observations. Unlike k-nn radius neighbors will only consider the neighbors that have a distance from the observation less than the provided radius. Thus, making the radius the most important parameter of the approach with the distance function and weights. Another similarity between k-nn and radius-n that the inputs should have the same scale to obtain a coherent distance.

Choosing a big value of radius will result in retrieving all the data points from the training set and thus predictions will be equal to the average value of the training set. Choosing a very small value of radius will cause some instance to not find any near neighbor within the radius distance which will cause the algorithm to fail.

Below the used algorithm for prediction the medical care cost of the testing set. I took the liberty of presenting some alternatives of radius values so that the effects can be seen. The obtained testing data allowed me to go down to a radius of 0.4 without having any problems. In other, cases this value can cause the algorithm to crush. This method is a bit sensitive to the quality of testing data comparing to the training set.

In [None]:
from sklearn.neighbors import RadiusNeighborsRegressor
R=[0.4,0.5,1,2,3]
for rad in R:
    RN = RadiusNeighborsRegressor(radius=rad)
    RN.fit(X_train_s,y_train_s)
    y_pred_s=RN.predict(X_test_s)
    y_pred = y_Scaler.inverse_transform(y_pred_s.reshape(-1,1))
    y_p_t = y_Scaler.inverse_transform(RN.predict(X_train_s).reshape(-1,1))
    Performance["Method"].append('{}-N'.format(rad))
    Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
    Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
    Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
    Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
    print('{}-N'.format(rad),r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,
                                                                                                     y_pred)))
    Predictions['{}-N'.format(rad)]=y_pred
    Training_Preds['{}-N'.format(rad)]=y_p_t

One can note that the R2_squared score for radius = 3 is negative. Well this is caused by the fact that the choosen value for radius is big enough that when we retrieve neighbors we obtain all the data or almost all the data of the training set. Thus, the predicted value for any new observation will be equal to the mean of the training set. The mean of the test set is not necessary the same as the training set. Therefore, the obtained predictions is a worse fit for the test set then it's mean which explain a negative R2_squared values. If this is the case then the R2_squared for the training set will be equal to zero since the predictions are the mean of the training set values. We will check this out in the conclusion.

# XI. Ensemble Regressors

Ensemble learning is a powerful concept of machine learning. As we saw before each of the proposed regression methods has its performance. So what happens when we combine all these methods together? What happens if we use several models from the same method? Well this is the concept of ensemble learning. We can define ensemble learning is using several methods of regressions or several models from the same method or both. Here we will use at first the Random Forest Regression which is an ensemble learning method using several Decision Tree Regressors. Finally, we will define our own ensemble regressor based on all the previously defined methods. For methods where we defined more than one model we will use the best.

## Random Forest Regression

In [None]:
n_trees=100
from sklearn.ensemble import RandomForestRegressor
RFR = RandomForestRegressor(n_trees,random_state=0)
RFR.fit(X_train,y_train)
y_pred=RFR.predict(X_test)
y_p_t=RFR.predict(X_train)
Performance["Method"].append('RFR')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
print('RFR',r2_score(y_test,y_pred),sqrt(mean_squared_error(y_test,y_pred)))
Predictions['RFR']=y_pred
Training_Preds['RFR']=y_p_t

## Custom Ensemble Method

* In our ensemble method we will not include the polynomial support vector regressor mainly because it takes a lot of time to train. 
* We will not use the multi linear regression model for it requires a unique inputs.
* For polynomial linear regression we will not use any model because they require different input than any other method.
* For K-nn we will use the model with k=10. 
* For the radius neighbor regressor we will use the radius = 0.5.

One important remarque we will creat two custom ensemble methods according to the type of inputs of methods. Meaning we are going to group methods that do not require scalling together and we will then group all methods that require scaling together. 

### Unscaled Features

In [None]:
from sklearn.ensemble import VotingRegressor
unscaling_Regressors=[("SLR",SLR), ("L_SVR",L_SVR), ("DTR",DTR), ("RFR",RFR)]
unscaling_Ensemble = VotingRegressor(unscaling_Regressors)
unscaling_Ensemble.fit(X_train, y_train)
y_pred = unscaling_Ensemble.predict(X_test)
y_p_t = unscaling_Ensemble.predict(X_train)
Performance["Method"].append('Unscaled_Ensemble')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
Predictions["Unscaled_Ensemble"]=y_pred
Training_Preds['Unscaled_Ensemble']=y_p_t

### Scaled Features

In [None]:
from sklearn.ensemble import VotingRegressor
k_nn = KNeighborsRegressor(n_neighbors = 10)
RN = RadiusNeighborsRegressor(radius=0.5)
scaling_Regressors=[('RBF_SVR',RBF_SVR), ('K-NN',k_nn), ('R-N',RN)]
scaling_Ensemble = VotingRegressor(scaling_Regressors)
scaling_Ensemble.fit(X_train_s,y_train_s)
y_pred_s = scaling_Ensemble.predict(X_test_s)
y_pred = y_Scaler.inverse_transform(y_pred_s.reshape(-1,1))
y_p_t = y_Scaler.inverse_transform(scaling_Ensemble.predict(X_train_s).reshape(-1,1))
Performance["Method"].append('scaled_Ensemble')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
Predictions["scaled_Ensemble"]=y_pred
Training_Preds['scaled_Ensemble']=y_p_t

## (bonus) All methods on scaled features 

Since all methods that can performe on unscaled data can also performe on scaled data, we will combine here all methods on scaled features. 

In [None]:
regressors_models=[('RBF_SVR',RBF_SVR), ('K-NN',k_nn), ('R-N',RN), ("SLR",SLR), ("L_SVR",L_SVR),
                   ("DTR",DTR), ("RFR",RFR)]
All_Ensemble = VotingRegressor(regressors_models)
All_Ensemble.fit(X_train_s,y_train_s)
y_pred_s = All_Ensemble.predict(X_test_s)
y_pred = y_Scaler.inverse_transform(y_pred_s.reshape(-1,1))
y_p_t = y_Scaler.inverse_transform(All_Ensemble.predict(X_train_s).reshape(-1,1))
Performance["Method"].append('ALL_Ensemble')
Performance["R_squared (train)"].append(r2_score(y_train,y_p_t))
Performance["R_squared (test)"].append(r2_score(y_test,y_pred))
Performance["RMSE (train)"].append(sqrt(mean_squared_error(y_train,y_p_t)))
Performance["RMSE (test)"].append(sqrt(mean_squared_error(y_test,y_pred)))
Predictions["ALL_Ensemble"]=y_pred
Training_Preds['ALL_Ensemble']=y_p_t

# Conclusion

In [None]:
Performance = pd.DataFrame(Performance)
Performance

Before ploting and since all values of R² are of the same scale except the one of polynomial regression with degree 6. This row will be eliminated. 

In [None]:
Performance = Performance.drop(6).reset_index()

In [None]:
plt.figure(figsize=(7,7))
Performance["R_squared (train)"].plot(color="blue",label="R² Train")
Performance["R_squared (test)"].plot(color="red",label="R² Test")
plt.xticks(Performance.index,Performance["Method"].tolist(),rotation=90)
plt.ylabel("R² Value")
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(7,7))
Performance["RMSE (train)"].plot(color="blue",label="RMSE Train")
Performance["RMSE (test)"].plot(color="red",label="RMSE Test")
plt.xticks(Performance.index,Performance["Method"].tolist(),rotation=90)
plt.ylabel("RMSE Value")
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(16,100))
i=1
Cols = Predictions.columns[1:]
for elt in Cols:
    plt.subplot(24,2,i)
    plt.scatter(Training_Preds.index,Training_Preds['Ground Truth'],color = 'blue', label='Ground Truth')
    Training_Preds[elt].plot(color='red',label="Training with {}".format(elt))
    plt.legend()
    i+=1
    plt.subplot(24,2,i)
    plt.scatter(Predictions.index,Predictions['Ground Truth'],color = 'blue', label='Ground Truth')
    Predictions[elt].plot(color='red',label="Predictions with {}".format(elt))
    plt.legend()
    i+=1
plt.show()

You can see in these plots the comparison between the fitting in the training phase (the plot on the left) and the predictions (the plot on the right). From these plots and the obtained results in the table of R2_squared and RMSE for both training and testing phases, we can easily compare methods and explain the obtained results. 

* In the previous sections, we found that the Multiple Linear Regression (in which we excluded some of the variables) is practically the same as the Simple Linear Regression, there is a very small loss of prediction accuracy but nothing too serious. Does this mean we can use the optimized input features for all other methods without losing the predictions accuracy? Well we do not know for sure! It will certainly help with the execution time. But one should try and see what happens. But I know that it is a great idea to reduce the dimension of data. This can help us avoinding the Garbage-In Garbage-Out trap. 

* Polynomial regression seems to tend to train even on points that are far from the others (these observations are called outliers). But adapting too much to these outliers will cause the model to lose its variability and its capacity on prediction new unknown observations. We can see that happening when increasing the degree of the polynomial regression. At first we gain in accuracy and variability. Then at some point we start moving toward overfitting (with degree 6 we are certainly in an overfitting situation).

* Decision Tree regression and K-nearest neighbors (for k=1) seems to be able to learn outliers without losing the variability in the testing phase. 

* Linear Support Vector Regression is better than normal linear regression in learning outliers. 

* For K-NN when we start increasing the k parameter at first we are tending toward a better performance. But when k becomes big enough w start to lose the variability of the method in both training and testing. Particulary in training it seems that the model is moving toward computing the mean of the observations. 

* Radius Neighbors regression seems to have performances that are close to the k-NN with k=1 and to decision tree regression. However, as we suspected when the radius parameters starts taking big values the number of observations in the radius is getting bigger and the model is converging toward the mean of all observation. This mean is reached for the radius value equals to 3. 

* Random Forest and other ensemble methods used have a great performance in both training and testing in terms of variability, adaptability, capacity and accuracy of the predictions. 


These are not all the methods for regression. One can find other methods especially for the ensemble methods. I can think of AdaBoost and GradientBoosting. The work here is not done either. The parameters of these methods can be optimized further to obtain better results. However, one should always pay attention to the level of optimization to avoid falling in the overfitting trap. Moreover, it is also good to keep some variability in the model. Just to be sure that the model will react to new data. 

Well I hope you enjoyed it. Please let met know it there is any remark. If you have any questions I am willing to answer them.