# House Prices Prediction
##### Author: [Federico Sciuca](https://www.linkedin.com/in/federico-sciuca/)
  
  
<img src="https://miro.medium.com/max/1400/0*tdIkLF-rCqIbGWkn" title="Photo by Jamie Whiffen on Unsplash" height="680" width="680">

This notebook is the second notebook I produce after my introduction to Machine Learning in Python.
Every comments and suggestions are welcome!

But let's start immidiately to code.

# <span id="0"></span> Contents Table

1. [Overview](#1)
1. [Importing Modules, Reading the Dataset and Defining an Evaluation Table](#2)
1. [Explore the columns and deal with missing values](#3)
    * [Handle Missing values: Mean and Frequence](#4)
1. [Data Visualization](#5)
1. [Correlation and Features selection](#6)
1. [Model development](#7)
    * [Split the dataset in train and test](#8)
    * [Simple Linear Regression](#9)
    * [Multiple Linear Regression - Top 5 Features](#10)
    * [Multiple Linear Regression - All Features](#11)
    * [Polynomial Regression - Top Feature](#12)
    * [Multiple Polynomial Regression - Top 5 Features](#13)
    * [Multiple Polynomial Regression - All Features](#14)
1. [Regularization](#15)
    * [Ridge Regression](#16)  
        * [Ridge Regression - Best Feature](#17)
        * [Ridge Regression - Top 5 Features](#18)
        * [Ridge Regression - All Features](#19)
    * [Lasso Regression](#20)
        * [Lasso Regression - Best Feature](#21)
        * [Lasso Regression - Top 5 Features](#22)
        * [Lasso Regression - All Features](#23)
1. [Decision Tree Regression](#24)
    * [Decision Tree Regression - Best Feature](#25)
    * [Decision Tree Regression - Top 5 Features](#26)
    * [Decision Tree Regression - All Features](#27)
1. [Multi-layer Perceptron Regressor](#28)
1. [Evaluation Table](#29)
1. [Conclusion](#30)

---

# <span id="1"></span> Overview
###### [Return Contents](#0)

Welcome to my Kernel. In this kernal I'll *practise* with different **Regression Models** and I'll do my best to predict the house prices by using them.

To make the Kernal more readable, I'll explain and comment every model I'm going to use.
My previous studies are for the most related to the **Simple Linear regression, Multiple Linear Regression, Polynomial Regression.** 

In order to explore and evaluate and learn different methods, I'll use this Kernal as an excuse to study more in-depth the **Scikit Learn Library**.


# <span id="2"></span> Importing Modules, Reading the Dataset and Defining an Evaluation Table
###### [Return Contents](#0)
<hr>

The first thing to do is to import all the libraries and data we are going to use to explore the dataset and develop our models.
</br>
I'll also define a DataFrame dedicated to the model evaluation. This DataFrame will be very helpful to summarize the models we are going to build in order to identify the best fit for our pourpuse.

In [None]:
# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
import folium
%matplotlib inline

# Data and Statistics
import pandas as pd
import numpy as np
from scipy import stats

# Train and Test Preparation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# Preprocessing
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

# Models
from sklearn.pipeline import Pipeline
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import RandomizedSearchCV

# Evaluation metrics
from sklearn.metrics import explained_variance_score
from sklearn.metrics import max_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_poisson_deviance
from sklearn.metrics import mean_gamma_deviance

df = pd.read_csv("../input/housesalesprediction/kc_house_data.csv")

evaluation = pd.DataFrame({'Model': [],
                           'Details':[],
                           'Max Error':[],
                           'Mean Absolute Error' : [],
                           'Mean Squared Error' : [],
                           'Mean Squared Log Error' : [],
                           'Median Absolute Error' : [],
                           'Mean Poisson Deviance' : [],
                           'Mean Gamma Deviance': [],
                           'Root Mean Squared Error (RMSE)':[],
                           'R-squared (training)':[],
                           'Adjusted R-squared (training)':[],
                           'R-squared (test)':[],
                           'Adjusted R-squared (test)':[],
                           '12-Fold Cross Validation':[]})
def adjustedR2(r2,n,k):
    return r2-(k-1)/(n-k)*(1-r2)

Now that we have imported a few modules and libraries we are going to use in this analysis, it's time to import the csv file and explore it!

In [None]:
df.head()

In [None]:
print("The dataset has", df.shape[0], "rows and", df.shape[1], "features.")

Let's see the data type we have:

In [None]:
df.dtypes

I'm not totally sure why some of the data are float instead of normal integer but let's explore more in-depth the columns and values.

> **How does 2.25 bathrooms look like?**

PS: This is not the question we want to answer throught this analysis

Let's see if there are missing values.

# <span id="3"></span> Explore the columns and deal with missing values
###### [Return Contents](#0)
<hr>

In [None]:
# Plot graphic of missing values
missingno.matrix(df, figsize = (30,5))

As we can see from the previous visualization, there are no missing values in the dataset but let's see other ways to check it.

The fist method is using the isnull() function. This function convert the data in boolean. Taking the sum of them it simply show how many null value (isnull = True = 1) we have in the dataset.

In [None]:
# Use isnull() function to convert the missing values in boolean and sum them.
df.isnull().sum()

A similar way to do so is using a loop that print for each column in the dataset the exact number of isnull()=True and isnull()=False we have in each column.

In [None]:
# Let's define a list of columns and the dataset in boolean version
col_list = df.columns.to_list()
df_isnull = df.isnull()

# Now is time to create a loop that print the informations we are looking for.
for col in col_list:
    print(col)
    print(df_isnull[col].value_counts())
    print("")

As we can see all the columns present the same number of values and all of them are isnull()=False.
</br>
* **Can we say that there are no missing values?**
* **Is the missing value always expressed as *NULL* or it can be expressed in other forms?**

</br>
Considering that the missing value can be expressed in multiple ways and considering the data types, I'm going to use the pandas function .describe() to be sure there are not unusual maximum or minimum values in te columns such as **9999999** or **0** in some essential columns.

In [None]:
df.describe()

## <span id="4"></span> Handle Missing values: Mean and Frequence
###### [Return Contents](#0)
<hr>

Is important to highlight that there are records that present **0** as values for the columns *"bedrooms"* and *"bathrooms"*.
  
In my opinion is important to analyze these records and take in consideration to drop these rows.

In [None]:
df_2 = df[(df['bedrooms'] == 0) | (df['bathrooms'] == 0)]
df_2

There are 16 rows that present **0** in at least one of the columns "bedrooms" and "bathrooms".  
  
There are three main types of missing data:
* **Missing completely at random (MCAR)**
* **Missing at random (MAR)**
* **Not missing at random (NMAR)**
  

Considering that ***16*** rows out of ***21613*** is just the ***0.0742%*** of the dataset, I think it's better to drop these rows in order to have clean data to use to train the model.
  
**BUT**  
  
Which other options do we have to deal with those cells?
Here a list of options:
1. Replace the missing values with the ***mean*** in the column
2. Replace the missing values with the ***most frequent value*** in the column
3. Imputation of the missing values using ***k-NN***

Just for learning porpuse, let's analyze/implement the first two methods.

In [None]:
# Mean Calculation
print("The average number of bedrooms is:" , df['bedrooms'].mean(axis=0))
print("The average number of bathrooms is" , df['bathrooms'].mean(axis=0))

We are facing the same question I had before. How do 3.37 bedrooms look like? And 2.11 bathrooms?
  


In [None]:
# Most Frequent Value
print("The most frequent number of bedrooms is: " , df['bedrooms'].value_counts().idxmax())
print("The most frequent number of bathrooms is" , df['bathrooms'].value_counts().idxmax())

I'm pretty satisfied about the results, even if the half bathroom is quite uncommon in Italy but is more reasonable then the 11% of a bathroom.
  
Let's replace the 0s in the dataset.

In [None]:
# Most Frequent Value
freq_bed = df['bedrooms'].value_counts().idxmax()
freq_bath = df['bathrooms'].value_counts().idxmax()

# Replace the values
df['bedrooms'].replace(0, freq_bed, inplace=True)
df['bathrooms'].replace(0, freq_bed, inplace=True)

# Double check if there are other 0 values
# df[(df['bedrooms'] == 0) | (df['bathrooms'] == 0)].head()

## Number of bedrooms

I noticed that in the dataset there is a record that shows ***33 bedrooms*** and ***1.75 bathrooms***.
I use folium to identify the area where this house is located.
Comparing the size of the houses from the satelite and from the street map, I strongly believe that this record is inaccurate.

In [None]:
df.loc[df['bedrooms'] > 12]

In [None]:
m = folium.Map(
    location=[47.6878, -122.331], zoom_start=25, tiles="OpenStreetMap", attr='Mapbox')
m

So I decided to drop the row directly.

## Date format
  
Another thing I believe is important to change is the date format. The date would give us insight about the price trend across the years after we have divided the records in clusters.

In [None]:
df['date'] =  pd.to_datetime(df['date'], format='%Y%m%dT%H%M%S%f')
df.head()

In [None]:
df.dtypes

In [None]:
# Drop the row where the 'badrooms' value is 33
df.drop(df[df['bedrooms'] == 33].index, inplace=True)

Finally let's reset the index!

In [None]:
df.reset_index(drop=True, inplace=True)

Now the dataset looks clean and ready to be visualized!

<hr>

# <span id="5"></span> Data Visualization
###### [Return Contents](#0)

This is the trickiest part for me. I had been a retoucher for the last two years and I use to dedicate a lot of attention to colours and how the graphic looks like in general.  
  
Being at the start of my journey as Data Scientist I'm weak to build good visualizations using ***matplotlib*** and ***seaborn*** but I'll do my best!

<hr>
First of all I want to plot some bar chart to visualize the frequency the number of bedrooms and bathrooms occure in the dataset.

In [None]:
# Let's define a personal colour palette. This is something that I'm still working on in order to identify my graphic design as Data Scientist.
p_palette = ['#FCBB6D', "#D8737F", "#AB6C82", "#685D79", "#475C7A", "#F18C8E", "#F0B7A4", "#F1D1B5", "#568EA6", "#305F72"]
d_palette = ['#568EA6']

# Plot a bar chart with the number of bedrooms 
n_bedr = df['bedrooms'].unique()
plt.figure(figsize = (12, 6))
sns.barplot(x = n_bedr, y = df['bedrooms'].value_counts(), palette = p_palette, data = df)
plt.xlabel("Number of bedrooms", fontsize = 14)
plt.ylabel("Count of Houses", fontsize = 14)
plt.title("Houses - Number of bedrooms distribution", fontsize = 18)
plt.show()

In [None]:
# Plot a bar chart with the number of Bathrooms 

# Define a new DataFrame with the number of bathrooms and the frequency
bath_dic = df['bathrooms'].value_counts()
bath_df = bath_dic.to_frame()
bath_df.reset_index(inplace=True)
bath_df.rename(columns={'index': 'bathrooms', 'bathrooms': 'freq_b'}, inplace=True)

# Plot the bar chart
plt.figure(figsize = (12, 6))

sns.barplot(x = bath_df['bathrooms'], y = bath_df['freq_b'], palette = p_palette, data = df)
plt.xlabel("Number of bathrooms", fontsize = 14)
plt.ylabel("Count of Houses", fontsize = 14)
plt.title("Houses - Number of bathrooms distribution", fontsize = 18)
plt.show()

***How is the price distributed compared to the number of bedrooms?***  
***How is the price distributed compared to the number of bathrooms?***

In [None]:
# Let's plot a pairplot to show the different distributions.
# sns.pairplot(data=df, x_vars=df[['bedrooms', 'bathrooms']], y_vars = df['price'], kind='scatter')

plt.figure(figsize=(26,6))
sns.set_palette(d_palette)

# First plot - Bathrooms - Price
plt.subplot(1,2,1)
sns.scatterplot(x=df['bathrooms'], y=df['price'], data=df, palette=p_palette)
plt.xlabel('Bathrooms', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title("Price Distribution by bathrooms", fontsize=18)

# Second plot Bedrooms - Price
plt.subplot(1,2,2)
sns.scatterplot(x=df['bedrooms'], y=df['price'], data=df, palette=p_palette)
plt.xlabel('Bedrooms', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.title('Price Distribution by Bedrooms', fontsize=18)


In [None]:
# Group by number of bedrooms
df_mean_bed = df[['price','bedrooms']].groupby('bedrooms').mean()

# Reset index
df_mean_bed.reset_index(inplace=True)

# Calculate the price per bedroom
df_mean_bed['rate'] = df_mean_bed['price']/df_mean_bed['bedrooms']

# Define the figure dimensions
plt.figure(figsize=(26,6))

# First Plot
plt.subplot(1,2,1)
sns.barplot(x = df_mean_bed['bedrooms'], y = df_mean_bed['price'], palette = p_palette, data = df_mean_bed)
plt.xlabel('Number of bedrooms', fontsize = 14)
plt.ylabel('Price', fontsize = 14)
plt.title('Average price - bedrooms',fontsize = 18)

# Second Plot
plt.subplot(1,2,2)
sns.barplot(x = df_mean_bed['bedrooms'], y = df_mean_bed['rate'], palette = p_palette, data = df_mean_bed)
plt.xlabel('Number of bedrooms', fontsize = 14)
plt.ylabel('Price/bedrooms', fontsize = 14)
plt.title('Price/bedrooms rate',fontsize = 18)
# 

In [None]:
# Group by bathrooms
df_mean_bath = df[['price','bathrooms']].groupby('bathrooms').mean()

# Reset index
df_mean_bath.reset_index(inplace=True)

# Calculate the price per bedrooms
df_mean_bath['rate'] = df_mean_bath['price']/df_mean_bath['bathrooms']

# Define the figure dimensions
plt.figure(figsize=(26,6))

# Third Plot
plt.subplot(1,2,1)
sns.barplot(x = df_mean_bath['bathrooms'], y = df_mean_bath['price'], palette = p_palette, data = df_mean_bath)
plt.xlabel('Number of bathrooms', fontsize = 14)
plt.ylabel('Price', fontsize = 14)
plt.title('Average price - bathrooms',fontsize = 18)

# Fourth Plot
plt.subplot(1,2,2)
sns.barplot(x = df_mean_bath['bathrooms'], y = df_mean_bath['rate'], palette = p_palette, data = df_mean_bath)
plt.xlabel('Number of bathrooms', fontsize = 14)
plt.ylabel('Price/bathrooms', fontsize = 14)
plt.title('Price/bathrooms rate',fontsize = 18)

I think is interesting to notice that the rate between price and number of bedrooms is descending.  
The rate between price and number of bathrooms, instead, is descending between 0 and 2 bathrooms, is almost constant to the minimum of the distribution between 2 and 3 bathrooms and then it is ascending. This behaviour of the market could indicate that to have more then 3 bathroom is considerder unessential and for this reason the price rise.
We can try to understand if this trend is verified calculating the price distribution considering bathrooms and bedrooms combined.

In [None]:
# Define the dataset
df_bb_comb = df[['price', 'bedrooms', 'bathrooms']]
df_bb_comb.reset_index(drop=True, inplace=True)

# Sum of bathrooms and bedrooms
df_bb_comb['bath_bed'] = df_bb_comb['bathrooms'] + df_bb_comb['bedrooms']
# Price rate for number of bathrooms+bedrooms
df_bb_comb['bb_rate'] = round((df_bb_comb['price']/df_bb_comb['bath_bed']), 1)

# Rate bathrooms/bedrooms
df_bb_comb['bath_bed_rate'] = round((df_bb_comb['bathrooms']/df_bb_comb['bedrooms']), 1)
# Price rate for bathrooms/bedrooms rate
df_bb_comb['price_bath_bed_rate'] = round((df_bb_comb['price']/df_bb_comb['bath_bed_rate']), 1)

df_bb_comb.head()

In [None]:
plt.figure(figsize=(26,8))

plt.subplot(2,1,1)
sns.boxplot(x = df_bb_comb['bath_bed'], y = df_bb_comb['bb_rate'], palette = p_palette, data = df)
plt.xlabel('Number of bathrooms and bedrooms', fontsize = 14)
plt.ylabel('Price rate', fontsize = 14)
plt.title('Price rate for bathrooms and bedrooms', fontsize=18)
plt.subplots_adjust(hspace = 0.5)

plt.subplot(2,1,2)
sns.boxplot(x = df_bb_comb['bath_bed_rate'], y = df_bb_comb['price_bath_bed_rate'], palette = p_palette, data = df)
plt.xlabel('Bathrooms/bedrooms rate', fontsize = 14)
plt.ylabel('Price rate', fontsize = 14)
plt.title('Price rate for bathrooms per bedrooms rate', fontsize=18)


In [None]:
df.date.dt.year
df['age'] = df.date.dt.year - df['yr_built']
df[['date', 'yr_built', 'yr_renovated', 'age']].head()

How does the age of the building influence the price?

In [None]:
age_bins = [-2,10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 100000] 
labels = ['10-', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100', '100+']

df['age_binned'] = pd.cut(df['age'], age_bins, labels=labels, include_lowest=True)

In [None]:
df_age_binned = df.groupby(df.age_binned).mean()
df_age_binned.reset_index(inplace = True)

plt.figure(figsize=(26,6))

sns.barplot(x = df_age_binned['age_binned'], y = df_age_binned['price'], palette = p_palette, data = df_age_binned)
plt.xlabel('Age of the building', fontsize = 14)
plt.ylabel('Average price', fontsize = 14)
plt.title('Average price per building age', fontsize=18)

We colud go forward to visualize and study the data but for now I'll stop at this point.

# <span id="6"></span> Correlation and Features Selection
###### [Return Contents](#0)

In [None]:
columns_name = ['price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']

df_stand = df[columns_name]

scaler = StandardScaler()
scaler.fit(df_stand)
df_stand = scaler.transform(df_stand)
print(scaler)

df_stand = pd.DataFrame(df_stand,columns = columns_name)
df_stand.head()

In [None]:
corr = df_stand.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 16))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot = True)
plt.title('Heatmap of correlations', fontsize=18)

In [None]:
sns.set_palette(d_palette)
h = df[columns_name].hist(bins=25,figsize=(26,26), xlabelsize='10', ylabelsize='10')
sns.despine(left=True, bottom=True)
[x.title.set_size(14) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()];

<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>

### Pearson Correlation
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Total positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Total negative linear correlation.</li>
</ul>

<p>Pearson Correlation is the default method of the function "corr".  Like before we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>

<b>P-value</b>: 
<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the
<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>

In order to verify if the correlation is statistically significant I'm going to code a loop that pass through all the columns we need to analyze and print the results.

In [None]:
pearson_an = pd.DataFrame(columns=['variable', 'pearson_coef', 'p_value'])

for col in columns_name[1:]:
    pearson_coef, p_value = stats.pearsonr(df_stand[col], df_stand['price'])
    #print(col)
    #print('The Pearson coefficient is', pearson_coef, 'and the P_value is', p_value)
    #print('')
    to_append = pd.Series([col, pearson_coef, p_value], index = pearson_an.columns)
    pearson_an = pearson_an.append(to_append, ignore_index=True)

pearson_an.sort_values(by='pearson_coef', ascending=False)

As we can see there are some features that present high Pearson Coefficients and very low p_values.
In general all the p_values confirm that the correlation is statistically significant even if the correlation between the dependent and the independent variable is really low.

# <span id="7"></span> Model Development
###### [Return Contents](#0)
<hr>

## <span id="8"></span> Split the dataset in train and test

First of all I need to split the dataset and create the train and test.
I'll use the train_test_split from Scikit-Learn library. I'll also use the random_state = 22 in all my tests.

In [None]:
# Train_test split using the original dataframe
train_data,test_data = train_test_split(df, train_size = 0.8, random_state = 22)

# Initialize a new LinearRegression model
lr = linear_model.LinearRegression()

# Identify the X_train and convert it to a Numpy Array
X_train = np.array(train_data['sqft_living'], dtype=pd.Series).reshape(-1,1)

# Identify the y_train and convert it to a Numpy Array
y_train = np.array(train_data['price'], dtype=pd.Series)

# Train the model on X_train and y_train
lr.fit(X_train,y_train)

# Define X_test and y_test
X_test = np.array(test_data['sqft_living'], dtype=pd.Series).reshape(-1,1)
y_test = np.array(test_data['price'], dtype=pd.Series)

# Make a prediction on X_test
Yhat = lr.predict(X_test)

<hr>
## <span id="9"></span> Simple Linear Regression
###### [Return Contents](#0)

It's finally time to do some model development!  
  
I want to start developing a Simple Linear Regression model using the independent variable ***'sqft_living'*** because it presents the highest correlation with the price.

Simple Linear Regression is a method to help us understand the relationship between two variables:
  
* The predictor/independent variable (X)
* The response/dependent variable (that we want to predict)(Y)
  
The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.

$$
 Y: Response \ Variable\\
 X: Predictor \ Variables
$$

**Linear function:**
$$
Yhat = a + b  X
$$

* refers to the intercept of the regression line0, in other words: the value of Y when X is 0
* refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit  
  
By convention in machine learning, you'll write the equation for a model slightly differently:

$$
y' = b + w_1 x_1
$$

where:

* \\( y'\\) is the predicted label (a desired output).
* \\(b\\) is the bias (the y-intercept), sometimes referred to as .
* \\(w_1\\) is the weight of feature 1. Weight is the same concept as the "slope"  in the traditional equation of a line.
* \\(x_1\\) is a feature (a known input).
To infer (predict) the temperature  for a new chirps-per-minute value , just substitute the  value into this model.

Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (, , etc.). For example, a model that relies on three features might look as follows:  
  
$$
y' = b + w_1x_1 + w_2x_2 + w_3x_3
$$

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(lr.score(X_train, y_train),'.3f'))
rtesm = float(format(lr.score(X_test, y_test),'.3f'))
cv = float(format(cross_val_score(lr,df[['sqft_living']],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data: {:.3f}".format(y_test.mean()))
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Simple Linear Regression','Best Feature', max_err, mabserr, msqerr, msqlogerr, medabserror,mpoisdev, mgamdev, rmsesm,rtrsm,'-',rtesm,'-',cv]
evaluation

In [None]:
plt.figure(figsize=(12,6))
plt.scatter(X_test,y_test,color="DarkBlue", label="Actual values", alpha=.1)
plt.plot(X_test,lr.predict(X_test),color='Coral', label="Predicted Regression Line")
plt.xlabel("Living Space (sqft)", fontsize=15)
plt.ylabel("Price ($)", fontsize=15)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.legend()

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)

<hr>
## <span id="10"></span> Multiple Linear Regression - Top 5 Features
###### [Return Contents](#0)

We want to predict the house price based on multiple variables.  
  
If we want to use more variables in our model to predict house price, we can use **Multiple Linear Regression**.  
  
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and **two or more** independent variables.  
  
Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:</p>

$$
Y: Response \ Variable\\
X_1 :Predictor\ Variable \ 1\\
X_2: Predictor\ Variable \ 2\\
X_3: Predictor\ Variable \ 3\\
X_4: Predictor\ Variable \ 4\\
$$

$$
a: intercept\\
b_1 :coefficients \ of\ Variable \ 1\\
b_2: coefficients \ of\ Variable \ 2\\
b_3: coefficients \ of\ Variable \ 3\\
b_4: coefficients \ of\ Variable \ 4\\
$$

The equation is given by

$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$  
  
We are going to use the top 5 independent variables selected by correlation:  

|  Variable| Pearson Coefficient |
|------|------|
|sqft_living  | 0.702035   |
|   grade  | 0.667434|
|   sqft_above  | 0.605567|
|   sqft_living15  | 0.585379|
|   bathrooms  | 0.525138|

In [None]:
# We have train_data,test_data that include all the columns of our dataset.
top_5 = ['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']
# Select X_train and X_test
X_train = train_data[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']]
X_test = test_data[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']]

# Select y_train and y_test
y_train = train_data[['price']]
y_test = test_data[['price']]

# Initialize a LinearRegression model and fit it with the train data
mlr = linear_model.LinearRegression().fit(X_train, y_train)

# Make a prediction
Yhat = mlr.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(mlr.score(train_data[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (mlr.score
                 (train_data[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']],
                  train_data['price']),train_data.shape[0],
                 len(['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms'])
                ),'.3f')
              )
rtesm = float(format(mlr.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (mlr.score
                 (test_data[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']],
                  test_data['price']),
                 test_data.shape[0],
                 len(['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms'])
                ),'.3f')
              )
cv = float(format(cross_val_score(mlr,df[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Multiple Linear Regression','Top 5 Features by Pearson_coef', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

The Multiple Linear Regression works better then the Simple Linear Regrassion but there are rooms for improvement.  
  
We can visualise the results plotting the distribution of Yhat and y_test.  
This, of course, is not a prove of accuracy but is still interesting to visualise.

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Multiple Linear Regression - Top 5 Features', fontsize=18)

<hr>
## <span id="11"></span> Multiple Linear Regression - All Features
###### [Return Contents](#0)

Now, let's see how much the model can improve the prediction including all the features we have previously selected.

In [None]:
# columns_name
all_features = columns_name[1:]

# Define X_train and X_test
X_train = train_data[all_features]
X_test = test_data[all_features]

# Define y_train and y_test
y_train = train_data['price']
y_test = test_data['price']

# Initiate a LinearRegression Model and Train it
aflrm = linear_model.LinearRegression().fit(X_train, y_train)

# Make a prediction
Yhat = aflrm.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(aflrm.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (aflrm.score
                 (train_data[all_features],
                  train_data['price']),train_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
rtesm = float(format(aflrm.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (aflrm.score
                 (test_data[all_features],
                  test_data['price']),
                 test_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
cv = float(format(cross_val_score(aflrm,df[all_features],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Multiple Linear Regression','All Features from Pearson_coef table', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

As we can see, including all the valuable features we have in the dataset, we can increase the accurecy a lot.

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Multiple Linear Regression - All Features', fontsize=18)

Comparing the Yhat distribution, this time, we can see that 

<hr>
## <span id="12"></span> Polynomial Regression - Top Feature
###### [Return Contents](#0)

Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.

We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

There are different orders of polynomial regression:  
  
**Quadratic - Second Order**
  
$$
Y' = a + b_1x + b_2x^2
$$
  
**Cubic - 3rd order  **
  
$$
Y' = a + b_1x + b_2x^2 + b_3x^3
$$  
  
**Higher order:  **
$$
Y'= a + b_1x + b_2x^2 + b_3x^3 + b_4x^4...
$$
  
We saw earlier that a linear model did not provide the best fit while using sqft_living as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.

Let's start defining the X_train, y_train, X_test, y_test as usual and a pipeline to process the data, define the model we are going to use and finally to train the model.

In [None]:
# Define X_train, y_train, X_test, y_test
X_train = train_data[['sqft_living']]
y_train = train_data['price']
X_test = test_data[['sqft_living']]
y_test = test_data['price']

# Define the pipeline input
Input = [('standardscaler', StandardScaler()), ('polynomial', PolynomialFeatures(degree=2, include_bias=False)), ('model', linear_model.LinearRegression())]

# Prepare the pipeline
pipe = Pipeline(Input)

# Fit the pipeline
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[['sqft_living']],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[['sqft_living']],
                  train_data['price']),train_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[['sqft_living']],
                  test_data['price']),
                 test_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[['sqft_living']],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Polynomial Regression','Best Feature', max_err, mabserr, msqerr, msqlogerr, medabserror,mpoisdev, mgamdev, rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Polynomial Regression - Best Feature', fontsize=18)

<hr>
## <span id="13"></span>Multiple Polynomial Regression - Top 5 Features
###### [Return Contents](#0)

Let's train a Polynomial regression model using the top 5 features from the Pearson Coefficient.

In [None]:
# The top 5 features are stored into the top_5 list

# Train and test split
X_train = train_data[top_5]
y_train = train_data['price']
X_test = test_data[top_5]
y_test = test_data['price']

# Define the pipe's input
Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(degree = 2, include_bias = False)), ('linearRegression', linear_model.LinearRegression())]

# Define the pipe
pipe = Pipeline(Input)

# Train the model
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[top_5],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[top_5],
                  train_data['price']),train_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[top_5],
                  test_data['price']),
                 test_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[top_5],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Multivariate Polynomial Regression','Top 5 Features by Pearson_coef', max_err, mabserr, msqerr, msqlogerr, medabserror,mpoisdev, mgamdev, rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Multivariate Polynomial Regression - Top 5 Features', fontsize=18)

Even if the Multiple Linear Regression is still a simple model, it still perform better then the other implementations.

<hr>
## <span id="14"></span>Multiple Polynomial Regression - All Features
###### [Return Contents](#0)

This will be pretty streight away because it follows exactly the same steps then the previous one but I'm going to train it using a higher number of features.

In [None]:
# The variable that contain the list of features is all_features

# X_train, y_train, X_test, y_test
X_train = train_data[all_features]
y_train = train_data['price']
X_test = test_data[all_features]
y_test = test_data['price']

# Let's define the pipe's input
Input = [('scaler', StandardScaler()), ('plynomial', PolynomialFeatures(degree=2, include_bias=False)), ('LinearRegression', linear_model.LinearRegression())]

# Initialize the pipeline
pipe = Pipeline(Input)

# Train the pipeline
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[all_features],
                  train_data['price']),train_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[all_features],
                  test_data['price']),
                 test_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[all_features],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Multivariate Polynomial Regression','All Features from Pearson_coef', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Multivariate Polynomial Regression - All Features', fontsize=18)

**Now we are talking!**  
  
The Multivariate Polynomial Regression improved a lot the 5-Fold Cross Validation score.  
Another thing to notice is that the ***Root Mean Squared Error*** is becoming smaller and smaller

<hr>
# <span id="16"></span>Regularization
###### [Return Contents](#0)

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.
A simple relation for linear regression looks like this. Here Y represents the learned relation and β represents the coefficient estimates for different variables or predictors(X).

$$
Y ≈ β_0 + β_1X_1 + β_2X_2 + …+ β_pX_p
$$

The fitting procedure involves a loss function, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.

$$
RSS = \sum_{i=1}^{n} \left(y_i - β_0 - \sum_{j=1}^{p} β_jx_{ij}\right)^2
$$

Now, this will adjust the coefficients based on your training data. If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.

<hr>
## <span id="16"></span>Ridge Regression
> Explainations by Prashant Gupta - [Regularization in Machine Learning](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a)
###### [Return Contents](#0)

$$
RSS = \sum_{i=1}^{n} \left(y_i - β_0 - \sum_{j=1}^{p} β_jx_{ij}\right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 = RSS + \lambda \sum_{j=1}^{p}\beta_j^2
$$

Above image shows ridge regression, where the ***RSS is modified by adding the shrinkage quantity***. Now, the coefficients are estimated by minimizing this function. Here, ***λ is the tuning parameter that decides how much we want to penalize the flexibility of our model.*** The increase in flexibility of a model is represented by increase in its coefficients, and if we want to minimize the above function, then these coefficients need to be small. This is how the Ridge regression technique prevents coefficients from rising too high. Also, notice that we shrink the estimated association of each variable with the response, except the intercept β0, This intercept is a measure of the mean value of the response when \\(x_{i1} = x_{i2} = ... x_{ip} = 0\\).  
  
*When λ = 0, the penalty term has no eﬀect*, and the estimates produced by ridge regression will be equal to least squares. However, as ***λ→∞, the impact of the shrinkage penalty grows, and the ridge regression coeﬃcient estimates will approach zero.*** As can be seen, selecting a good value of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates produced by this method are ***also known as the L2 norm.***  
  
***The coefficients that are produced by the standard least squares method are scale equivariant,*** i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c. Therefore, regardless of how the predictor is scaled, the multiplication of predictor and coefficient ($x_j\beta_j$) remains the same. ***However, this is not the case with ridge regression, and therefore, we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression.*** The formula used to do this is given below.  
  

$$ \tilde x_{ij} = \frac{x_{ij}}{ \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_{ij} - \bar x_{j} )^2} }  $$ 
  
  
  
    
### Advantages and Disadvantages Of Ridge Regression

##### Advantages 
* Least squares regression doesn’t differentiate “important” from “less-important” predictors in a model, so it includes all of them. This leads to overfitting a model and failure to find unique solutions. Ridge regression avoids these problems.
* Ridge regression works in part because it doesn’t require unbiased estimators; while least squares produce unbiased estimates; its variances can be so large that they may be wholly inaccurate.
* Ridge regression adds just enough bias to make the estimates reasonably reliable approximations to true population values.
* One important advantage of the ridge regression is that it still performs well, compared to the ordinary least square method in a situation where you have a large multivariate data with the number of predictors (p) larger than the number of observations (n).
* The ridge estimator is especially good at improving the least-squares estimate when multicollinearity is present.  

##### Disadvantages  
* Firstly ridge regression includes all the predictors in the final model, unlike the stepwise regression methods which will generally select models that involve a reduced set of variables.
* A ridge model does not perform feature selection. If a greater interpretation is necessary where we need to reduce the signal in our data to a smaller subset then a lasso model may be preferable.
* Ridge regression shrinks the coefficients towards zero, but it will not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

<hr>
## <span id="17"></span>Ridge Regression - Best Feature
###### [Return Contents](#0)

Taking in consideration the advantage of this algorithm, I'm going to verify how its accuracy change adding multiple features starting from the best one we have.

In [None]:
# Define train and test
X_train = train_data[['sqft_living']]
y_train = train_data['price']
X_test = test_data[['sqft_living']]
y_test = test_data['price']

# Input pipeline
Input = [('scaler', StandardScaler()), ('Ridge', linear_model.Ridge(alpha = 0.5, fit_intercept = True, random_state = 22))]

# Initialize the Pipeline
pipe = Pipeline(Input)

# Fit the pipeline
pipe.fit(X_train, y_train)

# Make a prediction
Y_hat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
#rtrsm = float(format(pipe.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[['sqft_living']],
                  train_data['price']),train_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[['sqft_living']],
                  test_data['price']),
                 test_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[['sqft_living']],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Ridge Regression','Best Feature', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

At the moment the Ridge Regression is the worst model in 12-Folds Cross Validation but I think is interesting to notice that:  
  
* The **Mean Absolute Error** is the same of the actual best model
* The **R-squared (training)** is the same of the actual best model  


In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Ridge Regression - Best Feature', fontsize=18)

<hr>
## <span id="18"></span>Ridge Regression - Top 5 Features
###### [Return Contents](#0)

Exactly how I did for the previous models analysed, I'm going to train the Ridge Regression Model with the top 5 features we selected using the Pearson Coefficient and the P value.  
  
The top 5 features I'm about to use, are the following:
* sqft_living
* grade
* sqft_above
* sqft_living15
* bathrooms

In [None]:
# Define Train and test
X_train = train_data[top_5]
y_train = train_data['price']
X_test = test_data[top_5]
y_test = test_data['price']

# Define the Input for the pipeline
Input = [('scaler', StandardScaler()), ('Ridge_regression', linear_model.Ridge(alpha = 0.5, fit_intercept = True, random_state = 22))]

# Initialize the pipeline
pipe = Pipeline(Input)

# Train the Pipeline
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[top_5],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[top_5],
                  train_data['price']),train_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[top_5],
                  test_data['price']),
                 test_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[top_5],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Ridge Regression','Top 5 Features by Pearson_coef', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

That's curious! Adding features reduce the R-squared achieved on the training data but improve the 5-Fold Cross Validation of the model!

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Ridge Regression - Top 5 Features', fontsize=18)

<hr>
## <span id="19"></span>Ridge Regression - All Features
###### [Return Contents](#0)

Now it's time to include all the features we have available and see what happend! I expect to have a signficant increase in 5-Fold Cross Validation score but I'm not confident if it ill be enough to perform better than the Multivariate Polynomial Regression I trained using the Top 5 Features.  
  
**Let's see!**  
  
The features I'm going to use are the same we have into the Pearson table:  
  
  |  Variable| Pearson Coefficient |
|------|------|
|sqft_living  | 0.702035   |
|   grade  | 0.667434|
|   sqft_above  | 0.605567|
|   sqft_living15  | 0.585379|
|   bathrooms  | 0.525138|
|   view  | 0.397299|
|   sqft_basement  | 0.323812|
|   bedrooms  | 0.316035|
|   lat  | 0.306998|
|   waterfront  | 0.266371|
|   floors  | 0.256811|
|   yr_renovated  | 0.126437|
|   sqft_lot  | 0.089664|
|   sqft_lot15  | 0.082451|
|   yr_built  | 0.054023|
|   condition  | 0.036336|
|   long  | 0.021637|
|   zipcode  | -0.053209|


In [None]:
# Train and test
X_train = train_data[all_features]
y_train = train_data['price']
X_test = test_data[all_features]
y_test = test_data['price']

# Define the Pipeline Input
Input = [('scale', StandardScaler()), ('Ridge', linear_model.Ridge(alpha = 0.5, fit_intercept = True, random_state = 22))]

# Initialize the Pipeline
pipe = Pipeline(Input)

# Fit the model
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[all_features],
                  train_data['price']),train_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[all_features],
                  test_data['price']),
                 test_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[top_5],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Ridge Regression','All Features from Pearson_coef', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

This is a bit surprising to be honest but let's analyze the results!  
  
Ordering the table by the 12-Fold Cross Validation with the values rounded to the third decimal rank the Ridge Regression Model that use all the features available, just above the Ridge Regression that use the Top 5 Features.

**But**  
  
* The following metrics would rank this approach to the problem as second in the list:
    * Mean Absolute Error
    * Mean Squared Error
    * Median Absolute Error
    * The R-squared (training).
    * Adjusted R-squared (training)
    * R-squared (test)
    * Adjusted R-suqared (test)



In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Ridge Regression - All Features', fontsize=18)

<hr>
## <span id="20"></span>Lasso Regression
> Explainations by Prashant Gupta - [Regularization in Machine Learning](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a)
###### [Return Contents](#0)

$$ \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p}|\beta_j| = RSS + \lambda \sum_{j=1}^{p}|\beta_{j}| $$  
 
 
Lasso is another variation, in which the above function is minimized. Its clear that ***this variation differs from ridge regression only in penalizing the high coefficients.*** It uses \\(|\beta_j|\\) (modulus)instead of squares of \\(\beta\\), as its penalty. In statistics, this is ***known as the L1 norm.***  
  
Lets take a look at above methods with a different perspective. *The ridge regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to s. And the Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s.* Here, s is a constant that exists for each value of shrinkage factor \\(\lambda\\). ***These equations are also referred to as constraint functions.***  
  
***Consider their are 2 parameters in a given problem.*** Then according to above formulation, the ***ridge regression is expressed by \\(\beta_1^2 + \beta_1^2 \leq s \\).*** This implies that *ridge regression coefficients have the smallest RSS(loss function) for all points that lie within the circle given by \\(\beta_1^2 + \beta_1^2 \leq s \\).*  
  
Similarly, ***for lasso, the equation becomes, \\(|\beta_1| + |\beta_2| \leq s \\).*** This implies that *lasso coefficients have the smallest RSS(loss function) for all points that lie within the diamond given by \\(|\beta_1| + |\beta_2| \leq s \\).*  
  
The image below describes these equations.

<img src="https://miro.medium.com/max/1400/1*XC-8tHoMxrO3ogHKylRfRA.png" title="Credit : An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani">
> Credit : An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

<hr>
## <span id="21"></span>Lasso Regression - Best Feature
###### [Return Contents](#0)

As usual I want to highlight how the model change in performance when I add multiple features starting from the best features we have, capable to explain the price.

In [None]:
# Train and test
X_train = train_data[['sqft_living']]
y_train = train_data['price']
X_test = test_data[['sqft_living']]
y_test = test_data['price']

# Define the Input for the pipeline
Input = [('scale', StandardScaler()), ('Lasso', linear_model.Lasso(alpha = 0.5, precompute = False, random_state = 22))]

# Initialize the pipeline
pipe = Pipeline(Input)

# Train the model
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
#rtrsm = float(format(pipe.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[['sqft_living']],
                  train_data['price']),train_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[['sqft_living']],
                  test_data['price']),
                 test_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[['sqft_living']],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Lasso Regression','Best Feature', max_err, mabserr, msqerr, msqlogerr, medabserror,mpoisdev, mgamdev, rmsesm,rtrsm,artrcm,'-',artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

The Lasso Regression Model trained on the Best Feature we have, achieves the same score of the Ridge Regression trained using the same independent variables.

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Lasso Regression - Best Feature', fontsize=18)

<hr>
## <span id="22"></span>Lasso Regression - Top 5 Features
###### [Return Contents](#0)

In [None]:
# Train and test
X_train = train_data[top_5]
y_train = train_data['price']
X_test = test_data[top_5]
y_test = test_data['price']

# Define the Input for the pipeline
Input = [('scale', StandardScaler()), ('Lasso', linear_model.Lasso(alpha = 0.5, precompute = False, random_state = 22))]

# Initialize the pipeline
pipe = Pipeline(Input)

# Train the model
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))#
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))#
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))#
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[top_5],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[top_5],
                  train_data['price']),train_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[top_5],
                  test_data['price']),
                 test_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[top_5],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Lasso Regression','Top 5 Features by Pearson_coef', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

Once again, the Lasso Regression Model and the Ridge Regression Model achieve the same results if trained on the same features available in this dataset.

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Lasso Regression - Top 5 Features', fontsize=18)

<hr>
## <span id="23"></span>Lasso Regression - All Features
###### [Return Contents](#0)

Training the Lasso Regression Model on all the features available required in this case to set the alpha = 1 and the maximum number of interaction to 50,000 to converge.  
  
* ***Alpha*** : Constant that multiplies the L1 term. Defaults to 1.0. <code> alpha = 0 </code> is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using <code> alpha = 0 </code> with the Lasso object is not advised. Given this, you should use the LinearRegression object.

In [None]:
# Train and test
X_train = train_data[all_features]
y_train = train_data['price']
X_test = test_data[all_features]
y_test = test_data['price']

# Define the Input for the pipeline
Input = [('scale', StandardScaler()), ('Lasso', linear_model.Lasso(alpha = 1, precompute = False, max_iter = 50000, random_state = 22))]

# Initialize the pipeline
pipe = Pipeline(Input)

# Train the model
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[all_features],
                  train_data['price']),train_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[all_features],
                  test_data['price']),
                 test_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[top_5],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Lasso Regression','All Features from Pearson_coef', max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

We can do the same consideration we did analysing the results of the Ridge Regression Model trained on all the available features.  
  
* The following metrics would rank this approach to the problem as second in the list:
    * Mean Absolute Error
    * Mean Squared Error
    * Median Absolute Error
    * Root Mean Squared Error
    * The R-squared (training).
    * Adjusted R-squared (training)
    * R-squared (test)
    * Adjusted R-suqared (test)


In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Lasso Regression - All Features', fontsize=18)

<hr>
## <span id="24"></span>Decision Tree Regression
> Explaination by [Scikit-Learn](https://scikit-learn.org/stable/modules/tree.html#regression)
###### [Return Contents](#0)

***Decision Trees (DTs)*** are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.  
  
For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.  

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_tree_regression_0011.png" title="Decision Tree Regression">  
  
  
***Some advantages of decision trees are: *** 

* Simple to understand and to interpret. **Trees can be visualised.**

* **Requires little data preparation**. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module **does not support missing values.**

* The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.

* **Able to handle both numerical and categorical data**. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.

* **Able to handle multi-output problems.**

* Uses a **white box model.** *If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic.* By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

* **Possible to validate a model using statistical tests**. That makes it possible to account for the reliability of the model.

* Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
  
***The disadvantages of decision trees include:***
  
* Decision-tree learners **can create over-complex trees that do not generalise the data well**. **This is called overfitting**. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.

* Decision trees **can be unstable because small variations in the data might result in a completely different tree being generated**. *This problem is mitigated by using decision trees within an ensemble.*

* *The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts.* Consequently, **practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node.** ***Such algorithms cannot guarantee to return the globally optimal decision tree.*** This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

* There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.

* **Decision tree learners create biased trees if some classes dominate.** It is therefore recommended to balance the dataset prior to fitting with the decision tree.

<hr>
## <span id="25"></span>Decision Tree Regression - Best Feature
###### [Return Contents](#0)

Let's try to build a Decision Tree Regression Model and to interpret the results knowing that the overfitting is always around the corner!

To build the Decision Trees Regressors I decided to iterate identify the best level of depth to use in order to achieve the best result possible.  
I'm also going to plot a chart that explains the evolution of a few key scores used to evaluate the model.  
  
The process individuate the best level of depth automatically and use it to train the final model.

In [None]:
# Train and test.
X_train = train_data[['sqft_living']]
y_train = train_data['price']
X_test = test_data[['sqft_living']]
y_test = test_data['price']

# The Decision Tree Regression Model doesn't need data normalisation.

tree_depth = pd.DataFrame({'Model': [],
                           'Depth':[],
                           'Max Error':[],
                           'Mean Absolute Error' : [],
                           'Mean Squared Error' : [],
                           'Mean Squared Log Error' : [],
                           'Median Absolute Error' : [],
                           'Mean Poisson Deviance' : [],
                           'Mean Gamma Deviance': [],
                           'Root Mean Squared Error (RMSE)':[],
                           'R-squared (training)':[],
                           'Adjusted R-squared (training)':[],
                           'R-squared (test)':[],
                           'Adjusted R-squared (test)':[],
                           '12-Fold Cross Validation':[]})

# Initialize the model
for depth in range(1,20):
    tree = DecisionTreeRegressor(max_depth = depth)

    # Train the model
    tree.fit(X_train, y_train)

    # Evaluation Metrics
    max_err = float(format(max_error(y_test, Yhat),'.3f'))
    mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
    msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
    #msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
    medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
    #mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
    #mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
    rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
    rtrsm = float(format(tree.score(train_data[['sqft_living']],train_data['price']),'.3f'))
    artrcm = float(format
                   (adjustedR2
                    (tree.score
                     (train_data[['sqft_living']],
                      train_data['price']),train_data.shape[0],
                     len(['sqft_living'])
                    ),'.3f')
                  )
    rtesm = float(format(tree.score(X_test, y_test),'.3f'))
    artecm = float(format
                   (adjustedR2
                    (tree.score
                     (test_data[['sqft_living']],
                      test_data['price']),
                     test_data.shape[0],
                     len(['sqft_living'])
                    ),'.3f')
                  )
    cv = float(format(cross_val_score(tree,df[top_5],df['price'],cv=12).mean(),'.3f'))

    r = tree_depth.shape[0]

    tree_depth.loc[r] = ['Decision Tree Regression',depth, max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]

tree_depth.sort_values(by = '12-Fold Cross Validation', ascending=False, inplace=True)
tree_depth.reset_index(drop = True, inplace = True)
tree_depth.head(3)

In [None]:
plt.figure(figsize=(10,6))

max_depth = int(max(tree_depth['Depth']))
best_depth = tree_depth['Depth'][0]
max_cv_score = max(tree_depth['12-Fold Cross Validation'])


ax1 = sns.lineplot(x = tree_depth['Depth'], y = tree_depth['12-Fold Cross Validation'], color = 'Red', label="Cross Valudation")
sns.lineplot(x = tree_depth['Depth'], y = tree_depth['R-squared (test)'], label='R-squared (test)', color='Green')
sns.lineplot(x = tree_depth['Depth'], y = tree_depth['R-squared (training)'], label='R-squared (training)', color="orange")

plt.xlabel('Max Depth Level', fontsize = 14)
plt.ylabel('Evaluation Score', fontsize = 14)
plt.title('Cross Validation Score per Depth Level', fontsize = 18)

In [None]:
# Train and test. Does it make sense to trtain a decision tree with one feature only?
X_train = train_data[['sqft_living']]
y_train = train_data['price']
X_test = test_data[['sqft_living']]
y_test = test_data['price']

# The Decision Tree Regression Model doesn't need data normalisation.
# Initialize the model
tree = DecisionTreeRegressor(max_depth = best_depth)

# Train the model
tree.fit(X_train, y_train)

# Make a prediction
tree.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(tree.score(train_data[['sqft_living']],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (tree.score
                 (train_data[['sqft_living']],
                  train_data['price']),train_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
rtesm = float(format(tree.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (tree.score
                 (test_data[['sqft_living']],
                  test_data['price']),
                 test_data.shape[0],
                 len(['sqft_living'])
                ),'.3f')
              )
cv = float(format(cross_val_score(tree,df[top_5],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Decision Tree Regression','Max Depth = {} Best Feature'.format(best_depth), max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

In [None]:
# Plot the results
plt.figure(figsize=(16,6))
plt.scatter(df[['sqft_living']], df[['price']],
            color="DarkBlue", label="Actual Values", alpha=0.1)
plt.plot(X_test, Yhat, color="Coral",
         label="Decision Tree", linewidth=1)
#plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()


#plt.scatter(X_test,y_test,color="DarkBlue", label="Actual values", alpha=.1)
#plt.plot(X_test,lr.predict(X_test),color='Coral', label="Predicted Regression Line")

<hr>
## <span id="26"></span>Decision Tree Regression - Top 5 Features
###### [Return Contents](#0)

In [None]:
# Train and test.
X_train = train_data[top_5]
y_train = train_data['price']
X_test = test_data[top_5]
y_test = test_data['price']

# The Decision Tree Regression Model doesn't need data normalisation.

tree_depth = pd.DataFrame({'Model': [],
                           'Depth':[],
                           'Max Error':[],
                           'Mean Absolute Error' : [],
                           'Mean Squared Error' : [],
                           'Mean Squared Log Error' : [],
                           'Median Absolute Error' : [],
                           'Mean Poisson Deviance' : [],
                           'Mean Gamma Deviance': [],
                           'Root Mean Squared Error (RMSE)':[],
                           'R-squared (training)':[],
                           'Adjusted R-squared (training)':[],
                           'R-squared (test)':[],
                           'Adjusted R-squared (test)':[],
                           '12-Fold Cross Validation':[]})

# Initialize the model
for depth in range(1,20):
    tree = DecisionTreeRegressor(max_depth = depth)

    # Train the model
    tree.fit(X_train, y_train)

    # Evaluation Metrics
    max_err = float(format(max_error(y_test, Yhat),'.3f'))
    mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
    msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
    #msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
    medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
    #mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
    #mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
    rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
    rtrsm = float(format(tree.score(train_data[top_5],train_data['price']),'.3f'))
    artrcm = float(format
                   (adjustedR2
                    (tree.score
                     (train_data[top_5],
                      train_data['price']),train_data.shape[0],
                     len(top_5)
                    ),'.3f')
                  )
    rtesm = float(format(tree.score(X_test, y_test),'.3f'))
    artecm = float(format
                   (adjustedR2
                    (tree.score
                     (test_data[top_5],
                      test_data['price']),
                     test_data.shape[0],
                     len(top_5)
                    ),'.3f')
                  )
    cv = float(format(cross_val_score(tree,df[top_5],df['price'],cv=12).mean(),'.3f'))

    r = tree_depth.shape[0]

    tree_depth.loc[r] = ['Decision Tree Regression',depth, max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]

tree_depth.sort_values(by = '12-Fold Cross Validation', ascending=False, inplace=True)
tree_depth.reset_index(drop = True, inplace = True)
tree_depth.head(3)

In [None]:
plt.figure(figsize=(10,6))

max_depth = int(max(tree_depth['Depth']))
best_depth = tree_depth['Depth'][0]
max_cv_score = max(tree_depth['12-Fold Cross Validation'])


ax1 = sns.lineplot(x = tree_depth['Depth'], y = tree_depth['12-Fold Cross Validation'], color = 'Red', label="Cross Valudation")
sns.lineplot(x = tree_depth['Depth'], y = tree_depth['R-squared (test)'], label='R-squared (test)', color='Green')
sns.lineplot(x = tree_depth['Depth'], y = tree_depth['R-squared (training)'], label='R-squared (training)', color="orange")

plt.xlabel('Max Depth Level', fontsize = 14)
plt.ylabel('Evaluation Score', fontsize = 14)
plt.title('Cross Validation Score per Depth Level', fontsize = 18)

In [None]:
# Train and test. Does it make sense to trtain a decision tree with one feature only?
X_train = train_data[top_5]
y_train = train_data['price']
X_test = test_data[top_5]
y_test = test_data['price']

# The Decision Tree Regression Model doesn't need data normalisation.
# Initialize the model
tree = DecisionTreeRegressor(max_depth = best_depth)

# Train the model
tree.fit(X_train, y_train)

# Make a prediction
tree.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(tree.score(train_data[top_5],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (tree.score
                 (train_data[top_5],
                  train_data['price']),train_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
rtesm = float(format(tree.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (tree.score
                 (test_data[top_5],
                  test_data['price']),
                 test_data.shape[0],
                 len(top_5)
                ),'.3f')
              )
cv = float(format(cross_val_score(tree,df[top_5],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Decision Tree Regression','Depth = {} - Top 5 Features by Pearson_coef'.format(best_depth), max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Decision Tree Regression - Top 5 Features', fontsize=18)

<hr>
## <span id="27"></span>Decision Tree Regression - All Features
###### [Return Contents](#0)

In [None]:
# Train and test.
X_train = train_data[all_features]
y_train = train_data['price']
X_test = test_data[all_features]
y_test = test_data['price']

# The Decision Tree Regression Model doesn't need data normalisation.

tree_depth = pd.DataFrame({'Model': [],
                           'Depth':[],
                           'Max Error':[],
                           'Mean Absolute Error' : [],
                           'Mean Squared Error' : [],
                           'Mean Squared Log Error' : [],
                           'Median Absolute Error' : [],
                           'Mean Poisson Deviance' : [],
                           'Mean Gamma Deviance': [],
                           'Root Mean Squared Error (RMSE)':[],
                           'R-squared (training)':[],
                           'Adjusted R-squared (training)':[],
                           'R-squared (test)':[],
                           'Adjusted R-squared (test)':[],
                           '12-Fold Cross Validation':[]})

# Initialize the model
for depth in range(1,20):
    tree = DecisionTreeRegressor(max_depth = depth)

    # Train the model
    tree.fit(X_train, y_train)

    # Evaluation Metrics
    max_err = float(format(max_error(y_test, Yhat),'.3f'))
    mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
    msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
    #msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
    medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
    #mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
    #mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
    rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
    rtrsm = float(format(tree.score(train_data[all_features],train_data['price']),'.3f'))
    artrcm = float(format
                   (adjustedR2
                    (tree.score
                     (train_data[all_features],
                      train_data['price']),train_data.shape[0],
                     len(all_features)
                    ),'.3f')
                  )
    rtesm = float(format(tree.score(X_test, y_test),'.3f'))
    artecm = float(format
                   (adjustedR2
                    (tree.score
                     (test_data[all_features],
                      test_data['price']),
                     test_data.shape[0],
                     len(all_features)
                    ),'.3f')
                  )
    cv = float(format(cross_val_score(tree,df[all_features],df['price'],cv=12).mean(),'.3f'))

    r = tree_depth.shape[0]

    tree_depth.loc[r] = ['Decision Tree Regression',depth, max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]

tree_depth.sort_values(by = '12-Fold Cross Validation', ascending=False, inplace=True)
tree_depth.reset_index(drop = True, inplace = True)
tree_depth.head(3)

In [None]:
plt.figure(figsize=(10,6))

max_depth = int(max(tree_depth['Depth']))
best_depth = tree_depth['Depth'][0]
max_cv_score = max(tree_depth['12-Fold Cross Validation'])


ax1 = sns.lineplot(x = tree_depth['Depth'], y = tree_depth['12-Fold Cross Validation'], color = 'Red', label="Cross Valudation")
sns.lineplot(x = tree_depth['Depth'], y = tree_depth['R-squared (test)'], label='R-squared (test)', color='Green')
sns.lineplot(x = tree_depth['Depth'], y = tree_depth['R-squared (training)'], label='R-squared (training)', color="orange")

plt.xlabel('Max Depth Level', fontsize = 14)
plt.ylabel('Evaluation Score', fontsize = 14)
plt.title('Cross Validation Score per Depth Level', fontsize = 18)

In [None]:
# Train and test. Does it make sense to trtain a decision tree with one feature only?
X_train = train_data[all_features]
y_train = train_data['price']
X_test = test_data[all_features]
y_test = test_data['price']

# The Decision Tree Regression Model doesn't need data normalisation.
# Initialize the model
tree = DecisionTreeRegressor(max_depth = best_depth)

# Train the model
tree.fit(X_train, y_train)

# Make a prediction
tree.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(tree.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (tree.score
                 (train_data[all_features],
                  train_data['price']),train_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
rtesm = float(format(tree.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (tree.score
                 (test_data[all_features],
                  test_data['price']),
                 test_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
cv = float(format(cross_val_score(tree,df[all_features],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Decision Tree Regression','Depth = {} - All Features from Pearson_coef'.format(best_depth), max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Decision Tree Regression - All Features', fontsize=18)

<hr>
## <span id="28"></span>Multi-layer Perceptron Regressor
###### [Return Contents](#0)

The Multi-layer Perceptron Regressor is a class of feedforward Artificial Neural Network (ANN) that consist of at least 3 layers of nodes:
 * An input layer
 * A hidden layer
 * An output layer
  
Except for the input nodes, each node is a neuron that uses a non linear activation function. This model optimizes the squared-loss using LBFGS or stochastic gradient descent.

In [None]:
# train and test
X_train = train_data[all_features]
y_train = train_data['price']
X_test = test_data[all_features]
y_test = test_data['price']

# Define a pipeline
Input = [('scaler', StandardScaler()), ('MLPR', MLPRegressor(activation = 'tanh',
                                                            solver='sgd',
                                                            learning_rate = 'adaptive',
                                                            max_iter = 2000))]
pipe = Pipeline(Input)

# Train the model
pipe.fit(X_train, y_train)

# Make a prediction
Yhat = pipe.predict(X_test)

In [None]:
# Evaluation Metrics
max_err = float(format(max_error(y_test, Yhat),'.3f'))
mabserr = float(format(mean_absolute_error(y_test, Yhat),'.3f'))
msqerr = float(format(mean_squared_error(y_test, Yhat),'.3f'))
#msqlogerr = float(format(mean_squared_log_error(y_test, Yhat),'.3f'))
medabserror = float(format(median_absolute_error(y_test, Yhat),'.3f'))
#mpoisdev = float(format(mean_poisson_deviance(y_test, Yhat),'.3f'))
#mgamdev = float(format(mean_gamma_deviance(y_test, Yhat),'.3f'))
rmsesm = float(format(np.sqrt(mean_squared_error(y_test, Yhat)),'.3f'))
rtrsm = float(format(pipe.score(train_data[all_features],train_data['price']),'.3f'))
artrcm = float(format
               (adjustedR2
                (pipe.score
                 (train_data[all_features],
                  train_data['price']),train_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
rtesm = float(format(pipe.score(X_test, y_test),'.3f'))
artecm = float(format
               (adjustedR2
                (pipe.score
                 (test_data[all_features],
                  test_data['price']),
                 test_data.shape[0],
                 len(all_features)
                ),'.3f')
              )
cv = float(format(cross_val_score(pipe,df[all_features],df['price'],cv=12).mean(),'.3f'))

print ("Average Price for Test Data:", y_test.mean())
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))

r = evaluation.shape[0]

evaluation.loc[r] = ['Multi_layer Perceptron Regressor','All Features from Pearson_coef'.format(best_depth), max_err, mabserr, msqerr, '-', medabserror,'-', '-', rmsesm,rtrsm,artrcm,rtesm,artecm,cv]
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)


In [None]:
plt.figure(figsize = (26,6))

ax1 = sns.distplot(y_test, label = 'Actual values', color = 'DarkBlue', hist=False, bins=50)
sns.distplot(Yhat, color='Orange', label = 'Predicted values', hist=False, bins=50, ax=ax1)
plt.xlabel('Price distribution', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.title('Yhat and y_test distribution comparison - Decision Tree Regression - All Features', fontsize=18)

<hr>
## <span id="29"></span>Evauation Table
###### [Return Contents](#0)

It's finally time to select the best model!  
  
As we can see from the table below, the Multivariate Polynomial Regression works pretty well considering not only the 12-Fold Cross Validation but also the other metrics we are considering in the table.  
  
The second place go to the Multi-layer Perceptron Regressor in terms of 12-Fold Cross Validation score but I think is important to notice that the other metrics are not that good.  
One example of this is 3.93 times bigger than the Max Error recorded by the Multivariate Polynomial Regression.

Anyway, I'm wondering what's the score the Multivariate Polynomial Regression can achieved after a bit of work on the parameters.

In [None]:
evaluation.sort_values(by = '12-Fold Cross Validation', ascending=False)

<hr>
## <span id="30"></span>Conclusion
###### [Return Contents](#0)

When we look at the evaluation table, 2nd degree polynomial with all the features is the best. 
  
Studying other public Kernals on this dataset, I noticed that other Data Scientist used other method or they suggest for example a Polynomial Ridge Regression but my results with that model are pretty bad.

Anyway!
  
I hope you enjoyed my Kernal and <font color="green">if you liked it, please do not forget to UPVOTE 🙂</font>

And ***Keep Coding!***

<img src="https://media.giphy.com/media/PiQejEf31116URju4V/giphy.gif">