____

<center><h1> Library Imports & User Defined Functions </h1></center>

_____

We begin by importing a whole load of python libraries that will help us with predicting sale prices for homes in King County, Washington. These may appear random, but actually have been set out in a logical fashion.

`pandas`, `numpy` and `datetime` are basic functions - regardless of what we do with the data, it's hard to imagine any analytic exercise without using these libraries in some measure. Likewise, the visualisation libraries - `seaborn` and `matplotlib` are used in many instances to visualise. I've seen experienced data scientists also use `plotly`, but as I'm not fluent in this library I've steered clear :).

In [None]:
import pandas as pd
import numpy as np
import datetime as dt

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.neural_network import MLPRegressor
from mlxtend.regressor import StackingCVRegressor

In addition to the libraries imported above, I typically build the following functions everytime I attempt a regression problem. I have tried my hand at a few of these, and I find that these user defined functions are incredibly helpful with both exploratory data analytics as well as data cleaning.

The first function is used to succinctly summarise the key attributes of a dataframe, split across numeric and categorical features. This isn't a 'scientific' split, categorical is defined here as anything non-numeric, so take it with a pinch of salt. Regardless, useful in helping us decide what we need to do with the data.

In [None]:
def df_characteristics(df):
    
    print('The shape of this dataframe is: {}'.format(df.shape), '\n')
    
    df_num = df.select_dtypes(include=[np.number])
    print('This dataframe has {} numeric features.'.format(df_num.shape[1]), '\n')
    print(df_num.columns, '\n')
    
    df_cat = df.select_dtypes(exclude=[np.number])
    print('This dataframe has {} categorical features.'.format(df_cat.shape[1]), '\n')
    print(df_cat.columns)

The other function which I always use is a `check_null_values` function, which is designed to trawl through the data and return a dataframe (which I've called the `nanframe` for hopefully obvious reasons) which shows not only the fields in the dataframe carrying blank (`NaN`) values, but also the proportion of such values relative to the size of the data. This is a very useful statistic that helps us decide how to handle null values. As a rule of thumb, I drop anything where the proportion of `NaN` records is over 50% of the population of that field, as it doesn't seem intrinsically useful to estimate data for the majority of the feature's population.

In [None]:
def check_null_values(df):
    
    nanframe = pd.DataFrame((df.isnull().sum() / len(df)) * 100)
    nanframe.columns = ['NaN(%)']
    nanframe['Blank_Record_Counts'] = pd.DataFrame(df.isnull().sum())
    nanframe = nanframe[nanframe['Blank_Record_Counts'] != 0]
    return nanframe.sort_values(by='NaN(%)', ascending=False).reset_index()

The last one is quite specific to a regression exercise. I found this originally on stack overflow and have adapted it ever since and can't imagine solving a regression problem without it. The issue of multicollinearity can be addressed with this function, which identifies feature variables that are highly correlated with each other (for this exercise, 'highly correlated' means a threshold of 0.7, which is the generally accepted threshold).

In [None]:
def remove_collinear_features(x, threshold):
    
    # Create correlation matrix:
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []
    
    # Work through the iterations setup:
    for i in iters:
        for j in range(i+1):
            items = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = items.columns
            row = items.index
            val = abs(items.values)
            
            # Compare against threshold:
            if val >= threshold:
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])
                
    cols_to_drop = set(drop_cols)
    x = x.drop(columns = cols_to_drop, axis=1)
    
    return x

____

<center><h1> Data Sourcing </h1></center>

_____

In [None]:
data = pd.read_csv('../input/kc-house-data/kc_house_data.csv')

____

<center><h1> Exploratory Data Analysis </h1></center>

_____

Let's start by applying our first UDF to explore the dataframe's characteristics.

In [None]:
df_characteristics(data)

Let's now supplement this with the `pandas` `describe()` function.

In [None]:
data.describe()

As shown above, the data seems to be very much numerical at first glance. However, upon a closer look - the year fields `yr_built` and `yr_renovated` - are not really numeric features, and are temporal variables. Arguably `zipcode` should also be classed a categorical feature, but am probably overthinking now :).

That said, the `yr_built` and `yr_renovated` features enable us to think about a new feature - `Yrs_since_refurb`. It stands to reason that a recently renovated house is almost always worth more than one which was renovated, say, 10 years ago. We can very easily create this feature using `np.where()`.

In [None]:
data['Yrs_since_refurb'] = np.where(data['yr_renovated'] == 0, (2020 - data['yr_built']), (2020 - data['yr_renovated']))

Let's now check out if we have any null values to deal with..

In [None]:
check_null_values(data)

... hardly any, which is always a good sign :)

We've got just two records in the `sqft_above` field that are blank. These are inconsequential considering the size of the dataset. I'm going to take the lazy approach and just fill up the blanks with the mean `sqft_above` value that I've seen in the output of the `data.describe()` function above.

In [None]:
data['sqft_above'] = data['sqft_above'].fillna(1788.396095)

Having introduced a new feature and fixed the blank values, let's check out a random sample of 5 records in the dataframe. I always prefer `sample` to `head` as I like the randomness of the output, which although not powerful, has helped me spot anomalies in my approach in the past.

In [None]:
data.sample(5)

There appears to be one additional item to fix - the date appears to be in epoch format. This is again easily done using a `datetime` class (`strptime`). I first tried this on an example to make sure my code works, and then used a `lambda` function to convert this for all entries in the `date` field.

Once converted, all that remained was to change the `date` to an ordinal datatype, to enable regression modelling. 

In [None]:
dateObj = dt.datetime.strptime('20140623T000000', '%Y%m%dT%H%M%S')

In [None]:
print(dateObj)

In [None]:
data['date'] = data['date'].apply(lambda x: dt.datetime.strptime(x, '%Y%m%dT%H%M%S'))

In [None]:
data['date']=data['date'].map(dt.datetime.toordinal)

I can now use a `scatterplot` to visualise the relationship, if any, between the dates and sale prices.

In [None]:
sns.scatterplot(data['date'], data['price'])

That said, all this was a bit of a wasted effort as there's clearly no correlation whatsoever between date and price, so it's best to drop this field.

In [None]:
data.drop(columns=['date'], inplace=True)

The next step is to assess linear relationships between price and the other variables, which we will do using a correlation heatmap.

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=+1, cmap='RdYlGn')

In addition to clearly calling out features that do have a strong positive (or negative) correlation with price, we can also see a lot of features that are correlated to each other. Multicollinearity is a problem as it can hinder the predictive power of any model we create. Therefore, now is the time to:

* Segregate the features from the dataset into a separate dataframe.
* Remove collinear features from the features collected. For this purpose, I have arbitrarily used a correlation threshold of 0.7.

In [None]:
features = data.drop('price', axis=1)

In [None]:
remove_collinear_features(features, 0.7)

The function has worked and has identified 6 features that are highly correlated. I have removed them from the final selection of features for regression modelling in the next step.

In [None]:
features = features.drop(columns = ['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'sqft_lot15', 'yr_built'], axis=1)

In [None]:
df_characteristics(features)

This concludes the EDA phase of this exercise, and we are now ready to initiate modelling.

____

<center><h1> Regression Modelling </h1></center>

_____

In [None]:
X = features # This is the set of features identified at the end of the last section 
y = data['price'] # target variable

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

A good test at this point would be to verify that the features are in matrix format and the targets in vector format.

In [None]:
print('X_train shape: {}'.format(X_train.shape))
print('X_test shape: {}'.format(X_test.shape))
print('y_train shape: {}'.format(y_train.shape))
print('y_test shape: {}'.format(y_test.shape))

To help me choose the best model for the prediction, I have considered 10 different models, which are explained below. The high-level approach in each case is:

* Setting up the model, and fitting this on the train datasets.
* Generating predictions
* Calculating 3 error metrics - RMSE, R2 and Mean Absolute Error.

The MAE is my preferred error metric for this exercise given that it's easy to interpret. In our case this actually means the quantum by which any prediction is wrong, and therefore my choice of model will be influenced by whichever model generates the lowest MAE.

**Model 1 - Multivariate Linear Regression**

In [None]:
lin_reg = LinearRegression()
model = lin_reg.fit(X_train, y_train)

y_preds = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, y_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, y_preds)
print('MAE: ', mae)

**Model 2 - Ridge Regression**

In [None]:
ridge = Ridge(random_state=42)
ridge_mod = ridge.fit(X_train, y_train)

ridge_preds = ridge_mod.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, ridge_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, ridge_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, ridge_preds)
print('MAE: ', mae)

**Model 3 - Random Forest Regression**

In [None]:
random_forest = RandomForestRegressor()
forest = random_forest.fit(X_train, y_train)

rf_preds = forest.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, rf_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, rf_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, rf_preds)
print('MAE: ', mae)

**Model 4 - Gradient Boosting Regression**

In [None]:
gbr = GradientBoostingRegressor(learning_rate=0.01, n_estimators=1000)
gbm = gbr.fit(X_train, y_train)

gbm_preds = gbm.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, gbm_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, gbm_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, gbm_preds)
print('MAE: ', mae)

**Model 5 - Decision Tree Regression**

In [None]:
DT = DecisionTreeRegressor()
tree = DT.fit(X_train, y_train)

tree_preds = tree.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, tree_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, tree_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, tree_preds)
print('MAE: ', mae)

**Model 6 - SVR Model**

In [None]:
svr = SVR(gamma='auto')
svr_model = svr.fit(X_train, y_train)

svr_preds = svr_model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, svr_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, svr_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, svr_preds)
print('MAE: ', mae)

**Model 7 - XGB Model**

In [None]:
xgb = XGBRegressor(n_estimators=1000, learning_rate=0.09)
xgbm = xgb.fit(X_train, y_train)

xgb_preds = xgbm.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, xgb_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, xgb_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, xgb_preds)
print('MAE: ', mae)

**Model 8 - LGB Model**

In [None]:
lgb = LGBMRegressor(n_estimators=1000, learning_rate=0.1)
lgbm = lgb.fit(X_train, y_train)

lgb_preds = lgbm.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, lgb_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, lgb_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, lgb_preds)
print('MAE: ', mae)

**Model 9 - MLP Regressor**

In [None]:
scaler = StandardScaler()
X_train_MLP = scaler.fit_transform(X_train)
X_test_MLP = scaler.fit_transform(X_test)

net = MLPRegressor(max_iter=1000, learning_rate_init=0.05, hidden_layer_sizes=(50,25,25), random_state=42)
network = net.fit(X_train_MLP, y_train)

net_preds = network.predict(X_test_MLP)

rmse = np.sqrt(mean_squared_error(y_test, net_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, net_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, net_preds)
print('MAE: ', mae)

**Model 10 - Stacked Gen Model**

In [None]:
stacked_gen = StackingCVRegressor(regressors=(lin_reg, ridge, random_forest, gbr, DT, svr, xgb, lgb),
                                  meta_regressor=lgb, use_features_in_secondary=True)

stacked_gen_mod = stacked_gen.fit(np.array(X_train), np.array(y_train))
stacked_gen_preds = stacked_gen_mod.predict(np.array(X_test))

rmse = np.sqrt(mean_squared_error(y_test, stacked_gen_preds))
print('RMSE: ', rmse)
rsq = r2_score(y_test, stacked_gen_preds)
print('R2 Score: ', rsq)
mae = mean_absolute_error(y_test, stacked_gen_preds)
print('MAE: ', mae)

Finally, to facilitate readability and comparability of scores, it is sensible to tabulate the error metrics for each model and save this down to one dataframe.

In [None]:
scores = {
         'Model': ['Lin_Reg', 'Ridge', 'Random_Forest', 'Gradient_Boost', 'Decision_Tree', 
                    'SVR', 'XGB', 'LGB', 'MLP','Stacked_Gen'],
          
         'RMSE': [(np.sqrt(mean_squared_error(y_test, y_preds))),
                  (np.sqrt(mean_squared_error(y_test, ridge_preds))),
                 (np.sqrt(mean_squared_error(y_test, rf_preds))),
                 (np.sqrt(mean_squared_error(y_test, gbm_preds))),
                 (np.sqrt(mean_squared_error(y_test, tree_preds))),
                 (np.sqrt(mean_squared_error(y_test, svr_preds))),
                 (np.sqrt(mean_squared_error(y_test, xgb_preds))),
                 (np.sqrt(mean_squared_error(y_test, lgb_preds))),
                  (np.sqrt(mean_squared_error(y_test, net_preds))),
                 (np.sqrt(mean_squared_error(y_test, stacked_gen_preds)))],
    
         'R2 Score': [(r2_score(y_test, y_preds)),
                     (r2_score(y_test, ridge_preds)),
                     (r2_score(y_test, rf_preds)),
                     (r2_score(y_test, gbm_preds)),
                     (r2_score(y_test, tree_preds)),
                     (r2_score(y_test, svr_preds)),
                     (r2_score(y_test, xgb_preds)),
                     (r2_score(y_test, lgb_preds)),
                     (r2_score(y_test, net_preds)),
                     (r2_score(y_test, stacked_gen_preds))],
    
        'MAE': [(mean_absolute_error(y_test, y_preds)),
                  (mean_absolute_error(y_test, ridge_preds)),
                 (mean_absolute_error(y_test, rf_preds)),
                 (mean_absolute_error(y_test, gbm_preds)),
                 (mean_absolute_error(y_test, tree_preds)),
                 (mean_absolute_error(y_test, svr_preds)),
                 (mean_absolute_error(y_test, xgb_preds)),
                 (mean_absolute_error(y_test, lgb_preds)),
                (mean_absolute_error(y_test, net_preds)),
                 (mean_absolute_error(y_test, stacked_gen_preds))]
            }

col = ['Model', 'RMSE', 'R2 Score', 'MAE']

error_matrix = pd.DataFrame(data=scores, columns=col).sort_values(by='MAE', ascending=True).reset_index()
error_matrix.drop(columns=['index'], inplace=True)

error_matrix

Reviewing the table above, there's a tie between the LGB and XGB models, which are both gradient boosters (neural networks) and have a MAE of $85,000. The LGB appears (very) marginally better than XGB, and therefore is my preferred choice of model. The accuracy rate is shown in the R2 score of 81%, which means that the model can explain 81% of calculated variances.

So that's it :). There are clearly many more sophisticated methods that can be used to achieve a higher level of predictive power. Intuitively, I would have liked a model with a maximum MAE of $25,000 for a problem on house prices, as it would generate very reliable predictions that can be used by a real-estate firm or prospective customers thinking about buying or selling a house in the area. Whilst neural networks can achieve this to some degree, some finetuning is required to take this exercise to the next level. So please review the kernel and let me know if you have any suggestions for improvement - Any constructive comments will be very gratefully received!!

Cheers...

**End of Notebook**