 # Predicting House Prices (Two-Stage Hybrid Approach)

### Description

A hybrid approach is applied for this dataset so that in the first stage the Linear Regression and then the Neural Networks Regression in the second stage is used. At the end of this notebook we will see that following this approach leads to achieve the value of 0.88 for R-squared. 

This work has been done in collaboration with my colleagues Chiel Bakkeren, Remco Stam, Biljana Gvozdic, Yuan Li, Elangovan Krishnan and Michiel van Lunsen. 

### Loading and inspecting the data

In [None]:
# Let's import the needeed libraries 
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from scipy.special import boxcox
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
# Load the dataset
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')

In [None]:
# Inspect the dataset, the first rows
df.head()

In [None]:
# Inspect the feature names
df.columns

In [None]:
# DataFrame information
df.info()

In [None]:
# Counting missing values in the dataset
df.isnull().sum()

In [None]:
# Summary statistics
df.describe()

## Visualization
### Data distribution for each variable 

In [None]:
# Creat a dataset with only numeric columns
df_num = df.select_dtypes(include=['int64', 'float64'])

# Histogram of the numeric columns
df_num.hist(bins=20, figsize=(20,20))
plt.show()

## First Stage: Multiple Regression

In [None]:
df.drop(['id', 'date'], axis=1, inplace=True)

In [None]:
df.loc[:, df.columns != 'price'].columns

In [None]:
# Split the dataset into train and test sets
train_set, test_set = train_test_split(df, test_size=0.3, random_state=42)

In [None]:
# Instantiate the linear regressior
lr = linear_model.LinearRegression()

# Define the train and test sets
X_train = train_set.loc[:, train_set.columns != 'price']
y_train = train_set['price']
X_test = test_set.loc[:, train_set.columns != 'price']
y_test = test_set['price']

# Fit the model
lr.fit(X_train, y_train)

# Generate predictions
y_pred = lr.predict(X_test)

In [None]:
k = df.loc[:, train_set.columns != 'price'].shape[1]
n = df.shape[0]
# Add the new results to the result DataFrame
result = pd.DataFrame({
                       'R^2(train)': lr.score(X_train, y_train), 
                       'R^2(test)': lr.score(X_test, y_test), 
                       'Adjusted R^2(train)': lr.score(X_train, y_train)-(k-1)/(n-k)*(1-lr.score(X_train, y_train)),
                       'Adjusted R^2(test)': lr.score(X_test, y_test)-(k-1)/(n-k)*(1-lr.score(X_test, y_test)), 
                       '5-Fold Cross Validation': 
                           cross_val_score(lr, df.loc[:, df.columns != 'price'], df[['price']], cv=5).mean()
                      }, index=['Multiple with all the features'])
result

## Feature Engineering

### Feature Transformation

The Box-Cox transformation is applied for a subset of the dataset.

In [None]:
# Hist plot of the data
plt.subplot(121)
df['price'].hist()
plt.title('Original')

# Apply the Box-Cox tranformation
df['boxcox_price'] = boxcox(df['price'], -0.2)

# Hist plot of the data after tra
plt.subplot(122)
df['boxcox_price'].hist()
plt.title('Transformed version')
plt.show()

In [None]:
# Hist plot of the data 
plt.subplot(121)
df['sqft_living'].hist()

# Apply the Box-Cox transformation
df['boxcox_sqft_living'] = boxcox(df['sqft_living'], 0)

# Hist plot the data after transformation
plt.subplot(122)
df['boxcox_sqft_living'].hist()
plt.show()

In [None]:
# Hist plot of data 
plt.subplot(121)
df['sqft_lot'].hist()

# Apply the Box-Cox transformation
df['boxcox_sqft_lot'] = boxcox(df['sqft_lot'], -0.2)

# Hist plot of the data transformation
plt.subplot(122)
df['boxcox_sqft_lot'].hist()
plt.show()

In [None]:
# Hist plot of the data 
plt.subplot(121)
df['sqft_above'].hist()

# Applying Box-Cox
df['boxcox_sqft_above'] = boxcox(df['sqft_above'], 0)

# Hist plot of the data after transformation
plt.subplot(122)
df['boxcox_sqft_above'].hist()
plt.show()

In [None]:
# Hist plot of the data
plt.subplot(121)
df['sqft_living15'].hist()

# Applying the Box-Cox
df['boxcox_sqft_living15'] = boxcox(df['sqft_living15'], 0.1)

# Hist plot of the data after transformation
plt.subplot(122)
df['boxcox_sqft_living15'].hist()
# Hist plot of the data after transformation
plt.show()

In [None]:
# Hist plot of the data
plt.subplot(121)
df['sqft_lot15'].hist()

# Apply the Box-Cox
df['boxcox_sqft_lot15'] = boxcox(df['sqft_lot15'], -0.2)

# Hist plot of the data after transformation
plt.subplot(122)
df['boxcox_sqft_lot15'].hist()
plt.show()

### Feature Selection

Let's just use informative features to build the model. For instance, as we consider 'zipcode', the 'lat' and 'lon' are not used. 

In [None]:
df_new = df[['boxcox_price', 'bedrooms', 'bathrooms', 'boxcox_sqft_living', 'boxcox_sqft_lot', 'floors', 'waterfront', 
            'view', 'condition', 'grade', 'boxcox_sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 
            'zipcode', 'boxcox_sqft_living15', 'boxcox_sqft_lot15']]

### Outlier Treatment

Outlier removal, based on zscore, is used only for those columns that have normal distribution.

In [None]:
# Compute z-scores for selected columns
zscores = df_new[['boxcox_price', 'boxcox_sqft_living', 'boxcox_sqft_lot','boxcox_sqft_above', 
                 'boxcox_sqft_living15', 'boxcox_sqft_lot15']].apply(stats.zscore)
max_abs = zscores.apply(lambda x: max(abs(x)) < 3, axis='columns')

# Creat a dataset without outliers
df_clean = df_new.loc[max_abs, :]

In [None]:
print('Shape Before Cleaning:', df_new.shape)
print('Shape After Cleaning:', df_clean.shape)

In [None]:
# Hist of the updated datset
df_clean.hist(figsize=(20, 20))
plt.show()

### Data Preprocessing

Take advantage of 'zipcode' to generate predictive features (dummy variables).

In [None]:
# Convert 'zipcode' column from numeric into string type
df_clean['zipcode'] = df_clean['zipcode'].astype(str)

# Counts of unique categories
print(df_clean['zipcode'].value_counts())


In [None]:
# Convert categorical variable into dummy variables
dummies = pd.get_dummies(df_clean['zipcode']).rename(columns=lambda x: 'zipcode_' + x)

# Display the head of the dataframe
dummies.head()

In [None]:
# Concatenate the dummy dataframe and the dataset
df_cln = pd.concat([df_clean, dummies], axis=1)

# Drop the 'zipcode' column from the dataset
df_cln.drop('zipcode', axis=1, inplace=True)

# Display the created dataset
df_cln.head()

### Normalization with SatndardScaler

In [None]:
# Instantiate StandardScaler
ss = StandardScaler()

# Applying StandardScaler (not for the dummies)
for column in df_cln.iloc[:, :16]:
    df_cln[column] = ss.fit_transform(df_cln[[column]])

In [None]:
# Display the shape of the new dataset
df_cln.shape

## Linear Regression

In [None]:
lr = linear_model.LinearRegression()

In [None]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_cln.iloc[:,1:], df_cln['boxcox_price'], test_size=0.3, random_state=42)

# Fit the model
lr.fit(X_train, y_train)

# Generate prediction
y_pred = lr.predict(X_test)

In [None]:
k = df_cln.iloc[:, 1:].shape[1]
n = df_cln.shape[0]

# Add the new results to the result DataFrame
result_new = pd.DataFrame({ 
                       'R^2(train)': lr.score(X_train, y_train), 
                       'R^2(test)': lr.score(X_test, y_test), 
                       'Adjusted R^2(train)': lr.score(X_train, y_train)-(k-1)/(n-k)*(1-lr.score(X_train, y_train)),
                       'Adjusted R^2(test)': lr.score(X_test, y_test)-(k-1)/(n-k)*(1-lr.score(X_test, y_test)), 
                       '5-Fold Cross Validation': 
                           cross_val_score(lr, df_cln.iloc[:, 1:], df_cln['boxcox_price'], cv=5).mean()
                      }, index=['Processed data '])
result = result.append(result_new)
result

So far, with the help of data preprocessing and feature engineering, we have been able to achieve a value higher than 0.87 for R-squared. Now let's see if neural network regressor can further improve this accuracy. 

# Second Stage: Neural Network Regressor

In [None]:
y_lr = lr.predict(X_train)
y_error = y_train - y_lr

In [None]:
import tensorflow
tensorflow.random.set_seed(42)

In [None]:
from keras.wrappers.scikit_learn import KerasRegressor
from keras import regularizers
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers.normalization import BatchNormalization

In [None]:
# create model
nnr = Sequential()
nnr.add(Dense(128, input_dim=X_train.shape[1], kernel_regularizer=regularizers.l2(0.01), activation='relu'))
nnr.add(Dense(64, activation='relu'))
nnr.add(Dropout(0.5))
nnr.add(Dense(1))
nnr.summary()
# Compile model
nnr.compile(loss='mse', optimizer='rmsprop')

In [None]:
# Model training
training = nnr.fit(X_train, y_error, validation_split=0.2, verbose =1, epochs=100, batch_size=100)

In [None]:
plt.plot(training.history['loss'])
plt.plot(training.history['val_loss'])
plt.title("Model's Training & Validation loss across epochs")
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

In [None]:
# Make a flat list
l = nnr.predict(X_test)

y_nnr = []
for sublist in l:
    for item in sublist:
        y_nnr.append(item)

        
y_hybrid = lr.predict(X_test) + y_nnr
y_hybrid.shape

In [None]:
plt.scatter(y_test, y_hybrid)
plt.show()

In [None]:
from scipy.stats import pearsonr
corr, _ = pearsonr(y_test, y_hybrid)
print('R^2: %.3f'% corr**2)