# King County House Data




The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The <a target="_new" href="https://www.kaggle.com/shivachandel/kc-house-data">dataset</a> was obtained from <a target="_new" href="https://www.kaggle.com/shivachandel/kc-house-data">Kaggle</a>.

<ol>
    <li><a target="_self" href="#obj" id="index-obj" style="color: black;">Objective</a></li>
    <li><a target="_self" href="#load-data" id="index-load-data" style="color: black;">Load Data</a></li>
    <li><a target="_self" href="#var-types" id="index-var-types" style="color: black;">Identification of variables and data types</a></li>
    <li><a target="_self" href="#metrics" id="index-metrics" style="color: black;">Analyzing the basic metrics</a></li>
    <li><a target="_self" href="#uni-analysis" id="index-uni-analysis" style="color: black;">Non-Graphical Univariate Analysis</a></li>
    <li><a target="_self" href="#uni-analysis2" id="index-uni-analysis2" style="color: black;">Graphical Univariate Analysis</a></li> 
    <li><a target="_self" href="#bi-analysis" id="index-bi-analysis" style="color: black;">Graphical Bivariate Analysis</a></li>
    <li><a target="_self" href="#var-trans" id="index-var-trans" style="color: black;">Variable transformations</a></li>
    <li><a target="_self" href="#nan-value" id="index-nan-value" style="color: black;">Missing value treatment</a></li>
    <li>
        <a target="_self" href="#outlier" id="index-outlier" style="color: black;">Outlier treatment</a>
    </li>
    <li>
        <a target="_self" href="#feat-eng" id="index-feat-eng" style="color: black;">Feature Engineering</a>
    </li>
    <li>
        <a target="_self" href="#model" id="index-model" style="color: black;">Model - Deep Neural Network</a>
    </li>     
    <li>
        <a target="_self" href="#feat-imp" id="index-feat-imp" style="color: black;">Feature Importance</a>
    </li>    
</ol>


<h3><a target="_self" href="#index-obj" id="obj" style="color: black;">Objective</a></h3>

The objective here is to build a model to predict the price of the houses, and check which features play an important role to the model.

<h3><a target="_self" href="#index-load-data" id="load-data" style="color: black;">Load Data</a></h3>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

sns.set(palette='RdYlGn')

In [None]:
df = pd.read_csv('/kaggle/input/kc-house-data/kc_house_data.csv')
df.head()

<h3><a target="_self" href="#index-var-types" id="var-types" style="color: black;">Identification of variables and data types</a></h3>

In [None]:
df.info()

I'm going to delete the id and date columns, since they will not help us too much in our model.

In [None]:
df.drop(columns=['id', 'date'], inplace=True)

<h3><a target="_self" href="#index-metrics" id="metrics" style="color: black;">Analyzing the basic metrics</a></h3>

In [None]:
df.describe().T.drop(columns=['count'])

<h3><a target="_self" href="#index-uni-analysis" id="uni-analysis" style="color: black;">Non-Graphical Univariate Analysis</a></h3>

What are the characteristics of the most expensive house?

In [None]:
df.sort_values(by='price').tail(1)

Wooow! A house of 7.7 millions dollars?! I should have made an offer! :)

What are the characteristics of the oldest house?

In [None]:
df.sort_values(by='yr_built').head(1)

A house built in 1900, which has never been renovated. It must probably be falling apart.

<h3><a target="_self" href="#index-uni-analysis2" id="uni-analysis2" style="color: black;">Graphical Univariate Analysis</a></h3>

Let's check the behaviour of our features.

In [None]:
num_vars = ['price', 'long', 'lat', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15', 'yr_built', 'yr_renovated', 'zipcode']
cat_vars = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade']

In [None]:
def plot_univariate_panel(vars_name, data, func_plot, n_cols=2):
    
    from math import ceil
    
    n_rows = ceil(len(vars_name) / n_cols)
    
    plt.figure(figsize=(7 * n_cols, 4 * n_rows))
    for idx, var in enumerate(vars_name, 1):
        plt.subplot(n_rows, n_cols, idx)
        func_plot(data[var])

In [None]:
plot_univariate_panel(num_vars, df, sns.boxplot, 3)

In the boxplots above, we can see many outliers and the data distributions are distorted. We will deal with these problems later. Next, let's take a look on the discrete features.

In [None]:
plot_univariate_panel(cat_vars, df, sns.countplot, 3)

<h3><a target="_self" href="#index-bi-analysis" id="bi-analysis" style="color: black;">Graphical Bivariate Analysis</a></h3>

Now let's take a look at how each feature behaves compared to the price.

In [None]:
def plot_bivariate_panel(vars_name, var_ref, data, n_cols=3):

    from math import ceil
    
    n_rows = ceil(len(vars_name) / n_cols)
    
    plt.figure(figsize=(7 * n_cols, 4 * n_rows))
    for idx, var in enumerate(vars_name, 1):
        
        mean = df.groupby(by=var).mean()[[var_ref]].reset_index()
        std = df.groupby(by=var).std()[var_ref].fillna(0)

        plt.subplot(n_rows, n_cols, idx)        
        sns.scatterplot(x=var, y=var_ref, data=mean)
        plt.ylabel(f'Mean {var_ref}')               

In [None]:
plot_bivariate_panel(df.drop(columns=['price']).columns, 'price', df)

Apparently, all features seem to have some relationship with the price. Next, let's see how price is distributed along geographical coordinates.

In [None]:
bbox = (
    (df['long'].min(), df['long'].max(),
    df['lat'].min(), df['lat'].max())
)
bbox

In [None]:
plt.figure(figsize=(15, 8))
house_map = plt.imread('/kaggle/input/mapcity/map.png')

plt.imshow(house_map, zorder=0, extent=bbox, aspect='equal')
sns.scatterplot(
    x='long', 
    y='lat', 
    data=df[df['price'] < 2e6], 
    hue='price',    
    zorder=1, 
    edgecolor=None, 
    alpha=0.2,    
    palette='hot'
)

According to the map, the most expensive houses are near Lake Washington, and it is very concentrated on Mercer Island. The next map shows the distribution of houses according to their age.

In [None]:
plt.figure(figsize=(15, 8))
house_map = plt.imread('/kaggle/input/mapcity/map.png')

plt.imshow(house_map, zorder=0, extent=bbox, aspect='equal')
sns.scatterplot(
    x='long', 
    y='lat', 
    data=df, 
    hue='yr_built',    
    zorder=1, 
    edgecolor=None, 
    alpha=0.2,    
    palette='hot'
)

Again, the oldest houses are near Lake Washington, on the side of Seattle city.

<h3><a target="_self" href="#index-var-trans" id="var-trans" style="color: black;">Variable transformations</a></h3>

Some variables are almost normal distributed, right-skewed, and we can fix it taking the logarithm of each value.

In [None]:
new_df = df.copy()

new_df['price'] = np.log(new_df['price'])
new_df['sqft_living'] = np.log(new_df['sqft_living'])
new_df['sqft_above'] = np.log(new_df['sqft_above'])
new_df['sqft_living15'] = np.log(new_df['sqft_living15'])

<h3><a target="_self" href="#index-nan-value" id="nan-value" style="color: black;">Missing value treatment</a></h3>

Fortunately, there are no missing values.

In [None]:
print(f'Missing values: {new_df.isnull().sum().sum()}')

<h3><a target="_self" href="#index-outlier" id="outlier" style="color: black;">Outlier treatment</a></h3>

As we saw on the past graphs, many variables have outliers. I'm going to use Tukey's fences method to remove them.

In [None]:
def remove_outliers(data, var_names):
    ans = data.copy()
    for var in var_names:
        var_info = ans[var].describe()        
        iq_range = var_info['75%'] - var_info['25%']
        intv_range = (var_info['25%'] - 1.5 * iq_range, var_info['75%'] + 1.5 * iq_range)
        ans = ans[ans[var].between(*intv_range)]
    return ans

new_df = remove_outliers(new_df, num_vars)

In [None]:
plot_univariate_panel(num_vars, new_df, sns.boxplot, 3)

In the boxplots above, we can see that there are still outliers, but I will not remove them because we may lose information.

<h3><a target="_self" href="#index-feat-eng" id="feat-eng" style="color: black;">Feature Engineering</a></h3>

At that point, we could go ahead and build our model, but we will try to build more resources to capture more information.

The first thing I wanto to do is exclude the zipcode variable, because it will give me an extra work figuring out how to use this feature properly, and since we have the geographic coordinates, propably those will give us a more accurate information.

In [None]:
new_df.drop('zipcode', axis=1, inplace=True)

The second thing I'm going to do is remap the year renovated variable, because some houses weren't renovated since it was built. My idea is to replace the zero values by the year that the house was built.

In [None]:
new_df['yr_renovated'] = new_df[['yr_renovated', 'yr_built']].apply(lambda pair: pair[0] if pair[0] != 0 else pair[1], axis=1)

The third thing to do is to create square features.

In [None]:
def square_features(features, X, degree=2):
    
    X_cp = X.copy()
    X_features = X.columns
    
    for feature in features:
        if feature in X_features:
            X_cp[feature + f'^{degree}'] = np.power(X_cp[feature], degree)
    
    return X_cp

new_df = square_features(num_vars[1:] + cat_vars, new_df)

The fourth and final thing to do is to create a deviation variable to capture how the year the house was built fits into your average grade.

In [None]:
cat = new_df.groupby('grade')['yr_built']
new_df['yr_built_dev_grade'] = cat.transform(lambda x: (x - x.mean()) / x.std())

<h3><a target="_self" href="#index-model" id="model" style="color: black;">Model - Deep Neural Network</a></h3>

To model this problem, I'll use a deep neural network with eight layers and 36 units per layer.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow import keras
from sklearn.metrics import mean_absolute_error, r2_score, explained_variance_score


X = new_df.drop(columns=['price'])
y = new_df['price']
y = np.power(np.exp(1), y)

# Splitting the dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
scaler = MinMaxScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
model = keras.Sequential()

n_units = X.shape[1]
n_layers = 8

for _ in range(n_layers):
    model.add(keras.layers.Dense(units=n_units, activation='relu'))

model.add(keras.layers.Dense(units=1))

model.compile(optimizer='adam', loss='mse')

In [None]:
hist = model.fit(
    x=X_train, 
    y=y_train,    
    verbose=False,
    validation_data=(X_test, y_test),
    batch_size=128,
    epochs=500
)

In [None]:
y_train_pred = model.predict(x=X_train)
y_test_pred = model.predict(x=X_test)

In [None]:
print(f'MAE train set: {mean_absolute_error(y_train_pred, y_train):.2}')
print(f'MAE test set: {mean_absolute_error(y_test_pred, y_test):.2}')

In [None]:
loss_df = pd.DataFrame(hist.history)

In [None]:
plt.figure(figsize=(15, 6))
x, y = range(1, len(loss_df['loss']) + 1), loss_df['loss']
start = 2
plt.plot(x[start:], y[start:], label='Training Loss')
x, y = range(1, len(loss_df['val_loss']) + 1), loss_df['val_loss']
plt.plot(x[start:], y[start:], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('MSE')
plt.legend()

As we can see in the learning curves above, the model is not over-dimensioning or under-adjusting the data. It has achived a good balance. Let's compare the some statistical information about the real and predicted price.

In [None]:
y_test_pred = model.predict(X_test)

comp_df = pd.DataFrame(
    data={
        'Test True Y': y_test,
        'Test Pred Y': y_test_pred.reshape((-1,))
    }
)

comp_df.describe().T

In [None]:
plt.figure(figsize=(15, 6))

plt.subplot(1,2,1)
plt.scatter(
    x=y_train, 
    y=y_train_pred, 
    alpha=0.3, 
    edgecolor=None
)
plt.plot(
    [y_train.min(), y_train.max()],
    [y_train.min(), y_train.max()],
    color='black', 
    linestyle='dashed', 
    label=f'Desired\nR² train set: {r2_score(y_train, y_train_pred):.2}'
)
plt.legend()
plt.xlabel('True value')
plt.ylabel('Predicted value')


plt.subplot(1,2,2)
plt.scatter(
    x=y_test, 
    y=y_test_pred, 
    alpha=0.3, 
    edgecolor=None
)
plt.plot(
    [y_test.min(), y_test.max()],
    [y_test.min(), y_test.max()],
    color='black', 
    linestyle='dashed', 
    label=f'Desired\nR² test set: {r2_score(y_test, y_test_pred):.2}'
)
plt.legend()
plt.xlabel('True value')
plt.ylabel('Predicted value')

From the graphs above, the model is doing a pretty decent job. The R² coefficient in the training set is about 0.9 and 0.88 in the test set.

<h3><a target="_self" href="#index-feat-imp" id="feat-imp" style="color: black;">Feature Importance</a></h3>

Now let's get an idea which features are more important to our model.

In [None]:
def feature_importance(model, metric, X, y, features_name, shuffles_per_column=10):
    
    n_rows, n_cols = X.shape
    ref_score = metric(y, model.predict(X)) # Reference score    
    mean_list, std_list = [], []
    X_cp = X.copy()
    
    for c in range(n_cols):
        metric_list = []
        for _ in range(shuffles_per_column):
            np.random.shuffle(X_cp[:, c])
            y_pred = model.predict(X_cp)
            score = metric(y, y_pred)
            metric_list.append(ref_score - score)
            X_cp[:, c] = X[:, c]        
        mean_list.append(np.mean(metric_list))
        std_list.append(np.std(metric_list))
    
    importance_rel = [np.round(100*np.mean(mean)/np.sum(mean_list), 2) for mean in mean_list]
            
    return pd.DataFrame({
        'Feature': features_name,
        'Importance (%)': importance_rel,
        'Mean': mean_list,
        'Std': std_list
    }).sort_values('Importance (%)', ascending=False).reset_index(drop=True)

ans = feature_importance(model, r2_score, X_test, y_test, X.columns)

In [None]:
plt.figure(figsize=(15, 8))
sns.barplot(x='Importance (%)', y='Feature', data=ans)

As we would expect, the localization of the house is very important to predict the price, that's why latitude and longitude is on the top of importance. Second, the area available. Third, the year that the house was built and so on.