#### Imports

In [1]:
# activates python modules auto reload
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('..')

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

import shap
import pandas as pd
from math import sqrt
from sklearn import metrics
from sklearn.model_selection import train_test_split

from src.features import preprocess
from src.models import build_model

#### Read Data

In [2]:
df = pd.read_csv('../data/house_sales.csv', nrows=None)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df.shape)
df.head(3)

(18445, 16)


Unnamed: 0,price,num_bed,num_bath,size_house,size_lot,num_floors,is_waterfront,condition,size_basement,year_built,renovation_date,zip,latitude,longitude,avg_size_neighbor_houses,avg_size_neighbor_lot
0,221900,3,1.0,1180,5650,1.0,0,3,0,1955,0,98178,47.511234,-122.256775,1340,5650
1,538000,3,2.25,2570,7242,2.0,0,3,400,1951,1991,98125,47.721023,-122.318862,1690,7639
2,180000,2,1.0,770,10000,1.0,0,3,0,1933,0,98028,47.737927,-122.233196,2720,8062


## Price prediction

### Build model

Lets define the features to be used in the prediction model, separating them between categorical and numeric.

In [3]:
num_feat = [
    'num_bed',
    'num_bath',
    'size_house',
    'size_lot',
    'num_floors',
    'condition',
    'size_basement',
    'year_built',
    'renovation_date',
    'latitude',
    'longitude',
    'avg_size_neighbor_houses',
    'avg_size_neighbor_lot',
    'is_waterfront'
]

cat_feat = [
    'zip'
]

The features that are not making into the model and thus will have no effect on the outcome are:

In [4]:
print(*(set(df.columns.tolist()) - set(cat_feat) - set(num_feat)))

price


#### Preprocess Data

The preprocessor applies transformations to our features, so that they can be properly used by the model. 

Numerics have their gaps completed with the mean and their distribution stardadized. 

Categoricals have their gaps filled with the most frequent class and go trough one-hot-encoding. Encoding converts them to a numeric representation, each class becomes a standalone boolean feature.

In [5]:
df = preprocess(df, num_feat, cat_feat, 'price')
print(df.shape)
df.head(3)

(18445, 85)


Unnamed: 0,num_bed,num_bath,size_house,size_lot,num_floors,condition,size_basement,year_built,renovation_date,latitude,...,zip_98148,zip_98155,zip_98166,zip_98168,zip_98177,zip_98178,zip_98188,zip_98198,zip_98199,price
0,-0.398985,-1.448538,-0.980964,-0.224493,-0.914436,-0.630878,-0.661827,-0.544939,-0.210967,-0.352129,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,221900
1,-0.398985,0.169771,0.52751,-0.186423,0.934687,-0.630878,0.239844,-0.681179,4.727632,1.1619,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,538000
2,-1.469736,-1.448538,-1.425909,-0.120468,-0.914436,-0.630878,-0.661827,-1.294259,-0.210967,1.283894,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,180000


#### Split Training and Test sets

In [7]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=8)

print('Train:\t%d houses' % df_train.shape[0])
print('Test:\t%d houses' % df_test.shape[0])

Train:	14756 houses
Test:	3689 houses


#### Classifier
The classifier will be a linear regression, as our data exploration showed a good linear relation between most features and the target variable.

In [8]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()

##### Fit

In [9]:
clf.fit(df_train.drop(columns='price'), df_train['price']);

### Make prediction

In [12]:
y_test = df_test['price']
y_pred =  clf.predict(df_test.drop(columns='price'))

### Measure Quality

In [13]:
from math import sqrt
from sklearn import metrics

print('LR Scores:')
print('R^2: %.4f' % (metrics.r2_score(y_test, y_pred)))
print('Sqrt Mean Squared Error: %.4f' % (sqrt(metrics.mean_squared_error(y_test, y_pred))))

LR Scores:
R^2: 0.8080
Sqrt Mean Squared Error: 159786.6970


#### R squared
R^2 stands for coeffiecient of determination, it indicates the strenght of the relation, as obtained by out model, between the independent variables and the target, it goes from 0 to 1, with 1 being best. 

The linear regression model obtained R^2 equal to 0.78, this is quite good and indicates that our model is doing a good job at predicting prices.

#### SqRoot of the Mean Squared Error
The RMSR indicates how much our prediction usually differs from the real price. 

Being between 150k and 170k is a reasonable error.

This value probably could be improved using more sophisticated models. It could also be considered the distribution of these errors, if they occur mostly on the pricier outliers, it would be less impactful.

### Model interpretation

Interpreting the linear model coefficients.

In [18]:
from src.visualization.visualize import lr_coeff

coeff = lr_coeff(df_test.columns, clf)
coeff.head(10)

Unnamed: 0_level_0,Coefficient
Feature,Unnamed: 1_level_1
size_house,240123.17024
is_waterfront,72803.522353
avg_size_neighbor_houses,33163.255787
latitude,27825.369566
num_bath,21851.646931
condition,18414.590149
size_lot,13596.527613
renovation_date,10114.287521
avg_size_neighbor_lot,-3236.398929
year_built,-5269.184571


House size and waterfront status are the most important factors for the price of a home, as expected being larger and water front means a more expensive property. 

The size of nearby homes also plays a important role, larger nearby homes increase the property price tag, indicating that rich neiborhoods are desired by buyers. 

Higher latitudes increase price, so the north region is preferred. 

As for the house's interior space, the number of bathrooms was the most related to the price tag. Possibly due to its inherit relation with the house size.