#### Imports

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import math
from matplotlib.pyplot import figure

import sys
sys.path.append('..')

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

from src.visualization.visualize import lr_coeff

#### Read Data

In [2]:
df = pd.read_csv('../house_sales.csv')
df.drop_duplicates(inplace=True)
print('%d houses' % df.shape[0])

18445 houses


## Price prediction

### Build model

Lets define the features to be used in the prediction model, separating them between categorical and numeric.

In [3]:
num_feat = [
    'num_bed',
    'num_bath',
    'size_house',
    'size_lot',
    'num_floors',
    'condition',
    'size_basement',
    'year_built',
    'renovation_date',
    'latitude',
    'longitude',
    'avg_size_neighbor_houses',
    'avg_size_neighbor_lot',
    'is_waterfront'
]

cat_feat = [
    'zip'
]

The features that are not making into the model and thus will have no effect on the outcome are:

In [4]:
print(*(set(df.columns.tolist()) - set(cat_feat) - set(num_feat)))

price


Now we define our preprocessor and the classifier to be used in the model. 

The preprocessor applies transformations to our features. Numerics have their gaps completed with the mean and have their distribution stardadized, to improve model performace. 

Categoricals have the gaps filled with the most frequent class and go trough one-hot-encoding to be properly used by the model. Encoding converts categoricals to a numeric representation, a feature with 3 possible classes is converted to 3 new boolean features responsible for representing each element's class, these new features with 1 or 0 are the ones used to fit our model.

The classifier used will be Linear regression, as our data exploration showed a good linear relation between most features and the target variable.

In [5]:
from src.features.transformers import preprocessor
pp = preprocessor(num_feat, cat_feat)

from sklearn.linear_model import LinearRegression
clf = LinearRegression()

from src.models.build_model import build_model
model = build_model(pp, clf)

### Split into training and test 

In [6]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.2)

### Fit model

Split the data into training and data sets.

In [7]:
model.fit(df_train, df_train['price']);

### Make prediction

In [8]:
y_test = df_test['price']
y_pred =  model.predict(df_test)

### Measure Quality

In [9]:
from sklearn import metrics

print('LR Scores:')
print('R^2: %.4f' % (metrics.r2_score(y_test, y_pred)))
print('Sqrt Mean Squared Error: %.4f' % (math.sqrt(metrics.mean_squared_error(y_test, y_pred))))

LR Scores:
R^2: 0.7911
Sqrt Mean Squared Error: 156920.5106


#### R squared
R^2 stands for coeffiecient of determination, it indicates the strenght of the relation, as obtained by out model, between the independent variables and the target, it goes from 0 to 1, with 1 being best. 

The linear regression model obtained R^2 equal to 0.78, this is quite good and indicates that our model is doing a good job at predicting prices.

#### SqRoot of the Mean Squared Error
The RMSR indicates how much our prediction usually differs from the real price. 

Being between 150k and 170k is a reasonable error.

This value probably could be improved using more sophisticated models. It could also be considered the distribution of these errors, if they occur mostly on the pricier outliers, it would be less impactful.

### Model interpretation

Interpreting the linear model coefficients.

In [10]:
coeff = lr_coeff(pp, clf)
coeff.head(10)

Unnamed: 0_level_0,Coefficient
Feature,Unnamed: 1_level_1
size_house,243108.54763
is_waterfront,77593.224899
latitude,33684.847182
avg_size_neighbor_houses,32225.239525
num_bath,22430.802394
condition,16761.478464
size_lot,16467.93237
renovation_date,9307.01868
avg_size_neighbor_lot,-6032.87751
year_built,-8205.012796


House size and waterfront status are the most important factors for the price of a home, as expected being larger and water front means a more expensive property. 

The size of nearby homes also plays a important role, larger nearby homes increase the property price tag, indicating that rich neiborhoods are desired by buyers. 

Higher latitudes increase price, so the north region is preferred. 

As for the house's interior space, the number of bathrooms was the most related to the price tag. Possibly due to its inherit relation with the house size.