# Iteration 0: Creating an intuition-based model

## Load data

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
df = pd.read_csv('./data/housing_iteration_0_2_classification.csv')
df.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive
0,8450,65.0,856,3,0,0,2,0,0,0
1,9600,80.0,1262,3,1,0,2,298,0,0
2,11250,68.0,920,3,1,0,2,0,0,0
3,9550,60.0,756,3,1,0,3,0,0,0
4,14260,84.0,1145,4,1,0,3,192,0,0


In [3]:
y = df.pop('Expensive')

In [4]:
X = df.copy()

## Train-test split

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
y_train.value_counts()

Expensive
0    999
1    169
Name: count, dtype: int64

## Definition of intuition-based models

### Model 0: All houses are cheap

As a starting guess for our first model, we will assume that all houses are cheap.

In [19]:
model_cheap_train = pd.Series(0, index=range(len(y_train)))

In [20]:
from sklearn.metrics import accuracy_score

train_accuracy = accuracy_score(y_true = y_train,
                                 y_pred = model_cheap_train
                                 )

round(train_accuracy, 2)

0.86

This model has an accuracy of 86%. Compare with the test data below.

In [22]:
model_cheap_test = pd.Series(0, index=range(len(y_test)))

In [44]:
test_accuracy = accuracy_score(y_true = y_test,
                                 y_pred = model_cheap_test
                                 )

round(test_accuracy, 2)

0.84

The test model has an identical accuracy, i.e. the variance between train and test is very low, which is good. Each further model must fulfill this as well and additionally they must perform better (i.e. have higher accuracy) than this baseline model.

### Model 1: Houses with pool are expensive

An alternative intuitive baseline model.

In [59]:
model_pools_train = X_train['PoolArea'].apply(lambda x: 0 if x == 0 else 1)

In [60]:
train_accuracy = accuracy_score(y_true = y_train,
                                 y_pred = model_pools_train
                                 )

round(train_accuracy, 2)

0.85

In [61]:
model_pools_test = X_test['PoolArea'].apply(lambda x: 0 if x == 0 else 1)

In [62]:
test_accuracy = accuracy_score(y_true = y_test,
                                 y_pred = model_pools_test
                                 )

round(test_accuracy, 2)

0.83

This more complex model does not produce a more accurate prediction then our baseline model.