# House prices simplified

This model intends to be simple. Since there are lots of variables for a small number of instances, overfitting is a serious issue. So, I'll select a few variables beforehand that are the most important ones according to domain experts.

A quick domain research shows that the price of a house is mostly influenced by:

- Prices of similar houses in the neighborhood
- Location (Proximity to schools, employment oportunities, entertainment, ...)
- Size and Usuable space
- Age and condition
- Number of bedrooms and bathrooms
- Garage

In [151]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
import joblib
from scipy.stats.distributions import uniform, randint
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.impute import SimpleImputer

In [2]:
df = pd.read_csv('train.csv')

In [51]:
X = df.drop(['SalePrice'], axis=1)
y = df['SalePrice']

In [155]:
def write_test(mod,name):
    df_t = pd.read_csv('test.csv')
    df_pred = pd.concat([df_t.Id, pd.Series(mod.predict(df_t), name='SalePrice')], axis=1)
    df_pred.to_csv(name, index=False)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [33]:
class Preprocessing_Select(TransformerMixin, BaseEstimator):
    """ Simple transformation and Select Variables to work with """
    # House_Age = YrSold - YrBuilt | To assign a scale
    def __init__(self, vars_keep=False):
        self.vars_keep = vars_keep
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        Xt = X.copy()
        Xt['House_Age'] = X['YrSold'] - X['YearBuilt']
        if not self.vars_keep:
            return Xt
        return Xt[self.vars_keep]

In [162]:
simple = ['House_Age', 'GrLivArea','GarageCars', 'OverallQual', 'BedroomAbvGr','FullBath']
pipe_RF = Pipeline([
    ('Preprocess', Preprocessing_Select(vars_keep=simple)),
    ('Impute', SimpleImputer(strategy='median')),
    ('Predict', RandomForestRegressor())
    ]
)

In [160]:
pipe_RF.fit(X, y)

Pipeline(steps=[('Preprocess',
                 Preprocessing_Select(vars_keep=['House_Age', 'GrLivArea',
                                                 'GarageCars', 'OverallQual',
                                                 'BedroomAbvGr', 'FullBath'])),
                ('Impute', SimpleImputer(strategy='median')),
                ('Predict', RandomForestRegressor())])

In [161]:
write_test(pipe_RF, 'Simple_Model.csv')

In [154]:
cross_val_score(pipe_RF, X, y, cv = 10, scoring='neg_mean_squared_log_error').mean()

-0.03017758803343712