# Detect Outliers

In our EDA in R, we determined that the ids 524  692 1183 1299 (R counts start at 1!) had very large GrLivArea, while 31 496 534 917 969 had very low SalePrice for their size, relative to the rest of the population. We want to determine what sets of points can be dropped in order to increase prediction accuracy on a validation set.

In [6]:
# we decrement by 1 in order to conform with python counting
pot_outliers = [524-1, 692-1, 1183-1, 1299-1, 31-1, 496-1, 534-1, 917-1, 969-1]

In [21]:
import itertools
import numpy as np
import pandas as pd
pd.set_option('display.precision',20)

from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict, KFold, cross_val_score, GridSearchCV, \
                                    ShuffleSplit

In [2]:
# def to compare goodness of fit on training set
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

We import the preprocessed data set that includes all data points.

In [4]:
df = pd.read_csv("input/train_tidy_000000000.csv")

The columns GarageAge and GarageAgeLin have NAs. We have to drop them unless we drop the rows with NoGarage == 1 instead. For now, we just drop GarageAge and GarageAgeLin.

In [15]:
df.drop(['GarageAge', 'GarageAgeLin'], axis=1, inplace=True)

We want to split this into a training and validation set. We want the potential outliers to be in the training set.

In [18]:
outlier_df = df.iloc[pot_outliers]

In [19]:
nooutlier_df = df.drop(pot_outliers)

In [22]:
ss = ShuffleSplit(n_splits=1, test_size=0.20, random_state=89)

In [23]:
X = nooutlier_df.values

In [25]:
for train_idx, validation_idx in ss.split(X):
    train_df = nooutlier_df.iloc[train_idx]
    validation_df = nooutlier_df.iloc[validation_idx]

In [29]:
train_df = train_df.append(outlier_df)

We'll set up the matrices we use for the validation set, since these won't change as we drop outliers. 

In [31]:
y_validation = validation_df['SalePrice'].values
x_validation = validation_df.drop(['HouseId', 'SalePrice'],axis=1).values

We will use the LassoLarsCV model and RMS error as our metric.  This is because, as we will see, linear models do reasonably well on this problem and the regularization hyperparameter is automatically selected by CV on the training data.

In [32]:
# Cross-validation sets
kfold = KFold(n_splits=10, random_state=7)

lr = linear_model.LassoLarsCV(verbose=False, max_iter=5000,precompute='auto', cv=kfold, max_n_alphas=1000, n_jobs=-1)

We want to set a baseline value by training on the full dataset and then predicting on the validation set.

In [33]:
y_train = train_df['SalePrice'].values
x_train = train_df.drop(['HouseId', 'SalePrice'],axis=1).values
lr.fit(x_train, y_train)
y_pred = lr.predict(x_validation)



NameError: name 'x_pred' is not defined

In [35]:
baseline = rmse(y_validation, y_pred)
baseline

0.12503266782864195

We'll examine dropping all possible sets of outliers.  There are 512 total combinations including the baseline where we don't drop any points. train_df is still indexed by HouseId - 1

In [55]:
comb_drop_results_df = pd.DataFrame(dtype = 'float64')
count = 0 
for L in range(0, len(pot_outliers)+1):
        for subset in itertools.combinations(pot_outliers, L):
            drop_pts = list(subset)
            comb_drop_results_df.loc[count, 'Dropped'] = str([x+1 for x in drop_pts])
            y_train = train_df['SalePrice'].drop(drop_pts).values
            x_train = train_df.drop(['HouseId', 'SalePrice'],axis=1).drop(drop_pts).values
            lr.fit(x_train, y_train)
            y_pred = lr.predict(x_validation)
            error = rmse(y_validation, y_pred)
            comb_drop_results_df.loc[count, 'RMSE'] = error
            comb_drop_results_df.loc[count, 'Diff from Base'] = error - baseline
            count += 1



In [56]:
comb_drop_results_df.sort_values(['RMSE'])

Unnamed: 0,Dropped,RMSE,Diff from Base
312,"[524, 1299, 31, 496, 917]",0.11751015503521701488,-0.00752251279342493195
418,"[524, 1183, 1299, 31, 496, 917]",0.11767344602977361512,-0.00735922179886833172
403,"[524, 692, 1299, 31, 496, 917]",0.11767832787658395743,-0.00735433995205798940
487,"[524, 1183, 1299, 31, 496, 534, 917]",0.11794212227929237735,-0.00709054554934956949
509,"[524, 1183, 1299, 31, 496, 534, 917, 969]",0.11794722296893855873,-0.00708544485970338811
489,"[524, 1183, 1299, 31, 496, 917, 969]",0.11800623920290025104,-0.00702642862574169580
432,"[524, 1299, 31, 496, 534, 917]",0.11801314170909987800,-0.00701952611954206884
481,"[524, 692, 1299, 31, 496, 534, 917]",0.11807551789471042170,-0.00695714993393152514
434,"[524, 1299, 31, 496, 917, 969]",0.11808353826327439018,-0.00694912956536755666
493,"[524, 1299, 31, 496, 534, 917, 969]",0.11808925058320160484,-0.00694341724544034200


In [57]:
comb_drop_results_df.sort_values(['RMSE']).to_csv('comb_drop_results.csv', header=True)

We note that the difference from the baseline for all of these sets is very small, while the difference between sets is generally even smaller.  The (31, 496, 524, 917, 1299) set seems to be optimal, though the other top sets could be used to generate an ensemble of models.