# Outlier Detection and Removal
https://machinelearningmastery.com/model-based-outlier-detection-and-removal-in-python/

## Dataset
[House Price Dataset(housing.csv)](https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv)

[House Price Dataset Description (housing.names)](https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.names)

### Load and summarize the dataset

In [1]:
from pandas import read_csv

from sklearn.model_selection import train_test_split

In [2]:
# Load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)

In [3]:
# Retrieve the array
data = df.values

In [4]:
# Split into input and output elements
X, y = data[:, :-1], data[:, -1]

In [5]:
# Summarize the shape of the dataset
X.shape, y.shape

((506, 13), (506,))

In [6]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [7]:
# Summarize the shape of the train and test sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((339, 13), (167, 13), (339,), (167,))

## Local Outlier Factor Performance
A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import mean_absolute_error

The model provides the “contamination” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

In [9]:
# Identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)

In [10]:
# Select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

In [11]:
# Summarize the shape of the updated training dataset
X_train.shape, y_train.shape

((305, 13), (305,))

In [12]:
# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [13]:
# Evaluate the model
yhat = model.predict(X_test)

In [14]:
# Evaluate predictions
mae = mean_absolute_error(y_test, yhat)

In [15]:
print(f'MAE {mae}')

MAE 3.3559923292852263
