# Outlier Identification and Removal

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

This tutorial covers the identification and removal of outliers in datasets. We'll explore various techniques to detect and handle outliers, which are data points that significantly differ from other observations in a dataset.

## Learning Objectives

- Understand what outliers are and why they matter in data analysis
- Learn different methods to identify outliers
- Implement outlier removal techniques

## Prerequisites

- Basic understanding of Python programming
- Familiarity with NumPy libraries
- Knowledge of basic statistical concepts (mean, standard deviation, percentiles)


## Get Started

To start, we install required packages, import the necessary libraries, and define a helper function to download data using the `requests` library.

### Install required packages

In [None]:
%pip install matplotlib seaborn numpy pandas requests scikit-learn

### Import necessary libraries

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import requests
import seaborn as sns
from numpy import percentile, random
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor


### Define utility functions

Define a helper function for downloading example datasets.  

*Note!* It is not essential that you understand the following code.  It is just for getting the example data.

In [None]:
def download(url, to_file):
    """Download content from the given URL and save it to a file.

    Args:
        url (str): The URL to download the content from.
        to_file (str): The name of the file to save the downloaded content to.

    """
    response = requests.get(url, timeout=10)
    Path(to_file).write_bytes(response.content)
    print(f"downloaded file '{to_file}'")

## Outlier Identification and Removal

### What are Outliers?

An outlier is an observation that is unlike the other observations. They are rare, distinct, or do
not fit in some way.

We will generally define outliers as samples that are exceptionally far from the
mainstream of the data.

Outliers can have many causes, such as:

- Measurement or input error.
- Data corruption.
- True outlier observation.

There is no precise way to define and identify outliers in general because of the specifics of
each dataset. Instead, you, or a domain expert, must interpret the raw observations and decide
whether a value is an outlier or not.

### Remove outliers using Standard Deviation method

#### Generate a dataset of random observations

In [None]:
# Seed the random number generator
random.seed(1)

# Generate univariate observations
data = 5 * random.randn(10000) + 50

# Summarize
print("mean=%.3f stdv=%.3f" % (np.mean(data), np.std(data)))

# Plot the data
sns.displot(data)

#### Calculate summary statistics

In [None]:
# Calculate summary statistics
data_mean, data_std = np.mean(data), np.std(data)

#### Identify outliers

In [None]:
# Define outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

# Identify outliers
outliers = [x for x in data if x < lower or x > upper]
print("Identified outliers: %d" % len(outliers))

#### Identify non-outliers

In [None]:
# Identify non outliers
non_outliers = [x for x in data if x >= lower and x <= upper]
print("Non-outlier observations: %d" % len(non_outliers))

### Remove outliers using Interquartile Range method

#### Generate a dataset of random observations

In [None]:
# Seed the random number generator
random.seed(1)

# Generate univariate observations
data = 5 * random.randn(10000) + 50

# Plot the data
sns.displot(data)

#### Calculate summary statistics

In [None]:
# Calculate interquartile range
q25, q75 = percentile(data, 25), percentile(data, 75)
iqr = q75 - q25
print("Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f" % (q25, q75, iqr))

#### Identify outliers

In [None]:
# Calculate the outlier cutoff
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

# Identify outliers
outliers = [x for x in data if x < lower or x > upper]
print("Identified outliers: %d" % len(outliers))

#### Identify non-outliers

In [None]:
# Identify non outliers
non_outliers = [x for x in data if x >= lower and x <= upper]
print("Non-outlier observations: %d" % len(non_outliers))

### Remove outliers using Automatic Outlier Detection method

A simple approach to identifying outliers is to locate those examples that are far from the
other examples in the multi-dimensional feature space. This can work well for feature spaces
with low dimensionality (few features), although it can become less reliable as the number of
features is increased, referred to as the **curse of dimensionality**. The local outlier factor, or
LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier
detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers
based on the size of its local neighborhood. Those examples with the largest score are more
likely to be outliers.

#### Diabetes Dataset

The dataset classifies patient as
either an onset of diabetes within five years or not. 

```
Number of Instances: 768
Number of Attributes: 8 plus class 
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

- Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
- Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

#### Download and summarize diabetes data files

In [None]:
# Download the data
download(
  url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv",
  to_file="pima-indians-diabetes.csv"
)
download(
  url="https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names",
  to_file="pima-indians-diabetes.names"
)

# Load and summarize the dataset

# Load the dataset
df = pd.read_csv("pima-indians-diabetes.csv", header=None)

# Retrieve the array
data = df.values

# Split into input and output elements
X, y = data[:, :-1], data[:, -1]

# Summarize the shape of the dataset
print(X.shape, y.shape)

#### Set up train and test data

In [None]:

# Split into train (70%) and test sets (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#### Evaluate module on raw dataset

In [None]:
# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
yhat = model.predict(X_test)

# Evaluate predictions using mean absolute error
mae = mean_absolute_error(y_test, yhat)
print("MAE: %.3f" % mae)

#### Remove outliers from the data using Local Outlier Factor (LOF)

Next, we can try removing outliers from the training dataset. The expectation is that the
outliers are causing the linear regression model to learn a bias or skewed understanding of the
problem, and that removing these outliers from the training set will allow a more effective model
to be learned.

The **Local Outlier Factor** (LOF) algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors. 

We can achieve this by defining the **LocalOutlierFactor** model and using it to
make a prediction on the training dataset, marking each row in the training dataset as normal
(1) or an outlier (-1). We will use the default hyperparameters for the outlier detection model,
although it is a good idea to tune the configuration to the specifics of your dataset.



In [None]:
# Unsupervised anomaly detection method which computes the local density
# deviation of a given data point with respect to its neighbors.  It considers
# as outliers the samples that have a substantially lower density than their
# neighbors.
lof = LocalOutlierFactor()

# Fit the model to the training set X and return the labels.
yhat = lof.fit_predict(X_train)

# Select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

# Summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)

# Fit the model without outliers
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
yhat = model.predict(X_test)

# Evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print("MAE: %.3f" % mae)

We can see MAE (Mean Absolute Error) reduced from to 0.324 to 0.317.

## Conclusion

In this tutorial, we've learned how to identify and remove outliers using various statistical methods. We've seen how outliers can affect data analysis and how their removal can lead to more accurate insights. Always to consider the context of your data when applying outlier removal techniques, as some apparent outliers may contain useful information.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.