# Outliers

There are sometimes datapoints that are so extreme and act as anomalies. These are called outliers and there will be times you will consider removing them. These outliers may be erroneous and well outside a range of acceptable values, or are just not helpful for what you are trying to achieve. 

While there are valid cases to remove outliers, and that is what we will learn to do, just remember to ask what outliers mean in your application. Your smart thermostat may not need to learn from an unusually cold day in May, and that is an outlier you can safely consider removing. However, a pedestrian in a chicken costume disrupting a "self-driving" car's computer vision is a very serious issue, even if it is an outlier. 

Outliers are a very difficult topic to get right and require not just an understanding of statistics, but also an understanding of the problem. Just keep that in mind! 

Once we have determined we want to remove outliers, we can use tools like standard deviation and interquartile range. We can then use those techniques to remove outliers from our sample.

To prepare, let's bring in our dependencies as well as a dataset containing a sample of golden retriever weights.

In [1]:
import pandas as pd 
import numpy as np 

df = pd.read_csv('https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/distribution/golden_retriever_weights.csv',header=None, names=['weight'])
df

Unnamed: 0,weight
0,65.8
1,67.7
2,64.1
3,69.3
4,65.1
5,60.8
6,65.5
7,64.6
8,61.7
9,64.2


## Standard Deviation Outliers

One way we can deal with outliers is by marking and removing them by how many standard deviations they fall away from the mean. 

Let's calculate the mean and standard deviation of our golden retriever dataset.

In [4]:
mean = df.mean(axis=0)[0]
std = df.std(axis=0)[0]

print(f"MEAN: {mean}  STD: {std}")

MEAN: 64.43400000000001  STD: 3.0267251784928293


  mean = df.mean(axis=0)[0]
  std = df.std(axis=0)[0]


So the mean is approximately 64.433 and the standard deviation is about 3.0267. Note that when calculating standard deviation with Pandas, it will be assumed to be a sample and therefore will calculate with 1 degree of freedom by default as shown in this formula.

$$
s = \sqrt{\frac{\sum{(x_i - \bar{x})^2}}{N-1}}
$$ 


To get a sense of how standard deviations play a role in omitting outliers, consider the graphic below. 1 standard deviation away from the mean will capture 68% of the expected data points assuming a normal distribution. 2 standard deviations will capture 95%, and 3 standard deviations will capture 99.7%. With a standard deviation, The lower the standard deviation, the more aggressively outliers will be removed. 

svg image

For smaller samples, cutting off at two standard deviations will be more common. This means we would declare any data on the tails outside those two standard deviations to be outliers and become candidate for removal. 

Let's inspect the outliers outside two standard deviations. Multiply the standard deviation by 2 and subtract/add from the mean respectively to get the lower and upper bounds. Then we can compose a condition to identify the outliers by checking for weights less than or greater than these lower and upper bounds respectively.

In [10]:
lower = mean - (2*std) 
upper = mean + (2*std) 
outlier_condition = (df['weight'] < lower) | (upper < df['weight'])

df[outlier_condition]

Unnamed: 0,weight
11,57.4
18,58.1


Alright, but we want to remove the outliers. We can change that condition to only include elements that fall inside two standard deviations, not outside. Below we remove both of those weights and therefore have a dataframe with outliers removed exceeding two standard deviations. 

In [13]:
outliers_removed_df = df[(lower < df['weight']) & (df['weight'] < upper)]
outliers_removed_df

Unnamed: 0,weight
0,65.8
1,67.7
2,64.1
3,69.3
4,65.1
5,60.8
6,65.5
7,64.6
8,61.7
9,64.2


Note this is only for one dimension of data. You can also think of multi-dimensional distributions if you want to account for more than one field as outliers. Just be careful as the more dimensions you put into a distribution, the more sparse your data will become. Reasoning about outliers will become harder. 

## Interquartile Range Outliers

There is a lot of data that does not follow the nice bell curve shape of the normal distribution. Another way you can approach outliers in these cases is to use the Interquartile Range method, or IQR. This is the difference between the 75th and 25th percentile. When referring to the quarterly percentiles (0, 25, 50, 75, and 100). we refer to them as quartiles. A 50 percent quartile would be the middle-most value (the median), or the average of the two most-centered values. 

Using the IQR, you will define a cutoff by a factor $ k $ below or above the 25th and 75th percentile respectively. A common value for $ k $ is $ 1.5 $, whereas a value of $ 3.0 $ would be used for more extreme cutoffs. 

In Python, we can use the `percentile()` function in NumPy to find a given percentile in a datastet. 

In [18]:
from numpy import percentile

q25 = percentile(df, 25)
q75 = percentile(df, 75)

q25, q75

(62.225, 66.275)

Then you can calculate the difference between the 75th and 25th percentile to get the IQR. 

In [21]:
iqr = q75 - q25
iqr

4.050000000000004

Let's say we wanted to use `k = 1.5` and calculate the cutoffs like this. 

In [24]:
k = 1.5
cut_off = iqr * k
lower = q25 - cut_off
upper = q75 + cut_off

lower, upper

(56.14999999999999, 72.35000000000001)

Finally, we can remove outliers that fall outside this range. 

In [27]:
outliers_removed_df = df[(lower < df['weight']) & (df['weight'] < upper)]
outliers_removed_df

Unnamed: 0,weight
0,65.8
1,67.7
2,64.1
3,69.3
4,65.1
5,60.8
6,65.5
7,64.6
8,61.7
9,64.2


As you see above, the `k` value might be too generous for this dataset if we are looking to remove outliers. Maybe there are not extreme enough outliers in this dataset or this technique is just not warranted. But we can try to experiment lowering that `k` value to see how low the threshold must be before outliers removed. Below, I find a `k` value of `1.1` removes an outlier, with an index of `11` and weight of `54`. 

In [30]:
k = 1.1
cut_off = iqr * k
lower = q25 - cut_off
upper = q75 + cut_off

lower, upper

outliers_removed_df = df[(lower < df['weight']) & (df['weight'] < upper)]
outliers_removed_df

Unnamed: 0,weight
0,65.8
1,67.7
2,64.1
3,69.3
4,65.1
5,60.8
6,65.5
7,64.6
8,61.7
9,64.2


You can also use this technique on multidimensional data, by specifying an IQR policy for each field you want to target the removal of outliers. 

## Using LocalOutlierFactor

From a machine learning perspective, you can treat outliers as a classification. If they are far away from the rest of the datapoints in a multidimensional space, they can be detected as outliers. However, this becomes less reliable on higher dimensional problems due to curse of dimensionality. By leveraging logic that measures how far neighboring data points are, we can leverage the `LocalOutlierFactor`. 

Let's bring in a different dataset, the maintenance prediction dataset. 

In [35]:
from sklearn.neighbors import LocalOutlierFactor

df = pd.read_csv('https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/classification/maintenance_predict.csv')
df

Unnamed: 0,DAYS_SINCE_INSTALL,FLIGHT_HOURS,ENV_TEMPERATURE,REPLACEMENT_NEEDED
0,93,792,126,1
1,11,107,113,1
2,23,152,110,1
3,32,375,110,1
4,141,701,110,1
...,...,...,...,...
439,57,448,22,0
440,155,650,21,0
441,176,1697,17,1
442,33,188,12,0


Let's then create a `LocalOutlierFactor` with the default settings, which you can [explore here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html). We will get a `-1` for every record that is deemed an outlier. 

In [37]:
lof = LocalOutlierFactor()
outlier_ind = lof.fit_predict(df.iloc[:,:-1])

outlier_ind

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1

Therefore, if we pass this series of booleans back to the dataframe, we can omit those 12 records that are deemed outliers. 

In [39]:
df[outlier_ind != -1]

Unnamed: 0,DAYS_SINCE_INSTALL,FLIGHT_HOURS,ENV_TEMPERATURE,REPLACEMENT_NEEDED
0,93,792,126,1
1,11,107,113,1
2,23,152,110,1
3,32,375,110,1
4,141,701,110,1
...,...,...,...,...
439,57,448,22,0
440,155,650,21,0
441,176,1697,17,1
442,33,188,12,0


## EXERCISE

Complete the code below. Take the golden retriever weight dataset and remove outliers that exceed 2.25 standard deviations. 

In [47]:
import pandas as pd 
import numpy as np 

df = pd.read_csv('https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/distribution/golden_retriever_weights.csv',header=None, names=['weight'])

# calculate mean and standard devation
mean = df.mean(axis=0)[0]
std = df.std(axis=0)[0]

# define lower and upper bounds by cutoff factor
cutoff_factor = 2.25

lower = mean - (cutoff_factor * std) 
upper = mean + (cutoff_factor * std) 

# remove outliers 
removal_condition = (lower < df['weight']) & (df['weight'] < upper)
df[removal_condition]

  mean = df.mean(axis=0)[0]
  std = df.std(axis=0)[0]


Unnamed: 0,weight
0,65.8
1,67.7
2,64.1
3,69.3
4,65.1
5,60.8
6,65.5
7,64.6
8,61.7
9,64.2
