# Tukey Method (Tukey fences)

In this method, want to compute the outliers among a list of values. For this we compute the first and the third quartiles of the data denoted as $Q_1$ and $Q_3$.
In addition, we compute $\Delta=Q_3 - Q_1$ which is the interquartile range.
Now a value is said to be outlier, if it is in the range of $$Q_1-1.5 \Delta<x<Q_3+1.5 \Delta.$$

In [8]:
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter

In [2]:
def detect_outlier(data, num, features):
    """
    We use Tukey method to detect the outlier. 
    We count number of outlier feature value for each entry and if the number exceeds the num we account it as
    outlier.
    """
    outlier_index=[]
    
    for col in features:
        #Take the first quarter
        Q1 = np.percentile(data[col],25)
        #Take the third quarter
        Q3 = np.percentile(data[col],75)
        #Compute Delta
        Delta = Q3-Q1
        
        # Determine a list of indices of outliers for feature col
        outlier_col = data[(data[col] < Q1 - 1.5*Delta) | (data[col] > Q3 + 1.5*Delta )].index
        outlier_index.extend(outlier_col)
        
    # select observations containing more than 2 outliers
    outlier_index = Counter(outlier_index)        
    final_list = list( k for k, v in outlier_index.items() if v > num )
    return final_list        

In [13]:
Train_raw = pd.read_csv('Data/train.csv')
Test_raw = pd.read_csv('Data/test.csv')
outliers = detect_outlier(Train_raw, num=2, features = ['YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GarageArea','MiscVal'])
print(outliers)
train = Train_raw.drop(outliers, axis = 0).reset_index(drop=True)

[440, 888, 1205, 224, 496, 178, 691, 1182, 1298]
