# Data Quality(DQ) Outlier Detection with Interquartile Range (IQR) in Python

Reference and for more information:
Laurent Weichberger, Sr. Enterprise Customer Success Manager for Data Quality, Collibra: laurent (dot) weichberger (at) collibra (dot) com

### First we need to import a library so that I don't have to code the median function by hand:

In [30]:
# Need this for median() function
import statistics

### Then we have two datasets available, one has an odd number of elements, and contains no outliers, and the other has an even number of elements and contains one outlier:

In [31]:
# The quartiles are four equal parts: Q1 at0.25, Q2 at 0.5, and Q3 0.7
# We need a Dataset first.
# The _No_Outliers_ in the Dataset dataset is this:
unordered_data1 = [5, 8, 15, 26, 10, 18, 3, 12, 6, 14, 11]

# The _There_Are_Outliers_ in the Dataset dataset is this:
# When you get the above to work, use this one next:

unordered_data2 = [11, 31, 21, 19, 8, 54, 35, 26, 23, 13, 29, 17]

### After using one dataset, try the application again with the other dataset.
Order this list from lowest to highest values:

In [32]:
ordered_data = sorted(unordered_data2)

In [33]:
ordered_data

[8, 11, 13, 17, 19, 21, 23, 26, 29, 31, 35, 54]

Next we need to identify the "median" of the entire dataset: Q2

In [34]:
Q2 = statistics.median(ordered_data)

In [35]:
Q2

22.0

### Then we need the median of the lower half of the dataset (not including the Q2 value ).

- In an odd number of elements in a list, the median is found, whereas in an even number of elements in a list, the median is computed. We will need to treat even and odd number of values differently for this to work.
- Remember that List index starts at 0 (for half of the dataset, index should be 5 if the length is 12):

In [36]:
# ODD:
if len(ordered_data) % 2 > 0:
    print('Odd number of elements found...\n')
    index = ordered_data.index(Q2)
    print('index of Q2 =', index)
    #first half of dataset:
    first = ordered_data[:index]
    second = ordered_data[(index +1):]
# EVEN:
else:
    print('Even number of elements found...\n')
    length_of_dataset = len(ordered_data)
    index = int(length_of_dataset / 2)
    print('even index first half ends at =', index)
    first = ordered_data[:index]
    second = ordered_data[(index):] 

Even number of elements found...

even index first half ends at = 6


### We will need to find the value for Q1 (the first half of the dataset) and for Q3 the second half of the dataset:

In [37]:
# Show us what we are working with:
print('first half', first) 
print('second half', second)

Q1 = statistics.median(first)
Q3 = statistics.median(second)

first half [8, 11, 13, 17, 19, 21]
second half [23, 26, 29, 31, 35, 54]


### Once we know the values of Q1 and Q3 we can arrive at the Interquartile Range (IQR) which is the Q3 - Q1:

In [38]:
IQR = Q3 - Q1
print('IQR value = ', IQR)

IQR value =  15.0


### Next we search for an Outlier in the dataset using an IQR based formula. We create a range of values, within which a number is NOT an outlier, outside if which it will be called an Outlier. The formula is: [Q1 - (1.5 x IQR), Q3 + (1.5 x IQR)]:

In [39]:
not_outliers = [(Q1 - 1.5 * IQR),(Q3 + 1.5 * IQR)]

# Show the range:
print('Within this range of values are NOT outliers [low, high]:', not_outliers)

Within this range of values are NOT outliers [low, high]: [-7.5, 52.5]


Check for outliers in the given dataset and report what we found.

In [42]:
foundOutlier = False

for value in ordered_data:
    #low_check:
    if value < not_outliers[0]:
        print('I just found a low Outlier!:', value)
        foundOutlier = True
    #high_check:
    elif value > not_outliers[1]:
        print('I just found a high Outlier!:', value)
        foundOutlier = True

I just found a high Outlier!: 54


### Now we can just confirm back to the user what we discovered:

In [43]:
if (foundOutlier):
    print('I did find at least one Outlier today, Woo hoo!!')
else:
    print('I do not see any outliers in this Dataset...')

# We are done

I did find at least one Outlier today, Woo hoo!!
