<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/pandas/pandas_drop_outliers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Drop Outliers in Pandas Data

Numerical data that is too many standard deviations away from the mean could be considered an outlier.  Here we show do to delete that.


Here we use the **Z-Score Method**.  

The z-score is:


$$z = \frac{x - \mu}{\sigma}$$

## Two Approaches to Eliminating Outliers

**1. Interquartile Range (IQR) Method**:
- Calculate the IQR, which is the difference between the third quartile (Q3) and the first quartile (Q1).
- Define outliers as data points that are below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR.
- This method is robust to outliers and is commonly used for skewed distributions.

**2. Z-Score Method**:
- Calculate the z-score for each data point, which represents how many standard deviations the data point is away from the mean.
- Define outliers as data points with z-scores beyond a certain threshold, such as |z| > 3 or |z| > 2 depending on the desired level of stringency.
- This method is more sensitive to outliers and assumes a normal distribution of the data.



In [2]:
import numpy as np
import pandas as pd

import random

def makedata():
  mean=10000
  std=25

  # this code creates random data.  It adds invalid and missing values to give us data to work with.

  cols = [("name", str), ("education", str),
     ("age", np.int8), ("city",str), ("id", np.int8), ("email", str), ("salary", np.int8),
        ("citizen", ["Y", "N"])]

  words = [np.NaN, "", "abc", "def", "ghi", "jkl", "mno", "pqr"]

  records = []

  for i in range(20):

    data = {}

    for c in cols:

      if c[1] == np.int8:
        if random.randint(0,5)==5:
            data[c[0]] = np.NaN
        else:
            data[c[0]] = abs(int(random.gauss(mean, std)))

      if c[0] == "citizen":
        if random.randint(0,5)==5:
            data[c[0]] = random.randint(0,10)
        else:
            data[c[0]] =c[1][random.randint(0,1)]

      if c[1] == str and c[0] != "citizen":
        data[c[0]] = words[random.randint(0,len(words)-1)]

      if (c[0] == "salary") & (random.randint(0,5)==0):
            data[c[0]] = 1000000

      if (c[0] == "age"):
          data[c[0]] = random.randint(20,25)

    records.append(data)

  df=pd.DataFrame(records)

  return df

df = makedata()
df


Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22,jkl,10042.0,pqr,10013.0,0
1,ghi,ghi,24,ghi,10055.0,pqr,9974.0,5
2,abc,abc,25,jkl,,ghi,10008.0,N
3,pqr,def,20,def,9984.0,abc,9978.0,0
4,pqr,,20,,10003.0,jkl,1000000.0,Y
5,pqr,mno,20,abc,9981.0,ghi,1000000.0,N
6,pqr,jkl,20,abc,9960.0,jkl,10012.0,10
7,pqr,ghi,20,ghi,9988.0,pqr,9969.0,N
8,,pqr,22,,9954.0,ghi,9999.0,Y
9,pqr,ghi,23,def,10021.0,abc,1000000.0,5


In [3]:
df['salary']

0       10013.0
1        9974.0
2       10008.0
3        9978.0
4     1000000.0
5     1000000.0
6       10012.0
7        9969.0
8        9999.0
9     1000000.0
10      10014.0
11      10026.0
12      10044.0
13          NaN
14       9957.0
15          NaN
16       9998.0
17      10002.0
18      10020.0
19       9976.0
Name: salary, dtype: float64

In [4]:
df['salary'].mean()

174999.44444444444

In [5]:
df['salary']

0       10013.0
1        9974.0
2       10008.0
3        9978.0
4     1000000.0
5     1000000.0
6       10012.0
7        9969.0
8        9999.0
9     1000000.0
10      10014.0
11      10026.0
12      10044.0
13          NaN
14       9957.0
15          NaN
16       9998.0
17      10002.0
18      10020.0
19       9976.0
Name: salary, dtype: float64

In [6]:
# drop outliers, too many standard deviations away
# if the income - mean > 2 std the replace then drop the row
# remeber to use inplace=True


mean = df['salary'].mean()
std = df['salary'].std()

# Define threshold for outliers (e.g., values more than 2 standard deviations away from the mean)
threshold = 2

# calculate
# find with .loc
# drop
# inplace=True


df.drop(df.loc[abs(df['salary'] - mean) > 2 * std].index, inplace=True)

df



Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22,jkl,10042.0,pqr,10013.0,0
1,ghi,ghi,24,ghi,10055.0,pqr,9974.0,5
2,abc,abc,25,jkl,,ghi,10008.0,N
3,pqr,def,20,def,9984.0,abc,9978.0,0
6,pqr,jkl,20,abc,9960.0,jkl,10012.0,10
7,pqr,ghi,20,ghi,9988.0,pqr,9969.0,N
8,,pqr,22,,9954.0,ghi,9999.0,Y
10,,,22,,,abc,10014.0,10
11,,,24,mno,10002.0,mno,10026.0,N
12,ghi,,25,,9978.0,,10044.0,N


In [7]:
abs(1000000 - mean) > 2 * std

True

In [8]:
mean

174999.44444444444

In [9]:
std

379647.9257645434

In [10]:
sMean=df['salary'].mean()

df['salary'] = df['salary'].fillna(sMean)

df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22,jkl,10042.0,pqr,10013.0,0
1,ghi,ghi,24,ghi,10055.0,pqr,9974.0,5
2,abc,abc,25,jkl,,ghi,10008.0,N
3,pqr,def,20,def,9984.0,abc,9978.0,0
6,pqr,jkl,20,abc,9960.0,jkl,10012.0,10
7,pqr,ghi,20,ghi,9988.0,pqr,9969.0,N
8,,pqr,22,,9954.0,ghi,9999.0,Y
10,,,22,,,abc,10014.0,10
11,,,24,mno,10002.0,mno,10026.0,N
12,ghi,,25,,9978.0,,10044.0,N
