# Detecting and filtering outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(1000, 4))

df.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.022062,-0.045097,0.033339,-0.001697
std,0.999438,1.00507,0.992009,1.000575
min,-2.856865,-3.602793,-3.643771,-3.508534
25%,-0.716333,-0.650069,-0.628079,-0.653864
50%,-0.028872,-0.04531,0.054448,0.019975
75%,0.644004,0.57896,0.732762,0.694976
max,3.254549,2.796084,3.24792,2.989849


Suppose you want to find values in one of the columns whose absolute value is greater than 3:

In [2]:
col = df[1]

col[col.abs() > 3]

714   -3.602793
Name: 1, dtype: float64

To select all rows where value is greater than 3 or less than -3 in one of the columns, you can apply [pandas.DataFrame.any](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html) to a Boolean DataFrame:

In [3]:
df[(df.abs() > 3).any(1)]

Unnamed: 0,0,1,2,3
174,3.128554,2.004932,-1.475202,-0.102378
406,3.254549,0.322807,-0.497333,0.748531
714,0.352575,-3.602793,0.84384,0.1184
784,-0.998979,1.353545,-1.022069,-3.214897
821,-0.046112,-0.080435,-3.643771,0.282301
837,0.527207,0.974202,0.100687,-3.508534
855,0.263539,0.792687,3.24792,0.556589
950,3.08451,-0.923346,-0.149506,-0.983733
965,0.053554,-2.182066,-1.390962,-3.215706


On this basis, the values can be limited to an interval between -3 and 3. For this we use the instruction `np.sign(df)`, which generates values 1 and -1, depending on whether the values in `df` are positive or negative:

In [4]:
df[df.abs() > 3] = np.sign(df) * 3

df.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.02253,-0.044494,0.033735,-0.000758
std,0.997978,1.003113,0.989052,0.997577
min,-2.856865,-3.0,-3.0,-3.0
25%,-0.716333,-0.650069,-0.628079,-0.653864
50%,-0.028872,-0.04531,0.054448,0.019975
75%,0.644004,0.57896,0.732762,0.694976
max,3.0,2.796084,3.0,2.989849
