# Outlier Detection and Remover

Removing outliers is important in a time series since outliers can cause problems in downstream processing. Luckily, Kats makes it easy to detect and remove outliers.

Here is how Kats’ outlier detection algorithm works:

- Decompose the time series using seasonal decomposition
- Remove trend and seasonality to generate a residual time series
- Detect points in the residual which are outside 3 times the interquartile range


Let’s try out this detection algorithm using OulierDectector:

In [12]:
import pandas as pd
import numpy as np

day_df = pd.read_csv(
    "google-analytics-20210101-20210802-days.csv", parse_dates=["Day Index"]
)
day_df.head(10)

Unnamed: 0,Day Index,Users
0,2021-01-01,31
1,2021-01-02,41
2,2021-01-03,58
3,2021-01-04,56
4,2021-01-05,44
5,2021-01-06,46
6,2021-01-07,62
7,2021-01-08,39
8,2021-01-09,45
9,2021-01-10,43


In [20]:
import numpy as np
from pyod.models.lscp import LSCP
from statsmodels.tsa.seasonal import seasonal_decompose

# Load the time series data
ts_day = np.loadtxt("google-analytics-20210101-20210802-days.csv", skiprows=1, delimiter=",")

# Decompose the time series
decomposed = seasonal_decompose(ts_day, model="additive")
residual = decomposed.resid

# Create an outlier detector
outlier_detector = IQRDetector(n_neighbors=5)

# Fit the detector to the residual data
outlier_detector.fit(residual)

# Detect outliers
outliers = outlier_detector.predict(residual)

# Print the first outlier
print(outliers[0])




ValueError: could not convert string to float: '1/1/21'

Nice! We are able to detect the outliers on 2021–04–03, 2021–06–20, and 2021–06–21.

Now that we have detected these outliers, let’s remove them using the remover method. We will also replace the removed values with new linear interpolation values:

In [None]:
ts_day_outliers_interpolated = outlier_detector.remover(interpolate=True)

Plot the original time series and the new time series whose outliers are removed.

In [None]:
from matplotlib import pyplot as plt

ax = ts_day.to_dataframe().plot(x="time", y="value")
ts_day_outliers_interpolated.to_dataframe().plot(x="time", y="y_0", ax=ax)
plt.legend(labels=["original ts", "ts with removed outliers"])
plt.show()

Cool! As we can see from the plot above, the time series with outliers being removed (the orange line) is different from the original time series (the blue line) on 2021–04–03, 2021–06–20, and 2021–06–21.

There doesn’t seem to be an outstanding outlier in the new time series. Now, you can be confident using this new time series for other processes such as forecasting.

In [15]:
!pip install pyod
!pip install statsmodels


