# Outlier Detection and Remover

Removing outliers is important in a time series since outliers can cause problems in downstream processing. Luckily, Kats makes it easy to detect and remove outliers.

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the time series data into a Pandas dataframe
df = pd.read_csv("google-analytics-20210101-20210802-days.csv")

# Add a time index to the dataframe
df["Day Index"] = pd.to_datetime(df["Day Index"])
df = df.set_index("Day Index")
df.head(10)

Unnamed: 0_level_0,Users
Day Index,Unnamed: 1_level_1
2021-01-01,31
2021-01-02,41
2021-01-03,58
2021-01-04,56
2021-01-05,44
2021-01-06,46
2021-01-07,62
2021-01-08,39
2021-01-09,45
2021-01-10,43


In [43]:
import pandas as pd
import numpy as np
from stldecompose import STLDecompose

# Decompose the time series
decomposition = STLDecompose(df["Users"])

# Remove trend and seasonality from the time series
residual = decomposition.resid

# Calculate the interquartile range of the residual time series
iqr = np.percentile(residual, 75) - np.percentile(residual, 25)

# Define the lower and upper limits of the expected range of data values
lower_limit = np.percentile(residual, 25) - 1.5 * iqr
upper_limit = np.percentile(residual, 75) + 1.5 * iqr

# Identify any data points that fall outside of these limits as outliers
outliers = residual[residual < lower_limit] | residual[residual > upper_limit]

# Remove the outliers from the time series
df = df.loc[~df["value"].isin(outliers)]

# Print the updated time series data
print(df)


ImportError: cannot import name '_maybe_get_pandas_wrapper_freq' from 'statsmodels.tsa.filters._utils' (C:\Users\mreve\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\statsmodels\tsa\filters\_utils.py)

Nice! We are able to detect the outliers on 2021–04–03, 2021–06–20, and 2021–06–21.

Now that we have detected these outliers, let’s remove them using the remover method. We will also replace the removed values with new linear interpolation values:

In [None]:
ts_day_outliers_interpolated = outlier_detector.remover(interpolate=True)

Plot the original time series and the new time series whose outliers are removed.

In [None]:
from matplotlib import pyplot as plt

ax = ts_day.to_dataframe().plot(x="time", y="value")
ts_day_outliers_interpolated.to_dataframe().plot(x="time", y="y_0", ax=ax)
plt.legend(labels=["original ts", "ts with removed outliers"])
plt.show()

Cool! As we can see from the plot above, the time series with outliers being removed (the orange line) is different from the original time series (the blue line) on 2021–04–03, 2021–06–20, and 2021–06–21.

There doesn’t seem to be an outstanding outlier in the new time series. Now, you can be confident using this new time series for other processes such as forecasting.

In [44]:
!pip install kats

Collecting kats
  Using cached kats-0.2.0-py3-none-any.whl (612 kB)
Collecting numpy<1.22,>=1.21 (from kats)
  Using cached numpy-1.21.6-cp310-cp310-win_amd64.whl (14.0 MB)
Collecting pandas<=1.3.5,>=1.0.4 (from kats)
  Using cached pandas-1.3.5-cp310-cp310-win_amd64.whl (10.2 MB)
Collecting pystan==2.19.1.1 (from kats)
  Using cached pystan-2.19.1.1.tar.gz (16.2 MB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting fbprophet==0.7.1 (from kats)
  Using cached fbprophet-0.7.1.tar.gz (64 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting scipy<1.8.0 (from kats)
  Using cached scipy-1.7.3-cp310-cp310-win_amd64.whl (34.3 MB)
Collecting statsmodels==0.12.2 (from kats)
  Using cached statsmodels-0.12.2.tar.gz (17.5 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

