# "How to smoothen noisy data and find peaks and dips in a line plot using Python"
> "In this small tutorial we will use the U.S. COVID-19 inoculation data to demonstrate the effect of the Savitzky-Golay filter and find the most prominent peaks and dips in daily vaccinations."

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [python, jupyter]
- image: images/Savitzky-Golay-Filter.png
- hide: false
- search_exclude: false

Presenting peaks and dips in a noisy line plot can be a bit of a challenge, as there is a lot of unnecessary visual information. Savitzky-Golay filter is a function that can be applied to such data in order to clarify the points with minimal distortion and precision loss. It was formulated for the exact purpose of finding maxima and minima in curve data by Savitzky themselves {% fn 1 %}. In this small tutorial we will use the U.S. COVID-19 inoculation data to demonstrate the effect of the filter and find the most prominent peaks and dips in daily vaccinations. We will use an interactive widget to tweak the optimal parameters for the filter.

### Step 1: Install the following Python packages

In [None]:
!pip install widgetsnbextension ipywidgets jupyter-js-widgets-nbextension ipympl

### Step 2: Enable widget support in your Jupyter environment

In [21]:
!jupyter nbextension enable --py widgetsnbextension --sys-prefix

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


### Step 3: Importing the dependencies

We will use Pandas to read and manipulate the `.csv` file, Matplotlib for plotting the data, `signal` method from the Scipy package to apply the filter, Numpy and `argrelextrema` function to find the "extreme" values in the data, and finally `interactive` to build the necessary sliders.

In [None]:
from ipywidgets import interactive

import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal
import numpy as np

### Step 4: Import and filter the data by location; we will use a CSV file from Our World in Data.

In [None]:
df_raw = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv", usecols=["location", "date", "daily_vaccinations"], parse_dates=["date"])

df = df_raw[df_raw["location"] == "United States"]

df.set_index("location", inplace=True, drop=True)

df

Unnamed: 0_level_0,date,daily_vaccinations
location,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,2020-12-13,
United States,2020-12-14,4545.0
United States,2020-12-15,27098.0
United States,2020-12-16,71299.0
United States,2020-12-17,121556.0
...,...,...
United States,2022-02-25,279746.0
United States,2022-02-26,271896.0
United States,2022-02-27,260524.0
United States,2022-02-28,222105.0


### Step 5: Build the function and plot the data

First, we assign our X and Y values. We feed the Y values, or the first 444 consecutive days of `daily_vaccinations`, into `signal.savgol_filter()` method. This function requires 2 parameters: `windows_size` and `polyorder`. According to the documentation, `windows_size` is always a positive *odd* integer and `polyorder` is *any* positive integer that must be less than `window_length` {% fn 2 %}. Unfortunately my understanding of the concept is very limited, however the general goal is to "[keep] the important features and getting rid of the meaningless fluctuations" {% fn 3 %}. These variables control the smoothness of the curve: too low and the curve will loose the detail, too high -- it will become too distorted; the rule of thumb is to start low and build up from that. Because the exact values vary with data, they cannot be known beforehand. Controlling them visually will help us find the optimal curve for our purpose, hence the need for a slider to set the inputs more intuitively.

In [None]:
def make_iplot(window_size, polyorder):
  data_x = df["date"].values
  data_y = df["daily_vaccinations"].values # original
  data_y_filtered = signal.savgol_filter(data_y, window_size, polyorder) # smoothed

  # Find peaks (max)
  peak_indexes = signal.argrelextrema(data_y_filtered, np.greater)
  peak_indexes = peak_indexes[0]

  # Find valleys (min)
  valley_indexes = signal.argrelextrema(data_y_filtered, np.less)
  valley_indexes = valley_indexes[0]

  # Matplotlib plot
  plt.figure(figsize=(20, 5))
  plt.plot(data_x, data_y, color="grey") # line plot for the original data
  plt.plot(data_x, data_y_filtered, color="black") # line plot for the filtered data
  plt.plot(data_x[valley_indexes], data_y_filtered[valley_indexes], "o", label="dip", color="r")
  plt.plot(data_x[peak_indexes], data_y_filtered[peak_indexes], "o", label="peak", color="g")
  plt.show()

With that said, we arbitrarily set the range of 1 to 100 for the `windows_size` slider, and 1 to 10 for `polyorder`.

In [None]:
# this line of code makes the figure appear in the output below
%matplotlib inline

iplot = interactive(
  make_iplot,
  window_size=(1,100,2),
  polyorder=(1,10,1)
)

iplot

![](images/1.png)

We set the initial values of 3 and 1 (`polyorder` < `windows_size`) as starting points, which gives us a result that isn't dissimilar to the original data (red dots represent dips and green ones -- peaks).

In [None]:
# this line of code makes the figure appear in the output below
%matplotlib inline

iplot = interactive(
  make_iplot,
  window_size=(1,100,2),
  polyorder=(1,10,1)
)

iplot

![](images/2.png)

With some experimentation, values 23 and 3 give us a relatively smooth graph with less visual noise.

## Conclusion

Applying the Savitzky-Golay filter helps get rid of noise and present a better picture of the data. This example may not be the best, as it doesn't fluctuate as much as other linear data, such as digital signals, however it can be applied to finding prominent features in any type of data.

{{ '[Savitzky–Golay filter](https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter)' | fndetail: 1 }}

{{ '[scipy.signal.savgol_filter](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html)' | fndetail: 2 }}

{{ '[Smoothing Your Data with the Savitzky-Golay Filter and Python](https://blog.finxter.com/smoothing-your-data-with-the-savitzky-golay-filter-and-python/)' | fndetail: 3 }}