In [1]:
%matplotlib widget

# Processing Raw Data

This notebook will showcase how one may use the Nowcast Library to process raw time series data, particularly with regards to synchronizing data sets coming from different data sources.

## Imports and Config

In [2]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
sns.set_style("ticks")
sns.set_context("notebook")

In [4]:
plt.close("all")

## The Data

To make this notebook independent of external data, let's generate our own. We will generate data from 3 different data sources, and then work with that. The data will have different sample rates, perhaps even be irregular, and will be missing data at different time period.

First, let's define a function for generating data

In [5]:
def generate_data(
    sr_secs,
    start_date,
    end_date,
    n_cols=1,
    sr_stdev=0,
    gap_period=0,
    gap_size=0,
    gap_size_stdev=0,
):
    """
    Generates random time series data

    Parameters
    ----------
    sr_secs : int
        The number of seconds between each data point
    start_date : string or datetime.datetime or numpy.datetime64
        When the generated time series should start
    end_date : string or datetime.datetime or numpy.datetime64
        When the generated time series should end
    n_cols : int, default 1
        The number of columns the resulting dataframe should have
    sr_stdev : number, default 0
        The standard deviation in seconds the sample rate should have,
        if an irregular sample rate is desired
    gap_period: int, default 0
        Every how many seconds should a gap occur, 0 means never
    gap_size : int, default 0
        How long a gap should be in seconds
    gap_size_stdev: number, default 0
        The standard deviation in gap size, if the gap size should be irregular

    Returns
    -------
    pandas.core.frame.DataFrame
        Time-indexed pandas dataframe containing the generated data
    """
    regular_tsteps = np.arange(
        start_date, end_date, step=np.timedelta64(sr_secs, "s"), dtype="datetime64[ms]"
    )
    n_frames = len(regular_tsteps)

    gap_mask = np.ones(n_frames)
    if gap_period > 0:
        assert gap_size > 0, "Can't have 0-length gaps"
        # need to convert from seconds to frames
        gap_size = int(gap_size / sr_secs)
        gap_size_stdev = gap_size_stdev / sr_secs
        gap_period = int(gap_period / sr_secs)

        gappy_section_length = n_frames - 2 * gap_period

        n_gaps = int(gappy_section_length / gap_size)

        deviations = np.random.normal(gap_size, gap_size_stdev, n_gaps) - gap_size
        for i in range(n_gaps):
            curr_index = i * (gap_period + gap_size)
            if (curr_index + gap_size + deviations[i]) < gappy_section_length:
                gap_mask[gap_period:-gap_period][
                    curr_index : curr_index + int(gap_size + deviations[i])
                ] = 0
    gap_mask = gap_mask.astype(bool)

    tsteps = regular_tsteps
    if sr_stdev != 0:
        deviations = (
            np.random.normal(sr_secs, sr_stdev, n_frames - 2) - sr_secs
        ) * 1000
        tsteps[1:-1] += deviations.astype(int)
    gen_data = pd.DataFrame(np.random.randn(n_frames, n_cols), index=tsteps)
    return gen_data[gap_mask]

In [6]:
df = generate_data(120, "2020-01-01", "2020-01-31", 3, 300, 12 * 3600, 6 * 3600, 1200)
resampled_df = df.resample("2T").mean()

In [7]:
plt.figure(figsize=(10, 6))
plt.plot(resampled_df[1], linewidth=0.5, color="darkblue", label="resampled", zorder=1)
plt.scatter(df.index, df[1], s=0.5, c="red", label="raw", zorder=2)
plt.legend()
plt.tight_layout()
plt.draw()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …