In [1]:
%matplotlib widget

# Handling Raw Data

This notebook will showcase how one may use the Nowcast Library to process raw time series data, particularly with regards to synchronizing data sets coming from different data sources.

## Imports and Config

In [2]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
sns.set_style("ticks")
sns.set_context("notebook")

In [4]:
plt.close("all")

## The Data

To make this notebook independent of external data, let's generate our own. We will generate data from 3 different data sources, and then work with that. The data will have different sample rates, perhaps even be irregular, and will be missing data at different time periods.

First, let's define a function for generating data

In [5]:
def generate_data(
    sr_secs,
    start_date,
    end_date,
    n_cols=1,
    sr_stdev=0,
    gap_period=0,
    gap_size=0,
    gap_size_stdev=0,
    col_names=None
):
    """
    Generates random time series data

    Parameters
    ----------
    sr_secs : int
        The number of seconds between each data point
    start_date : string or datetime.datetime or numpy.datetime64
        When the generated time series should start
    end_date : string or datetime.datetime or numpy.datetime64
        When the generated time series should end
    n_cols : int, default 1
        The number of columns the resulting dataframe should have
    sr_stdev : number, default 0
        The standard deviation in seconds the sample rate should have,
        if an irregular sample rate is desired
    gap_period: int, default 0
        How many seconds of data between gaps, 0 means we never want gaps
    gap_size : int, default 0
        How long a gap should be in seconds
    gap_size_stdev: number, default 0
        The standard deviation in gap size, if the gap size should be irregular
    col_names: list of string, default None
        List of column names the resulting dataframe should have

    Returns
    -------
    pandas.core.frame.DataFrame
        Time-indexed pandas dataframe containing the generated data
    """
    regular_tsteps = np.arange(
        start_date, end_date, step=np.timedelta64(sr_secs, "s"), dtype="datetime64[ms]"
    )
    n_frames = len(regular_tsteps)

    gap_mask = np.ones(n_frames)
    if gap_period > 0:
        assert gap_size > 0, "Can't have 0-length gaps"
        # need to convert from seconds to frames
        gap_size = int(gap_size / sr_secs)
        gap_size_stdev = gap_size_stdev / sr_secs
        gap_period = int(gap_period / sr_secs)

        gappy_section_length = n_frames - 2 * gap_period

        n_gaps = int(gappy_section_length / gap_size)

        deviations = np.random.normal(gap_size, gap_size_stdev, n_gaps) - gap_size
        for i in range(n_gaps):
            curr_index = i * (gap_period + gap_size)
            if (curr_index + gap_size + deviations[i]) < gappy_section_length:
                gap_mask[gap_period:-gap_period][
                    curr_index : curr_index + int(gap_size + deviations[i])
                ] = 0
    gap_mask = gap_mask.astype(bool)

    tsteps = regular_tsteps
    if sr_stdev != 0:
        deviations = (
            np.random.normal(sr_secs, sr_stdev, n_frames - 2) - sr_secs
        ) * 1000
        tsteps[1:-1] += deviations.astype(int)
    gen_data = pd.DataFrame(np.random.randn(n_frames, n_cols), index=tsteps, columns=col_names)
    return gen_data[gap_mask]

In [6]:
df1 = generate_data(
    120,
    "2020-01-01",
    "2020-01-31",
    n_cols=3,
    sr_stdev=30,
    gap_period=7 * 12 * 3600,
    gap_size=12 * 3600,
    gap_size_stdev=3600,
    col_names=["A", "B", "C"]
)
df2 = generate_data(
    110,
    "2020-01-04",
    "2020-02-09",
    n_cols=4,
    sr_stdev=5,
    gap_period=12 * 3600,
    gap_size=6 * 3600,
    gap_size_stdev=1800,
    col_names=["D", "E", "F", "G"]
)
df3 = generate_data(
    300,
    "2019-12-25",
    "2020-01-28",
    n_cols=2,
    sr_stdev=60,
    gap_period=3 * 3600,
    gap_size=3600,
    gap_size_stdev=900,
    col_names=["H", "I"]
)

Visualizing the data. Only showing one column from each dataframe since we are mostly interested in the gaps and sampling rate which will be the same across columns of the same dataframe. Two of the dataframes have been vertically shifted to avoid confusion. 

In [7]:
sns.set_palette("Set1")
shift_amount = 4
plt.figure(figsize=(12, 6))
plt.scatter(
    df1.index, df1["A"] + shift_amount, s=1, label="df1 + {}".format(shift_amount)
)
plt.scatter(df2.index, df2["E"], s=1, label="df2")
plt.scatter(
    df3.index, df3["H"] - shift_amount, s=1, label="df3 - {}".format(shift_amount)
)
plt.legend()
plt.tight_layout()
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Processing

### Data Synchronization and Sample-rate regularization

If we want to feed our datasets as a joint input to a time series model, often this requires for the data to be regular, i.e. with a constant sample rate, i.e. . The datasets will also have to be merged and hence synced (meaning matching sample rates and dates) into a single dataset for whatever model to receive and reason about them as a whole.

We can achieve these two features in the following manner. 

First, we pick a target sample rate. All of our dataframes will be resampled to have this sample rate, so it is best to pick a sample rate that approximates the mean sample rate across the datasets. Two of the three dataframes we are dealing with have a sample rate of \~2 minutes, and the third one has a sample rate of \~5 minutes (I use "\~" because the sample rate is not constant), so let's pick a target sample rate of 2 minutes. In resampling, we also make sure that the resample origin is floored so that each dataframe is resampled from the same starting point. This means that if there are overlaps, they will be exact.

We then concatenate the resampled dataframes across the 2nd axis (axis=1) perform an inner join on the index, such that only indices that are shared across all dataframes are kept. This is basically an "intersection" of the indices, if we're thinking in terms of sets.

In [8]:
# config
TARGET_SR = "2min"

In [12]:
# resampling
dfs = [df1, df2, df3]
for i, df in enumerate(dfs):
     dfs[i] = df.resample(TARGET_SR, origin=df.index[0].floor(TARGET_SR)
).mean()

In [13]:
# concatenating
synced_df = pd.concat(dfs, axis=1, join="inner")

In [26]:
sns.set_palette("Set1")
shift_amount = 4
f, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6), sharex=True)
ax1.scatter(
    df1.index, df1["A"] + shift_amount, s=1, label="df1 + {}".format(shift_amount)
)
ax1.scatter(df2.index, df2["E"], s=1, label="df2")
ax1.scatter(
    df3.index, df3["H"] - shift_amount, s=1, label="df3 - {}".format(shift_amount)
)
ax1.set_title("Raw Data")
ax1.legend()
ax2.scatter(
    synced_df.index,
    synced_df["A"] + shift_amount,
    s=1,
    label="synced_df df1 + {}".format(shift_amount),
)
ax2.scatter(
    synced_df.index,
    synced_df["E"],
    s=1,
    label="synced_df df2",
)
ax2.scatter(
    synced_df.index,
    synced_df["H"] - shift_amount,
    s=1,
    label="synced_df df3 - {}".format(shift_amount),
)
ax2.set_title("Syncronzied and Regularised Dataframes")
ax2.legend()
plt.tight_layout()
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

### Finding overlapping data and Handling Gaps