# Processing Raw Data

This notebook will showcase how one may use the Nowcast Library to process raw time series data, particularly with regards to synchronizing data sets coming from different data sources.

## Imports

In [91]:
import datetime
import numpy as np
import pandas as pd

## The Data

To make this notebook independent of external data, let's generate our own. We will generate data from 3 different data sources, and then work with that. The data will have different sample rates, perhaps even be irregular, and will be missing data at different time period.

First, let's define a function for generating data

In [94]:
def generate_data(
    sr_secs,
    start_date,
    end_date,
    n_cols=1,
    sr_stdev=0,
    gap_frequency=0,
    gap_size=0,
    gap_size_stdev=0,
):
    """
    Generates time series data

    Parameters
    ----------
    sr_secs : int
        The number of seconds between each data point
    start_date : string or datetime.datetime or numpy.datetime64
        When the generated time series should start
    end_date : string or datetime.datetime or numpy.datetime64
        When the generated time series should end
    n_cols : int, default 1
        The number of columns the resulting dataframe should have
    sr_var : number, default 0
        The standard deviation in seconds the sample rate should have,
        if an irregular sample rate is desired
    gap_frequency: int, default 0
        Every how many time steps should a gap occur
    gap_size : int, default 0
        How many seconds a gap should be long
    gap_size_stdev: number, default 0
        The standard deviation in gap size, if the gap size should be irregular

    Returns
    -------
    pandas.core.frame.DataFrame
        Time-indexed pandas dataframe containing the generated data
    """
    regular_tsteps = np.arange(
        start_date, end_date, step=np.timedelta64(sr_secs, "s"), dtype="datetime64[ms]"
    )
    n_tsteps = len(regular_tsteps)
    tsteps = regular_tsteps
    if sr_stdev != 0:
        deviations = (
            np.random.normal(sr_secs, sr_stdev, n_tsteps - 2) - sr_secs
        ) * 1000
        tsteps[1:-1] += deviations.astype(int)
    return tsteps

In [96]:
generate_data(7200, "2020-01-01", "2020-01-03 02:00:00", 1, 300)

array(['2020-01-01T00:00:00.000', '2020-01-01T01:59:35.834',
       '2020-01-01T03:55:23.195', '2020-01-01T05:58:09.889',
       '2020-01-01T07:57:34.317', '2020-01-01T09:57:28.770',
       '2020-01-01T12:02:14.331', '2020-01-01T14:06:14.509',
       '2020-01-01T16:02:16.319', '2020-01-01T17:52:14.415',
       '2020-01-01T19:50:19.448', '2020-01-01T22:03:58.369',
       '2020-01-01T23:54:25.553', '2020-01-02T02:01:53.683',
       '2020-01-02T04:09:56.193', '2020-01-02T06:02:48.136',
       '2020-01-02T08:00:51.973', '2020-01-02T10:06:53.448',
       '2020-01-02T11:56:26.275', '2020-01-02T13:59:26.299',
       '2020-01-02T16:11:07.617', '2020-01-02T17:52:34.546',
       '2020-01-02T19:58:12.780', '2020-01-02T22:01:53.774',
       '2020-01-03T00:00:00.000'], dtype='datetime64[ms]')