# Handling DateTime Values

## What is 1594536602?
Let's start with the (overly optimistic) assumption that all datetime value are represented as Unix time.

*Note: This assumption fails when we have to take into account things like how the value should be presented (i.e. time zones, 1997-12-12 vs 12/12/1996, etc.).*

Now suppose we have the following dataset:

In [1]:
import pandas as pd

df = pd.DataFrame([
    {"datetime": 1577836800, "nb_customers": 1},
    {"datetime": 1578182400, "nb_customers": 10},
    {"datetime": 1578441600, "nb_customers": 2},
    {"datetime": 1578787200, "nb_customers": 8}
])

If we were trying to apply standard machine learning algorithms to, for example, predict the number of customers on a given day, our path forward would be simple. We would use some feature engineering methods to transform the datetime into more easily interpretable features for the model.

For example, in this case, if we added a `day_of_week` feature, then our model would easily learn that there are more customers on Sundays than Wednesdays.

## Generating Timestamps
Feature extraction is a powerful tool for discriminative models... but it's not clear how to translate those ideas into a generative framework such as **DeepEcho**.

> *In fact, to the best of my knowledge, there exist no papers in the space of synthetic data generation, generative adversarial networks, autoregressive models, etc. which discuss how to generate timestamps.*

The default way to handle time stamp values appears to be involve simply normalizing the Unix timestamp into a value in $[-1.0, 1.0]$ based on the minimum and maximum timestamp in the data and treating it as a real number.

### Modeling Distributions
Within **DeepEcho**, our models tend to generate as outputs distributions over data (as opposed to producing the actual synthetic data). This is important as it allows us to separate the sampling stage from the modeling stage, particularly in the case of autoregressive models which would otherwise be deterministic.

#### Continuous Values
For example, in the `PARModel`, when modeling continuous values, our model outputs three values:

 - `mu` - the mean of the Normal distribution
 - `var` - the variance of the Normal distribution
 - `missing` - the probability that the value is misisng

which we can use to generate a value by (1) sampling whether the value is missing and (2) if the value is not missing, sampling from the Normal distribution with the given mean and variance.

#### Timestamps as Continuous Values
This works well for a wide variety of continuous-valued variables. However, it has some clear weaknesses if you try to use it to handle Unix timestamps.

For example, suppose your dataset only contains Sundays at 12am. The model would have to learn that the timestamps it generates should satisfy `x mod 259200 = 0`. Normalizing your timestamps to `[-1.0, 1.0]` range doesn't necesssarily help - it might even hurt!

Therefore, we're looking for some way to specify a distribution over timestamps that:
1. Makes it easy for the model to learn cyclical and seasonal properties
2. Has a fixed scale which doesn't require additional/arbitrary normalization
3. Can be efficiently sampled from.
4. Has a functional form which allows us to log-likelihood of the data and then compute the gradients.

## Modeling Timestamps
The **rejection sampling** framework gives us the tools to sample from an unnormalized probability distribution where (1) we know the range of possible x-values and (2) we know the maximum value of the pdf. Using rejection sampling, we can avoid having to specify a well-formed pdf/cdf at the cost of a little extra computation.

### Motivating Example
This allows us to think of our "distribution" over timestamps as a scoring function such as:

$$
    P(\texttt{timestamp}) \propto
        a \cdot I[\texttt{is_winter}] + 
        b \cdot I[\texttt{is_monday}] +
        c \cdot I[\texttt{is_sunday}]
$$

where $a$, $b$, and $c$ are the parameters of the distribution and must be in range $[0.0, 1.0]$.

#### As Input
When providing a timestamp as input to the model, we simply set the parameters equal to the corresponding indicator variable. For example, if our value is `1578182400`, then we set $a=1.0$ since the date is in the winter, $b=0.0$ since it is not a Monday, and $c=1.0$ since it is a Sunday.

Note that in this example scoring function, all Sundays in the winter are equally likely (i.e. they all receive the same score and are therefore equally likely to be sampled).

#### As Output
When these parameters are produced by the model, they specify a scoring function which takes a timestamp and returns a real-valued score between in the range $[0.0, 3.0]$ which is proportional to the value of the pdf at the given timestamp. Using rejection sampling, we can sample from this pdf and obtain timestamps!

For example, suppose our model produces $a=1.0$, $b=1.0$, $c=1.0$. Intuitively, this would suggest that we are more likely to sample a timestamp that is in the winter and on a Monday.

## Implementation
Here's a quick-and-dirty implementation of the scoring function, rejection sampling, and corresponding transform/sample steps. Don't pay too much attention to this, we can definately do a better job of making it more readible and computationally efficient.

**Aside: Why isn't this an RDT?** *Well, the main distinction between this and an RDT is that the "reverse transform" involves sampling (and is not deterministic/"reversible".) However, if you want to package it as part of RDT, I'm fine with that. This document is intended to describe the mathematical framework for modeling timestamps as opposed to recommending some specific software design / structure.*

In [2]:
import numpy as np
from datetime import datetime

class EchoTime():
    
    def score(self, timestamp, params):
        dt = datetime.fromtimestamp(timestamp)
        return params[dt.weekday()]
    
    def to_params(self, timestamp):
        params = [0.0] * 7
        dt = datetime.fromtimestamp(timestamp)
        params[dt.weekday()] = 1.0
        return params

    def from_params(self, params, max_tries=1000):
        for _ in range(max_tries):
            timestamp = np.random.randint(1535730073, 1555730073)
            u = np.random.uniform(0, 1.0) # max value = 1.0
            if u <= self.score(timestamp, params):
                return timestamp

    def transform(self, df, target_column):
        """Transform the target datetime column.
        """
        new_df = []
        self.columns = df.columns
        self.target_column = target_column
        for row in df.to_dict('records'):
            params = et.to_params(row[target_column])
            del row[target_column]
            self.dt_columns = []
            for i, param in enumerate(params):
                row["%s.%s" % (target_column, i)] = param
                self.dt_columns.append("%s.%s" % (target_column, i))
            new_df.append(row)
        return pd.DataFrame(new_df)

    def sample(self, df):
        new_df = []
        for row in df.to_dict('records'):
            params = []
            for c in self.dt_columns:
                params.append(row[c])
                del row[c]
            row[self.target_column] = self.from_params(params)
            new_df.append(row)
        return pd.DataFrame(new_df, columns=self.columns)

## Demo
Recall our example dataset from earlier where we discovered that we had more customers on Saturdays than Tuesdays. Let's try to model this dataset using Copulas.

In [3]:
df

Unnamed: 0,datetime,nb_customers
0,1577836800,1
1,1578182400,10
2,1578441600,2
3,1578787200,8


First, we'll write a simple helper function which summarizes the dataset. We'll use this function to generate a human-readible printout which highlights some of the properties of the dataset.

In [4]:
def summarize(df):
    daysOftheWeek = (
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday"
    )
    df = df.copy()
    df["day_of_week"] = [daysOftheWeek[datetime.fromtimestamp(x).weekday()] for x in df["datetime"]]
    df = df.drop("datetime", axis=1).groupby("day_of_week").agg("mean")
    return df.rename({"nb_customers": "avg_nb_customers"}, axis=1)

summarize(df)

Unnamed: 0_level_0,avg_nb_customers
day_of_week,Unnamed: 1_level_1
Saturday,9.0
Tuesday,1.5


As expected, we see that (1) our dataset only contains Tuesdays and Saturdays and (2) on average, there are much more customers on Saturday than Tuesday.

### Direct Modeling
Let's see what happens if we directly pass this data to Copulas. Note that we could also use something RDT to transform the datatime values; however, the current datetime transformer in RDT simply transforms it into Unix time so it's essentially a no-op in this case (since there are no missing values).

In [5]:
from copulas.multivariate import GaussianMultivariate
copula = GaussianMultivariate()
copula.fit(df)
synthetic_df = copula.sample(10)
summarize(synthetic_df)

Unnamed: 0_level_0,avg_nb_customers
day_of_week,Unnamed: 1_level_1
Monday,9.233596
Saturday,8.012633
Sunday,6.315706
Wednesday,4.472315


As you can see here, the synthetic data (1) contains timestamps from other days which do not belong and (2) does not capture the property that there are more cusomers on Saturdays than Tuesays.

## Using EchoTime
Now let's see what happens if we first transform the column using EchoTime.

In [6]:
from copulas.multivariate import GaussianMultivariate

et = EchoTime()
copula = GaussianMultivariate()
copula.fit(et.transform(df, "datetime"))
synthetic_df = et.sample(copula.sample(10))
summarize(synthetic_df)

  return getattr(obj, method)(*args, **kwds)


Unnamed: 0_level_0,avg_nb_customers
day_of_week,Unnamed: 1_level_1
Saturday,9.086956
Tuesday,1.080311


Note that we are now able to capture both properties! The synthetic dataset only contains Tuesdays and Saturdays and, on average, there are much more customers on Saturday than Tuesday.

## Takeaways
The EchoTime methodology - which consists of defining a distribution over timestamps using a scoring function and applying rejection sampling - is a powerful, flexible, and effective approach to modeling timestamp data.