# Getting Started

In this short tutorial we will guide you through a series of steps to learn
what is DeepEcho, what functionalities it has and how to use them.

## DeepEcho Overview

DeepEcho is a Synthetic Data Generation library that implements multiple models
to work on **multi-type, multivariate timeseries data**.

During the next steps we will use the `PARModel` class, which is a Probabilistic
AutoRegressive model, as an example to demonstrate how to use **DeepEcho** to learn
from timeseries data and later on generate new data that has the same format and
statistical properties.

## Load the demo dataset

We will start by loading the demo dataset that comes with the library.

In [1]:
from deepecho.demo import load_demo

data = load_demo()
data.head(10)

Unnamed: 0,date,store_id,region,day_of_week,total_sales,nb_customers
0,2020-06-01,68608,New York,0,736.19,43
1,2020-06-02,68608,New York,1,777.31,45
2,2020-06-03,68608,New York,2,921.22,54
3,2020-06-04,68608,New York,3,1085.69,63
4,2020-06-05,68608,New York,4,1476.3,86
5,2020-06-06,68608,New York,5,2463.12,144
6,2020-06-07,68608,New York,6,1579.1,92
7,2020-06-01,47226,California,0,2750.94,161
8,2020-06-02,47226,California,1,2853.73,167
9,2020-06-03,47226,California,2,2915.41,171


The output of this call is a `pandas.DataFrame` which simulates a table with the sales
and number of customers that multiple sunglasses stores had across multiple days.

The first thing we observe is that there is a `store_id` value which repeats itself multiple times.
This is because this `store_id` column which acts as the `entity_id` of our dataset: The table
contains time series that belong to multiple stores, each one identified by the value in
the `store_id` column.

Let's group the table by the `store_id` value and see what the results look like:

In [2]:
entity_columns = ['store_id']

for _, group in list(data.groupby(entity_columns))[0:2]:
    display(group)

Unnamed: 0,date,store_id,region,day_of_week,total_sales,nb_customers
91,2020-06-01,10149,California,0,2750.94,161
92,2020-06-02,10149,California,1,2833.17,166
93,2020-06-03,10149,California,2,3120.99,183
94,2020-06-04,10149,California,3,3141.55,184
95,2020-06-05,10149,California,4,3449.93,202
96,2020-06-06,10149,California,5,4560.1,268
97,2020-06-07,10149,California,6,3470.49,204


Unnamed: 0,date,store_id,region,day_of_week,total_sales,nb_customers
238,2020-06-01,10279,New York,0,756.75,44
239,2020-06-02,10279,New York,1,921.22,54
240,2020-06-03,10279,New York,2,921.22,54
241,2020-06-04,10279,New York,3,1167.92,68
242,2020-06-05,10279,New York,4,1558.54,91
243,2020-06-06,10279,New York,5,2401.44,141
244,2020-06-07,10279,New York,6,1517.42,89


Here we observe a few properties of the dataset:

1. Each `store_id` is associated with a 7 rows long time series.
2. The `date` column works as the time series index and goes from `2020-06-01` to
   `2020-06-07` in each time series.
3. The `region` column is constant within each time series; this is because the `region` is a
   property of each store, and therefore the value does not change over time.
4. The `day_of_week` column is a categorical column that indicates the weekday, Monday to
   Sunday associated to the correponding date.
5. The `total_sales` column is continuous and is correlated with the number of customers.
6. The `nb_customers` column is count-valued (i.e. non-negative integers).

Base on the previous points, we could think about `region` as contextual information about
each store, so let's also try to group the data by this column and observe how the `total_sales`
and `nb_customers` evolve over the different week days.

In [3]:
data.groupby(['region', 'day_of_week'])[['total_sales', 'nb_customers']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_sales,nb_customers
region,day_of_week,Unnamed: 2_level_1,Unnamed: 3_level_1
California,0,2835.6402,166.36
California,1,2918.2846,171.16
California,2,3003.8084,176.2
California,3,3202.4036,187.96
California,4,3516.542,206.26
California,5,4568.7304,268.32
California,6,3535.4548,207.4
New York,0,772.3762,44.94
New York,1,834.0516,48.48
New York,2,974.6712,56.86


Then, we make some additional observations about the dataset:

1. Number of customers and sales are significantly higher in `California` than in `New York`, as
   one would expect considering the wheather differences between one state and the other one.
2. The number of customers and total sals always peaks on Saturday, probably due to more people
   going out after a hard week of work.

### Context

As we observed, the `region` column is constant within each time series. This is what we call
a _context_ variable, which is what we will use later on to condition what the sampled
time series look like once we learn their conditional distribution.

In [4]:
context_columns = ['region']

### Data Types

Apart from the _entity_ and _context_ columns, DeepEcho needs to be informed about
the data types of each column that it needs to model.

Let's create a dictionary with this information:

In [5]:
data_types = {
    'region': 'categorical',
    'day_of_week': 'categorical',
    'total_sales': 'continuous',
    'nb_customers': 'count',
}

### Sequence Index

Finally, we observed that the time index of our time series is the column date,
which indicates us in which order the rows of each time series happen.

In [6]:
sequence_index = 'date'

### Using the PAR Model

Now let's apply the probabilistic autoregressive model to learn the time series distributions.

In [7]:
from deepecho import PARModel

model = PARModel(epochs=1024, cuda=False)
model.fit(
    data=data,
    entity_columns=entity_columns,
    context_columns=context_columns,
    data_types=data_types,
    sequence_index=sequence_index,
)
model.sample(5)

PARModel(epochs=1024, sample_size=1, cuda='cpu', verbose=True) instance created


Epoch 1024 | Loss 0.02675650641322136: 100%|██████████| 1024/1024 [02:05<00:00,  8.19it/s] 
100%|██████████| 5/5 [00:00<00:00, 62.06it/s]


Unnamed: 0,store_id,region,day_of_week,total_sales,nb_customers
0,0,California,0,2914.035253,168
1,0,California,1,2875.666793,168
2,0,California,2,3044.118317,170
3,0,California,3,3244.566601,230
4,0,California,4,3391.854591,198
5,0,California,5,4439.453731,252
6,0,California,6,3582.236903,215
7,1,California,0,2789.442373,175
8,1,California,0,2887.875587,207
9,1,California,2,2981.336418,197


Looking at this synthetic dataset, we see that the three observations we made before still hold true.