# Getting Started

In this short tutorial we will guide you through a series of steps to learn
what is DeepEcho, what functionalities it has and how to use them.

## DeepEcho Overview

DeepEcho is a Synthetic Data Generation library that implements multiple models
to work on **multi-type, multivariate timeseries data**.

During the next steps we will use the `PARModel` class, which is a Probabilistic
AutoRegressive model, as an example to demonstrate how to use **DeepEcho** to learn
from timeseries data and later on generate new data that has the same format and
statistical properties.

## Load the demo dataset

We will start by loading the demo dataset that comes with the library.

In [1]:
from deepecho.demo import load_demo

data = load_demo()
data.head(10)

Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,winter,0,736.192546,43
1,0,winter,1,777.309789,45
2,0,winter,2,921.220142,54
3,0,winter,3,1085.689116,63
4,0,winter,4,1476.30293,86
5,0,winter,5,2463.116775,144
6,0,winter,6,1579.096039,92
7,1,summer,0,2750.93748,161
8,1,summer,1,2853.730589,167
9,1,summer,2,2915.406454,171


The output of this call is a `pandas.DataFrame` which simulates a table with the sales
and number of customers that a business had across multiple days.

The first thing we observe is that there is an `id` value which repeats itself multiple times.
This is because this `id` column which acts as the `entity_id` of our dataset: The table
contains time series that belong to multiple entities, each one identified by the value in
the `id` column. We could think, for example, that each `entity` corresponds to a different week
of the year.

Let's group the table by the `id` value and see what the results look like:

In [2]:
entity_columns = ['id']

for _, group in list(data.groupby(entity_columns))[0:2]:
    display(group)

Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,winter,0,736.192546,43
1,0,winter,1,777.309789,45
2,0,winter,2,921.220142,54
3,0,winter,3,1085.689116,63
4,0,winter,4,1476.30293,86
5,0,winter,5,2463.116775,144
6,0,winter,6,1579.096039,92


Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
7,1,summer,0,2750.93748,161
8,1,summer,1,2853.730589,167
9,1,summer,2,2915.406454,171
10,1,summer,3,3120.992672,183
11,1,summer,4,3429.371998,201
12,1,summer,5,4560.096196,268
13,1,summer,6,3573.282351,210


Here we observe a few properties of the dataset:

1. Each `id` is associated with a 7 rows long time series.
2. The `season` column is constant within each time series; this is because in this particular dataset, each 7-day window comes from either the summer sales event or the winter sales event at a retail outlet.
3. The `day_of_week` column is categorical and represents Monday through Sunday.
4. The `total_sales` column is continuous and is correlated with the number of customers.
5. The `nb_customers` column is count-valued (i.e. non-negative integers).

Base on the previous points, we could think about  `season` is acting as the context Let's also try to group the data by `season` and `

In [3]:
data.groupby(['season', 'day_of_week'])[['total_sales', 'nb_customers']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_sales,nb_customers
season,day_of_week,Unnamed: 2_level_1,Unnamed: 3_level_1
summer,0,2835.639002,166.36
summer,1,2918.284661,171.16
summer,2,3003.808528,176.2
summer,3,3202.404814,187.96
summer,4,3516.540555,206.26
summer,5,4568.730817,268.32
summer,6,3535.454487,207.4
winter,0,772.37572,44.94
winter,1,834.051585,48.48
winter,2,974.672558,56.86


Then, we make some additional observations about the dataset:

1. The day of the week is always ascending from 0 to 6.
2. The number of customers always peaks on Saturday.
3. The average number of customers is higher in summer than winter.

### Context

As we observed, the `season` column is constant within each time series. This is what we call
a _context_ variable, which is what we will use later on to condition what the sampled
time series look like once we learn their conditional distribution.

In [4]:
context_columns = ['season']

### Data Types

Apart from the _entity_ and _context_ columns, DeepEcho needs to be informed about
the data types of each column that it needs to model.

Let's create a dictionary with this information:

In [5]:
data_types = {
    'season': 'categorical',
    'day_of_week': 'categorical',
    'total_sales': 'continuous',
    'nb_customers': 'count',
}

### Using the PAR Model

Now let's apply the probabilistic autoregressive model to learn the time series distributions.

In [6]:
from deepecho import PARModel

model = PARModel(epochs=1024, cuda=False)
model.fit(
    data=data,
    entity_columns=entity_columns,
    context_columns=context_columns,
    data_types=data_types,
)
model.sample(5)

2020-08-14 15:45:48,659 - INFO - deepecho.par - PARModel(epochs=1024, max_seq_len=100, sample_size=1, cuda='cpu', verbose=True) instance created


PARModel(epochs=1024, max_seq_len=100, sample_size=1, cuda='cpu', verbose=True) instance created


Epoch 1024 | Loss 0.028885547071695328: 100%|██████████| 1024/1024 [01:51<00:00,  9.16it/s]
100%|██████████| 5/5 [00:00<00:00, 81.85it/s]


Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,winter,0.0,800.11223,48
1,0,winter,1.0,936.029513,48
2,0,winter,2.0,911.075088,54
3,0,winter,3.0,1173.235468,65
4,0,winter,4.0,1442.998136,86
5,0,winter,5.0,2388.721954,158
6,0,winter,6.0,1501.16062,89
0,1,winter,0.0,709.996819,46
1,1,winter,1.0,781.464579,53
2,1,winter,2.0,1075.851116,57


Looking at this synthetic dataset, we see that the three observations we made before still hold true.