## Getting Started

### Load the dataset
Let's start by reading the dataset. It's stored in a CSV file with 5 columns.

In [1]:
import pandas as pd

data = pd.read_csv("01_Getting_Started.csv")
data.head(10)

Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,winter,0,736.192546,43
1,0,winter,1,777.309789,45
2,0,winter,2,921.220142,54
3,0,winter,3,1085.689116,63
4,0,winter,4,1476.30293,86
5,0,winter,5,2463.116775,144
6,0,winter,6,1579.096039,92
7,1,summer,0,2750.93748,161
8,1,summer,1,2853.730589,167
9,1,summer,2,2915.406454,171


### Explore the dataset

The first thing we notice when looking at the data is that there is an `id` column which acts
as the `entity_id` of our dataset: The table contains time series that belong to multiple entities,
each one identified by the value in the `id` column.

Let's group the table by the `id` value and see what the results look like:

In [2]:
entity_columns = ['id']

for _, group in list(data.groupby(entity_columns))[0:2]:
    display(group)

Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,winter,0,736.192546,43
1,0,winter,1,777.309789,45
2,0,winter,2,921.220142,54
3,0,winter,3,1085.689116,63
4,0,winter,4,1476.30293,86
5,0,winter,5,2463.116775,144
6,0,winter,6,1579.096039,92


Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
7,1,summer,0,2750.93748,161
8,1,summer,1,2853.730589,167
9,1,summer,2,2915.406454,171
10,1,summer,3,3120.992672,183
11,1,summer,4,3429.371998,201
12,1,summer,5,4560.096196,268
13,1,summer,6,3573.282351,210


Here we observe a few properties of the dataset:

1. Each `id` is associated with a 7 rows long time series.
2. The `season` column is constant within each time series; this is because in this particular dataset, each 7-day window comes from either the summer sales event or the winter sales event at a retail outlet.
3. The `day_of_week` column is categorical and represents Monday through Sunday.
4. The `total_sales` column is continuous and is correlated with the number of customers.
5. The `nb_customers` column is count-valued (i.e. non-negative integers).

Then, we make some additional observations about the dataset:

1. The day of the week is always ascending from 0 to 6.
2. The number of customers always peaks on Saturday.
3. The average number of customers is higher in summer than winter.

### Context

As we observed, the `season` column is constant within each time series. This is what we call
a _context_ variable, which is what we will use later on to condition what the sampled
time series look like once we learn their conditional distribution.

In [3]:
context_columns = ['season']

### Data Types

Apart from the _entity_ and _context_ columns, DeepEcho needs to be informed about
the data types of each column that it needs to model.

Let's create a dictionary with this information:

In [4]:
data_types = {
    'season': 'categorical',
    'day_of_week': 'categorical',
    'total_sales': 'continuous',
    'nb_customers': 'count',
}

### Using the PAR Model

Now let's apply the probabilistic autoregressive model to learn the time series distributions.

In [5]:
from deepecho import PARModel

model = PARModel(nb_epochs=1024)
model.fit(
    df=data,
    entity_columns=entity_columns,
    context_columns=context_columns,
    dtypes=data_types,
)
model.sample(5)

Epoch 1023 | Loss 0.025627393275499344: 100%|██████████| 1024/1024 [02:58<00:00,  5.72it/s]


Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,summer,0,2762.01849,156
1,0,summer,1,2895.472612,164
2,0,summer,2,2900.876148,180
3,0,summer,3,3032.070394,205
4,0,summer,4,3355.616244,225
5,0,summer,5,4535.462193,289
6,0,summer,6,3465.191048,227
7,1,winter,0,819.564511,43
8,1,winter,1,861.86668,44
9,1,winter,2,937.729433,53


Looking at this synthetic dataset, we see that the three observations we made before still hold true.