## Getting Started

### Load the dataset
Let's start by reading the dataset. It's stored in a CSV file with 4 columns.

In [1]:
import pandas as pd
df = pd.read_csv("01_Getting_Started.csv")
df.head(10)

Unnamed: 0,id,season,day_of_week,nb_customers
0,0,winter,0,-1.370776
1,0,winter,1,-1.335633
2,0,winter,2,-1.212632
3,0,winter,3,-1.072061
4,0,winter,4,-0.738203
5,0,winter,5,0.105228
6,0,winter,6,-0.650345
7,1,summer,0,0.351229
8,1,summer,1,0.439086
9,1,summer,2,0.4918


### Explore the dataset
If we group the table by `id`, we obtain many time series, each of which has length 7. For example, let's look at the first two time series in the dataset.

In [2]:
for i, grp in df.groupby("id"):
    if i == 2:
        break
    display(grp)

Unnamed: 0,id,season,day_of_week,nb_customers
0,0,winter,0,-1.370776
1,0,winter,1,-1.335633
2,0,winter,2,-1.212632
3,0,winter,3,-1.072061
4,0,winter,4,-0.738203
5,0,winter,5,0.105228
6,0,winter,6,-0.650345


Unnamed: 0,id,season,day_of_week,nb_customers
7,1,summer,0,0.351229
8,1,summer,1,0.439086
9,1,summer,2,0.4918
10,1,summer,3,0.667515
11,1,summer,4,0.931087
12,1,summer,5,1.897518
13,1,summer,6,1.054087


First, we highlight a few properties of the dataset.
1. The `season` column is constant within each time series; this is because in this particular dataset, each 7-day window comes from either the summer sales event or the winter sales event at a retail outlet.
2. The `day_of_week` column is categorical and represents Monday through Sunday.
3. The `nb_customers` column is continuous and contains the (normalized) number of customers who visited each day.

Then, we make some additional observations about the dataset:
1. The day of the week is always ascending from 0 to 6.
2. The number of customers always peaks on Saturday.
3. The number of customers is higher in summer than winter.

### Using the PAR Model
Now let's apply the probabilistic autoregressive model. From our exploration, we know that `season` is fixed - therefore, we will specify that it is a "context" column that the other columns will be conditioned on.

In [3]:
from deepecho import PARModel

model = PARModel(nb_epochs=1024)
model.fit(df, entity_columns=["id"], context_columns=["season"], dtypes={
    "season": "categorical",
    "day_of_week": "categorical",
    "nb_customers": "continuous"
})
model.sample(5)

Epoch 1023 | Loss -0.04456918314099312: 100%|██████████| 1024/1024 [01:54<00:00,  8.92it/s] 


Unnamed: 0,id,season,day_of_week,nb_customers
0,0,winter,0,-1.303901
1,0,winter,1,-1.268224
2,0,winter,2,-1.175574
3,0,winter,3,-0.951369
4,0,winter,4,-0.67099
5,0,winter,5,0.124725
6,0,winter,6,-0.824864
7,1,summer,0,0.46
8,1,summer,1,0.421911
9,1,summer,2,0.598174


Looking at this synthetic dataset, we see that the three observations we made before still hold true.

1. The day of the week is always ascending from 0 to 6.
2. The number of customers always peaks on Saturday.
3. The number of customers is higher in summer than winter.