## Getting Started

### Load the dataset
Let's start by reading the dataset. It's stored in a CSV file with 5 columns.

In [1]:
import pandas as pd
df = pd.read_csv("01_Getting_Started.csv")
df.head(10)

Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,winter,0,736.192546,43
1,0,winter,1,777.309789,45
2,0,winter,2,921.220142,54
3,0,winter,3,1085.689116,63
4,0,winter,4,1476.30293,86
5,0,winter,5,2463.116775,144
6,0,winter,6,1579.096039,92
7,1,summer,0,2750.93748,161
8,1,summer,1,2853.730589,167
9,1,summer,2,2915.406454,171


### Explore the dataset
If we group the table by `id`, we obtain many time series, each of which has length 7. For example, let's look at the first two time series in the dataset.

In [2]:
for i, grp in df.groupby("id"):
    if i == 2:
        break
    display(grp)

Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,winter,0,736.192546,43
1,0,winter,1,777.309789,45
2,0,winter,2,921.220142,54
3,0,winter,3,1085.689116,63
4,0,winter,4,1476.30293,86
5,0,winter,5,2463.116775,144
6,0,winter,6,1579.096039,92


Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
7,1,summer,0,2750.93748,161
8,1,summer,1,2853.730589,167
9,1,summer,2,2915.406454,171
10,1,summer,3,3120.992672,183
11,1,summer,4,3429.371998,201
12,1,summer,5,4560.096196,268
13,1,summer,6,3573.282351,210


First, we highlight a few properties of the dataset.
1. The `season` column is constant within each time series; this is because in this particular dataset, each 7-day window comes from either the summer sales event or the winter sales event at a retail outlet.
2. The `day_of_week` column is categorical and represents Monday through Sunday.
3. The `total_sales` column is continuous and is correlated with the number of customers.
4. The `nb_customers` column is count-valued (i.e. non-negative integers).

Then, we make some additional observations about the dataset:
1. The day of the week is always ascending from 0 to 6.
2. The number of customers always peaks on Saturday.
3. The average number of customers is higher in summer than winter.

### Using the PAR Model
Now let's apply the probabilistic autoregressive model. From our exploration, we know that `season` is fixed - therefore, we will specify that it is a "context" column that the other columns will be conditioned on.

In [3]:
from deepecho import PARModel

model = PARModel(nb_epochs=1024)
model.fit(df, entity_columns=["id"], context_columns=["season"], dtypes={
    "season": "categorical",
    "day_of_week": "categorical",
    "total_sales": "continuous",
    "nb_customers": "count",
})
model.sample(5)

Epoch 1023 | Loss 0.027066027745604515: 100%|██████████| 1024/1024 [03:31<00:00,  4.84it/s]


Unnamed: 0,id,season,day_of_week,total_sales,nb_customers
0,0,summer,0,2856.411281,145
1,0,summer,1,2994.449837,164
2,0,summer,2,3094.594397,202
3,0,summer,3,3206.578351,200
4,0,summer,4,3544.135498,207
5,0,summer,5,4687.70236,274
6,0,summer,6,3569.471805,222
7,1,winter,0,706.050657,46
8,1,winter,1,845.922611,44
9,1,winter,2,1027.911791,53


Looking at this synthetic dataset, we see that the three observations we made before still hold true.