PAR Model
=========

In this guide we will go through a series of steps that will let you
discover functionalities of the `PAR` model for timeseries data.

What is PAR?
------------

The `PAR` class is an implementation of a Probabilistic AutoRegressive
model that allows learning **multi-type, multivariate timeseries data**
and later on generate new synthetic data that has the same format and
properties as the learned one.

Additionally, the `PAR` model has the ability to generate new synthetic
timeseries conditioned on the properties of the entity to which this
timeseries data would be associated.

<div class="alert alert-info">

**Note**

The PAR model is under active development. Please use it, try it on your
data and give us feedback on a [github
issue](https://github.com/sdv-dev/SDV/issues) or our [Slack
workspace](https://join.slack.com/t/sdv-space/shared_invite/zt-gdsfcb5w-0QQpFMVoyB2Yd6SRiMplcw)

</div>

Quick Usage
-----------

We will start by loading one of our demo datasets, the `nasdaq100_2019`,
which contains daily stock marked data from the NASDAQ 100 companies
during the year 2019.

In [1]:
from sdv.demo import load_timeseries_demo

data = load_timeseries_demo()
data.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAPL,2018-12-31,39.6325,39.435001,140014000,737873400000.0,Technology,Computer Manufacturing
1,AAPL,2019-01-02,38.7225,39.48,148158800,737873400000.0,Technology,Computer Manufacturing
2,AAPL,2019-01-03,35.994999,35.547501,365248800,737873400000.0,Technology,Computer Manufacturing
3,AAPL,2019-01-04,36.1325,37.064999,234428400,737873400000.0,Technology,Computer Manufacturing
4,AAPL,2019-01-07,37.174999,36.982498,219111200,737873400000.0,Technology,Computer Manufacturing


As you can see, this table contains information about multiple Tickers,
including:

-   Symbol of the Ticker.
-   Date associated with the stock market values.
-   The opening and closing prices for the day.
-   The Volume of transactions of the day.
-   The MarketCap of the company
-   The Sector and the Industry in which the company operates.

This data format is a very common an well known format for timeseries
data which includes 4 types of columns:

### Entity Columns

These are columns that indicate how the rows are associated with
external, abstract, `entities`. The group of rows associated with each
`entity_id` form a time series sequence, where order of the rows matters
and where inter-row dependencies exist. However, the rows of different
`entities` are completely independent from each other.

In this case, the external `entity` is the company, and the identifier
of the company within our data is the `Symbol` column.

In [2]:
entity_columns = ['Symbol']

<div class="alert alert-info">

**Note**

In some case, the datsets do not contain any `entity_columns` because
the rows are not associated with any external entity. In these cases,
the `entity_columns` specification can be omitted and the complete
dataset will be interpreted as a single timeseries sequence.

</div>

### Context

The timeseries datasets may have one or more `context_columns`.
`context_columns` are variables that provide information about the
entities associated with the timeseries in the form of attributes and
which may condition how the timeseries variables evolve.

For example, in our stock market case, the `MarketCap`, the `Sector` and
the `Industry` variables are all contextual attributes associated with
each company and which have a great impact on what each timeseries look
like.

In [3]:
context_columns = ['MarketCap', 'Sector', 'Industry']

<div class="alert alert-info">

**Note**

The `context_columns` are attributes that are associated with the
entities, and which do not change over time. For this reason, since each
timeseries sequence has a single entity associated, the values of the
`context_columns` are expected to remain constant alongside each
combination of `entity_columns` values.

</div>

### Sequence Index

By definition, the timeseries datasets have inter-row dependencies for
which the order of the rows matter. In most cases, this order will be
indicted by a `sequence_index` column that will contain sortable values
such as integers, floats or datetimes. In some other cases there may be
no `sequence_index`, which means that the rows are assumed to be already
given in the right order.

In this case, the column that indicates us the order of the rows within
each sequence is the `Date` column:

In [4]:
sequence_index = 'Date'

### Data Columns

Finally, the rest of the columns of the dataset are what we call the
`data_columns`, and they are the columns that our `PAR` model will learn
to generated synthetically conditioned on the values of the
`context_columns`.

Let now see how to use the `PAR` class to learn this timeseries dataset
and generate new synthetic timeseries that replicate its properties.

For this, you will need to:

-   Import the `sdv.timeseries.PAR` class and create an instance of it
    passing the variables that we just created.
-   Call its `fit` method passing the timeseries data.
-   Call its `sample` method indicating the number of sequences that we
    want to generate.

In [5]:
from sdv.timeseries import PAR

model = PAR(
    entity_columns=entity_columns,
    context_columns=context_columns,
    sequence_index=sequence_index,
)
model.fit(data)

<div class="alert alert-info">

**Note**

Notice that the model `fitting` process took care of transforming the
different fields using the appropriate [Reversible Data
Transforms](http://github.com/sdv-dev/RDT) to ensure that the data has a
format that the underlying models can handle.

</div>

### Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic
data by calling the `sample` method from your model passing the number
of the sequences that we want to generated.

Let's start by generating a single sequence.

In [6]:
new_data = model.sample(1)

This will return a table identical to the one which the model was fitted
on, but filled with new synthetic data which resembles the original one.

In [7]:
new_data.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,aIWZWCGPVU,2019-01-02,183.437196,183.531971,8583004,45834410000.0,Technology,Semiconductors
1,aIWZWCGPVU,2019-01-05,183.437196,137.288708,18137034,45834410000.0,Technology,Semiconductors
2,aIWZWCGPVU,2019-01-04,119.467777,117.415616,18499391,45834410000.0,Technology,Semiconductors
3,aIWZWCGPVU,2019-01-05,117.547403,116.944213,12249561,45834410000.0,Technology,Semiconductors
4,aIWZWCGPVU,2019-01-06,111.969057,123.915607,12068214,45834410000.0,Technology,Semiconductors


<div class="alert alert-info">

**Note**

**Note**

Notice how the model generated a random string for the `Symbol`
identifier which does not look like the regular Ticker symbols that we
saw in the original data. This is because the model needs you to tell it
how these symbols need to be generated by providing a regular expression
that it can use. We will see how to do this in a later section.

</div>

### Save and Load the model

In many scenarios it will be convenient to generate synthetic versions
of your data directly in systems that do not have access to the original
data source. For example, if you may want to generate testing data on
the fly inside a testing environment that does not have access to your
production database. In these scenarios, fitting the model with real
data every time that you need to generate new data is feasible, so you
will need to fit a model in your production environment, save the fitted
model into a file, send this file to the testing environment and then
load it there to be able to `sample` from it.

Let's see how this process works.

#### Save and share the model

Once you have fitted the model, all you need to do is call its `save`
method passing the name of the file in which you want to save the model.
Note that the extension of the filename is not relevant, but we will be
using the `.pkl` extension to highlight that the serialization protocol
used is [pickle](https://docs.python.org/3/library/pickle.html).

In [8]:
model.save('my_model.pkl')

This will have created a file called `my_model.pkl` in the same
directory in which you are running SDV.

<div class="alert alert-info">

**Note**

If you inspect the generated file you will notice that its size is much
smaller than the size of the data that you used to generate it. This is
because the serialized model contains **no information about the
original data**, other than the parameters it needs to generate
synthetic versions of it. This means that you can safely share this
`my_model.pkl` file without the risc of disclosing any of your real
data!

</div>

#### Load the model and generate new data

The file you just generated can be send over to the system where the
synthetic data will be generated. Once it is there, you can load it
using the `PAR.load` method, and then you are ready to sample new data
from the loaded instance:

In [9]:
loaded = PAR.load('my_model.pkl')
loaded.sample(num_sequences=1).head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,hO,2019-01-01,183.437196,227.279329,6569659,891671500.0,Consumer Services,Industrial Machinery/Components
1,hO,2019-01-01,165.890654,172.680317,-12948192,891671500.0,Consumer Services,Industrial Machinery/Components
2,hO,2019-01-02,280.426422,230.925929,3216414,891671500.0,Consumer Services,Industrial Machinery/Components
3,hO,2019-01-07,283.994792,279.855631,1215530,891671500.0,Consumer Services,Industrial Machinery/Components
4,hO,2019-01-07,303.788616,276.491937,485251,891671500.0,Consumer Services,Industrial Machinery/Components


<div class="alert alert-warning">

**Warning**

Notice that the system where the model is loaded needs to also have
`sdv` installed, otherwise it will not be able to load the model and use
it.

</div>

### Conditional Sampling

On the previous examples we had the model generate random values for use
to populate the `context_columns` and the `entity_columns`. In order to
do this, the model learned the context and entity values using a
`GaussianCopula`, which later on used to sample new realistic values for
them. This is fine for cases in which we do not have any constraints
regarding the type of data that we generate, but in some cases we might
want to control the values of the contextual columns to force the model
into generating data of certain type.

In order to achieve this, we will first have to create a
`pandas.DataFrame` with the expected values.

As an example, let's generate values for two companies in the Technology
and Health Care sectors.

In [10]:
import pandas as pd

context = pd.DataFrame([
    {
        'Symbol': 'AAAA',
        'MarketCap': 1.2345e+11,
        'Sector': 'Technology',
        'Industry': 'Electronic Components'
    },
    {
        'Symbol': 'BBBB',
        'MarketCap': 4.5678e+10,
        'Sector': 'Health Care',
        'Industry': 'Medical/Nursing Services'
    },
])
context

Unnamed: 0,Symbol,MarketCap,Sector,Industry
0,AAAA,123450000000.0,Technology,Electronic Components
1,BBBB,45678000000.0,Health Care,Medical/Nursing Services


Once you have created this, you can simply pass the dataframe as the
`context` argument to the `sample` method.

In [11]:
new_data = model.sample(context=context)

And we can now see the data generated for the two companies:

In [12]:
new_data[new_data.Symbol == 'AAAA'].head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAAA,2019-01-02,56.788037,183.531971,467831,123450000000.0,Technology,Electronic Components
1,AAAA,2019-01-01,183.437196,109.853322,12287392,123450000000.0,Technology,Electronic Components
2,AAAA,2019-01-03,99.706027,125.668329,10803297,123450000000.0,Technology,Electronic Components
3,AAAA,2019-01-05,107.323586,183.531971,7367826,123450000000.0,Technology,Electronic Components
4,AAAA,2019-01-08,125.369102,114.97389,6569659,123450000000.0,Technology,Electronic Components


In [13]:
new_data[new_data.Symbol == 'BBBB'].head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
215,BBBB,2018-12-27,183.437196,183.531971,6569659,45678000000.0,Health Care,Medical/Nursing Services
216,BBBB,2018-12-31,140.279261,160.459861,-338301,45678000000.0,Health Care,Medical/Nursing Services
217,BBBB,2019-01-03,132.545216,125.508325,5575496,45678000000.0,Health Care,Medical/Nursing Services
218,BBBB,2019-01-05,113.432206,97.730619,2774735,45678000000.0,Health Care,Medical/Nursing Services
219,BBBB,2019-01-07,109.552246,102.639291,3950797,45678000000.0,Health Care,Medical/Nursing Services


Advanced Usage
--------------

Now that we have discovered the basics, let's go over a few more
advanced usage examples and see the different arguments that we can pass
to our `PAR` Model in order to customize it to our needs.

### How to customize the generated IDs?

In the previous examples we saw how the `Symbol` values were generated
as random strings that do not look like that ones typically seen for
Tickers, which usually are strings made of between 2 and 4 uppercase
letters.

In order to fix this and force the model to generate values that are
valid for the field, we can use the `field_types` argument to indicate
the characteristics of each field by passing a dictionary that follows
the `Metadata` field specification.

For this case in particular, we will indicate that the `Symbol` field
needs to be generated using the regular expression `[A-Z]{2,4}`.

In [14]:
field_types = {
    'Symbol': {
        'type': 'id',
        'subtype': 'string',
        'regex': '[A-Z]{2,4}'
    }
}
model = PAR(
    entity_columns=entity_columns,
    context_columns=context_columns,
    sequence_index=sequence_index,
    field_types=field_types
)
model.fit(data)

After this, we can observe how the new `Symbols` are generated as
indicated.

In [15]:
model.sample(num_sequences=1).head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,GW,2018-12-31,303.32467,200.678131,8322448,25410480000.0,Technology,Electronic Components
1,GW,2019-01-02,177.960162,190.637568,5428279,25410480000.0,Technology,Electronic Components
2,GW,2019-01-05,133.421485,209.447492,6569659,25410480000.0,Technology,Electronic Components
3,GW,2019-01-07,197.513859,183.961058,2979529,25410480000.0,Technology,Electronic Components
4,GW,2019-01-07,169.781218,183.846672,6989041,25410480000.0,Technology,Electronic Components


<div class="alert alert-info">

**Note**

Notice how in this case we only specified the properties of the `Symbol`
field and the PAR model was able to handle the other fields
appropriately without needing any indication from us.

</div>

### Can I control the length of the sequences?

When learning the data, the PAR model also learned the distribution of
the lengths of the sequences, so each generated sequence may have a
different length:

In [16]:
model.sample(num_sequences=5).groupby('Symbol').size()

Symbol
EIZG    185
EJU     252
EM      252
HZV     252
ZGUY    246
dtype: int64

If we want to force a specific length to the generated sequences we can
pass the `sequence_length` argument to the `sample` method:

In [17]:
model.sample(num_sequences=5, sequence_length=100).groupby('Symbol').size()

Symbol
FOO     100
OQXN    100
TIU     100
XTKV    100
ZVER    100
dtype: int64

### Can I use timeseries without context?

Sometimes the timeseries datasets do not provide any additional
properties from the entities associated with each sequence, other than
the unique identifier of the entity.

Let's simulate this situation by dropping the context columns from our
data.

In [18]:
no_context = data[['Symbol', 'Date', 'Open', 'Close', 'Volume']].copy()
no_context.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume
0,AAPL,2018-12-31,39.6325,39.435001,140014000
1,AAPL,2019-01-02,38.7225,39.48,148158800
2,AAPL,2019-01-03,35.994999,35.547501,365248800
3,AAPL,2019-01-04,36.1325,37.064999,234428400
4,AAPL,2019-01-07,37.174999,36.982498,219111200


In this cases, we can simply skip the context columns when creating the
model, and PAR will be able to learn the timeseries without imposing any
conditions to them.

In [19]:
model = PAR(
    entity_columns=entity_columns,
    sequence_index=sequence_index,
    field_types=field_types,
)
model.fit(no_context)
model.sample(num_sequences=1).head()

Unnamed: 0,Symbol,Date,Open,Close,Volume
0,NZJ,2019-01-03,283.358468,320.417504,6569659
1,NZJ,2018-12-28,267.504874,183.531971,26998765
2,NZJ,2019-01-03,183.437196,350.709475,4440035
3,NZJ,2019-01-03,302.68309,325.844561,10384659
4,NZJ,2019-01-04,310.52061,360.810723,6569659


In this case, of course, we are not able to sample new sequences
conditioned on any value, but we are still able to force the symbols
that we want on the generated data by passing them in a
`pandas.DataFrame`

In [20]:
symbols = pd.DataFrame({
    'Symbol': ['TSLA']
})
model.sample(context=symbols).head()

Unnamed: 0,Symbol,Date,Open,Close,Volume
0,TSLA,2019-01-02,183.437196,238.092983,6569659
1,TSLA,2019-01-04,295.915921,287.785729,9121682
2,TSLA,2019-01-04,339.393268,259.81719,9797831
3,TSLA,2019-01-06,269.608546,271.854816,-142704
4,TSLA,2019-01-08,316.25451,267.436525,9561240


### What happens if there are no `entity_columns` either?

In some cases the timeseries datasets are made of a single timeseries
sequence with no identifiers of external entities. For example, suppose
we only had the data from one company:

In [21]:
tsla = no_context[no_context.Symbol == 'TSLA'].copy()
del tsla['Symbol']
tsla.head()

Unnamed: 0,Date,Open,Close,Volume
1008,2018-12-31,67.557999,66.559998,31511500
1009,2019-01-02,61.220001,62.023998,58293000
1010,2019-01-03,61.400002,60.071999,34826000
1011,2019-01-04,61.200001,63.537998,36970500
1012,2019-01-07,64.344002,66.991997,37756000


In this case, we can simply omit the `entity_columns` argument when
creating our PAR instance:

In [22]:
model = PAR(
    sequence_index=sequence_index,
)
model.fit(tsla)
model.sample()

Unnamed: 0,Date,Open,Close,Volume
0,2018-12-31,58.323193,54.638159,14633613
1,2019-01-02,64.963547,54.638159,27377623
2,2019-01-02,66.218629,68.069365,36860547
3,2019-01-05,68.713604,70.431854,45715575
4,2019-01-06,68.018849,70.771949,-2642317
...,...,...,...,...
247,2020-01-05,53.544431,56.569719,70254877
248,2020-01-07,55.006984,51.746359,46508664
249,2020-01-08,54.552286,53.639119,17514409
250,2020-01-09,52.936391,51.917528,44074317
