# Synthesize Timeseries Sequences

In this notebook, we'll use the SDV library to create multiple, synthetic sequences. The SDV uses machine learning to learn patterns from real data and emulates them when creating synthetic data.

We'll use the Probabilistic AutoRegressive (**PAR**) algorithm to do this. PAR uses a neural network to create sequences.

In [1]:
from sdv.datasets.demo import download_demo
from sdv.sequential import PARSynthesizer
from sdv.evaluation.single_table import get_column_plot

In [2]:
real_data, metadata = download_demo(
    modality='sequential',
    dataset_name='nasdaq100_2019'
)

**Details**: The data is available as a single table.
- `Symbol` describes the ticker symbol of the company
- `Date` describes the point of time that the prices correspond to
-  Columns such as `Open`, `Close` and `Volume` are measurements that change daily
- Columns such `Sector` and `Industry` describe fixed, unchanging values for every company

In [3]:
real_data.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAPL,2018-12-31,39.6325,39.435001,140014000,737873400000.0,Technology,Computer Manufacturing
1,AAPL,2019-01-02,38.7225,39.48,148158800,737873400000.0,Technology,Computer Manufacturing
2,AAPL,2019-01-03,35.994999,35.547501,365248800,737873400000.0,Technology,Computer Manufacturing
3,AAPL,2019-01-04,36.1325,37.064999,234428400,737873400000.0,Technology,Computer Manufacturing
4,AAPL,2019-01-07,37.174999,36.982498,219111200,737873400000.0,Technology,Computer Manufacturing


In [4]:
metadata

{
    "sequence_index": "Date",
    "columns": {
        "Symbol": {
            "sdtype": "id",
            "regex_format": "[A-Z]{4}"
        },
        "Date": {
            "sdtype": "datetime",
            "datetime_format": "%Y-%m-%d"
        },
        "Open": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "Close": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "Volume": {
            "sdtype": "numerical",
            "computer_representation": "Int64"
        },
        "MarketCap": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "Sector": {
            "sdtype": "categorical"
        },
        "Industry": {
            "sdtype": "categorical"
        }
    },
    "sequence_key": "Symbol",
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1"
}

## 1.1 What is sequential data?

A **sequence** is a set of measurements taken in a particular order, such as the `Open`, `Close` and `Volume` of stock prices. Some datasets have a **sequence index** that prescribes this order. In our case, the `Date` column.

In a single sequence, all measurements belong to the same entity. For example, if we isolate only the stock from Amazon (`Symbol='AMZN'`), then we have a single sequence of data. This sequence has 252 measurements, all ordered by `Date`.

In [5]:
amzn_sequence = real_data[real_data['Symbol'] == 'AMZN']
amzn_sequence

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
252,AMZN,2018-12-31,1510.800049,1501.969971,6954500,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
253,AMZN,2019-01-02,1465.199951,1539.130005,7983100,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
254,AMZN,2019-01-03,1520.010010,1500.280029,6975600,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
255,AMZN,2019-01-04,1530.000000,1575.390015,9182600,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
256,AMZN,2019-01-07,1602.310059,1629.510010,7993200,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
...,...,...,...,...,...,...,...,...
499,AMZN,2019-12-23,1788.260010,1793.000000,2136400,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
500,AMZN,2019-12-24,1793.810059,1789.209961,881300,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
501,AMZN,2019-12-26,1801.010010,1868.770020,6005400,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
502,AMZN,2019-12-27,1882.920044,1869.800049,6186600,4.035002e+11,Consumer Services,Catalog/Specialty Distribution


In a **multi-sequence** dataset, there are multiple sequences existing in the same table. For example, in our dataset, there are different sequences for each company: Amazon, Google, Netflix, etc.

A **sequence key** is a column that can be used to identify each sequence. In this case, the `Symbol` column. If we inspect it, we can see that it contains 100 unique values -- which means there are 100 sequences in the data.

In [6]:
real_data['Symbol'].unique()

array(['AAPL', 'AMZN', 'MSFT', 'FB', 'TSLA', 'GOOGL', 'GOOG', 'NVDA',
       'ADBE', 'PYPL', 'NFLX', 'INTC', 'CMCSA', 'PEP', 'CSCO', 'COST',
       'AVGO', 'QCOM', 'TMUS', 'TXN', 'AMGN', 'CHTR', 'SBUX', 'AMD', 'ZM',
       'INTU', 'ISRG', 'MDLZ', 'GILD', 'JD', 'BKNG', 'VRTX', 'FISV',
       'ADP', 'ATVI', 'REGN', 'MELI', 'CSX', 'AMAT', 'MU', 'LRCX', 'ADSK',
       'ILMN', 'BIIB', 'ADI', 'DOCU', 'LULU', 'MNST', 'WDAY', 'CTSH',
       'EXC', 'EBAY', 'KHC', 'EA', 'NXPI', 'BIDU', 'XEL', 'DXCM', 'SGEN',
       'CTAS', 'IDXX', 'ORLY', 'SNPS', 'ROST', 'KLAC', 'SPLK', 'CDNS',
       'NTES', 'MAR', 'VRSK', 'WBA', 'PCAR', 'ASML', 'PAYX', 'MRNA',
       'ANSS', 'XLNX', 'MCHP', 'CPRT', 'ALXN', 'ALGN', 'FAST', 'SWKS',
       'SIRI', 'VRSN', 'PDD', 'CERN', 'DLTR', 'INCY', 'MXIM', 'TTWO',
       'CDW', 'CHKP', 'CTXS', 'TCOM', 'BMRN', 'ULTA', 'EXPE', 'WDC',
       'FOXA', 'LBTYK', 'FOX', 'LBTYA'], dtype=object)

**The PAR synthesizer is suited for multi-sequence data.** So this dataset with 100 sequences is a perfect candidate.

## 1.2 What are Context Columns?
A **context** column does not change during the course of a sequence.  In our case, `Sector` and `Industry` are context columns.

If we choose a sequence -- such as Amazon (`Symbol='AMZN'`) -- then we'll see that the context values don't change. Amazon is always a `'Consumer Services'` company.

In [7]:
real_data[real_data['Symbol'] == 'AMZN']['Sector'].unique()

array(['Consumer Services'], dtype=object)

**The PAR Synthesizer learns sequence information based on the context.** It's important to identify these columns ahead of time.

# 2. Basic Usage

## 2.1 Creating a Synthesizer

An SDV **synthesizer** is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data.

In [8]:
synthesizer = PARSynthesizer(
    metadata,
    context_columns=['Sector', 'Industry'])

synthesizer.fit(real_data)

<font color="maroon"><i><b>This step takes about 5 min to complete.</b> For larger datasets, this phase may take longer.</i></font>

When this code finishes running, the synthesizer is ready to use.

## 2.2 Generating Synthetic Data

Use the `sample` function and pass in any number of sequences to synthesize. The synthesizer algorithmically determines how long to make each sequence.

In [9]:
synthetic_data = synthesizer.sample(num_sequences=10)
synthetic_data.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAAA,2019-01-07,22.038722,183.531971,7760803,5893961000.0,,Biotechnology: Commercial Physical & Biologica...
1,AAAA,2018-12-30,81.37698,183.531971,10000,28210410000.0,,Biotechnology: Commercial Physical & Biologica...
2,AAAA,2019-01-01,168.147425,128.063154,5283434,41239800000.0,,Biotechnology: Commercial Physical & Biologica...
3,AAAA,2019-01-03,111.623019,123.544582,1168152,29527710000.0,,Biotechnology: Commercial Physical & Biologica...
4,AAAA,2019-01-04,116.369078,89.499184,1679515,30996260000.0,,Biotechnology: Commercial Physical & Biologica...


In [10]:
synthetic_data[['Symbol', 'Industry']].groupby(['Symbol']).first().reset_index()

Unnamed: 0,Symbol,Industry
0,AAAA,Biotechnology: Commercial Physical & Biologica...
1,AAAB,Automotive Aftermarket
2,AAAC,Biotechnology: In Vitro & In Vivo Diagnostic S...
3,AAAD,"Computer Software: Programming, Data Processing"
4,AAAE,Industrial Specialties
5,AAAF,Other Specialty Stores
6,AAAG,Computer Communications Equipment
7,AAAH,Industrial Specialties
8,AAAI,Television Services
9,AAAJ,Catalog/Specialty Distribution


## 2.3 Saving and Loading
We can save the synthesizer to share with others and sample more synthetic data in the future.

In [11]:
# save_path = 'par_synthesizer.pkl'
# synthesizer.save(save_path)

# synthesizer = PARSynthesizer.load(save_path)

# 3. PAR Customization

When using this synthesizer, we can make a tradeoff between training time and data quality using the `epochs` parameter: Higher `epochs` means that the synthesizer will train for longer, and ideally improve the data quality.

In [12]:
custom_synthesizer = PARSynthesizer(
    metadata,
    epochs=250,
    context_columns=['Sector', 'Industry'],
    verbose=True)

custom_synthesizer.fit(real_data)

Epoch 250 | Loss -2.618673324584961: 100%|██████████| 250/250 [03:37<00:00,  1.15it/s] 


<font color="maroon"><i><b>This step takes about 10 min to complete.</b> We can use the `verbose` parameter to track progress. For larger datasets, this phase may take longer.</i></font>

In [13]:
# save_path = 'par_custom_synthesizer.pkl'
# custom_synthesizer.save(save_path)

# custom_synthesizer = PARSynthesizer.load(save_path)

# 4. Sampling Options
Using the PAR synthesizer, you can customize the synthetic data to suit your needs.

## 4.1 Specify Sequence Length

By default, the synthesizer algorithmically determines the length of each sequence. However, you can also specify a fixed, predetermined length.

In [14]:
synthetic_data = custom_synthesizer.sample(num_sequences=3, sequence_length=10)

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:00<00:00, 21.68it/s]


In [15]:
synthetic_data.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAAA,2019-01-01,5.29,183.531971,19659423,27549300000.0,,Biotechnology: Commercial Physical & Biologica...
1,AAAA,2019-01-02,97.825273,183.531971,1246769,94086560000.0,,Biotechnology: Commercial Physical & Biologica...
2,AAAA,2019-01-03,85.504444,85.673321,10000,56189360000.0,,Biotechnology: Commercial Physical & Biologica...
3,AAAA,2019-01-03,101.997292,88.891088,3643255,44749170000.0,,Biotechnology: Commercial Physical & Biologica...
4,AAAA,2019-01-05,96.422442,100.127489,10000,55609520000.0,,Biotechnology: Commercial Physical & Biologica...


In [16]:
for column in real_data.columns:
    try:
        fig = get_column_plot(
            real_data=real_data,
            synthetic_data=synthetic_data,
            column_name=column,
            metadata=metadata
        )   

        fig.show()
    except:
        pass

In [17]:
synthetic_data = custom_synthesizer.sample(num_sequences=3, sequence_length=100)

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:01<00:00,  2.36it/s]


In [18]:
for column in real_data.columns:
    try:
        fig = get_column_plot(
            real_data=real_data,
            synthetic_data=synthetic_data,
            column_name=column,
            metadata=metadata
        )   

        fig.show()
    except:
        pass

## 4.2 Conditional Sampling Using Context

You can pass in context columns and allow the PAR synthesizer to simulate the sequence based on those values.

Let's start by creating a scenario with 2 companies in the Technology sector and 3 others in the Consumer Services sector. Each row corresponds to a new sequence that we want to synthesize.

In [19]:
import pandas as pd

scenario_context = pd.DataFrame(data={
    'Symbol': real_data["Symbol"].value_counts().index[:5],
    'Sector': ['Technology']*2 + ['Consumer Services']*3,
    'Industry': ['Computer Manufacturing', 'Computer Software: Prepackaged Software',
                 'Hotels/Resorts', 'Restaurants', 'Clothing/Shoe/Accessory Stores']
})

scenario_context

Unnamed: 0,Symbol,Sector,Industry
0,AAPL,Technology,Computer Manufacturing
1,KLAC,Technology,Computer Software: Prepackaged Software
2,MRNA,Consumer Services,Hotels/Resorts
3,PAYX,Consumer Services,Restaurants
4,ASML,Consumer Services,Clothing/Shoe/Accessory Stores


Now we can simulate this scenario using our trained synthesizer.

In [20]:
custom_synthesizer.sample_sequential_columns(
    context_columns=scenario_context,
    sequence_length=2
)

100%|██████████| 5/5 [00:00<00:00, 78.38it/s]


Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAPL,2019-01-01,71.197683,183.531971,10000,95264010000.0,Technology,Computer Manufacturing
1,AAPL,2019-01-03,70.335414,5.27,111760,76464460000.0,Technology,Computer Manufacturing
2,KLAC,2019-01-01,204.014435,183.531971,3673427,25381390000.0,Technology,Computer Software: Prepackaged Software
3,KLAC,2019-01-01,183.437196,195.942724,3926616,,Technology,Computer Software: Prepackaged Software
4,MRNA,2019-01-01,73.712226,108.739218,8005900,133335300000.0,Consumer Services,Hotels/Resorts
5,MRNA,2018-12-31,97.144911,84.263123,5233339,,Consumer Services,Hotels/Resorts
6,PAYX,2019-01-06,183.437196,325.324281,6569659,,Consumer Services,Restaurants
7,PAYX,2019-01-03,112.235859,173.152746,18343498,126649200000.0,Consumer Services,Restaurants
8,ASML,2019-01-01,147.096949,290.521409,19337826,,Consumer Services,Clothing/Shoe/Accessory Stores
9,ASML,2019-01-01,249.429564,183.531971,344693,148991900000.0,Consumer Services,Clothing/Shoe/Accessory Stores
