# Synthesize Sequences (PAR)

In this notebook, we'll use the SDV library to create multiple, synthetic sequences. The SDV uses machine learning to learn patterns from real data and emulates them when creating synthetic data.

We'll use the **PAR** algorithm to do this. PAR uses a neural network to create sequences.

# 1. Loading the demo data
For this demo, we'll use a fake dataset that describes the daily prices of the 100 largest companies listed in the NASDAQ stock exchanges.

In [75]:
from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='sequential',
    dataset_name='nasdaq100_2019'
)

**Details**: The data is available as a single table.
- `Symbol` describes the ticker symbol of the company
- `Date` describes the point of time that the prices correspond to
-  Columns such as `Open`, `Close` and `Volume` are measurements that change daily
- Columns such `Sector` and `Industry` describe fixed, unchanging values for every company

In [76]:
real_data.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAPL,2018-12-31,39.6325,39.435001,140014000,737873400000.0,Technology,Computer Manufacturing
1,AAPL,2019-01-02,38.7225,39.48,148158800,737873400000.0,Technology,Computer Manufacturing
2,AAPL,2019-01-03,35.994999,35.547501,365248800,737873400000.0,Technology,Computer Manufacturing
3,AAPL,2019-01-04,36.1325,37.064999,234428400,737873400000.0,Technology,Computer Manufacturing
4,AAPL,2019-01-07,37.174999,36.982498,219111200,737873400000.0,Technology,Computer Manufacturing


In [77]:
real_data.shape

(25784, 8)

## 1.1 What is sequential data?

A **sequence** is a set of measurements taken in a particular order, such as the `Open`, `Close` and `Volume` of stock prices. Some datasets have a **sequence index** that prescribes this order. In our case, the `Date` column.

In a single sequence, all measurements belong to the same entity. For example, if we isolate only the stock from Amazon (`Symbol='AMZN'`), then we have a single sequence of data. This sequence has 252 measurements with a `Date` ranging from the end of 2018 to 2019 .

In [78]:
amzn_sequence = real_data[real_data['Symbol'] == 'AMZN']
amzn_sequence

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
252,AMZN,2018-12-31,1510.800049,1501.969971,6954500,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
253,AMZN,2019-01-02,1465.199951,1539.130005,7983100,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
254,AMZN,2019-01-03,1520.010010,1500.280029,6975600,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
255,AMZN,2019-01-04,1530.000000,1575.390015,9182600,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
256,AMZN,2019-01-07,1602.310059,1629.510010,7993200,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
...,...,...,...,...,...,...,...,...
499,AMZN,2019-12-23,1788.260010,1793.000000,2136400,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
500,AMZN,2019-12-24,1793.810059,1789.209961,881300,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
501,AMZN,2019-12-26,1801.010010,1868.770020,6005400,4.035002e+11,Consumer Services,Catalog/Specialty Distribution
502,AMZN,2019-12-27,1882.920044,1869.800049,6186600,4.035002e+11,Consumer Services,Catalog/Specialty Distribution


In a **multi-sequence** dataset, there are multiple sequences existing in the same table. For example, in our dataset, there are different sequences for each company: Amazon, Google, Netflix, etc.

A **sequence key** is a column that can be used to identify each sequence. In this case, the `Symbol` column. If we inspect it, we can see that it contains 100 unique values -- which means there are 100 sequences in the data.

In [79]:
real_data['Symbol'].unique()

array(['AAPL', 'AMZN', 'MSFT', 'FB', 'TSLA', 'GOOGL', 'GOOG', 'NVDA',
       'ADBE', 'PYPL', 'NFLX', 'INTC', 'CMCSA', 'PEP', 'CSCO', 'COST',
       'AVGO', 'QCOM', 'TMUS', 'TXN', 'AMGN', 'CHTR', 'SBUX', 'AMD', 'ZM',
       'INTU', 'ISRG', 'MDLZ', 'GILD', 'JD', 'BKNG', 'VRTX', 'FISV',
       'ADP', 'ATVI', 'REGN', 'MELI', 'CSX', 'AMAT', 'MU', 'LRCX', 'ADSK',
       'ILMN', 'BIIB', 'ADI', 'DOCU', 'LULU', 'MNST', 'WDAY', 'CTSH',
       'EXC', 'EBAY', 'KHC', 'EA', 'NXPI', 'BIDU', 'XEL', 'DXCM', 'SGEN',
       'CTAS', 'IDXX', 'ORLY', 'SNPS', 'ROST', 'KLAC', 'SPLK', 'CDNS',
       'NTES', 'MAR', 'VRSK', 'WBA', 'PCAR', 'ASML', 'PAYX', 'MRNA',
       'ANSS', 'XLNX', 'MCHP', 'CPRT', 'ALXN', 'ALGN', 'FAST', 'SWKS',
       'SIRI', 'VRSN', 'PDD', 'CERN', 'DLTR', 'INCY', 'MXIM', 'TTWO',
       'CDW', 'CHKP', 'CTXS', 'TCOM', 'BMRN', 'ULTA', 'EXPE', 'WDC',
       'FOXA', 'LBTYK', 'FOX', 'LBTYA'], dtype=object)

**The PAR synthesizer is suited for multi-sequence data.** So this dataset with 100 sequences is a perfect candidate.

In [80]:
categories_to_filter = ['AAPL', 'AMZN', 'MSFT', 'FB', 'TSLA', 'GOOGL', 'GOOG', 'NVDA',
       'ADBE', 'PYPL', 'NFLX', 'INTC', 'CMCSA', 'PEP', 'CSCO', 'COST'
       ]
data = real_data[real_data['Symbol'].isin(categories_to_filter)]

In [81]:
data.Symbol.unique()

array(['AAPL', 'AMZN', 'MSFT', 'FB', 'TSLA', 'GOOGL', 'GOOG', 'NVDA',
       'ADBE', 'PYPL', 'NFLX', 'INTC', 'CMCSA', 'PEP', 'CSCO', 'COST'],
      dtype=object)

In [82]:
data.shape

(4032, 8)

In [83]:
data.Date.min()

'2018-12-31'

In [84]:
data.Date.max()

'2019-12-30'

In [85]:
data.Date.value_counts()

Date
2018-12-31    16
2019-09-09    16
2019-08-20    16
2019-08-21    16
2019-08-22    16
              ..
2019-05-08    16
2019-05-09    16
2019-05-10    16
2019-05-13    16
2019-12-30    16
Name: count, Length: 252, dtype: int64

In [86]:
data_trimmed = data[data['Date']<= '2019-04-01']

In [87]:
data_trimmed.shape

(1008, 8)

In [88]:
data_trimmed.Symbol.nunique()

16

In [89]:
data_trimmed.Symbol.unique()

array(['AAPL', 'AMZN', 'MSFT', 'FB', 'TSLA', 'GOOGL', 'GOOG', 'NVDA',
       'ADBE', 'PYPL', 'NFLX', 'INTC', 'CMCSA', 'PEP', 'CSCO', 'COST'],
      dtype=object)

In [90]:
data_trimmed.Date.value_counts()

Date
2018-12-31    16
2019-03-11    16
2019-02-20    16
2019-02-21    16
2019-02-22    16
              ..
2019-02-07    16
2019-02-08    16
2019-02-11    16
2019-02-12    16
2019-04-01    16
Name: count, Length: 63, dtype: int64

In [91]:
data_trimmed['Date'].dropna(inplace = True)

In [92]:
data_trimmed.Date.nunique()

63

In [93]:
data_trimmed.head()

Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAPL,2018-12-31,39.6325,39.435001,140014000,737873400000.0,Technology,Computer Manufacturing
1,AAPL,2019-01-02,38.7225,39.48,148158800,737873400000.0,Technology,Computer Manufacturing
2,AAPL,2019-01-03,35.994999,35.547501,365248800,737873400000.0,Technology,Computer Manufacturing
3,AAPL,2019-01-04,36.1325,37.064999,234428400,737873400000.0,Technology,Computer Manufacturing
4,AAPL,2019-01-07,37.174999,36.982498,219111200,737873400000.0,Technology,Computer Manufacturing


## 1.2 What are Context Columns?
A **context** column does not change during the course of a sequence.  In our case, `Sector` and `Industry` are context columns.

If we choose a sequence -- such as Amazon (`Symbol='AMZN'`) -- then we'll see that the context values don't change. Amazon is always a `'Consumer Services'` company.

In [73]:
data_trimmed.to_excel('data_trimmed.xlsx')

# 2. Basic Usage

## 2.1 Creating a Synthesizer

An SDV **synthesizer** is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data.

In [95]:
from sdv.sequential import PARSynthesizer

synthesizer = PARSynthesizer(
    metadata,
    epochs=500,
    context_columns=['MarketCap', 'Sector', 'Industry'],
    enforce_min_max_values=True,
    verbose=True)

synthesizer.fit(data_trimmed)

Loss (-4.864): 100%|█████████████████████████████████████████████████████████████████| 500/500 [00:41<00:00, 12.08it/s]


## 2.2 Generating Synthetic Data

Use the `sample` function and pass in any number of sequences to synthesize. The synthesizer algorithmically determines how long to make each sequence.

In [48]:
synthetic_data = synthesizer.sample(num_sequences = 1008)
synthetic_data.head()


The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.


The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.

100%|██████████████████████████████████████████████████████████████████████████████| 1008/1008 [04:58<00:00,  3.38it/s]


Unnamed: 0,Symbol,Date,Open,Close,Volume,MarketCap,Sector,Industry
0,AAFC,,85.980261,99.023256,27711095,422766000000.0,Miscellaneous,Consumer Electronics/Video Chains
1,AAFC,,88.404471,100.91242,10806184,422766000000.0,Miscellaneous,Consumer Electronics/Video Chains
2,AAFC,,55.546494,34.049999,21019612,422766000000.0,Miscellaneous,Consumer Electronics/Video Chains
3,AAFC,,33.490002,34.049999,34958548,422766000000.0,Miscellaneous,Consumer Electronics/Video Chains
4,AAFC,,44.669461,34.049999,36660069,422766000000.0,Miscellaneous,Consumer Electronics/Video Chains


In [98]:
data.shape

(4032, 8)

In [99]:
data.to_excel('de_real_data1.xlsx')

In [97]:
synthetic_data.to_excel('de_synthetic_data1.xlsx')

In [96]:
synthetic_data.shape

(63504, 8)

In [51]:
synthetic_data.Date.value_counts(dropna = False)

Date
NaN           62499
2018-12-31      207
2019-01-01       86
2019-01-03       52
2019-01-02       45
              ...  
2019-02-21        2
2019-02-25        1
2019-03-24        1
2019-03-25        1
2019-03-26        1
Name: count, Length: 87, dtype: int64

In [52]:
synthetic_data.Date.unique()

array([nan, '2018-12-31', '2019-01-02', '2019-01-01', '2019-01-03',
       '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-04',
       '2019-01-08', '2019-01-11', '2019-01-12', '2019-01-15',
       '2019-01-16', '2019-01-17', '2019-01-18', '2019-01-19',
       '2019-01-20', '2019-01-21', '2019-01-23', '2019-01-24',
       '2019-01-25', '2019-01-26', '2019-01-27', '2019-01-28',
       '2019-01-30', '2019-01-31', '2019-02-01', '2019-02-02',
       '2019-02-04', '2019-02-05', '2019-02-06', '2019-02-07',
       '2019-02-09', '2019-02-10', '2019-02-11', '2019-02-12',
       '2019-02-13', '2019-02-14', '2019-02-15', '2019-02-16',
       '2019-02-17', '2019-02-18', '2019-02-19', '2019-02-20',
       '2019-02-22', '2019-02-23', '2019-01-09', '2019-01-10',
       '2019-01-14', '2019-01-22', '2019-01-29', '2019-02-03',
       '2019-01-13', '2019-02-08', '2019-02-21', '2019-02-25',
       '2019-02-26', '2019-02-27', '2019-02-28', '2019-03-01',
       '2019-03-02', '2019-03-04', '2019-03-05', '

In [53]:
synthetic_data.Date.nunique()

86

In [54]:
data_trimmed.Date.nunique()

63

In [55]:
synthetic_data.Symbol.nunique()

1008

In [56]:
synthetic_data.shape

(63504, 8)

In [57]:
loss_values = synthesizer.get_loss_values()

In [58]:
import plotly.express as px

loss_df = synthesizer.get_loss_values()
#loss_df['Deepecho_loss'] = loss_df['Loss'].apply(lambda x: x.iteam())

In [59]:
fig = px.line(loss_df, x = 'Epoch', y = ['Loss'])
fig.update_layout(template = 'plotly_white', legend_title_text = '', legend_orientation = "v", legend = dict(x = 1.1, y = 0.3))

fig.show()

The synthesizer is generating entirely new sequences in the same format as the real data. **Each sequence represents an entirely new company** based on the overall patterns from the dataset. **_They do not map or correspond to any real company._**

For example, fictitious company `AAAA` is a generic Consumer Electronics/Video Chains company and `AAAB` is a Business Services company. A full list of our synthetic companies is shown below.

In [60]:
synthetic_data.shape

(63504, 8)

In [61]:
synthetic_data[['Symbol', 'Sector', 'Industry', 'MarketCap']].groupby(['Symbol']).first().reset_index()

Unnamed: 0,Symbol,Sector,Industry,MarketCap
0,AAFC,Miscellaneous,Consumer Electronics/Video Chains,4.227660e+11
1,AAFD,Technology,Computer Software: Prepackaged Software,
2,AAFE,Consumer Services,"Computer Software: Programming, Data Processing",4.873686e+11
3,AAFF,Technology,Catalog/Specialty Distribution,7.182782e+11
4,AAFG,Technology,Computer Software: Prepackaged Software,7.266094e+11
...,...,...,...,...
1003,ABRR,,Semiconductors,1.904565e+11
1004,ABRS,Miscellaneous,Auto Manufacturing,5.436237e+10
1005,ABRT,Technology,Computer Communications Equipment,2.644218e+11
1006,ABRU,Technology,Computer Software: Prepackaged Software,7.368885e+11


In [62]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    data_trimmed,
    synthetic_data,
    metadata)

Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████████████████████████████████████████████| 8/8 [00:00<00:00, 95.19it/s]
(2/2) Evaluating Column Pair Trends: : 100%|███████████████████████████████████████████| 28/28 [00:00<00:00, 36.03it/s]

Overall Score: 61.44%

Properties:
- Column Shapes: 77.46%
- Column Pair Trends: 45.42%


In [63]:
fig = quality_report.get_visualization(property_name='Column Shapes')
fig.show()

In [64]:
import plotly.subplots as sp
import plotly.graph_objects as go 
from sdv.evaluation.single_table import get_column_plot


In [65]:
fig = get_column_plot(
    real_data=data_trimmed,
    synthetic_data=synthetic_data,
    column_name='Volume',
    metadata=metadata
)
fig.show()

In [66]:
fig = get_column_plot(
    real_data=data_trimmed,
    synthetic_data=synthetic_data,
    column_name='Close',
    metadata=metadata
)
fig.show()

In [100]:
fig = get_column_plot(
    real_data=data_trimmed,
    synthetic_data=synthetic_data,
    column_name='Date',
    metadata=metadata
)
fig.show()

In [67]:
fig = get_column_plot(
    real_data=data_trimmed,
    synthetic_data=synthetic_data,
    column_name='Open',
    metadata=metadata
)
fig.show()

In [38]:
#pip install --upgrade nbformat

## 2.3 Saving and Loading
We can save the synthesizer to share with others and sample more synthetic data in the future.

In [130]:
# synthesizer.save('my_synthesizer.pkl')

# synthesizer = PARSynthesizer.load('my_synthesizer.pkl')

# 3. PAR Customization

We can customizer our PARSynthesizer in many ways.

- Use the `epochs` parameter to make a tradeoff between training time and data quality. Higher epochs mean the synthesizer will train for longer, ideally improving the data quality.
- Use the `enforce_min_max_values` parameter to specify whether the synthesized data should always be within the same min/max ranges as the real data. Toggle this to `False` in order to enable forecasting.


In [131]:
# custom_synthesizer = PARSynthesizer(
#     metadata,
#     epochs=250,
#     context_columns=['Sector', 'Industry'],
#     enforce_min_max_values=False,
#     verbose=True)

# custom_synthesizer.fit(real_data)

<font color="maroon"><i><b>This step takes about 10 min to complete.</b> We can use the `verbose` parameter to track progress. For larger datasets, this phase may take longer.</i></font>

# 4. Sampling Options
Using the PAR synthesizer, you can customize the synthetic data to suit your needs.

## 4.1 Specify Sequence Length

By default, the synthesizer algorithmically determines the length of each sequence. However, you can also specify a fixed, predetermined length.

In [132]:
# custom_synthesizer.sample(num_sequences=3, sequence_length=2)

In [134]:
# long_sequence = custom_synthesizer.sample(num_sequences=1, sequence_length=500)

# 5. What's Next?

For more information about the PAR Synthesizer, visit the **[documentation](https://docs.sdv.dev/sdv/sequential-data/modeling/parsynthesizer)**.

**Need more help?** [Browse all tutorials](https://docs.sdv.dev/sdv/demos).
