### Demo "Rossmann"

In this notebook, we will show you step-by-step how to download the "Rossmann" dataset, explain the structure and sample the data.

The "Rossmann" dataset can be found in Kaggle. This dataset contains 2 tables: "store" and "historical".

**store**

It is the main table of the dataset, its primary key is "Store".

Contains supplemental information about the stores.

| Fields                    | Type        | Subtype | Additional Properties |
|---------------------------|-------------|---------|-----------------------|
| Store                     | id          | integer |                       |
| StoryType                 | categorical |         |                       |
| Assortment                | categorical |         |                       |
| PromoInterval             | caregorical |         |                       |
| CompetitionOpenSinceYear  | numerical   | integer |                       |
| CompetitionOpenSinceMonth | numerical   | integer |                       |
| CompetitionDistance       | numerical   | integer |                       |
| Promo2                    | boolean     |         |                       |
| Promo2SinceYear           | numerical   | integer |                       |
| Promo2SinceWeek           | numerical   | integer |                       |

**historical**

It is the children table of "store", at Kaggle the file is called "train.csv", but we've renamed it as "historical". Its primary key is "Id" and its foreign key is "Store".

Contains historical data including sales.

| Fields        | Type        | Subtype | Additional Properties     |
|---------------|-------------|---------|---------------------------|
| Id            | id          | integer |                           |
| Store         | id          | integer | foreign key (store.Store) |
| Date          | datetime    |         | format: "%m/%d/%y"        |
| DayOfWeek     | numerical   | integer |                           |
| Promo         | numerical   | integer |                           |
| StateHoliday  | categorical |         |                           |
| Open          | numerical   | integer |                           |
| SchoolHoliday | numerical   | integer |                           |
| Customers     | numerical   | integer |                           |

### 1. Download the demo data

To download the demo data we will use the load_demo method. In this example we will use the **rossmann** dataset. Datasets will be downloaded from [Amazon S3 bucket](https://sdv-demos.s3.eu-west-3.amazonaws.com/index.html).

In [None]:
from sdv import load_demo

metadata, tables = load_demo(dataset_name='rossmann', metadata=True)

By default, datasets will be downloaded to the "data" folder within SDV. If SDV is installed via pip, the data will be stored in the virtual environment. You can change the output path using the "data_path" argument.

In [None]:
from sdv import load_demo

metadata, tables = load_demo(
    dataset_name='rossmann',
    data_path='/home/josedavid/.sdv',
    metadata=True
)

### 2. Create an instance of SDV

Once the data is downloaded, we can create a new instance of SDV.

In [None]:
from sdv import SDV

sdv = SDV()

### 3. Train the model

Once the SDV object has been created, we must fit the model.

We just need to call the "fit" method with the previous metadata and the tables from the csv files.

In [None]:
sdv.fit(metadata, tables=tables)

### 4. Data sampling

After fitting the model, we are ready to generate data. To create data for all the tables we will call "sample_all" method.

In [None]:
samples = sdv.sample_all()

In [None]:
samples['stores'].head()

In [None]:
samples['sales'].head()

This function will return a dictionary with all tables in the dataset with a dataframe for each table.

Alternatively, we can sample, table by table, by calling the "sample" method with the table name and the number of rows to sample.

In [None]:
sample_stores = sdv.sample('stores', 5)
sample_features = sdv.sample('features', 5)
sample_depts = sdv.sample('depts', 5)

### 5. Data evaluation

Once you have generated sample data, we may want to evaluate them.

SDV implements an evaluation package to calculate scores using different descriptors and metrics. In this example, we will use metrics and descriptors by default.

In [None]:
from sdv.evaluation import evaluate

evaluate(tables, samples)