### Demo "walmart"

We have a store series, each of those have a size and a category and additional information in a given date: average temperature in the region, cost of fuel in the region, promotional data, the customer price index, the unemployment rate and whether the date is a special holiday.

From those stores we obtained a training of historical data 
between 2010-02-05 and 2012-11-01. This historical data includes the sales of each department on a specific date.
In this notebook, we will show you step-by-step how to download the "Walmart" dataset, explain the structure and sample the data.

In this demonstration we will show how SDV can be used to generate synthetic data. And lately, this data can be used to train machine learning models.

*The dataset used in this example can be found in [Kaggle](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data), but we will show how to download it from SDV.*

### Data model summary

<p style="text-align: center"><b>stores</b><p>

| Field | Type        | Subtype | Additional Properties |
|-------|-------------|---------|-----------------------|
| Store | id          | integer | Primary key           |
| Size  | numerical   | integer |                       |
| Type  | categorical |         |                       |

Contains information about the 45 stores, indicating the type and size of store.

<p style="text-align: center"><b>features</b><p>

| Fields       | Type      | Subtype | Additional Properties       |
|--------------|-----------|---------|-----------------------------|
| Store        | id        | integer | foreign key (stores.Store)  |
| Date         | datetime  |         | format: "%Y-%m-%d"          |
| IsHoliday    | boolean   |         |                             |
| Fuel_Price   | numerical | float   |                             |
| Unemployment | numerical | float   |                             |
| Temperature  | numerical | float   |                             |
| CPI          | numerical | float   |                             |
| MarkDown1    | numerical | float   |                             |
| MarkDown2    | numerical | float   |                             |
| MarkDown3    | numerical | float   |                             |
| MarkDown4    | numerical | float   |                             |
| MarkDown5    | numerical | float   |                             |

Contains historical training data, which covers to 2010-02-05 to 2012-11-01.

<p style="text-align: center"><b>depts</b><p>

| Fields       | Type      | Subtype | Additional Properties        |
|--------------|-----------|---------|------------------------------|
| Store        | id        | integer | foreign key (stores.Stores)  |
| Date         | datetime  |         | format: "%Y-%m-%d"           |
| Weekly_Sales | numerical | float   |                              |
| Dept         | numerical | integer |                              |
| IsHoliday    | boolean   |         |                              |

Contains additional data related to the store, department, and regional activity for the given dates.

### 1. Load data

Let's start downloading the data set. In this case, we will download the data set *walmart*. We will use the SDV function `load_demo`, we can specify the name of the dataset we want to use and if we want its Metadata object or not. To know more about the demo data [see the documentation](https://sdv-dev.github.io/SDV/api/sdv.demo.html).

In [1]:
from sdv import load_demo

metadata, tables = load_demo(dataset_name='walmart', metadata=True)

INFO - Loading table stores
INFO - Loading table features
INFO - Loading table depts


Our dataset is downloaded from an [Amazon S3 bucket](http://sdv-datasets.s3.amazonaws.com/index.html) that contains all available data sets of the `load_demo` method.

### 2. Create an instance of SDV and train the instance

Once we download it, we have to create an SDV instance. With that instance, we have to analyze the loaded tables to generate a statistical model from the data. In this case, the process of adjusting the model is quickly because the dataset is small. However, with larger datasets it can be a slow process.

In [2]:
from sdv import SDV

sdv = SDV()
sdv.fit(metadata, tables=tables)

INFO - Modeling stores
INFO - Loading transformer CategoricalTransformer for field Type
INFO - Loading transformer NumericalTransformer for field Size
INFO - Modeling depts
INFO - Loading transformer DatetimeTransformer for field Date
INFO - Loading transformer NumericalTransformer for field Weekly_Sales
INFO - Loading transformer NumericalTransformer for field Dept
INFO - Loading transformer BooleanTransformer for field IsHoliday
INFO - Modeling features
INFO - Loading transformer DatetimeTransformer for field Date
INFO - Loading transformer NumericalTransformer for field MarkDown1
INFO - Loading transformer BooleanTransformer for field IsHoliday
INFO - Loading transformer NumericalTransformer for field MarkDown4
INFO - Loading transformer NumericalTransformer for field MarkDown3
INFO - Loading transformer NumericalTransformer for field Fuel_Price
INFO - Loading transformer NumericalTransformer for field Unemployment
INFO - Loading transformer NumericalTransformer for field Temperatur

Note: We may not want to train the model every time we want to generate new synthetic data. We can [save](https://sdv-dev.github.io/SDV/api/sdv.sdv.html#sdv.sdv.SDV.save) the SDV instance to [load](https://sdv-dev.github.io/SDV/api/sdv.sdv.html#sdv.sdv.SDV.save) it later.

### 3. Generate synthetic data

Once the instance is trained, we are ready to generate the synthetic data.

The easiest way to generate synthetic data for the entire dataset is to call the `sample_all` method. By default, this method generates only 5 rows, but we can specify the row number that will be generated with the `num_rows` argument. To learn more about the available arguments, see [sample_all](https://sdv-dev.github.io/SDV/api/sdv.sampler.html#sdv.sampler.Sampler.sample_all).

In [3]:
samples = sdv.sample_all()

This returns a dictionary with a `pandas.DataFrame` for each table.

In [4]:
samples['stores'].head()

Unnamed: 0,Type,Size,Store
0,A,246421,0
1,B,101653,1
2,B,149370,2
3,B,86468,3
4,C,7703,4


In [5]:
samples['features'].head()

Unnamed: 0,Date,MarkDown1,Store,IsHoliday,MarkDown4,MarkDown3,Fuel_Price,Unemployment,Temperature,MarkDown5,MarkDown2,CPI
0,2011-10-09 03:08:43.017056768,6893.109081,0,False,,,3.347564,,64.794222,6653.340383,65.129589,146.314647
1,2011-12-11 06:28:45.549379072,,0,False,2028.10673,,3.338502,6.839839,62.70754,,8321.455672,149.725961
2,2010-01-14 08:59:13.278471680,-4611.501659,0,False,,-54633.624043,3.514647,5.800018,45.156123,,,173.136556
3,2011-11-03 09:36:35.068498176,10198.372303,0,False,,-8417.349682,3.195244,5.856891,70.328883,7021.08613,,152.523479
4,2012-09-04 16:15:21.479766272,14899.947285,0,False,,,2.944646,6.887388,85.053296,12144.532402,10886.509626,139.517963


In [6]:
samples['depts'].head()

Unnamed: 0,Date,Weekly_Sales,Store,Dept,IsHoliday
0,2011-07-05 05:39:04.576852736,51183.491903,0,33,False
1,2011-11-19 13:41:56.459749888,25908.64571,0,37,False
2,2011-10-13 16:16:41.476217600,9646.520687,0,71,False
3,2011-09-11 03:56:14.725833472,23035.729599,0,64,False
4,2010-11-19 20:35:42.432448768,11745.744019,0,13,False


We may not want to generate data for all tables in the dataset, rather for just one table. This is possible with SDV using the `sample` method. To use it we only need to specify the name of the table we want to synthesize and the row numbers to generate. In this case, the "walmart" data set has 3 tables: stores, features and depts.

In the following example, we will generate 1000 rows of the "features" table.

In [7]:
sdv.sample('features', 1000)

{'features':                              Date     MarkDown1  Store  IsHoliday  \
 0   2012-08-23 10:52:37.774681344  13966.037644      5      False   
 1   2011-07-14 19:04:23.880190720           NaN      5      False   
 2   2011-05-15 00:21:00.399277056           NaN      5      False   
 3   2011-09-29 16:04:14.657976576  15692.818648      5      False   
 4   2012-02-02 01:14:58.155269888   9960.687695      5      False   
 5   2011-03-31 09:38:41.608038144  11183.251948      5      False   
 6   2012-09-30 13:46:30.558435840   7548.289858      5      False   
 7   2012-11-21 09:12:59.703204864   5806.214722      5      False   
 8   2011-01-16 20:32:24.967801344           NaN      5      False   
 9   2010-03-22 09:16:19.507253504           NaN      5      False   
 10  2013-03-10 07:24:26.572299520   7023.819698      5      False   
 11  2009-09-25 19:41:30.437349376           NaN      5      False   
 12  2010-04-30 04:40:04.605824512           NaN      5      False   
 13  201

SDV has tools to evaluate the synthetic data generated and compare them with the original data. To see more about the evaluation tools, [see the documentation](https://sdv-dev.github.io/SDV/api/sdv.evaluation.html#sdv.evaluation.evaluate).