### Demo "Airbnb simplified"

In this notebook, we will show you step-by-step how to download the "Airbnb simplified" dataset, explain the structure and sample the data.

The original "Airbnb" dataset can be found in Kaggle, this dataset version has been simplified. There are 1000 entries in the "users" table and their corresponding entries to the "sessions" table, in this case 26987 entries.

**users**

It is the main table of the dataset, its primary key is "id".

Contains information about each user.

**sessions**

It is a children's table of "users", it has no primary key and its foreign key is "user"

### 1. Download the demo data

To download the demo data we will use the `load_demo` method. In this example we will use the **airbnb-simplified** dataset. Datasets will be downloaded from an Amazon S3 bucket.

In [1]:
from sdv import load_demo

metadata, tables = load_demo(dataset_name='airbnb-simplified', metadata=True)

INFO - Downloading dataset airbnb-simplified from https://sdv-demos.s3.eu-west-3.amazonaws.com/airbnb-simplified.zip
INFO - Extracting dataset into /home/josedavid/Projects/SDV/sdv/data


By default, datasets will be downloaded to the "data" folder within SDV. If SDV is installed via pip, the data will be stored in the virtual environment. You can change the output path using the "data_path" argument.

In [None]:
from sdv import load_demo

metadata, tables = load_demo(
    dataset_name='airbnb-simplified',
    data_path='/home/josedavid/.sdv',
    metadata=True
)

### 2. Create an instance of SDV

Once the data is downloaded, we can create a new instance of SDV.

In [2]:
from sdv import SDV

sdv = SDV()

### 3. Train the model

Once the SDV object has been created, we must fit the model.

We just need to call the "fit" method with the previous metadata and the tables from the csv files.

In [3]:
sdv.fit(metadata, tables=tables)

INFO - Modeling users
INFO - Loading transformer DatetimeTransformer for field date_account_created
INFO - Loading transformer DatetimeTransformer for field timestamp_first_active
INFO - Loading transformer DatetimeTransformer for field date_first_booking
INFO - Loading transformer CategoricalTransformer for field gender
INFO - Loading transformer NumericalTransformer for field age
INFO - Loading transformer CategoricalTransformer for field signup_method
INFO - Loading transformer CategoricalTransformer for field signup_flow
INFO - Loading transformer CategoricalTransformer for field language
INFO - Loading transformer CategoricalTransformer for field affiliate_channel
INFO - Loading transformer CategoricalTransformer for field affiliate_provider
INFO - Loading transformer CategoricalTransformer for field first_affiliate_tracked
INFO - Loading transformer CategoricalTransformer for field signup_app
INFO - Loading transformer CategoricalTransformer for field first_device_type
INFO - Loa

### 4. Data sampling

After fitting the model, we are ready to generate data. To create data for all the tables we will call "sample_all" method.

In [4]:
samples = sdv.sample_all()

In [None]:
samples['users'].head()

In [None]:
samples['sessions'].head()

This function will return a dictionary with all tables in the dataset with a dataframe for each table.

Alternatively, we can sample, table by table, by calling the "sample" method with the table name and the number of rows to sample.

In [None]:
sample_users = sdv.sample('users', 5)
sample_sessions = sdv.sample('sessions', 5)

### 5. Data evaluation

Once you have generated sample data, we may want to evaluate them.

SDV implements an evaluation package to calculate scores using different descriptors and metrics. In this example, we will use metrics and descriptors by default.

In [5]:
from sdv.evaluation import evaluate

evaluate(tables, samples)

  return 1 - (numerator / denominator)


mse         inf
rmse        inf
r2_score    NaN
dtype: float64