# The Dataset and MultiDataset objects

Both the **Dataset** and **MultiDataset** objects are core to best leverage YData's package features.
- The **Dataset** object it's an abstraction of the DASK engine and allows you to easily scale any data preparation job. Nevertheless, if you prefer to work with core python engines such as DASK, pandas or numpy, you can easily convert through the following methods:
    - to_dask: If you're looking for scalability and want to leverage as much as possible from DASK native api. 
    - to_pandas: If you want to keep it as pythonic as possible, and your workloads do not require scale.
    - Numpy: If arrays are your thing, and your workloads do not require scale.
    
The **MultiDataset** object it's a composed version of the **Dataset** object, representing a group of Datasets.

## The Dataset object

In [1]:
import pandas as pd
from ydata.dataset import Dataset

It is straighfoward to consume a data using pandas connector and afterwards converted it to Dataset
To convert into Dataset allows not only to deal with larger scales and DAG like workloads but also to consume the data into downstream applications, such as `Metadata`, **ydata** `Synthesizers` and `ProfileReport`.

In [8]:
data = pd.read_csv('{insert-csv}.csv')

#Create the dataset object
dataset = Dataset(data)

#Getting some info from the Dataset
#Schema - Columns and variable types
print('\033[1m Dataset schema \033[0m')
print(dataset.schema)

#Nrows - Number of rows
print(dataset.nrows)

print("\n\033[1m Dataset shape - Number of training rows and columns for both training and holdout \033[0m")
print(dataset.shape(lazy_eval=False))

[1m Dataset schema [0m
{'age': <VariableType.INT: 'int'>, 'gender': <VariableType.FLOAT: 'float'>, 'height': <VariableType.INT: 'int'>, 'weight': <VariableType.FLOAT: 'float'>, 'ap_hi': <VariableType.INT: 'int'>, 'ap_lo': <VariableType.INT: 'int'>, 'cholesterol': <VariableType.INT: 'int'>, 'gluc': <VariableType.INT: 'int'>, 'smoke': <VariableType.INT: 'int'>, 'alco': <VariableType.FLOAT: 'float'>, 'active': <VariableType.FLOAT: 'float'>, 'cardio': <VariableType.FLOAT: 'float'>}
110982

[1m Dataset shape - Number of training rows and columns for both training and holdout [0m
(110982, 12)


In [27]:
print("\n\033[1m Dataset total memory usage \033[0m")
print(dataset.memory_usage.compute())


[1m Dataset total memory usage [0m
Index          887856
active         887856
age            887856
alco           887856
ap_hi          887856
ap_lo          887856
cardio         887856
cholesterol    887856
gender         887856
gluc           887856
height         887856
smoke          887856
weight         887856
dtype: int64


### Interact with the data

#Getting the n first rows from a dataset
dataset.head(n=100)

In [10]:
#Get the number of columns and number of rows
print(dataset.ncols, dataset.nrows)

12 110982


In [21]:
#Create new Dataset that results from the applied transformation
dataset.apply(lambda row: row.age/360, axis=1).head(100)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=(None, 'float64'))



Unnamed: 0_level_0,0
idx,Unnamed: 1_level_1
0,51.091667
1,56.188889
2,52.380556
3,48.952778
4,48.538889
...,...
95,58.544444
96,53.494444
97,51.138889
98,60.722222


In [31]:
#Count distinct values for a given col (column_name)
dataset.value_counts(col='cardio')

TypeError: type of the return value must be dask.dataframe.core.Series; got dict instead

In [37]:
#Select the columns from the dataset based on the existing dtypes
#Valid dtypes include int, string, float, date and datetime 
dataset.select_dtypes('float').head(100)

Unnamed: 0_level_0,gender,weight,alco,active,cardio
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2.0,62.0,0.0,1.0,0.0
1,1.0,85.0,0.0,1.0,1.0
2,1.0,64.0,0.0,0.0,1.0
3,2.0,82.0,0.0,1.0,1.0
4,1.0,56.0,0.0,0.0,0.0
...,...,...,...,...,...
95,1.0,53.0,0.0,1.0,0.0
96,2.0,65.0,0.0,1.0,0.0
97,1.0,99.0,0.0,0.0,1.0
98,2.0,100.0,0.0,0.0,1.0


In [39]:
#Calculate number of unique observations for a given column
dataset.uniques('active')

3

In [43]:
#Drop columns from the Dataset object
# if inplace=True than the current Dataset object is changed. Otherwise the method creates a new Dataset objetc
filter_dataset = dataset.drop_columns('gender', inplace=False)
filter_dataset.head(10)

Unnamed: 0_level_0,age,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,18393,168,62.0,110,80,1,1,0,0.0,1.0,0.0
1,20228,156,85.0,140,90,3,1,0,0.0,1.0,1.0
2,18857,165,64.0,130,70,3,1,0,0.0,0.0,1.0
3,17623,169,82.0,150,100,1,1,0,0.0,1.0,1.0
4,17474,156,56.0,100,60,1,1,0,0.0,0.0,0.0
5,21914,151,67.0,120,80,2,2,0,0.0,0.0,0.0
6,22113,157,93.0,130,80,3,1,0,0.0,1.0,0.0
7,22584,178,95.0,130,90,3,3,0,0.0,1.0,1.0
8,17668,158,71.0,110,70,1,1,0,0.0,1.0,0.0
9,19834,164,68.0,110,60,1,1,0,0.0,0.0,0.0


In [45]:
#Subset the Dataset based on a provided list of columns
subset_dataset = dataset[['age', 'height', 'weight', 'ap_hi']]

#Get the last n records from the dataset
subset_dataset.tail(n=10)

Unnamed: 0_level_0,age,height,weight,ap_hi
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,18393,168,62.0,110
1,20228,156,85.0,140
2,18857,165,64.0,130
3,17623,169,82.0,150
4,17474,156,56.0,100
5,21914,151,67.0,120
6,22113,157,93.0,130
7,22584,178,95.0,130
8,17668,158,71.0,110
9,19834,164,68.0,110


In [46]:
#Get a random sample of size n from a Dataset
sample_dataset = dataset.sample(size=1000)
len(sample_dataset)

1000

### Dataset object & YData Connectors

In [50]:
from ydata.connectors.filetype import FileType
from ydata.connectors import GCSConnector
from ydata.utils.formats import read_json

In [53]:
token = read_json('gcs_credentials.json')
conn = GCSConnector(project_id=token['project_id'], keyfile_dict=token)
data = conn.read_file('gs://ydata_testdata/tabular/cardio/data.csv', file_type=FileType.CSV)


+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| toolz   | 0.11.2 | 0.12.0    | None    |
+---------+--------+-----------+---------+


In [54]:
data

<ydata.dataset.dataset.Dataset at 0x7f4ba8c9a4f0>

## The MultiDataset object

In [56]:
from ydata.connectors import MySQLConnector

Username = 'YDataSQL'
Password =  'a6L7.uvJyV+kFWpF'
Hostname = 'ydata.database.windows.net'
Database = 'berka'
schema = 'berka'

conn_str = {
        "hostname":Hostname,
        "username":Username,
        "password": Password,
        "port": '3306',
        "database": Database
    }

conn = MySQLConnector(conn_string=conn_str)

In [None]:
#Read the database
database = conn.read_database(table='account')