# The Dataset and MultiDataset objects

Both the **Dataset** and **MultiDataset** objects are core to best leverage YData's package features.
- The **Dataset** object it's an abstraction of the DASK engine and allows you to easily scale any data preparation job. Nevertheless, if you prefer to work with core python engines such as DASK, pandas or numpy, you can easily convert through the following methods:
    - to_dask: If you're looking for scalability and want to leverage as much as possible from DASK native api. 
    - to_pandas: If you want to keep it as pythonic as possible, and your workloads do not require scale.
    - Numpy: If arrays are your thing, and your workloads do not require scale.
    
The **MultiDataset** object it's a composed version of the **Dataset** object, representing a group of Datasets.

## The Dataset object

In [1]:
from ydata.labs import DataSources

It is straighfoward to consume a data using pandas connector and afterwards converted it to Dataset
To convert into Dataset allows not only to deal with larger scales and DAG like workloads but also to consume the data into downstream applications, such as `Metadata`, **ydata** `Synthesizers` and `ProfileReport`.

In [2]:
datasource = DataSources.get(uid='{uid}', namespace='{namespace}')
dataset = datasource.dataset

#Getting some info from the Dataset
#Schema - Columns and variable types
print('\033[1m Dataset schema \033[0m')
print(dataset.schema)

#Nrows - Number of rows
print(dataset.nrows)

print("\n\033[1m Dataset shape - Number of training rows and columns for both training and holdout \033[0m")
print(dataset.shape(lazy_eval=False))

[1m Dataset schema [0m
{'id': <VariableType.INT: 'int'>, 'age': <VariableType.INT: 'int'>, 'gender': <VariableType.INT: 'int'>, 'height': <VariableType.INT: 'int'>, 'weight': <VariableType.FLOAT: 'float'>, 'ap_hi': <VariableType.INT: 'int'>, 'ap_lo': <VariableType.INT: 'int'>, 'cholesterol': <VariableType.INT: 'int'>, 'gluc': <VariableType.INT: 'int'>, 'smoke': <VariableType.INT: 'int'>, 'alco': <VariableType.INT: 'int'>, 'active': <VariableType.INT: 'int'>, 'cardio': <VariableType.INT: 'int'>}
70000

[1m Dataset shape - Number of training rows and columns for both training and holdout [0m
(70000, 13)


In [3]:
print("\n\033[1m Dataset total memory usage \033[0m")
print(dataset.memory_usage.compute())


[1m Dataset total memory usage [0m
Index             260
active         560000
age            560000
alco           560000
ap_hi          560000
ap_lo          560000
cardio         560000
cholesterol    560000
gender         560000
gluc           560000
height         560000
id             560000
smoke          560000
weight         560000
dtype: int64


### Interact with the data

#Getting the n first rows from a dataset
dataset.head(n=100)

In [4]:
#Get the number of columns and number of rows
print(dataset.ncols, dataset.nrows)

13 70000


In [5]:
#Create new Dataset that results from the applied transformation
dataset.apply(lambda row: row.age/360, axis=1).head(100)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=(None, 'float64'))



Unnamed: 0,0
0,51.091667
1,56.188889
2,52.380556
3,48.952778
4,48.538889
...,...
95,58.544444
96,53.494444
97,51.138889
98,60.722222


In [6]:
#Count distinct values for a given col (column_name)
dataset.value_counts(col='cardio')

0    35021
1    34979
Name: cardio, dtype: int64

In [7]:
#Select the columns from the dataset based on the existing dtypes
#Valid dtypes include int, string, float, date and datetime 
dataset.select_dtypes('int').head(100)

Unnamed: 0,id,age,gender,height,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,110,80,1,1,0,0,1,0
1,1,20228,1,156,140,90,3,1,0,0,1,1
2,2,18857,1,165,130,70,3,1,0,0,0,1
3,3,17623,2,169,150,100,1,1,0,0,1,1
4,4,17474,1,156,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
95,129,21076,1,158,110,70,1,1,0,0,1,0
96,131,19258,2,165,110,70,1,1,0,0,1,0
97,132,18410,1,165,150,110,1,1,0,0,0,1
98,133,21860,2,170,120,80,1,1,0,0,0,1


In [8]:
#Calculate number of unique observations for a given column
dataset.uniques('active')

2

In [9]:
#Drop columns from the Dataset object
# if inplace=True than the current Dataset object is changed. Otherwise the method creates a new Dataset objetc
filter_dataset = dataset.drop_columns('gender', inplace=False)
filter_dataset.head(10)

Unnamed: 0,id,age,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,156,56.0,100,60,1,1,0,0,0,0
5,8,21914,151,67.0,120,80,2,2,0,0,0,0
6,9,22113,157,93.0,130,80,3,1,0,0,1,0
7,12,22584,178,95.0,130,90,3,3,0,0,1,1
8,13,17668,158,71.0,110,70,1,1,0,0,1,0
9,14,19834,164,68.0,110,60,1,1,0,0,0,0


In [10]:
#Subset the Dataset based on a provided list of columns
subset_dataset = dataset[['age', 'height', 'weight', 'ap_hi']]

#Get the last n records from the dataset
subset_dataset.tail(n=10)

Unnamed: 0,age,height,weight,ap_hi
69990,15094,168,72.0,110
69991,20609,159,72.0,130
69992,18792,161,56.0,170
69993,19699,172,70.0,130
69994,21074,165,80.0,150
69995,19240,168,76.0,120
69996,22601,158,126.0,140
69997,19066,183,105.0,180
69998,22431,163,72.0,135
69999,20540,170,72.0,120


In [11]:
#Get a random sample of size n from a Dataset
sample_dataset = dataset.sample(size=1000)
len(sample_dataset)

1000