# The Dataset and Metadata objects

Both the **Dataset** and **Metadata** objects are the pilar to leverage YData's package features.
- The **Dataset** object it's an abstraction of differente Python engines for handling data: 
    - Dask: If you're looking for scalability and still prefer to keep it.
    - Pandas: If you want to keep it as pythonic as possible.
    - Numpy: If arrays are your thing.
    
    
- The **Metadata** is the object that helps you extracting the main data from your dataset:
    - The columns metadata: Both Variable and Data type (numerical, categorical, etc.)
    - The data warnings: Checks for the presence of duplicates, variables with skewness, etc.
    
The **Metadata** object only works with a **Dataset** as an input. In this Notebook, we will be showing you features and capabilities of this objects and how to combine them with other pieces from YData's package offer.

In [9]:
import pandas as pd

from ydata.dataset import Dataset
from ydata.metadata import Metadata

In [10]:
data = pd.read_csv('diabetes.csv')

#Create the dataset object
dataset = Dataset(data)

#Getting some info from the Dataset
#Schema - Columns and variable types
print('\033[1m Dataset schema \033[0m')
print(dataset.schema)

#Nrows - Number of rows
print(dataset.nrows)

print("\n\033[1m Dataset shape - Number of training rows and columns for both training and holdout \033[0m")
print(dataset.shape(lazy_eval=False))

[1m Dataset schema [0m
{'encounter_id': <VariableType.INT: 'int'>, 'patient_nbr': <VariableType.INT: 'int'>, 'race': <VariableType.STR: 'string'>, 'gender': <VariableType.STR: 'string'>, 'age': <VariableType.STR: 'string'>, 'weight': <VariableType.STR: 'string'>, 'admission_type_id': <VariableType.INT: 'int'>, 'discharge_disposition_id': <VariableType.INT: 'int'>, 'admission_source_id': <VariableType.INT: 'int'>, 'time_in_hospital': <VariableType.INT: 'int'>, 'payer_code': <VariableType.STR: 'string'>, 'medical_specialty': <VariableType.STR: 'string'>, 'num_lab_procedures': <VariableType.INT: 'int'>, 'num_procedures': <VariableType.INT: 'int'>, 'num_medications': <VariableType.INT: 'int'>, 'number_outpatient': <VariableType.INT: 'int'>, 'number_emergency': <VariableType.INT: 'int'>, 'number_inpatient': <VariableType.INT: 'int'>, 'diag_1': <VariableType.STR: 'string'>, 'diag_2': <VariableType.STR: 'string'>, 'diag_3': <VariableType.STR: 'string'>, 'number_diagnoses': <VariableType.INT

### Extract the metadata from the Dataset

In [11]:
#init the metadata object
metadata = Metadata()

#calculate the Metadata of a given Dataset
metadata(dataset)

#Getting the all metadata summary
print('\n\033[1mMetadata summary\033[0m')
print(metadata.summary)

for item, values in metadata.summary.items():
    print('\n\033[4m'+item+'\033[0m')
    print(values)

[########################################] | 100% Completed |  4.0s

[1mMetadata summary[0m
{'target': None, 'dataset_attrs': None, 'nrows': 11105, 'summary': {'nrows': 11105, 'cardinality': {'encounter_id': 11083.560117437017, 'patient_nbr': 9065.006722473736, 'race': 6.000274674963478, 'gender': 2.000030518198964, 'age': 10.000763017065795, 'weight': 10.000763017065795, 'admission_type_id': 7.000373866960488, 'discharge_disposition_id': 14.001495574319138, 'admission_source_id': 10.000763017065795, 'time_in_hospital': 14.001495574319138, 'payer_code': 1.0000076294721387, 'medical_specialty': 44.01477712229648, 'num_lab_procedures': 106.08581642423437, 'num_procedures': 7.000373866960488, 'num_medications': 59.026573872641585, 'number_outpatient': 14.001495574319138, 'number_emergency': 11.000923260056664, 'number_inpatient': 14.001495574319138, 'diag_1': 457.59382330741664, 'diag_2': 441.4836953993499, 'diag_3': 484.7886474548861, 'number_diagnoses': 9.000618037546497, 'max_glu_ser

In [12]:
## Setting the target variable
metadata.target='readmitted'

### Updating columns datatypes
The automated inferences might not be totally correct in all the cases. For that reason, we always recommend to update the datatypes accordingly to the user understanding of the data. 

The update can be done by column or for a group of columns.

In [13]:
print(metadata)



In this particular example, the 'encounter_id' columns have been mistakenly identified as a numerical columns instead of an ID. The code snippet below shows how to change the datatypes:

In [15]:
#Getting the all metadata summary
print('\n\033[1mChanging one column data type\033[0m')
metadata.columns = {'encounter_id': 'id'}

print(f"'encounter_id': {metadata.columns['encounter_id'].datatype.name}")

print('\n\033[1mChanging multiple columns data types\033[0m')
metadata.columns = {'patient_nbr': 'id',
                    'admission_type_id': 'id'}

print(f"'patient_nbr': {metadata.columns['patient_nbr'].datatype.name}")
print(f"'admission_type_id': {metadata.columns['admission_type_id'].datatype.name}")


[1mChanging one column data type[0m
'encounter_id': ID

[1mChanging multiple columns data types[0m
'patient_nbr': ID
'admission_type_id': ID


### Filtering metadata by columns
For some activities the full metadata might not be needed and only a portion of the calls shall be considered for a certain activity (eg. data synthesis). The Metadata object allows the users to select only the needed columns, as per the example below.

In [23]:
filtered_metadata = metadata['encounter_id', 'age', 'acarbose', 'readmitted']

print('\n\033[1mNew avaialble metadata\033[0m')
print(filtered_metadata)


[1mNew avaialble metadata[0m


## YData connectors and the Metadata

In [16]:
import os
from pathlib import Path
from ydata.connectors import GCSConnector
from ydata.utils.formats import read_json

In [19]:
token = read_json('gcs_credentials.json')
conn = GCSConnector(project_id=token['project_id'], keyfile_dict=token)
data = conn.read_file('gs://ydata_testdata/tabular/diabetes/data.csv')

In [20]:
# The Ydata connectors already return an object of type Dataset.
diabetes_metadata = Metadata()
diabetes_metadata(data)

<ydata.metadata.metadata.Metadata at 0x7ff8e52d9d10>

In [21]:
print(diabetes_metadata)



## The dataset