# The Dataset and Metadata

Both the **Dataset** and **Metadata** objects are the pilar to leverage YData's package features.
- The **Dataset** object is an abstraction of different Python engines for handling data:    
    - Dask: If you're looking for scalability and still prefer to keep it.
    - Pandas: If you want to keep it as pythonic as possible.
    - Numpy: If arrays are your thing.
- The **Metadata** is the object that helps you extracting the main insights from your dataset and assess its quality:
    - The columns metadata: Both Variable and Data type (numerical, categorical, etc.)
    - The data warnings: Checks for the presence of duplicates, variables with skewness, etc.
    
The **Metadata** object only works with a **Dataset** as an input. In this Notebook, we will be showing you features and capabilities of this objects and how to combine them with other pieces from YData's package offer.

In [1]:
from ydata.labs import DataSources
from ydata.metadata import Metadata

In [2]:
datasource = DataSources.get(uid='{uid}', namespace='{namespace}')
dataset = datasource.dataset

#Getting some info from the Dataset
#Schema - Columns and variable types
print('\033[1m Dataset schema \033[0m')
print(dataset.schema)

#Nrows - Number of rows
print(dataset.nrows)

print("\n\033[1m Dataset shape - Number of training rows and columns for both training and holdout \033[0m")
print(dataset.shape(lazy_eval=False))

[1m Dataset schema [0m
{'encounter_id': <VariableType.INT: 'int'>, 'patient_nbr': <VariableType.INT: 'int'>, 'race': <VariableType.STR: 'string'>, 'gender': <VariableType.STR: 'string'>, 'age': <VariableType.STR: 'string'>, 'weight': <VariableType.STR: 'string'>, 'admission_type_id': <VariableType.INT: 'int'>, 'discharge_disposition_id': <VariableType.INT: 'int'>, 'admission_source_id': <VariableType.INT: 'int'>, 'time_in_hospital': <VariableType.INT: 'int'>, 'payer_code': <VariableType.STR: 'string'>, 'medical_specialty': <VariableType.STR: 'string'>, 'num_lab_procedures': <VariableType.INT: 'int'>, 'num_procedures': <VariableType.INT: 'int'>, 'num_medications': <VariableType.INT: 'int'>, 'number_outpatient': <VariableType.INT: 'int'>, 'number_emergency': <VariableType.INT: 'int'>, 'number_inpatient': <VariableType.INT: 'int'>, 'diag_1': <VariableType.STR: 'string'>, 'diag_2': <VariableType.STR: 'string'>, 'diag_3': <VariableType.STR: 'string'>, 'number_diagnoses': <VariableType.INT

### Extract the metadata from the Dataset

In [3]:
#init the metadata object
metadata = Metadata()

#calculate the Metadata of a given Dataset
metadata(dataset)

#Getting the all metadata summary
print('\n\033[1mMetadata summary\033[0m')

for item, values in metadata.summary.items():
    print('\n\033[4m'+item+'\033[0m')
    print(values)

[########################################] | 100% Completed | 307.88 ms
[########################################] | 100% Completed | 1.68 sms
[########################################] | 100% Completed | 628.97 ms
[########################################] | 100% Completed | 341.05 ms
[########################################] | 100% Completed | 58.35 s

[1mMetadata summary[0m

[4mnrows[0m
101766

[4mcardinality[0m
{'encounter_id': 101766, 'patient_nbr': 71518, 'race': 6, 'gender': 3, 'age': 10, 'weight': 10, 'admission_type_id': 8, 'discharge_disposition_id': 26, 'admission_source_id': 17, 'time_in_hospital': 14, 'payer_code': 18, 'medical_specialty': 73, 'num_lab_procedures': 118, 'num_procedures': 7, 'num_medications': 75, 'number_outpatient': 39, 'number_emergency': 33, 'number_inpatient': 21, 'diag_1': 717, 'diag_2': 749, 'diag_3': 790, 'number_diagnoses': 16, 'max_glu_serum': 4, 'A1Cresult': 4, 'metformin': 4, 'repaglinide': 4, 'nateglinide': 4, 'chlorpropamide': 4, 'glime

In [4]:
## Setting the target variable
metadata.target='readmitted'

### Updating columns datatypes
The automated inferences might not be totally correct in all the cases. For that reason, we always recommend to update the datatypes accordingly to the user understanding of the data. 

The update can be done by column or for a group of columns.

In [5]:
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m50
[1mDuplicate rows: [0m145
[1mTarget column: [0m

[1mColumn detail: [0m
                      Column    Data type Variable type Characteristics
0               encounter_id    numerical           int                
1                patient_nbr    numerical           int                
2                       race  categorical        string                
3                     gender  categorical        string                
4                        age  categorical        string                
5                     weight  categorical        string                
6          admission_type_id  categorical           int                
7   discharge_disposition_id  categorical           int                
8        admission_source_id  categorical           int                
9           time_in_hospital  categorical           int                
10            

In this particular example, the 'encounter_id' column has been mistakenly identified as a numerical column, instead of a categorical one. The code snippet below shows how to change the datatypes:

In [6]:
#Getting the all metadata summary
print('\n\033[1mChanging the encounter_id column data type\033[0m')
metadata.update_datatypes({'encounter_id': 'categorical'})

print(f"'encounter_id': {metadata.columns['encounter_id'].datatype.name}")


[1mChanging the encounter_id column data type[0m
'encounter_id': CATEGORICAL


### Filtering metadata by columns
For some activities the full metadata might not be needed and only a portion of the calls shall be considered for a certain activity (eg. data synthesis). The Metadata object allows the users to select only the needed columns, as per the example below.

In [7]:
filtered_metadata = metadata['encounter_id', 'age', 'acarbose', 'readmitted']

print('\n\033[1mNew avaialble metadata\033[0m')
print(filtered_metadata)


[1mNew avaialble metadata[0m
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m4
[1mDuplicate rows: [0m145
[1mTarget column: [0m

[1mColumn detail: [0m
         Column    Data type Variable type Characteristics
0  encounter_id  categorical           int              id
1           age  categorical        string                
2      acarbose  categorical        string                
3    readmitted  categorical        string                

0  cardinality   [encounter_id]
1    imbalance  [age, acarbose]
2       unique   [encounter_id]

