# The Dataset and Metadata

Both the **Dataset** and **Metadata** objects are the pilar to leverage [ydata-sdk](https://pypi.org/project/ydata-sdk/) features.
- The **Dataset** object is an abstraction of different Python engines for handling data:    
    - Dask: If you're looking for scalability and still prefer to keep it.
    - Pandas: If you want to keep it as pythonic as possible.
    - Numpy: If arrays are your thing.
- The **Metadata** is the object that helps you extracting the main insights from your dataset and assess its quality:
    - The columns metadata: Both Variable and Data type (numerical, categorical, etc.)
    - The data warnings: Checks for the presence of duplicates, variables with skewness, etc.
    
The **Metadata** object only works with a **Dataset** as an input. In this Notebook, we will be showing you features and capabilities of these objects and how to combine them with other pieces from YData's package offer.

### Authenticate with your YData account

In [1]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## 📥 Creating a Dataset from a DataFrame

You can create a `Dataset` from a pandas, Dask, or NumPy object. Here’s a basic example with pandas:

In [2]:
import pandas as pd
from ydata.dataset import Dataset

# Create a simple DataFrame
df = pd.DataFrame({
    "age": [25, 32, 40],
    "income": [50000, 60000, 75000],
    "gender": ["M", "F", "M"]
})

# Wrap it into a ydata-sdk Dataset
dataset = Dataset(df)
dataset.head()

Unnamed: 0,age,income,gender
0,25,50000,M
1,32,60000,F
2,40,75000,M


### 🔍 Exploring the Dataset

Once you've created a `Dataset`, you can inspect its structure, content as well as filter and select columns. For more information regarding `Dataset` object interface please check [ydata-sdk's documentation](https://docs.sdk.ydata.ai/latest/api/datasets/dataset/).

In [3]:
# Number of rows
print("Number of rows:", dataset.nrows)

# Shape of the dataset
print("Shape:", dataset.shape())

# Schema: Column types and variable types
print("Schema:")
print(dataset.schema)

# First rows
dataset.head()

Number of rows: 3
Shape: (3, 3)
Schema:
{'age': <VariableType.INT: 'int'>, 'income': <VariableType.INT: 'int'>, 'gender': <VariableType.STR: 'string'>}


Unnamed: 0,age,income,gender
0,25,50000,M
1,32,60000,F
2,40,75000,M


## ✳️ Creating Metadata from a Dataset

The `Metadata` object gives you valuable insights about the dataset structure and data quality. It is also possible to interact and select the information that is hold by the `Metadata`object. For more information please check the [ydata-sdk's API reference documentation](https://docs.sdk.ydata.ai/latest/api/metadatas/metadata/).


In [4]:
from ydata.metadata import Metadata
#Calculate the Dataset Metadata
metadata = Metadata(dataset=dataset)

#Getting the all metadata summary
metadata

[########################################] | 100% Completed | 104.88 ms
[########################################] | 100% Completed | 104.36 ms


<ydata.metadata.metadata.Metadata at 0x12228ffb0>

In [5]:
## Setting the target variable
metadata.target='income'

### Updating columns datatypes
The automated inferences might not be totally correct in all the cases. For that reason, we always recommend to update the datatypes accordingly to the user understanding of the data. 

The update can be done by column or for a group of columns.

In [6]:
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m3
[1mNumber of rows: [0m3
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
   Column    Data type Variable type Characteristics
0     age  categorical           int                
1  income  categorical           int                
2  gender  categorical        string                

0           unique                            [age, income]
1  constant_length                                 [gender]
2      correlation  [age|income, age|gender, income|gender]



In this particular example, the 'encounter_id' column has been mistakenly identified as a numerical column, instead of a categorical one. The code snippet below shows how to change the datatypes:

In [7]:
#Getting the all metadata summary
print('\n\033[1mChanging the gender column data type\033[0m')
metadata.update_datatypes({'gender': 'categorical'})

print(f"'encounter_id': {metadata.columns['gender'].datatype.name}")


[1mChanging the gender column data type[0m
'encounter_id': CATEGORICAL


### Filtering metadata by columns
For some activities the full metadata might not be needed and only a portion of the calls shall be considered for a certain activity (eg. data synthesis). The Metadata object allows the users to select only the needed columns, as per the example below.

In [8]:
filtered_metadata = metadata['gender', 'income']

print('\n\033[1mNew available metadata\033[0m')
print(filtered_metadata)


[1mNew available metadata[0m
[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m2
[1mNumber of rows: [0m3
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
   Column    Data type Variable type Characteristics
0  income  categorical           int                
1  gender  categorical        string                

0           unique  [income]
1  constant_length  [gender]

