# Metadata components

The `Metadata` object is the central piece of the Data Catalog as it is responsible to extract information and insights from your datasets.    
The main information about your dataset are summarized in the `Metadata.summary` which is a dictionary indexed on the different attributes from your dataset (e.g. `cardinality`, `zeros` or `correlation`).

The metadata has several individual components:

- the base component which offers, type inference, basic summary and statistics about the dataset
- the correlation component to compute the correlation between all pairs of columns
- the characteristics component to deduce some attributes from each column (e.g. `id`, `phone number`, `location`, etc.)
- the interaction component to deduce the interaction between all pairs of columns (TBD)
- the constraint engine to valid certain rules on the dataset (TBD)

Because some of the components are computational intensive (e.g. correlation for datasets with a lot of columns), or not relevant for your particular use-case (e.g. the characteristics on a purely numerical dataset), the components can be triggered individually.

In this tutorial, we will see how to easily determine what should be computed and how to finely tune the computation.

First, let's retrieve our dataset from the datasource.

In [None]:
from ydata.labs import DataSources
from ydata.metadata import Metadata

datasource = DataSources.get(uid='{uid}', namespace='{namespace}')
dataset = datasource.dataset

The metadata interface offers two flags to easily separate the most computationally expensive component from the main computation:
- `infer_characteristics` (default: `False`): indicates if the characteristics should be deduced,
- `pairwise_metrics` (default: `True`): indicates if the (computationally expensive) pairwise metrics should be computed. It includes the correlation and interactions.

In [8]:
m = Metadata(dataset, infer_characteristics=False, pairwise_metrics=False)

[########################################] | 100% Completed | 100.84 ms
[########################################] | 100% Completed | 214.53 ms


In [9]:
print(m)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m8
[1mNumber of rows: [0m10000
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
        Column  Data type Variable type Characteristics
0         name     string        string                
1      address     string        string                
2        email     string        string                
3         city     string        string                
4        state     string        string                
5    date_time       date      datetime                
6   randomdata  numerical           int                
7  randomdata2  numerical         float                

0  cardinality  [name, address, email, city, state]
1       unique          [address, email, date_time]



It also means that the warnings related to these metrics are not available at this stage!    
If you inspect the summary after the main computation, you can notice that the characteristics and correlation are not computed:

In [10]:
m.summary['characteristics'], m.summary['correlation']  # Both empty

({},
 Empty DataFrame
 Columns: []
 Index: [])

It is possible to compute the characteristics or the correlation when needed. The following methods compute the characteristics and the summary. They return the corresponding values **and updates automatically the summary and the warnings**:

In [11]:
m.compute_characteristics(dataset)

{'email': [{'characteristic': <ColumnCharacteristic.EMAIL: 'email'>,
   'value': 0.9996985955896898,
   'upper_bound': 1,
   'lower_bound': 0.9993093265368261}],
 'state': [{'characteristic': <ColumnCharacteristic.LOCATION: 'location'>,
   'value': 0.9996985955896898,
   'upper_bound': 1,
   'lower_bound': 0.9993093265368261}]}

In [12]:
m.compute_correlation(dataset)

Unnamed: 0,name,address,email,city,state,date_time,randomdata,randomdata2
name,1.0,0.0,0.017837,0.040656,0.003685,1.0,0.0,0.012544
address,0.0,1.0,0.000143,0.0,0.0,1.0,0.0,0.0
email,0.017837,0.000143,1.0,0.0,0.0,1.0,0.0,0.0
city,0.040656,0.0,0.0,1.0,0.013123,1.0,0.012164,0.0
state,0.003685,0.0,0.0,0.013123,1.0,1.0,0.015672,0.0
date_time,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
randomdata,0.0,0.0,0.0,0.012164,0.015672,1.0,1.0,0.000858
randomdata2,0.012544,0.0,0.0,0.0,0.0,1.0,0.000858,1.0


In [13]:
m.summary['characteristics'], m.summary['correlation'] # Both are updated

({'email': [{'characteristic': <ColumnCharacteristic.EMAIL: 'email'>,
    'value': 0.9996985955896898,
    'upper_bound': 1,
    'lower_bound': 0.9993093265368261}],
  'state': [{'characteristic': <ColumnCharacteristic.LOCATION: 'location'>,
    'value': 0.9996985955896898,
    'upper_bound': 1,
    'lower_bound': 0.9993093265368261}]},
                  name   address     email      city     state  date_time  \
 name         1.000000  0.000000  0.017837  0.040656  0.003685        1.0   
 address      0.000000  1.000000  0.000143  0.000000  0.000000        1.0   
 email        0.017837  0.000143  1.000000  0.000000  0.000000        1.0   
 city         0.040656  0.000000  0.000000  1.000000  0.013123        1.0   
 state        0.003685  0.000000  0.000000  0.013123  1.000000        1.0   
 date_time    1.000000  1.000000  1.000000  1.000000  1.000000        1.0   
 randomdata   0.000000  0.000000  0.000000  0.012164  0.015672        1.0   
 randomdata2  0.012544  0.000000  0.000000  0

It is possible to defer the computation using the flag `deferred=True`. In this case the `compute_<component>` returns a `distributed.Future` object immediately. This is useful for large datasets for which any computation might take some time.


The `Future` is also available in the `Metadata.status` dictionary under the key `<component>` (e.g. `Metadata.status['correlation']`).   
When the task is finished, the metadata instance summary is automatically updated as well as the warnings.

In [14]:
m = Metadata(dataset, infer_characteristics=False, pairwise_metrics=False)

[########################################] | 100% Completed | 202.69 ms


In [18]:
future_correlation = m.compute_correlation(dataset, deferred=True)  # Returns immediately 
# ... Do something else in parallel

In [19]:
# It is possible to check the status either in the object itself or in the metadata.status
future_correlation.status, m.status['correlation'].status  # `pending`

('pending', 'pending')

2024-01-09 11:56:15.255531: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [23]:
# Whenever the correlation is needed, you can wait for the result until it is available (if not already)
corr_matrix = future_correlation.result()  # Wait for the result
# or
m.status['correlation'].result()

Unnamed: 0,name,address,email,city,state,date_time,randomdata,randomdata2
name,1.0,0.0,0.017837,0.040656,0.003685,1.0,0.0,0.012544
address,0.0,1.0,0.000143,0.0,0.0,1.0,0.0,0.0
email,0.017837,0.000143,1.0,0.0,0.0,1.0,0.0,0.0
city,0.040656,0.0,0.0,1.0,0.013123,1.0,0.012164,0.0
state,0.003685,0.0,0.0,0.013123,1.0,1.0,0.015672,0.0
date_time,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
randomdata,0.0,0.0,0.0,0.012164,0.015672,1.0,1.0,0.000858
randomdata2,0.012544,0.0,0.0,0.0,0.0,1.0,0.000858,1.0


It is now available in the summary as the computation is finished:

In [21]:
m.summary['correlation']

Unnamed: 0,name,address,email,city,state,date_time,randomdata,randomdata2
name,1.0,0.0,0.017837,0.040656,0.003685,1.0,0.0,0.012544
address,0.0,1.0,0.000143,0.0,0.0,1.0,0.0,0.0
email,0.017837,0.000143,1.0,0.0,0.0,1.0,0.0,0.0
city,0.040656,0.0,0.0,1.0,0.013123,1.0,0.012164,0.0
state,0.003685,0.0,0.0,0.013123,1.0,1.0,0.015672,0.0
date_time,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
randomdata,0.0,0.0,0.0,0.012164,0.015672,1.0,1.0,0.000858
randomdata2,0.012544,0.0,0.0,0.0,0.0,1.0,0.000858,1.0


In addition, you can now check if there are any warning about high correlation:

In [22]:
m.warnings['correlation']

