# __Metadata tutorial__

In this tutorial you can learn how to access and manage the DataFrame metadata. The functions that are about to be explained pertain to a class named _VariantMetadata()_. In order to use their functions we will use an instance named _metadata_, which by default is harbored inside every _Oskar()_ instance as an attribute.
> <span style="color:#ff6600">**In order to optimise the performance we decided to integrate PyOskar in PySpark default API. This means that it is possible to manage data simultaneously with functions from both libraries in case we want to select specific fields, visualize the dataframe, filter the output or performing other operations besides PyOskar API.**</span>

First, we need to import the PyOskar and PySpark modules. Second, we need to create an instance of the _Oskar()_ object, from which depends a big part of the functionality. Finally, we must use the _load()_ function pointing to where the parquet file is stored to convert our data into a DataFrame _df_, and we are ready to start playing.

NOTE: Whenever we use the _load()_ function, the method automatically looks for a file in that same location and with the same name plus ".meta.json.gz" or ".meta.json". In case that file is found, it is set in as the DataFrame metadata.

In [47]:
from pyoskar.core import Oskar
from pyoskar.sql import *
from pyoskar.analysis import *
from pyspark.sql.functions import *

oskar = Oskar(spark)
df = oskar.load("./data/platinum_chr22.small.parquet")

You can use PySpark _show()_ method to print the data from _df_. This is how our testing dataframe looks like. As you can see for this tutorial we have selected a small dataset from Illumina Platinum Genomes with 1,000 random variants from chromosome 22, which pertain to a set of 17 samples.

In [2]:
print("Print first 20 variants:")
df.show()

Print first 20 variants:
+---------------+-----+----------+--------+--------+---------+---------+------+----+------+-----+----+--------------------+--------------------+
|             id|names|chromosome|   start|     end|reference|alternate|strand|  sv|length| type|hgvs|             studies|          annotation|
+---------------+-----+----------+--------+--------+---------+---------+------+----+------+-----+----+--------------------+--------------------+
|22:16054454:C:T|   []|        22|16054454|16054454|        C|        T|     +|null|     1|  SNV|  []|[[hgvauser@platin...|[22, 16054454, 16...|
|22:16065809:T:C|   []|        22|16065809|16065809|        T|        C|     +|null|     1|  SNV|  []|[[hgvauser@platin...|[22, 16065809, 16...|
|22:16077310:T:A|   []|        22|16077310|16077310|        T|        A|     +|null|     1|  SNV|  []|[[hgvauser@platin...|[22, 16077310, 16...|
|22:16080499:A:G|   []|        22|16080499|16080499|        A|        G|     +|null|     1|  SNV|  []|[[h

NOTE: With _df.printSchema()_ command you can check the dataset hierarchy and all its fields.

## Read Metadata

This method looks for the specified file and returns it as a dict object.
<br>
<br>
Usage:
```python
readMetadata(meta_path[str])
```

In [42]:
# oskar.metadata.readMetadata("./data/platinum_chr22.small.parquet.meta.json.gz")

*_This method is not turned on by default because of its huge output, but the user can try it by removing the hash._

## Variant Metadata
This method looks for the specified file from inside the given DataFrame and returns it as a dict object.
<br>
<br>
Usage:
```python
variantMetadata(df[DataFrame])
```

In [41]:
# oskar.metadata.variantMetadata(df)

*_This method is not turned on by default because of its huge output, but the user can try it by removing the hash._

## Set Metadata

This method sets a VariantMetadata object into the given dataframe.
<br>
<br>
Usage:
```python
setVariantMetadata(df[DataFrame], variant_metadata[VariantMetadata])
```

In [40]:
# Load the metadata into a variable.
df_metadata = oskar.metadata.variantMetadata(df)

# We can edit the metadata as we please but the structure must be respected.
df_metadata["version"] = "This version says that I love PyOskar"
df_metadata["species"] = {}
df_metadata["creationDate"] = ""
df_metadata["studies"] = []

# Set the new metadata into the DataFrame.
new_dataframe = oskar.metadata.setVariantMetadata(df, df_metadata)

# Print the new metadata.
print(oskar.metadata.variantMetadata(new_dataframe))

{'version': 'This version says that I love PyOskar', 'species': {}, 'creationDate': '', 'studies': []}


# Samples

This method returns a list containing all the samples from inside the DataFrame metadata. We can specify a particular study if desired.
<br>
<br>
Usage:
```python
samples(df[DataFrame], studyId[str]=None)
```

In [52]:
oskar.metadata.samples(df, "hgvauser@platinum:illumina_platinum")

['NA12877',
 'NA12878',
 'NA12879',
 'NA12880',
 'NA12881',
 'NA12882',
 'NA12883',
 'NA12884',
 'NA12885',
 'NA12886',
 'NA12887',
 'NA12888',
 'NA12889',
 'NA12890',
 'NA12891',
 'NA12892',
 'NA12893']

# Pedigrees

This method returns a list containing all the clinical data from inside the DataFrame metadata. We can specify a particular study if desired.
<br>
<br>
Usage:
```python
pedigrees(df[DataFrame], studyId[str]=None)
```

In [51]:
oskar.metadata.pedigrees(df, "hgvauser@platinum:illumina_platinum")

[{'name': 'FF',
  'phenotypes': [],
  'members': [{'id': 'NA12884',
    'father': {'id': 'NA12878',
     'father': {'id': 'NA12892',
      'sex': 'FEMALE',
      'phenotypes': [{'id': 'JJ',
        'name': 'JJ',
        'attributes': {},
        'ageOfOnset': '',
        'status': 'UNKNOWN'}]},
     'sex': 'FEMALE',
     'phenotypes': [{'id': 'LL',
       'name': 'LL',
       'attributes': {},
       'ageOfOnset': '',
       'status': 'UNKNOWN'}]},
    'sex': 'MALE',
    'phenotypes': [{'id': 'JJ',
      'name': 'JJ',
      'attributes': {},
      'ageOfOnset': '',
      'status': 'UNKNOWN'}],
    'attributes': {}},
   {'id': 'NA12878',
    'father': {'id': 'NA12892',
     'sex': 'FEMALE',
     'phenotypes': [{'id': 'JJ',
       'name': 'JJ',
       'attributes': {},
       'ageOfOnset': '',
       'status': 'UNKNOWN'}]},
    'sex': 'FEMALE',
    'phenotypes': [{'id': 'LL',
      'name': 'LL',
      'attributes': {},
      'ageOfOnset': '',
      'status': 'UNKNOWN'}]},
   {'id': 'NA12