# Example of Metadata Tracking with OpenCGA
OpenCGA has many components, one of the most important for this project being OpenCGA Catalog. This component allows users to create a metadata database to track users, projects, studies, files, samples, families, and jobs. We will interact with this database using the Python client library `PyOpenCGA` to access the REST API included in OpenCGA. You can refer to the documentation [here](http://docs.opencb.org/display/opencga/Python) to learn more about PyOpenCGA. 

This notebook will act as an example for how to use OpenCGA to track metadata in the Silent Genomes Project. More specifically, we will include examples showing how to use `PyOpenCGA` with the OpenCGA instance created on the BCCHR server.

The following links can provide further reference for describing OpenCGA and how to use it:
* [Overview of Data Management](http://docs.opencb.org/display/opencga/Data+Management) including definitions of the major entities involved in OpenCGA Catalog

# Imports
The following cell imports the libraries needed to run code within this notebook

In [14]:
# Import libraries needed to run the script
import os
import pandas as pd
import matplotlib.pyplot as plt
from time import sleep
from pprint import pprint

from result import *
from pyopencga.opencga_config import ClientConfiguration
from pyopencga.opencga_client import OpenCGAClient

# Configure and Connect
Before using OpenCGA, the client needs to be configured and connected to. Information relating to the OpenCGA host, username and password are considered to be sensitive information and should not be openly shared. For this reason, they are stored as environment variables in a `.env` file located in the base directory of this project. 

In order to read the information from this file into your environment, before running this notebook you need to activate your environment and load the variables with the following command:
```
source venv/bin/activate
source .env
```

This should load the variables `OPENCGA_HOST`, `OPENCGA_USERNAME`, and `OPENCGA_PASSWORD` into your environment so you can successfully connect to the client.

In [2]:
# Configure the OpenCGA client
config = ClientConfiguration({
        "rest": {
                "host": os.environ.get("OPENCGA_HOST")
        }
})
oc = OpenCGAClient(config)

# Authenticate the user
user = os.environ.get("OPENCGA_USERNAME")
passwd = os.environ.get("OPENCGA_PASSWORD")
oc.login(user, passwd)

# Setup OpenCGA Clients
users = oc.users
projects = oc.projects
studies = oc.studies
files = oc.files
jobs = oc.jobs
families = oc.families
individuals = oc.individuals
samples = oc.samples
cohorts = oc.cohorts
panels = oc.panels

# Querying the Metadata Database
The following section outlines a number of queries that can be performed using OpenCGA. Information in this database is organized into the following major entities:
1. **_Users_**: a person who will be using OpenCGA -- every user should be authenticated and some can perform specific actions
2. **_Groups_**: you can create a group of users to simplify data permission management. These are defined at the study level (i.e. each study contains different groups) and can only be created by the study owner
3. **_Projects_**: a piece of planned work or an activity that is finished over a period of time and is intended to achieve a particular purpose. Any user with full permissions can create any number of projects and studies. Projects need to contain at minimum, a name, an alias (project identifier), and the species organism
4. **_Studies_**: projects are componsed by a set of studies. A study is the activity of examining a subject in detail in order to discover new information. Aside from Users and Projects, most of the data models in Catalog belong to a particular study
5. **_Variable Sets_**: a set of Variables (the complete definition of fields that need to be populated). Every Study can have as many different Variable Set definitions as necessary
6. **_Annotation Sets_**: the values defined for each of the Variables are called Annotations, and the population of a whole Variable Set is called an Annotation Set. An Annotation Set is always related to one concrete Variable Set
7. **_Files_**: every File registry contains the physical path where the files/folders are stored in the file system (i.e. the uri). Catalog also creates a virtual file structure so no matter what the real location of the files are, users can organize and work with those files differently
8. **_Individuals and Families_**: an Individual is a subject (typically a person) for which some analysis will be made. A group of Individuals with any parental or blood relationship is called a Family
9. **_Samples and Cohorts_**: a Sample is any biological material, normally extracted from an Individual. Cohorts contain groups of samples sharing some particular conditions such as "healthy" vs "infected"
10. **_Clinical Analysis_**: contains all the information of the Individuals and Samples involved to perform a real clinical analysis
11. **_Job_**: every time the user calls to an analysis webservice to run anything, a new Job is created. This job contains the essential information about the task that needs to be run. OpenCGA supports SGE

The above entities are related in the following way:
![OpenCGA Data Model](http://docs.opencb.org/download/attachments/327907/catalog_data_models_v13.png?version=1&modificationDate=1560245879990&api=v2)


For more information regarding API endpoints for OpenCGA, you can refer to the demo Swagger located [here](http://bioinfo.hpc.cam.ac.uk/opencga-demo/webservices/#/). 

If you need more information about an object or method, you can use the `help()` command as shown below:

In [22]:
help(oc.samples)

Help on Samples in module pyopencga.rest_clients.sample_client object:

class Samples(pyopencga.rest_clients._parent_rest_clients._ParentBasicCRUDClient, pyopencga.rest_clients._parent_rest_clients._ParentAclRestClient, pyopencga.rest_clients._parent_rest_clients._ParentAnnotationSetRestClient)
 |  Samples(configuration, session_id=None, login_handler=None, *args, **kwargs)
 |  
 |  This class contains method for Samples webservice
 |  
 |  Method resolution order:
 |      Samples
 |      pyopencga.rest_clients._parent_rest_clients._ParentBasicCRUDClient
 |      pyopencga.rest_clients._parent_rest_clients._ParentAclRestClient
 |      pyopencga.rest_clients._parent_rest_clients._ParentAnnotationSetRestClient
 |      pyopencga.rest_clients._parent_rest_clients._ParentRestClient
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, configuration, session_id=None, login_handler=None, *args, **kwargs)
 |      :param login_handler: a parameterless method that can log

### Querying for Samples
Samples can be organized into projects and/or studies.

In [19]:
samples.search(study='1kG_phase3').num_results()

1000

In [16]:
sample_results = samples.search(study='1kG_phase3', limit=2)

for res in sample_results.results():
    pprint(res)

{'annotationSets': [],
 'attributes': {'OPENCGA_INDIVIDUAL': {'affectationStatus': 'UNKNOWN',
                                       'attributes': {},
                                       'creationDate': '20210901022141',
                                       'disorders': [],
                                       'ethnicity': 'AFR',
                                       'father': {'parentalConsanguinity': False,
                                                  'release': 0,
                                                  'version': 0},
                                       'id': 'HG00096',
                                       'karyotypicSex': 'XY',
                                       'lifeStatus': 'ALIVE',
                                       'location': {},
                                       'modificationDate': '20210904233501',
                                       'mother': {'parentalConsanguinity': False,
                                                  'relea

In [21]:
help(samples.search())

Help on RESTResponse in module pyopencga.commons object:

class RESTResponse(builtins.object)
 |  RESTResponse(response)
 |  
 |  Methods defined here:
 |  
 |  __init__(self, response)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  num_matches(self)
 |      Return the total number of matches taking of all the DataResponses
 |  
 |  num_results(self)
 |      Return the total number of results taking of all the DataResponses
 |  
 |  results(self)
 |      Iterates over all the results of all the QueryResults
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

