# Contribution's guideline: Data

Thank you very much for considering contributing to the data collection of **NetworkCommons**! In order to make the resource as user-friendly as possible, we aim to be as transparent as possible, which means that all contributions should contain at least the following elements. For other examples, see [the Datasets details.](../datasets.html)

## 1. Data information
* Experimental design: number of samples, number of experiments (if applicable), confounding factors
* Data production and processing: tools used, how the data processing was performed (if applicable).
* Files: number and type of files, with a small description of their contents.
* Link to the database from which the data was retrieved.
* Link to the dataset publication
* Path information explaining the structure of the data directories
This information should be appended to the existing YAML file in `networkcommons/data/datasets.yaml`

An example of this can be found below:

    NCI60:
        name: NCI60
        description: NCI-60 cell line data
        publication_link: https://doi.org/10.1038/nrc1951
        detailed_description: >-
            This dataset contains data from the NCI-60 cell line panel.
            It includes three files: TF activities from transcriptomics data,
            metabolite abundances and gene reads.
        path: NCI60/{cell_line}/{cell_line}__{data_type}.tsv

This information can then be accessed via `nc.data.omics.datasets()`

In [2]:
import networkcommons as nc

In [3]:
nc.data.omics.datasets()

Unnamed: 0,name,description,publication_link,detailed_description
decryptm,DecryptM,Drug perturbation proteomics and phosphoproteomics data,https://doi.org/10.1126/science.ade3925,"This dataset contains the profiling of 31 cancer drugs in 13 human cancer cell line models resulted in 1.8 million dose-response curves, including 47,502 regulated phosphopeptides, 7316 ubiquitinylated peptides, and 546 regulated acetylated peptides."
panacea,Panacea,Pancancer Analysis of Chemical Entity Activity RNA-Seq data,https://doi.org/10.1016/j.xcrm.2021.100492,"PANACEA contains dose-response and perturbational profiles for 32 kinase inhibitors in 11 cancer cell lines, in addition to a DMSO control. Originally, this resource served as the basis for a DREAM Challenge assessing the accuracy and sensitivity of computational algorithms for de novo drug polypharmacology predictions."
CPTAC,CPTAC,Clinical Proteomic Tumor Analysis Consortium data,https://doi.org/10.1158/2159-8290.CD-13-0219,This dataset contains data from the Clinical Proteomic Tumor Analysis Consortium. It includes various cancer types and proteomic data.
NCI60,NCI60,NCI-60 cell line data,https://doi.org/10.1038/nrc1951,"This dataset contains data from the NCI-60 cell line panel. It includes three files: TF activities from transcriptomics data, metabolite abundances and gene reads."


## 2. API
The data will either be deposited in the [NetworkCommons server](https://commons.omnipathdb.org/), or can be directly accessed from the original source. Regardless of this, the following functions are required

* A function providing an overview of the subsets (if applicable). For example, check `nc.data.omics.decryptm_experiments()`. 
* In case the data contains different files (for example, different omics layers, metadata tables, etc.), a function should retrieve this information. For example, check `nc.data.omics.nci60_datatypes()`
* A function that retrieves the data. For example, check `nc.data.omics.nci60_table()`. Ideally, a `pd.DataFrame`, but we are planning to expand support for `AnnData` instances.

These new functions can be implemented in a new file, `_{dataset}`, inside the `networkcommons/data/omics/` folder.

For example, `nc.data.omics.nci60_table()` retrieves a single `pd.DataFrame` by providing a data type and a cell line.

In [20]:
nc.data.omics.nci60_table(cell_line='A498', data_type='RNA').head()

Unnamed: 0,ID,score
0,WASH7P,-2.109966
1,NOC2L,-1.480194
2,HES4,-0.781522
3,ISG15,0.406806
4,AGRN,-0.32497
