--------
# Day 1: Sources of omics data and prior knowledge

Today we will interact with omics data and prior knowledge. First, we will play around with their content using the web services that most of them offer, and later we will learn how to access and interact with their content programmatically.

## Session organization and schedule

- Introduction (30 mins)
- Task 1: Omics data (1 hour)
- Break (15 mins)
- Task 2: Retrieving prior knowledge for a single protein (45 mins)
- Task 3: Retrieving interactions (45 mins)
- Q&A and wrap-up (15 mins)

### Organization

Each student will be working in a given set of tasks. By the end of each task session, we will have a short Q&A session and a break.

### Schedule

* 13:00 - 13:15 -> Introduction, setup and troubleshooting
* 13:15 - 13:45 -> Task 1: Omics data
* 13:45 - 14:30 -> Task 2: Retrieving prior knowledge for a single protein
* 14:30 - 14:45 -> Break
* 14:45 - 15:30 -> Task 3: Retrieving interactions
* 15:30 - 15:45 -> Wrap-up of results and preparation of presentations. Send presentations to martin.garrido@uni-heidelberg.de
* 15:45 - 16:30 -> Presentations

In [None]:
# run this before Task 3
if(!require(BiocManager)) install.packages("BiocManager")
BiocManager::install("OmnipathR", update = FALSE)

------------

# Task 1: Retrieving omics data from public databases and repositories (45 mins)

1. Select a database from the "pointers" section below. What type of omics data does it contain?
2. Choose a biological context that may interest you. For example, a disease such as breast cancer, a biological process such as DNA damage, or a treatment such as TNF inhibition. Write it down.
3. Search the database for records containing omics data for your biological context. Keywords such as "breast cancer" or "radiation" could be useful for this task. How many records match your query? Select one record and write down its accession ID within the database of your choice.
4. Describe the dataset:
  1. When was it published? Does it have an associated publication?
  2. What specific omics data does this dataset contain?  
  3. What specific experiments were performed?
  4. How many samples does the dataset contain?
  5. Try to download the data associated to the record. What format does it have? Is it raw or preprocessed? Describe the data format.

## Pointers
- https://www.ebi.ac.uk/arrayexpress/
- https://www.ncbi.nlm.nih.gov/geo/
- https://www.ebi.ac.uk/pride/
- https://www.ebi.ac.uk/metabolights/
- https://www.omicsdi.org/

## Context-specific pointers
- https://proteomic.datacommons.cancer.gov/pdc/
- https://gdc.cancer.gov/
- https://gtexportal.org/home/

## Preprocessed data repositories (only for transcriptomics)
- https://ncbiinsights.ncbi.nlm.nih.gov/2023/04/19/human-rna-seq-geo/  
- https://maayanlab.cloud/biojupies/
- https://www.refine.bio/
- https://rna.recount.bio/ 

# Q&A and discussion on the main challenges of public omics data accession and reuse (15 mins)

--------------------

# Task 2: Retrieving knowledge for a single protein

## Web access

1. Choose a **human protein** that interests you. Write its name down.
2. Select one database from the "pointers" section. Look for your protein. Can you find a record associated with it? 
3. What ID does it have in this database? Does it have the same name that you wrote down?
4. Retrieve some information about your protein:
  1. General description
  2. Alternative names
  3. Number of alternative isoforms
  4. Protein sequence of the alternative isoforms
  5. Subcellular location

## Programmatic access

1. Write a short function in R / Python to retrieve the protein sequence of your protein using as input oen of its IDs (hint: [UniProt provides easy programmatic access to its records](https://www.uniprot.org/help/api_retrieve_entries).
2. Using your function, retrieve the sequence of the following proteins: P04637, P40763, Q92630, P00533, Q9BXS6

## Pointers

- https://www.uniprot.org/
- https://www.ncbi.nlm.nih.gov/gene/
- https://www.ensembl.org/index.html
- https://www.proteinatlas.org/

# Q&A and discussion on the benefits of databases with APIs (10 mins)

--------------------

# Task 3: Retrieving interactions between biomolecules

## Web access

1. Select one database from the "pointers" section. What type of interactions does it contain?
2. Using the protein that you selected in Task 2, retrieve its interactors from the database that you chose.
3. What type of evidence is used to give support to the interactions that you retrieved? Are there interactions with stronger supporting evidence than others?
4. Choose another protein, preferably less studied than the one that you chose before. Repeat the steps 1, 2, and 3. Does it have more or less interactors than your previous choice? Why does this happen?

## Programmatic access

1. Using OmniPath (see OmniPath pointers), retrieve all the interactions where at least one of the previous proteins is involved (list: P04637, P40763, Q92630, P00533, Q9BXS6)

## Pointers

- https://reactome.org/
- https://string-db.org/
- http://stitch.embl.de/
- https://www.grnpedia.org/trrust/
- https://thebiogrid.org/
- https://www.ebi.ac.uk/intact/home
- https://www.wikipathways.org/index.php/WikiPathways

## OmniPath pointers

- https://omnipathdb.org/
- https://github.com/saezlab/OmnipathR
- https://github.com/saezlab/omnipath

# Q&A and brief discussion on the limitations and biases in prior knowledge (10 mins)

--------------------