# Introduction

This is the first of a series of practical sessions in which we will learn about databases, omics data, prior knolwedge, data visualization, functional analysis and logic network modeling. Here you can find the schedule plus the contact details of the trainer for each day.

1. June 29th. Intro to omics and prior knowledge databases ( martin.garrido@uni-heidelberg.de / mgrcprof@gmail.com )
2. June 30th. Data visualization and exploratory data analysis ( roramirezf@uni-heidelberg.de )
3. July 1st. Hypothesis testing ( roramirezf@uni-heidelberg.de )
4. July 4th. Functional omics + small network exercise, not running CARNIVAL ( pau.badia@uni-heidelberg.de )
5. July 5th. Logic network modeling ( jovan.tanevski@uni-heidelberg.de )

More information about our group can be found at https://saezlab.org/ and we are always welcoming interns!

# Working tool: Jupyter Notebooks in Google Colab

We will use Jupyter Notebooks running on the Google Colab platform. Jupyter Notebooks can be used to write text using the [Markdown format](https://www.markdownguide.org/) and to execute R / Python code. 

To work with today's notebook, please make a copy of this notebook in your Google Drive by selecting:

* File > Save a copy in Drive

The programming language choice is up to you. This notebook is setup to run using R, but if you prefer to use python, you can change the runtime by selecting:

* Runtime > Change tuntime type > Runtime type > Python3

**Once the runtime is changed to Python3, it can not be changed back to R, so please be careful**

In [None]:
# if using R, run before starting the session
if(!require(BiocManager)) install.packages("BiocManager")
BiocManager::install("OmnipathR", update = FALSE)

In [None]:
# if using Python, run before starting the session
! pip install omnipath

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting omnipath
  Downloading omnipath-1.0.5-py3-none-any.whl (39 kB)
Collecting docrep>=0.3.1
  Downloading docrep-0.3.2.tar.gz (33 kB)
Collecting urllib3>=1.26.0
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 9.7 MB/s 
Collecting inflect>=4.1.0
  Downloading inflect-5.6.0-py3-none-any.whl (33 kB)
Collecting requests>=2.24.0
  Downloading requests-2.28.0-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.3 MB/s 
Building wheels for collected packages: docrep
  Building wheel for docrep (setup.py) ... [?25l[?25hdone
  Created wheel for docrep: filename=docrep-0.3.2-py3-none-any.whl size=19896 sha256=c5792e68d2c1b0344cbee37977df7ba994c8f7b91c3b381b9bbad21cd90a7bd3
  Stored in directory: /root/.cache/pip/wheels/4b/a1/89/8c863c13903012831ee9e6f0544375e06de9c461659e968c40
Successfully built docrep
Ins

# Day 1: Sources of omics data and prior knowledge

Today we will interact with databases of omics data and prior knowledge. First, we will play around with their content using the web services that most of them offer, and later we will learn how to access and interact with their content programmatically.

## Session organization and schedule

### Organization

1. Make groups of 2/3 students
2. Each group will be working independently on a given set of tasks 
3. By the end of the session (15:45), each group will make a short presentation (15 minutes) about their results and findings. This is not for evaluation purposes, so do not feel the pressure of making a "perfect" presentation. Rather, focus on documenting your solutions, and try to what aspects of the tasks you found more interesting or challenging. 

### Schedule

* 13:00 - 13:15 -> Introduction, setup and troubleshooting
* 13:15 - 13:45 -> Task 1: Omics data
* 13:45 - 14:30 -> Task 2: Retrieving prior knowledge for a single protein
* 14:30 - 14:45 -> Break
* 14:45 - 15:30 -> Task 3: Retrieving interactions
* 15:30 - 15:45 -> Wrap-up of results and preparation of presentations. Send presentations to martin.garrido@uni-heidelberg.de
* 15:45 - 16:30 -> Presentations

# IMPORTANT: Don't be shy and ask questions. We are all learning.

# Task 1: Omics data

1. What are omics data? (1 answer per group)
2. Select a database from the "pointers" section. What type of omics data does it contain? (from now on, as many databases as group members)
3. Choose a biological context that may interest you. For example, a disease such as breast cancer, a stimulus such as DNA damage, or a treatment such as TNF inhibition. Write it down.
4. Search the database for records containing omics data for your biological context. Keywords such as "breast cancer" or "radiation" are useful for this task. How many records match your query? Select one record and write down its accession ID within the database of your choice.
5. Describe the dataset:
  1. When was it published? Does it have an associated publication?
  2. What specific omics data does this dataset contain?  
  3. What specific experiments were performed?
  4. How many samples does the dataset contain?
  5. Try to download the data associated to the record. What format does it have? Is it raw or preprocessed?. Describe the data format.

## Pointers
- https://www.ebi.ac.uk/arrayexpress/
- https://www.ncbi.nlm.nih.gov/geo/
- https://www.ebi.ac.uk/pride/
- https://www.ebi.ac.uk/metabolights/
- https://www.omicsdi.org/

## Context-specific pointers
- https://proteomic.datacommons.cancer.gov/pdc/
- https://gdc.cancer.gov/
- https://gtexportal.org/home/

# <font color='green'>Task 1 solution (only Web access)</font>




# Task 2: Retrieving prior knowledge for a single protein

## Web access

1. Choose a **human protein** that interests you. Write its name down.
2. Select one database from the "pointers" section. Look for your protein. Can you find a record associated with it? 
3. What ID does it have in this database? Does it have the same name that you wrote down?
4. Retrieve some information about your protein:
  1. General description
  2. Alternative names
  3. Number of alternative isoforms
  4. Protein sequence of the alternative isoforms
  5. Subcellular location

## Programmatic access

1. Write a short function in R / Python to retrieve the protein sequence of your protein using as input oen of its IDs (hint: [UniProt provides easy programmatic access to its records](https://www.uniprot.org/help/api_retrieve_entries).
2. Using your function, retrieve the sequence of the following proteins: P04637, P40763, Q92630, P00533, Q9BXS6

## Pointers

- https://www.uniprot.org/
- https://www.ncbi.nlm.nih.gov/gene/
- https://www.ensembl.org/index.html
- https://www.proteinatlas.org/

# <font color='green'>Task 2 solution (Web access)</font>

In [6]:
# Task 2 solution (Programmatic access)

# import urllib required for URL APi requests
import urllib.request

# define function
def get_uniprot_sequence(protein_id):

    # create API URL
    api_url = "https://rest.uniprot.org/uniprotkb/" + protein_id + ".txt"

    # define the keep and sequence variables
    keep = False
    sequence = ""

    # iterate over the lines of the API response
    for line in urllib.request.urlopen(api_url):

        # if the line starts with //, do not keep sequence
        if line.startswith(b"//"):
            keep = False

        # if the keep variable is true, then store sequence
        if keep:
            sequence += line.decode().strip()

        # if the line starts with the SQ letters, keep sequence in the next iteration
        if line.startswith(b"SQ"):
            keep = True
    
    # remove all white spaces in sequence
    sequence = sequence.replace(" ", "")

    # return the sequence
    return sequence

# iterate over accession IDs and retrieve their sequences
accession_list = ["P04637", "P40763", "Q92630", "P00533", "Q9BXS6"]
for accession in accession_list:
  print(get_uniprot_sequence(accession))

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
MAQWNQLQQLDTRYLEQLHQLYSDSFPMELRQFLAPWIESQDWAYAASKESHATLVFHNLLGEIDQQYSRFLQESNVLYQHNLRRIKQFLQSRYLEKPMEIARIVARCLWEESRLLQTAATAAQQGGQANHPTAAVVTEKQQMLEQHLQDVRKRVQDLEQKMKVVENLQDDFDFNYKTLKSQGDMQDLNGNNQSVTRQKMQQLEQMLTALDQMRRSIVSELAGLLSAMEYVQKTLTDEELADWKRRQQIACIGGPPNICLDRLENWITSLAESQLQTRQQIKKLEELQQKVSYKGDPIVQHRPMLEERIVELFRNLMKSAFVVERQPCMPMHPDRPLVIKTGVQFTTKVRLLVKFPELNYQLKIKVCIDKDSGDVAALRGSRKFNILGTNTKVMNMEESNNGSLSAEFKHLTLREQRCGNGGRANCDASLIVTEELHLITFETEVYHQGLKIDLETHSLPVVVISNICQMPNAWASILWYNMLTNNPKNVNFFTKPPIGTWDQVAEVLSWQFSSTTKRGLSIEQLTTLAEKLLGPGVNYSGCQITWAKFCKENMAGKGFSFWVWLDNIIDLVKKYILALWNEGYIMGFISKERERAILSTKPPGTF

# Task 3: Retrieving interactions

What type of information can we represent with interactions? Intro to interaction databases and to OmniPath (~10 mins)

## Web access

1. Select one database from the "pointers" section. What type of interactions does it contain?
2. Using the protein that you selected in Task 2, retrieve its interactors from the database that you chose.
3. What type of evidence is used to give support to the interactions that you retrieved? Are there interactions with stronger supporting evidence than others?
4. Choose another protein, preferably less studied than the one that you chose before. Repeat the steps 1, 2, and 3. Does it have more or less interactors than your previous choice? Why does this happen?

## Programmatic access

1. Using OmniPath (see OmniPath pointers), retrieve all the interactions where at least one of the previous proteins is involved (list: P04637, P40763, Q92630, P00533, Q9BXS6)

## Pointers

- https://reactome.org/
- https://string-db.org/
- http://stitch.embl.de/
- https://www.grnpedia.org/trrust/
- https://thebiogrid.org/
- https://www.ebi.ac.uk/intact/home
- https://www.wikipathways.org/index.php/WikiPathways

## OmniPath pointers

- https://omnipathdb.org/
- https://github.com/saezlab/OmnipathR
- https://github.com/saezlab/omnipath

# <font color='green'>Task 3 solution (Web access)</font>

In [7]:
# Task 3 solution (Programmatic access)

# import hte omnipath library
import omnipath

# define interesting proteins
interesting_proteins = ["P04637", "P40763", "Q92630", "P00533", "Q9BXS6"]

# retrieve all interactions in omnipath
all_interactions = omnipath.interactions.AllInteractions.get()

# subset pandas dataframe with all interactions to those in which target or source are in the list
interesting_interactions = all_interactions[
    (all_interactions["target"].isin(interesting_proteins))
    | (all_interactions["source"].isin(interesting_proteins))
]

# print the head of the source, target and type columns of the interesting interactions
print(interesting_interactions[["source", "target", "type"]].head())

  0%|          | 0.00/16.0M [00:00<?, ?B/s]

     source  target                type
73   P00533  Q8NET8  post_translational
210  P04637  O43663  post_translational
211  P04637  P18847  post_translational
212  P18847  P04637  post_translational
247  P04198  P04637  post_translational


# Before tomorrow's session, please have a look to the following article:
# https://www.ahajournals.org/doi/10.1161/JAHA.120.019667