# Data collection and model annotation
<br>
<div align='center'><img src="https://raw.githubusercontent.com/vporubsky/tellurium-libroadrunner-tutorial/master/data_aggregation_logo.png" width = "50%" style="padding: 0px"></div>
<br>
<div align='center' style='font-size:100%'>
Veronica L. Porubsky, BS
<div align='center' style='font-size:100%'>Sauro Lab PhD Student, Department of Bioengineering<br>
Head of Outreach, <a href="https://reproduciblebiomodels.org/dissemination-and-training/seminar/">Center for Reproducible Biomedical Modeling</a><br>
University of Washington, Seattle, WA USA
</div>
<hr>

## TOC
* [What is data collection?](#data-collection)
* [What databases are useful for biochemical network modeling?](#databases)
* [What is metadata and how much should we collect?](#metadata)
* [Packages and Constants](#packages-constants)
* [Importing data programmatically with KEGG and bioservices](#import-with-KEGG)
* [Collecting data programmatically with ChEBI](#collection-with-chebi)
* [Importing a kinetic constant](#kinetic-constant-import)
* [Storing collected data and metadata in a dataframe](#storing-collected-data)
* [What is model annotation and why do we do it?](#model-annotation)
* [Adding annotations to the repressilator model with Antimony](#repressilator-annotations-antimony)
* [Adding annotations to the repressilator model with sbmlutils](#repressilator-annotations-sbmlutils)
* [Exercises](#exercises)

# What is data collection? <a class="anchor" id="data-collection"></a>

The collection of data from multiple experiments, scientific papers, and online data sources is necessary
to fully-inform your model.

<br>
<br>
Typically, you will need to curate your collected data to ensure the quality of measurements is acceptable to include
in your model. This my involve excluding data from dissimilar species or environmental conditions than
your model system.


# What databases are useful for biochemical network modeling? <a class="anchor" id="databases"></a>


<ul>
  <li>SABIO-RK: biochemical reaction kinetics database</li>
     <ul class="square">
      <li>Describes chemical reactions and kinetics</li>
      <li>Contains information about participants and modifiers in reactions</li>
      <li>Metabolic and signaling network reactions</li>
     </ul>
  <li>BRENDA: the comprehensive enzyme information system</li>
     <ul class="square">
      <li>Enzyme information classified by the biochemical reaction it catalyzes</li>
      <li>Kinetic information about substrates and products is available</li> 
     </ul>
  <li>ChEBI: dictionary of "small" chemical compounds</li>
  <li>KEGG: collection of pathway/genome/diesease/drug databases</li>
  <li>BioCYC: collection of pathway/genome databases</li>
       <ul class="square">
      <li>Search for genes, proteins, metabolites or pathways, and the occurence of your term will be located in multiple databases</li> 
     </ul>
  <li>BioModels Database: repository of mathematical models of biological systems</li>
      <ul class="square">
      <li> *Will be covered in more detail later in the course</li> 
     </ul>
</ul>

<a href="https://www.sciencedirect.com/science/article/abs/pii/S0958166917301428?via%3Dihub">Appendix A of Goldberg et al. (2018)</a> provides a useful and more comprehensive list of data sources containing intracellular biochemical data. 


# What is metadata and how much should we collect? <a class="anchor" id="metadata"></a>

<ul>
  <li>Metadata: data that describes biochemical data</li>
  <li>Collect information about:</li>
     <ul class="square">
      <li>Units</li>
      <li>Estimates of measurement accuracy</li>
      <li>Annotations</li>
      <li>Ontology terms defining the annotations</li>
      <li>etc.</li>
     </ul>
  <li>Collect provenance data:</li>
     <ul class="square">
      <li>Lab which generated the data</li>
      <li>Experimental conditions</li>
      <li>Protocol used to generate the data</li>
      <li>Paper which reported the measurement</li>
      <li>etc.</li>
</ul>

# Packages and constants <a class="anchor" id="packages-constants"></a>

In [35]:
!pip install tellurium -q
!pip install bioservices -q
!pip install pandas -q
!pip install sbmluitils -q
!pip install pathlib -q
!pip install ssl -q
!pip install openpyxl -q

import tellurium as te # Python-based modeling environment for kinetic models
from bioservices import * # Use for querying databases
from sbmlutils.metadata.annotator import ModelAnnotator, annotate_sbml # Use to annotate SBML
from pathlib import Path # Used for sbmlutils annotation functionality
import ssl # Use for managing data import from GitHub
ssl._create_default_https_context = ssl._create_unverified_context # Allows you to access raw data on GitHub

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -m pip install --upgrade pip' command.[0m
[31mERROR: Could not find a version that satisfies the requirement sbmluitils[0m
[31mERROR: No matching distribution found for sbmluitils[0m
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -m pip install --upgrade pip' command.[0m
[31m    ERROR: Command errored out with exit status 1:
     command: /Library/Frameworks/Python.framework/Versions/3.6/bin/p

# Importing data programmatically with KEGG and bioservices <a class="anchor" id="import-with-KEGG"></a>

In [17]:
# Select database
database = KEGG()

# Retrieve a KEGG entry
tetR_query = database.get("K18476")

# Build a dictionary to parse query
tetR_dict = database.parse(tetR_query)

# Show information about the query
print(tetR_dict['NAME'])
print(tetR_dict['DEFINITION'])


['tetR']
TetR/AcrR family transcriptional regulator, tetracycline repressor protein


# Collecting metadata programmatically with ChEBI <a class="anchor" id="collection-with-chebi"></a>

In [18]:
# Store annotation information
# Select database
database = ChEBI()

# Retrieve a ChEBI entry for D-fructose 1,6-bisphosphate
query = database.getCompleteEntity("CHEBI:78682")

print(query.definition)


A ketohexose bisphosphate that is D-fructose substituted by phosphate groups at positions 1 and 6. It is an intermediate in the glycolysis metabolic pathway.


# Importing a kinetic constant <a class="anchor" id="kinetic-constant-import"></a>


# Storing collected data and metadata in a Pandas dataframe <a class="anchor" id="storing-aggregated-data"></a>

A Pandas dataframe is a tabular data structure with labeled rows and columns. They are very useful for
pulling out specific quantities from a dataset.

# What is model annotation and why do we do it?  <a class="anchor" id="model-annotation"></a>



# Adding annotations to the repressilator model with Antimony <a class="anchor" id="repressilator-annotations-antimony"></a>

In [22]:
repressilator_str = """
# Species:
species M1, P3, P1, M2, P2, M3;

# Reactions:
J0:  -> M1; a_m1*(Kr_P3^n1/(Kr_P3^n1 + P3^n1)) + leak1;
J1: M1 -> ; d_m1*M1;
J2:  -> P1; a_p1*M1;
J3: P1 -> ; d_p1*P1;
J4:  -> M2; a_m2*(Kr_P1^n2/(Kr_P1^n2 + P1^n2)) + leak2;
J5: M2 -> ; d_m2*M2;
J6:  -> P2; a_p2*M2;
J7: P2 -> ; d_p2*P2;
J8:  -> M3; a_m3*(Kr_P2^n3/(Kr_P2^n3 + P2^n3)) + leak3;
J9: M3 -> ; d_m3*M3;
J10:  -> P3; a_p3*M3;
J11: P3 -> ; d_p3*P3;

# Species initializations:
M1 = 0.604016261711246;
P3 = 1.10433330559171;
P1 = 7.94746428021418;
M2 = 2.16464969760648;
P2 = 3.55413750091507;
M3 = 2.20471854765531;

# Variable initializations:
a_m1 = 1.13504504342841;
Kr_P3 = 0.537411795656332;
n1 = 7.75907326833983;
leak1 = 2.59839004225795e-07;
d_m1 = 0.360168301619141;
a_p1 = 5.91755684808254;
d_p1 = 1.11075218613419;
a_m2 = 2.57306185467814;
Kr_P1 = 0.190085253528206;
n2 = 6.89140262856765;
leak2 = 1.51282707494481e-06;
d_m2 = 1.05773721506759;
a_p2 = 8.35628834784826;
d_p2 = 0.520562081730298;
a_m3 = 0.417889543691157;
Kr_P2 = 2.71031378955001;
n3 = 0.44365980532785;
leak3 = 3.63586125130783e-11;
d_m3 = 0.805873530762994;
a_p3 = 4.61276807677109;
d_p3 = 1.54954108126666;



// CV terms:
A.sboTerm = 236 or A.sboTerm = SBO:00000236
A identity "cvterm" or A biological_entity_is "cvterm"
A hasPart "cvterm" or A part "cvterm"
A isPartOf "cvterm" or A parthood "cvterm"
A isVersionOf "cvterm" or A hypernym "cvterm"
A hasVersion "cvterm" or A version "cvterm"
A isHomologTo "cvterm" or A homolog "cvterm"
A isDescribedBy "cvterm" or A description "cvterm"
A isEncodedBy "cvterm" or A encoder "cvterm"
A encodes "cvterm" or A encodement "cvterm"
A occursIn "cvterm" or A container "cvterm"
A hasProperty "cvterm" or A property "cvterm"
A isPropertyOf "cvterm" or A propertyBearer "cvterm"
A hasTaxon "cvterm" or A taxon "cvterm"
"""


# Adding annotations to the repressilator model with sbmlutils <a class="anchor" id="repressilator-annotations-sbmlutils"></a>

An alternative option for annotation in Python is to use sbmlutils, which provides rapid annotation

In [37]:
# Load the repressilator model
repressilator = te.loada(repressilator_str)

# Save the repressilator model in the SBML format to be used for annotation with sbmlutils
te.saveToFile('repressilator_sbml.xml', repressilator.getCurrentSBML())

# Simulate the repressilator model
repressilator.simulate(0, 100, 500)
repressilator.plot(figsize = (10, 8), linewidth = 3)

|    | pattern   | sbml_type   | annotation_type   | qualifier   | resource        | name              |
|---:|:----------|:------------|:------------------|:------------|:----------------|:------------------|
|  0 | d_m1      | parameter   | rdf               | BQB_IS      | sbo/SBO:0000356 | Decay constant    |
|  1 | d_m2      | parameter   | rdf               | BQB_IS      | sbo/SBO:0000356 | Decay constant    |
|  2 | d_m3      | parameter   | rdf               | BQB_IS      | sbo/SBO:0000356 | Decay constant    |
|  3 | d_p1      | parameter   | rdf               | BQB_IS      | sbo/SBO:0000356 | Decay constant    |
|  4 | d_p2      | parameter   | rdf               | BQB_IS      | sbo/SBO:0000356 | Decay constant    |
|  5 | d_p3      | parameter   | rdf               | BQB_IS      | sbo/SBO:0000356 | Decay constant    |
|  6 | n1        | parameter   | rdf               | BQB_IS      | sbo/SBO:0000190 | Hill coefficient  |
|  7 | n2        | parameter   | rdf               | BQ

In [None]:
# Read and print the annotation file, stored as an .csv file on GitHub
annotations_url = 'https://raw.githubusercontent.com/sys-bio/network-modeling-summer-school-2021/main/annotations/repressilator_annotations.csv'

repressilator_annotations = ModelAnnotator.read_annotations_df(annotations_url,
                                                               file_format="csv")
print(repressilator_annotations.to_markdown())
repressilator_annotations.to_csv('repressilator_annotations.csv')  

# Annotated existing repressilator SBML
repressilator_doc = annotate_sbml(
    source=Path('repressilator_sbml.xml'),
    annotations_path=Path('repressilator_annotations.csv'),
    filepath=Path('repressilator_annotated.xml')
)

# Save annotated SBML string to file in working directory
REPRESSILATOR_ANNOTATED_SBML = repressilator_doc.getSBMLDocument().toSBML()
te.saveToFile('repressilator_annotated.xml',
              REPRESSILATOR_ANNOTATED_SBML)


# Adding annotations to the repressilator model <a class="anchor" id="repressilator-annotations"></a>
e

# Exercises <a class="anchor" id="exercises"></a>

## Exercise 1:

Visit SABIO-RK and find a reaction involving tau-protein. What tissue and organism is the provided reaction relevant to,
according to the results?



What other metadata are available for the reaction?

## Exercise 1 Solution:

Tissue: brain, organisms: Homo sapiens and Rattus norvegicus


## Exercise 2:

Visit BRENDA and search for the enzyme lactase. What reaction does this enzyme catalyze? Record the reactants and products in the reaction.

What other databases does BRENDA link to which could help you build a pathway model containing the specified reaction?


## Exercise 2 Solution:



## Exercise 3:

Craving a coffee? Look up "caffeine" in ChEBI. Getting late in your region of the world? Try "melatonin" instead.

What is the ChEBI ID for your small molecule? What organisms was the metabolite detected in or isolated from according to the ChEBI data?


## Exercise 3 Solution:



## Exercise 4:

Explore this <a href="https://biocyc.org/overviewsWeb/celOv.shtml?orgid=ECOLI"> full metabolic map on BioCYC. </a> 


## Exercise 5:

Programmatically access the ChEBI entry for glucose and print the molecular formula information.

## Exercise 5 Solution:

In [None]:
# Select database
database = ChEBI()

# Retrieve a ChEBI entry for D-fructose 1,6-bisphosphate
query = database.getCompleteEntity("CHEBI:17234")

print(query.Formulae)

# Acknowledgements
<br>
<div align='left'><img src="https://raw.githubusercontent.com/vporubsky/tellurium-libroadrunner-tutorial/master/acknowledgments.png" width="80%"></div>

<br>
<html>
   <head>
      <title>Bibliography</title>
   </head>
   <body>
      <h1>Bibliography</h1>
      <ol>
         <li>
            <p>K. Choi et al., <cite>Tellurium: An extensible python-based modeling environment for systems and synthetic biology</cite>, Biosystems, vol. 171, pp. 74–79, Sep. 2018.</p>
         </li>
         <li>
            <p>E. T. Somogyi et al., <cite>libRoadRunner: a high performance SBML simulation and analysis library.,</cite>, Bioinformatics, vol. 31, no. 20, pp. 3315–21, Oct. 2015.</p>         
          <li>
            <p>L. P. Smith, F. T. Bergmann, D. Chandran, and H. M. Sauro, <cite>Antimony: a modular model definition language</cite>, Bioinformatics, vol. 25, no. 18, pp. 2452–2454, Sep. 2009.</p>
         </li>
         <li>
            <p>K. Choi, L. P. Smith, J. K. Medley, and H. M. Sauro, <cite>phraSED-ML: a paraphrased, human-readable adaptation of SED-ML</cite>, J. Bioinform. Comput. Biol., vol. 14, no. 06, Dec. 2016.</p>
         </li>         
         <li>
            <p> B.N. Kholodenko, O.V. Demin, G. Moehren, J.B. Hoek, <cite>Quantification of short term signaling by the epidermal growth factor receptor.</cite>, J Biol Chem., vol. 274, no. 42, Oct. 1999.</p>
         </li>
      </ol>
   </body>
</html>