# Seminar Quantitativ Microbiology 
# Metabolite Production with Strain Engineering

## Introduction


For a overview on Jupyter notebooks read [this review](https://www.nature.com/articles/d41586-018-07196-1). 

---

## Tutorial Steps
  * Set up of Python environment
    * Basic libraries(sys, pandas, numpy, matplotlib, zipfile, cobrapy)
  * Analysis of Genome Scale Metabolic Model
    * Retrieval of GSMM for *P. pastoris*
    * Flux variability of exchange reactions
    * Minimal medium composition
  * Experimental growth rate reproduction
    * Familiarizing with biomass composition reactions
    * Defining functions for correct biomass equation switch
    * Data retrieval
    * Simulation loop
    * Graphical output


## Set-up compute environment

Before we can analyse GSMM, we have adjust the python environment that it integrates the cobrapy toolbox and downloading the GSMM.

### Basic Python libraries 
Some libraries that facilitate data manipulation

In [86]:
import sys # loading commands to control/navigate within the system architecture
# Loading pandas, a library for data manipulation
from os.path import join
import xlrd
import pandas as pd
# import lxml

# Loading numpy, a library fo manipulation of numbers
import numpy as np

# loading matplotlib, a library for visualization
import matplotlib.pyplot as plt
%matplotlib inline

# loading cobrapy, a library dedicated to the analysis of genome scale metabolic models
from cobra.io import read_sbml_model, write_sbml_model, load_matlab_model

# loading escher for metabolic network visualization
import escher
from escher import Builder
from time import sleep
escher.rc['never_ask_before_quit'] = True
# list of available maps
# print(escher.list_available_maps())

# loading Memote, quality assessment of GSMM
from memote import test_model, snapshot_report

print('System ready')

System ready


In [87]:
!wget http://bigg.ucsd.edu/static/models/e_coli_core.xml

# loading a visualization file of the metabolic network. 
# For frequently used models, like iJO1366, Escher automatically retrieves the visualization file.  
#!wget http://bigg.ucsd.edu/static/models/iJO1366.json

--2020-08-31 16:18:28--  http://bigg.ucsd.edu/static/models/e_coli_core.xml
Resolving bigg.ucsd.edu (bigg.ucsd.edu)... 169.228.33.117
Connecting to bigg.ucsd.edu (bigg.ucsd.edu)|169.228.33.117|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 707166 (691K) [application/xml]
Saving to: ‘e_coli_core.xml.1’


2020-08-31 16:18:30 (734 KB/s) - ‘e_coli_core.xml.1’ saved [707166/707166]



In [88]:

# generating cobra variable from SBML/xml/mat file
myFile = 'e_coli_core.xml'
model = read_sbml_model(myFile)
model

0,1
Name,e_coli_core
Memory address,0x07fb0580f1b90
Number of metabolites,72
Number of reactions,95
Number of groups,0
Objective expression,1.0*BIOMASS_Ecoli_core_w_GAM - 1.0*BIOMASS_Ecoli_core_w_GAM_reverse_712e5
Compartments,"extracellular space, cytosol"


### Flux variability of exchange reactions

Flux balance analysis provides a single optimal solution. Mostly, there exist a number of alternative flux distributions around the optimum, which can be physiologically relevant. To identify the variability of exchange fluxes around the optimum solution 'flux variability analysis' can be performed ([Mahadevan & Schilling, 2003](http://dx.doi.org/10.1016/j.ymben.2003.09.002)). Use the following command to identify minimum and maximum flux ranges of the model for the exchange reactions.

model.summary(fva=.95)

In [89]:
solution = model.optimize()
model.summary() # fva=.95 additional argument specifies allowed deviation from the optimum 


Unnamed: 0_level_0,IN_FLUXES,IN_FLUXES,OUT_FLUXES,OUT_FLUXES,OBJECTIVES,OBJECTIVES
Unnamed: 0_level_1,ID,FLUX,ID,FLUX,ID,FLUX
0,o2_e,21.799493,h2o_e,29.175827,BIOMASS_Ecoli_core_w_GAM,0.873922
1,glc__D_e,10.0,co2_e,22.809833,,
2,nh4_e,4.765319,h_e,17.530865,,
3,pi_e,3.214895,,,,


In [24]:
# Load an Escher map
builder.map_name = 'e_coli_core.Core metabolism'
builder.model = model
builder.reaction_data = solution.fluxes
# Add some data for metabolites
#builder.metabolite_data = solution.shadow_prices
# Simplify the map by hiding some labels
builder.hide_secondary_metabolites = True
#builder.hide_all_labels = True
builder.reaction_scale = [
    { 'type': 'min', 'color': '#000000', 'size': 12 },
    { 'type': 'median', 'color': '#ffffff', 'size': 20 },
    { 'type': 'max', 'color': '#ff0000', 'size': 25 }
]
builder.reaction_scale_preset = 'GaBuRd'

# Make all the arrows three times as thick
builder.reaction_scale = [
    {k: v * 3 if k == 'size' else v for k, v in x.items()}
    for x in builder.reaction_scale
]
builder.save_html('escher_map_file.html')

Downloading Map from https://escher.github.io/1-0-0/6/maps/Escherichia%20coli/e_coli_core.Core%20metabolism.json


### Malate Export Reaction
coding for malate export reaction.

In [None]:
# 1. generating random genome with defined GC-content
Genome_Sequence, Genome_Genes = generat_Genome(Genome_Size, Genome_GGcont, Model)

In [None]:
def generate_Genome(Genome_Size, Genome_GGcont, Model):
    '''
    Function to generate random genome to map promoter activities to reactions in a cobrapy model.
    '''
    return''.join(random.choices(letters, k=length)

In [195]:
Model = model
# coding sequence construction
# first we determine the minimum coding gene length of nucleotides to distinguish the enzymes in the model
Enzyme_Number = len(Model.reactions)
Gen_Minimum = np.ceil(np.log2(Enzyme_Number))

# we want to represent codon triplicates, we calculate the next highest divisor of three
Gene_Length = int(np.ceil(Gen_Minimum/3))
CodonFile = 'CodonTriplets.csv'
CodonTriplets = pd.read_csv(CodonFile, delimiter=';', skipinitialspace=True)
CodonStop = CodonTriplets[['Stop' in s for s in CodonTriplets['Name']]].reset_index()
CodonCoding = CodonTriplets.drop(CodonStop.index).reset_index()
Gene_Coding = [random.choices(CodonCoding['Triplet'], weights=CodonCoding['Percent']) for CodonId in range(Gene_Length)]
Gene_Stop = random.choices(CodonStop['Triplet'], weights=CodonStop['Percent'])
Gene_Coding.append(Gene_Stop)
Gene_Coding.insert(0,['ATG'])
Gene_Coding = ''.join([Letter for Nest in Gene_Coding for Letter in Nest])
print(Gene_Coding)


ATGGATCCCCGTTGA


In [81]:
# background genome construction
Genome_Size = 20
GC_cont = .5

SeqNestList = [random.choices([Letter for Nest in random.choices([['G','C'],['A','T']], weights=[GC_cont, 1-GC_cont]) for Letter in Nest]) for _x in range(Genome_Size)]
Genome_Sequence = ''.join(Letter for Nest in SeqNestList for Letter in Nest)
Genome_Sequence

'CTCAAAACTTCTCCCCGTTC'

In [202]:
import pickle
import os
# generating promoter sequence
WeightFile = os.path.join('Models','NucleotideWeightTable.pkl')
with open(WeightFile, 'rb') as handle:
    Nucleotides_Weight = pickle.load(handle)
RefFile = os.path.join('Models','RefSeq.txt')
with open(RefFile) as f:
    RefSeq = f.read()
Nucleotides_Weight.index = Nucleotides_Weight.index+40

Int64Index([ 0,  1,  2,  3,  4,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
            20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 32, 35, 36, 37, 38],
           dtype='int64')

In [239]:
myAr = np.asarray([Letter for Letter in RefSeq])
myAr[0]

'G'

In [240]:
Bases = ['A','C','G','T']
for idx in range(len(Nucleotides_Weight)):
#     print(Nucleotides_Weight.index[idx], Nucleotides_Weight.iloc[idx].values)
#     print(random.choices(Bases,Nucleotides_Weight.iloc[idx].values))
#     print(Nucleotides_Weight.index[idx])
    np.put(myAr,Nucleotides_Weight.index[idx],random.choices(Bases,Nucleotides_Weight.iloc[idx].values))
#     myAr[idx] = random.choices(Bases,Nucleotides_Weight.iloc[idx].values)

# type(idx)


In [241]:
myAr

array(['T', 'T', 'C', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'A', 'T', 'A',
       'A', 'G', 'G', 'A', 'G', 'G', 'C', 'A', 'G', 'T', 'A', 'C', 'T',
       'A', 'A', 'T', 'A', 'T', 'C', 'T', 'T', 'T', 'G', 'C', 'G', 'G',
       'G'], dtype='<U1')

## Prediction of growth rates
### Choose the Substrate

In [None]:
myhost.Choose_Substrate(None)
myhost.show_BiotechSetting()

### Familiarizing with biomass composition reactions

Microorganisms adapt to their substrate. Different substrates provide different energy content and require different cellular resources to become metabolized. In GSMM these differences may be represented by different equations/reactions for the substrates. In iMR1026 for *P. pastoris* there are various biomass equations for glucose, glycerol, glucose-glycerol mixtures, and methanol. When simulating a model, we have to make sure we use the right biomass equation fitting with the substrate.

In [None]:
# List of all reactions with 'BIOMASS' in their name
model.reactions.query('?..?')

In [None]:
# Looking in detail to biomass with methanol; use the reaction name given to you from the list of all biomass reactions;
model.reactions.?..?

### Defining functions for correct biomass equation switch

For each substrate, the boundary exchange fluxes are activated and the reactions of competing substrates are disabled.

Remember to write a specific function to adapt the model for glucose utilization.

In [None]:
def AdaptMethanol(model, meoh_up):
  model.objective = 'BIOMASS_meoh'
  # setting uptake reactions right
  model.reactions.Ex_glc_D.lower_bound = 0
  model.reactions.Ex_glyc.lower_bound = 0
  model.reactions.ATPM.lower_bound = .4
  model.reactions.Ex_meoh.lower_bound = -np.abs(meoh_up)
  # setting additional biomass composition
  model.reactions.LIPIDS_meoh.upper_bound = 1000
  model.reactions.PROTEINS_meoh.upper_bound = 1000
  model.reactions.STEROLS_meoh.upper_bound = 1000
  model.reactions.BIOMASS_meoh.upper_bound = 1000
  # deactivating Glyc-based biomass composition
  model.reactions.LIPIDS_glyc.upper_bound = 0
  model.reactions.PROTEINS_glyc.upper_bound = 0
  model.reactions.STEROLS_glyc.upper_bound = 0
  model.reactions.BIOMASS_glyc.upper_bound = 0  
  # deactivating Glc-based biomass composition
  model.reactions.LIPIDS.upper_bound = 0
  model.reactions.PROTEINS.upper_bound = 0
  model.reactions.STEROLS.upper_bound = 0
  model.reactions.BIOMASS.upper_bound = 0  
  return model

def AdaptGlycerol(model, glyc_up):
  model.objective = 'BIOMASS_glyc'
  # setting uptake reactions right
  model.reactions.Ex_meoh.lower_bound = 0
  model.reactions.Ex_glc_D.lower_bound = 0
  model.reactions.ATPM.lower_bound = 2.5
  model.reactions.Ex_glyc.lower_bound = -np.abs(glyc_up)
  # setting additional biomass composition
  model.reactions.LIPIDS_glyc.upper_bound = 1000
  model.reactions.PROTEINS_glyc.upper_bound = 1000
  model.reactions.STEROLS_glyc.upper_bound = 1000
  model.reactions.BIOMASS_glyc.upper_bound = 1000  
  # deactivating MeOH-based biomass composition
  model.reactions.LIPIDS_meoh.upper_bound = 0
  model.reactions.PROTEINS_meoh.upper_bound = 0
  model.reactions.STEROLS_meoh.upper_bound = 0
  model.reactions.BIOMASS_meoh.upper_bound = 0
  # deactivating Glc-based biomass composition
  model.reactions.LIPIDS.upper_bound = 0
  model.reactions.PROTEINS.upper_bound = 0
  model.reactions.STEROLS.upper_bound = 0
  model.reactions.BIOMASS.upper_bound = 0
  return model

def AdaptGlucose(model, glc_up):
  # setting uptake reactions right
  model.reactions.Ex_meoh.lower_bound = 0
  model.reactions.Ex_glyc.lower_bound = 0
  model.reactions.Ex_glc_D.lower_bound = -np.abs(glc_up)
  # setting additional biomass composition
  model.reactions.LIPIDS.upper_bound = 1000
  model.reactions.PROTEINS.upper_bound = 1000
  model.reactions.STEROLS.upper_bound = 1000
  model.reactions.BIOMASS.upper_bound = 1000  
  # deactivating Glyc-based biomass composition
  model.reactions.LIPIDS_glyc.upper_bound = 0
  model.reactions.PROTEINS_glyc.upper_bound = 0
  model.reactions.STEROLS_glyc.upper_bound = 0
  model.reactions.BIOMASS_glyc.upper_bound = 0  
 # deactivating meoh-based biomass composition
  model.reactions.LIPIDS_meoh.upper_bound = 0
  model.reactions.PROTEINS_meoh.upper_bound = 0
  model.reactions.STEROLS_meoh.upper_bound = 0
  model.reactions.BIOMASS_meoh.upper_bound = 0
  model.objective = 'BIOMASS'
  return model

### Data retrieval

For evaluation of the growth rate prediction of the *P. pastoris* model we use experimental data from the closely related organism *Ogataea polymorpha*. The measurements in the table are extracted from [van Dijken et al. 1976](https://dx.doi.org/10.1007/bf00446560) for methanol and from [de Koning et al., 1987](https://dx.doi.org/10.1007/BF00456710) and [Moon et al., 2003](https://dx.doi.org/10.1385/ABAB:111:2:65).

Data Address: [here](https://rwth-aachen.sciebo.de/s/o72jwWQWh3ame1e/download)

**When using Binder:**
Click on "here" and save the file. Go back to "Home"-Tab in the browser afterwards and navigate back to the root directory of your cloud system (i.e. you see the Jupyter notebook files). Click on the "Upload" button on the right side of the window and select the file you just saved. Now you should see the file "Opol-expt-grwth_MeOH-Glyc.csv", click again on "Upload" in the line of the corresponding file. Now go back to the simulation-Tab. Replace '?..?' in the cell below by entering the name of the csv file you downloaded. Then press Ctrl+Enter to load the experimental data.

**When using Colab:**
Click on "here" and save the file. Then click on "Files" on the left side of the window. Upload the file you have just saved by clicking on "Upload" and selecting this file. Now you should see the file "Opol-expt-grwth_MeOH-Glyc.csv" in the "Files" section. If this is not the case, click on "Refresh". Then right-click on this file and select "Copy path". Replace '?..?' in the cell below with the file by pressing Ctrl+V. Then evaluate the cell by Ctrl+Enter.

In [None]:
# Create Excel file with all uptake rates and then import it instead as follows:
# data = pd.read_excel('nameOfFile.xlsx', sheet_name='Ecol', index_col=0)

data = pd.read_csv('?..?', delimiter=',|;', engine='python')
data


Adding measurements for glucose from [Lehnen et al., 2017](https://doi.org/10.1016/j.meteno.2017.07.001) to our basic data table. The required information is the growth rate on glucose for *H. polymorpha*. Extract the necessary information from Table 3 in the article. The reaction name for the 'Exchange'-reaction is 'Ex_glc_D', replace this in the corresponding position of '?..?' in the cell below.

In [None]:
data = data.append({'Substrate':'Glucose', 'Exchange':'?..?', 'uptake rate (mmol/gCDW/h)':?..?, 'growth rate (/h)':?..?, 'source':'Lehnen et al.'}, ignore_index=True)

### Simulation loop
For-Loop over all experimental data points.

Remember to add a decision when you include glucose.

In [None]:
# At the beginning a substrate was selected, therefore only one execution is necessary, approximately as follows:
# Substrate = myhost.var_Substrate
# model=AdaptSubstrate(model, data.loc[['Substrate'],['uptake rate']].values)
# growth_simulated = model.optimize()

growth_simulated = [];
# test_model = model.copy()
# iteration over all rows in 'data'
for index, row in data.iterrows():
  with model as test_model:
    print(index) # printing the row number to get feedback that everything is working
    # selecting the right substrate in the model based on 'Substrate' in 'data'
    if row['Substrate'] == 'Methanol':
      model = AdaptMethanol(test_model, row['uptake rate (mmol/gCDW/h)'])
    elif row['Substrate'] == 'Glycerol':
      model = AdaptGlycerol(test_model, row['uptake rate (mmol/gCDW/h)'])
    elif row['Substrate'] == 'Glucose':
      test_model = AdaptGlucose(test_model, row['uptake rate (mmol/gCDW/h)'])
    else:
      print('substrate not considered')      
  #     model.optimize()
    growth_simulated.append(model.slim_optimize())


### Graphical output

Remember to add glucose to the visualization. Add the right axis labels, and a file name.

In [None]:
plt.scatter(data['growth rate (/h)'][0:7], growth_simulated[0:7], s=50, c='k', marker='o');
plt.scatter(data['growth rate (/h)'][7], growth_simulated[7], s=50, c='k', marker='s');
plt.scatter(data['growth rate (/h)'][8], growth_simulated[8], s=50, c='k', marker='d');
plt.xlabel('?..?');
plt.ylabel('?..?');
myline = np.linspace(0,np.max(growth_simulated),10);
plt.plot(myline,myline,'k--');
plt.title('Growth rate comparison');
plt.legend(['Optimum', 'Methanol (van Dijken)', 'Glycerol (deKoning, Moon)', 'Glucose (Lehnen et al.)'], loc=2);
plt.style.use('seaborn-paper')

# Saving figure
plt.savefig('?..?.png')