# Protein Comparison

## Objective and Prerequisites

In this example, we’ll show you how to use mathematical optimization to address a Protein Comparison problem. You’ll learn how to model this problem – which involves measuring the similarities of two proteins – as a quadratic assignment problem using the Gurobi Python API and find an optimal solution to it with the Gurobi Optimizer.

This model is example 29 from the fifth edition of Model Building in Mathematical Programming by H. Paul Williams on pages 290-291 and 345.

This modeling example is at the advanced level, where we assume that you know Python and the Gurobi Python API and that you have advanced knowledge of building mathematical optimization models. Typically, the objective function and/or constraints of these examples are complex or require advanced features of the Gurobi Python API.

**Download the Repository** <br /> 
You can download the repository containing this and other examples by clicking [here](https://github.com/Gurobi/modeling-examples/archive/master.zip). 

**Gurobi License** <br /> 
In order to run this Jupyter Notebook properly, you must have a Gurobi license. If you do not have one, you can request an [evaluation license](https://www.gurobi.com/downloads/request-an-evaluation-license/?utm_source=3PW&utm_medium=OT&utm_campaign=WW-MU-MUI-OR-O_LEA-PR_NO-Q3_FY20_WW_JPME_PROTEIN_COMPARISON_COM_EVAL_GitHub&utm_term=Protein_Comparison&utm_content=C_JPM) as a *commercial user*, or download a [free license](https://www.gurobi.com/academia/academic-program-and-licenses/?utm_source=3PW&utm_medium=OT&utm_campaign=WW-MU-EDU-OR-O_LEA-PR_NO-Q3_FY20_WW_JPME_PROTEIN_COMPARISON_ACADEMIC_EVAL_GitHub&utm_term=Protein_Comparison&utm_content=C_JPM) as an *academic user*.

In [1]:
import gurobipy as gp
from gurobipy import GRB

# tested with Python 3.7.0 & Gurobi 9.1.0

## Problem Description

This problem is based on one problem discussed in a paper by Forrester and Greenberg (2008).
It is concerned with measuring the similarities of two proteins. A protein can be
represented by a graph with the acids represented by the nodes and
the edges being present when two acids are within a threshold distance of each
other. This graphical representation is known as the contact map of the protein.
Given two contact maps representing proteins, we would like to find the largest
(measured by the number of corresponding edges) isomorphic subgraphs in each
graph. The acids in each of the proteins are ordered. We need to preserve this
ordering in each of the subgraphs, which implies that there can be no crossovers
in the comparison. This is illustrated in the following figure. 

![crossover](crossover.PNG)

If $i < k$ in the contact map for the first protein then we cannot have $l < j$ in the second protein, if $i$ is to be
associated with $j$ and $k$ with $l$ in the comparison. The following figure gives a comparison between two small contact
maps leading to five corresponding edges.

![comparison](comparison.PNG)

The goal is to compare the contact maps given by the following figures.

Mapping of the first protein:

![map1](map1.PNG)

Mapping of the second protein:

![map1](map2.PNG)

## Model Formulation

The mapping of the first protein is represented by the graph $G_{1} = (N_{1},E_{1})$, and the  mapping of the second protein is represented by the  graph $G_{2} = (N_{2},E_{2})$

### Sets and Indices

$i,k \in N_{1} =\{1,2,...,9\}$: Nodes in graph $G_{1}$ which are the acids in the first protein.

$E_{1} = \{(1,2),(2,9),(3,4),(3,5),(5,6),(6,7),(7,9),(8,9) \}$: Edges in graph $G_{1}$.

$j,l \in N_{2} =\{1,2,...,11\} $: Nodes in graph $G_{2}$ which are the acids in the second protein.

$E_{2} = \{(1,4),(2,3),(4,6),(4,7),(5,6),(6,8),(7,8),(7,10),(9,10),(10,11) \}$: Edges in graph $G_{2}$.

### Decision variables

$\text{map}_{i,j} = x_{i,j} = 1$, iff node $i$ in $G_{1}$ is matched with node $j$ in $G_{2}$.

$ w_{i,j,k,l} = x_{i,j}*x_{k,l}  = 1$, iff an edge $(i,k) \in E_{1}$ is matched with edge $(j,l) \in E_{2}$.

### Constraints

**$G_{1}$ matching**: No node in $G_{1}$ can be matched with more than one  in $G_{2}$.

$$
\sum_{i \in N_{1} } x_{i,j} \leq 1 \quad \forall j \in N_{2}
$$

**$G_{2}$ matching**: No node in $G_{2}$ can be matched with more than one  in $G_{1}$.

$$
\sum_{j \in N_{2} } x_{i,j} \leq 1 \quad \forall i \in N_{1}
$$

**Edge matching**: if edges $(i, k)$ and $(j, l)$ are matched then so are the corresponding nodes.

$$
 w_{i,j,k,l} \leq x_{i,j}, \;  w_{i,j,k,l} \leq x_{k,l} \quad \forall 
 (i,j,k,l) \in ijkl = \{ i,k \in N_{1}, j,l \in N_{2}: (i,k) \in E_{1},  (j,l) \in E_{2}  \}
$$

**No crossovers**: There can be no crossovers.

$$
x_{i,j} +  x_{k,l} \leq 1 \quad \forall 
(i,j,k,l) \in ijklx = \{ (i,j,k,l) \in ijkl: i < k \in N_{1},  j > l \in N_{2}  \}
$$


### Objective function
The objective is to maximize the number of edge matchings.

$$
\sum_{(i,j,k,l) \in ijkl} w_{i,j,k,l}
$$

This linear integer programming formulation of the Protein Comparison problem is in fact a linearization of a quadratic assignment formulation of this problem. With Gurobi 9.1.0, you can directly solve the quadratic assignment formulation of the Protein Comparison problem without the auxiliary variables and the logical constraints.

### Objective function
The objective is to maximize the number of edge matchings.

$$
\sum_{(i,j,k,l) \in ijkl} x_{i,j}*x_{k,l}
$$

### Constraints

**$G_{1}$ matching**: No node in $G_{1}$ can be matched with more than one  in $G_{2}$.

$$
\sum_{i \in N_{1} } x_{i,j} \leq 1 \quad \forall j \in N_{2}
$$

**$G_{2}$ matching**: No node in $G_{2}$ can be matched with more than one  in $G_{1}$.

$$
\sum_{j \in N_{2} } x_{i,j} \leq 1 \quad \forall i \in N_{1}
$$

**No crossovers**: There can be no crossovers.

$$
x_{i,j} +  x_{k,l} \leq 1 \quad \forall 
(i,j,k,l) \in ijklx = \{ (i,j,k,l) \in ijkl: i < k \in N_{1},  j > l \in N_{2}  \}
$$

## Input Data 

In [2]:
# nodes in G1

nodes1 = [*range(1,10)]

# edges (i,k) in G1

edges1 = [(1,2),(2,9),(3,4),(3,5),(5,6),(6,7),(7,9),(8,9)]

# nodes in G2

nodes2 = [*range(1,12)]

# edges (j,l) in G2

edges2 = [(1,4),(2,3),(4,6),(4,7),(5,6),(6,8),(7,8),(7,10),(9,10),(10,11)]

## Preprocessing

In [3]:
# Node matching: matchings of nodes in G1 with nodes in G2

list_ij = []

for i in nodes1:
    for j in nodes2:
        tp = i,j
        list_ij.append(tp)
        
ij = gp.tuplelist(list_ij)

# Edge matching: matchings of edges in G1 with edges in G2

list_ijkl = []

for i,k in edges1:
    for j,l in edges2:
        tp = i,j,k,l
        list_ijkl.append(tp)
        
ijkl = gp.tuplelist(list_ijkl)

# No crossover 

list_nox = []

for i,j in ij:
    for k,l in ij:
        if i < k and l < j:
            tp = i,j,k,l
            list_nox.append(tp)
            
nox = gp.tuplelist(list_nox)  
        

## Model Deployment

We create a model and the decision variables. The decision variables map the nodes on each graph, with the constraint that ensures that the edges of each graph are properly matched.

In [4]:
model = gp.Model('ProteinComparison')

# Map nodes in G1 with nodes in G2
map_nodes = model.addVars(ij, vtype=GRB.BINARY, name="map")

Using license file c:\gurobi\gurobi.lic


**$G_{1}$ matching constraint**: No node in $G_{1}$ can be matched with more than one  in $G_{2}$.

$$
\sum_{i \in N_{1} } x_{i,j} \leq 1 \quad \forall j \in N_{2}
$$

In [5]:
# At most one node in G1 is matched with a node in G2

node1_match = model.addConstrs((gp.quicksum(map_nodes[i,j] for i in nodes1) <= 1 for j in nodes2 ) ,name='node1_match')


**$G_{2}$ matching constraint**: No node in $G_{2}$ can be matched with more than one  in $G_{1}$.

$$
\sum_{j \in N_{2} } x_{i,j} \leq 1 \quad \forall i \in N_{1}
$$

In [6]:
# At most one node in G2 is matched with a node in G1

node2_match = model.addConstrs((gp.quicksum(map_nodes[i,j] for j in nodes2) <= 1 for i in nodes1 ) ,name='node2_match')


**No crossovers**: There can be no crossovers.

$$
x_{i,j} +  x_{k,l} \leq 1 \quad \forall 
(i,j,k,l) \in ijklx = \{ (i,j,k,l) \in ijkl: i < k \in N_{1},  j > l \in N_{2}  \}
$$

In [7]:
# No crossovers

no_crossover = model.addConstrs((map_nodes[i,j] + map_nodes[k,l] <= 1 for i,j,k,l in nox), name='no_crossover')

### Objective function

Maximize the matchings of edges in G1 with edges in G2. 

$$
\sum_{(i,j,k,l) \in ijkl} x_{i,j}*x_{k,l}
$$

In [8]:
# Objective function

model.setObjective(gp.quicksum(map_nodes[i,j]*map_nodes[k,l] for i,j,k,l in ijkl ) , GRB.MAXIMIZE )

In [9]:
# Verify model formulation

model.write('ProteinComparison.lp')

# Run optimization engine

model.optimize()

Gurobi Optimizer version 9.1.0 build v9.1.0rc0 (win64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 2000 rows, 99 columns and 4158 nonzeros
Model fingerprint: 0x22958823
Model has 80 quadratic objective terms
Variable types: 0 continuous, 99 integer (99 binary)
Coefficient statistics:
  Matrix range     [1e+00, 1e+00]
  Objective range  [0e+00, 0e+00]
  QObjective range [2e+00, 2e+00]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e+00, 1e+00]
Found heuristic solution: objective -0.0000000
Presolve removed 1876 rows and 17 columns
Presolve time: 0.01s
Presolved: 204 rows, 162 columns, 2026 nonzeros
Variable types: 0 continuous, 162 integer (162 binary)

Root relaxation: objective -6.923077e+00, 178 iterations, 0.00 seconds

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0    6.92308    0   52   -0.00000    6.9230

In [10]:
# Output report

print(f"Maximum number of edge matches: {round(model.objVal)}") 


for i,j,k,l in ijkl:
    if map_nodes[i,j].x*map_nodes[k,l].x > 0.5:
        print(f"Edge {i,k} in G1 is mapped with edge {j,l} in G2")

Maximum number of edge matches: 5
Edge (1, 2) in G1 is mapped with edge (2, 3) in G2
Edge (3, 4) in G1 is mapped with edge (4, 6) in G2
Edge (3, 5) in G1 is mapped with edge (4, 7) in G2
Edge (5, 6) in G1 is mapped with edge (7, 8) in G2
Edge (7, 9) in G1 is mapped with edge (9, 10) in G2


---
## References

H. Paul Williams, Model Building in Mathematical Programming, fifth edition.

Forrester, R.J. and Greenberg, H.J. (2008) Quadratic Binary Programming Models in Computational Biology. Algorithmic Operations Research, 3, 110–129.

Copyright © 2020 Gurobi Optimization, LLC