How to deal with ambiguity<a id="cross-link_ms-ambiguity"></a>
==========================

**Table of contents**

 - [1. Cross-link identification ambiguity](#autotoc2v1)
 - [2. Compositional ambiguity](#autotoc2v2)


> This tutorial continues on from the [introduction to cross-linking](cross-link_ms-colab.ipynb) tutorial.


There are several way one can deal with ambiguity of XL-MS data. Let's define the possible ambiguities:

 1. cross-link identification ambiguity
 2. compositional ambiguity
 3. state ambiguity

First, we need to install IMP:

In [None]:
!echo "deb https://integrativemodeling.org/latest/download $(lsb_release -cs)/" > /etc/apt/sources.list.d/salilab.list
!wget -O /etc/apt/trusted.gpg.d/salilab.asc https://salilab.org/~ben/pubkey256.asc
!apt update
!apt install imp
import sys
sys.path.append('/usr/lib/python3.8/dist-packages')

In [None]:
from __future__ import print_function

import IMP
import IMP.pmi
import IMP.pmi.topology
import IMP.pmi.io
import IMP.pmi.io.crosslink

# 1. Cross-link identification ambiguity<a id="autotoc2v1"></a>

There are several models on how to implement the identification ambiguity.

One way to do it is to use the `UniqueID` keyword; cross-links with the same UniqueID are considered ambiguous:

In [None]:
xldb='''Protein 1,Protein 2,Residue 1,Residue 2,UniqueID,Score
ProtA,ProtB,1,10,1,1.0
ProtA,ProtB,1,11,1,2.0
ProtA,ProtB,1,21,2,2.0
'''

with open('xlinks.csv', 'w') as xlf:
    xlf.write(xldb)

In the example above, cross-links ProtA:1-ProtB:10 and ProtA:1-ProtB:11 are ambiguous because they were assigned to the same UniqueID.

Now we create a conversion map between internal keywords of xlinks features and the one in the file:

In [None]:
cldbkc = IMP.pmi.io.crosslink.CrossLinkDataBaseKeywordsConverter()
cldbkc.set_protein1_key("Protein 1")
cldbkc.set_protein2_key("Protein 2")
cldbkc.set_residue1_key("Residue 1")
cldbkc.set_residue2_key("Residue 2")
cldbkc.set_unique_id_key("UniqueID")
cldbkc.set_id_score_key("Score")

With this keyword interpreter, let's read the cross-link database:

In [None]:
cldb = IMP.pmi.io.crosslink.CrossLinkDataBase(cldbkc)
cldb.create_set_from_file("xlinks.csv")

Let's check that the database looks ok:

In [None]:
print(cldb)

1
--- XLUniqueID 1
--- XLUniqueSubIndex 1
--- XLUniqueSubID 1.1
--- Protein1 ProtA
--- Protein2 ProtB
--- Residue1 1
--- Residue2 10
--- IDScore 1.0
--- Redundancy 1
--- RedundancyList ['1.1']
--- Ambiguity 2
--- Residue1LinksNumber 3
--- Residue2LinksNumber 1
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 2
--- XLUniqueSubID 1.2
--- Protein1 ProtA
--- Protein2 ProtB
--- Residue1 1
--- Residue2 11
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['1.2']
--- Ambiguity 2
--- Residue1LinksNumber 3
--- Residue2LinksNumber 1
-------------
2
--- XLUniqueID 2
--- XLUniqueSubIndex 1
--- XLUniqueSubID 2.1
--- Protein1 ProtA
--- Protein2 ProtB
--- Residue1 1
--- Residue2 21
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['2.1']
--- Ambiguity 1
--- Residue1LinksNumber 3
--- Residue2LinksNumber 1
-------------


As you can see there are two unique indexes, 1 and 2. The first spectral index contains two identifications, with subindexes 1.1 and 1.2, corresponding to the two ambiguous restraints.

# 2. Compositional ambiguity<a id="autotoc2v2"></a>

Compositional ambiguity occurs when identical copies of the same protein are present in the sample, and we are not able to attribute the cross-link to one or the other copy.

Let's suppose we already have an indentification ambiguity, to complicate the example, and see how the two ambiguities combine with each other. See the data below; note that two cross-links have the same UniqueID:

In [None]:
xldb='''Protein 1,Protein 2,Residue 1,Residue 2,UniqueID,Score
ProtA,ProtB,1,10,1,1.0
ProtA,ProtB,1,11,1,2.0
ProtB,ProtA,21,1,2,2.0
ProtA,ProtA,1,2,3,3.0
'''

with open('xlinks.csv', 'w') as xlf:
    xlf.write(xldb)

We will first create a database:

In [None]:
cldbkc = IMP.pmi.io.crosslink.CrossLinkDataBaseKeywordsConverter()
cldbkc.set_protein1_key("Protein 1")
cldbkc.set_protein2_key("Protein 2")
cldbkc.set_residue1_key("Residue 1")
cldbkc.set_residue2_key("Residue 2")
cldbkc.set_unique_id_key("UniqueID")
cldbkc.set_id_score_key("Score")

cldb = IMP.pmi.io.crosslink.CrossLinkDataBase(cldbkc)
cldb.create_set_from_file("xlinks.csv")

Now, we know that there are two copies of ProtA, which we called ProtA.1 and ProtA.2 in our IMP [Hierarchy](https://integrativemodeling.org/2.19.0/doc/ref/classIMP_1_1atom_1_1Hierarchy.html). Let's rename ProtA into ProtA.1 for both ends of each cross-link:

In [None]:
from IMP.pmi.io.crosslink import FilterOperator as FO
import operator

fo1 = FO(cldb.protein1_key, operator.eq, "ProtA")
cldb.set_value(cldb.protein1_key, "ProtA.1", fo1)
fo2 = FO(cldb.protein2_key, operator.eq, "ProtA")
cldb.set_value(cldb.protein2_key, "ProtA.1", fo2)

Next we clone all cross-links involving ProtA.1 so that they were observed also by ProtA.2:

In [None]:
cldb.clone_protein("ProtA.1", "ProtA.2")

Let's check that the database looks OK:

In [None]:
print(cldb)

1
--- XLUniqueID 1
--- XLUniqueSubIndex 1
--- XLUniqueSubID 1.1
--- Protein1 ProtA.1
--- Protein2 ProtB
--- Residue1 1
--- Residue2 10
--- IDScore 1.0
--- Redundancy 1
--- RedundancyList ['1.1']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 2
--- XLUniqueSubID 1.2
--- Protein1 ProtA.2
--- Protein2 ProtB
--- Residue1 1
--- Residue2 10
--- IDScore 1.0
--- Redundancy 1
--- RedundancyList ['1.2']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 3
--- XLUniqueSubID 1.3
--- Protein1 ProtA.1
--- Protein2 ProtB
--- Residue1 1
--- Residue2 11
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['1.3']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 4
--- XLUniqueSubID 1.4
--- Protein1 ProtA.2
--- Protein2 ProtB
--- Residue1 1
--- Residue2 11
--- IDScore 2.0
--- Redundancy 1
--- Red

As you can see there are three unique indexes, 1, 2 and 3. The first index contains four cross-links, 
the second two cross-links and the third four cross-links.