# The `comparison` module

In this notebook, we explore the main classes in the `comparison` module that can be used to calculate

- _fingerprint distances_ between two fingerprints representing two structures (`FingerprintDistance`) and
- _feature distances_ between each feature of two fingerprints (`FeatureDistances`).

__Note__: The calculation of _feature distances_ is the step before the calculation of the final _fingerprint distance_.

Such distances can not only be generated between two fingerprints as described above but also in bulk for a set of fingerprints in an all-against-all comparison using the classes `FingerprintDistanceGenerator` and `FeatureDistancesGenerator`.

Let's take a look at the API logic in this table again:

| Action                                                         | Module       | Class for single calculation | Class for bulk calculation     |
|----------------------------------------------------------------|--------------|------------------------------|--------------------------------|
| Encode structures as fingerprint                               | `encoding`   | `Fingerprint`                | `FingerprintGenerator`         |
| Compare fingerprint features (calculate feature distance**s**) | `comparison` | `FeatureDistances`           | `FeatureDistancesGenerator`    |
| Compare fingerprints (calculate fingerprint distance)          | `comparison` | `FingerprintDistance`        | `FingerprintDistanceGenerator` |
|                                                                |              |                              |                                |

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
from pathlib import Path

from opencadd.databases.klifs import setup_local

from kissim.encoding import FingerprintGenerator
from kissim.comparison import FeatureDistances, FeatureDistancesGenerator
from kissim.comparison import FingerprintDistance, FingerprintDistanceGenerator



In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "../../kissim/tests/data/KLIFS_download/"

## Get local KLIFS structures

We use the `opencadd.databases.klifs` module to access structures in our local KLIFS download.

In [17]:
klifs_session = setup_local(DATA)

In [18]:
structures = klifs_session.structures.all_structures()

In [19]:
structure_klifs_ids = structures["structure.klifs_id"].to_list()
print(f"Number of structures: {len(structure_klifs_ids)}")
print(*structure_klifs_ids)

Number of structures: 16
109 118 110 113 111 116 112 114 115 117 12347 1641 2542 3833 5399 9122


## Generate fingerprints

Let's generate a few fingerprints for the structures in our local KLIFS download using the bulk fingerprint generator `FingerprintGenerator`.

In [30]:
# Use local KLIFS session to access KLIFS data
fingerprint_generator = FingerprintGenerator.from_structure_klifs_ids(structure_klifs_ids, klifs_session=klifs_session)
print(f"Number of fingerprints: {len(fingerprint_generator.data.keys())}")

117: Local complex.pdb or pocket.pdb file missing: /home/dominique/Documents/GitHub/kissim/docs/tutorials/../../kissim/tests/data/KLIFS_download/HUMAN/ABL2/3gvu_altA_chainA/complex.pdb
117: Empty fingerprint (data unaccessible).
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.
9122: Non-standard residue MSE is set to MET.


Number of fingerprints: 15


__Note__: If fingerprint cannot be generated (e.g. because structural data is missing), the structure is removed from list.

As convenience for later, we extract a list of `Fingerprint` objects from the `FingerprintGenerator` object.

In [21]:
fingerprints = list(fingerprint_generator.data.values())

Number of fingerprints: 16


## Compare two fingerprints

Let's first focus on the comparison between two fingerprints only.

For two fingerprints (`Fingerprint` objects), we will 

1. Calculate the _feature distances_ using `FeatureDistances` and 
2. Calculate based on these _feature distances_ and given _feature weights_ the final _fingerprint distance_ using `FingerprintDistance`.

### Generate feature distances between two fingerprints (`FeatureDistances`)

- Input: Two `Fingerprint` objects
- Output: `FeatureDistances` object

In [8]:
fingerprint1 = fingerprints[0]
fingerprint2 = fingerprints[1]

In [9]:
feature_distances = FeatureDistances.from_fingerprints(fingerprint1, fingerprint2)
print(f"Kinase pair: {feature_distances.kinase_pair_ids}")
print(f"Structure pair: {feature_distances.structure_pair_ids}")
feature_distances.data

Kinase pair: ('ABL2', 'ABL2')
Structure pair: (109, 118)


Unnamed: 0,feature_type,feature_name,distance,bit_coverage
0,physicochemical,size,0.0,1.0
1,physicochemical,hbd,0.0,1.0
2,physicochemical,hba,0.0,1.0
3,physicochemical,charge,0.0,1.0
4,physicochemical,aromatic,0.0,1.0
5,physicochemical,aliphatic,0.0,1.0
6,physicochemical,sco,0.08,0.88
7,physicochemical,exposure,0.294118,1.0
8,distances,distance_to_centroid,0.059839,1.0
9,distances,distance_to_hinge_region,0.122168,1.0


### Generate fingerprint distance between two fingerprints (`FingerprintDistance`)

- Input: `FeatureDistances` object and optionally feature weights
- Output: `FingerprintDistance` object

#### Use standard feature weights

In [10]:
fingerprint_distance = FingerprintDistance.from_feature_distances(feature_distances, feature_weights=None)
print(f"Fingerprint distance: {fingerprint_distance.distance}")
print(f"Fingerprint bit coverage: {fingerprint_distance.bit_coverage}")
print(f"Feature weights: {fingerprint_distance.feature_weights}")

Fingerprint distance: 0.07421423894307076
Fingerprint bit coverage: 0.9919999999999999
Feature weights: [0.06666667 0.06666667 0.06666667 0.06666667 0.06666667 0.06666667
 0.06666667 0.06666667 0.06666667 0.06666667 0.06666667 0.06666667
 0.06666667 0.06666667 0.06666667]


#### Use user-defined feature weights

In [11]:
feature_weights = [0.3 / 8] * 8 + [0.5 / 4] * 4 + [0.2 / 3] * 3
fingerprint_distance = FingerprintDistance.from_feature_distances(feature_distances, feature_weights=feature_weights)
print(f"Fingerprint distance: {fingerprint_distance.distance}")
print(f"Fingerprint bit coverage: {fingerprint_distance.bit_coverage}")
print(f"Feature weights: {fingerprint_distance.feature_weights}")

Fingerprint distance: 0.08417398268335104
Fingerprint bit coverage: 0.9954999999999999
Feature weights: [0.0375     0.0375     0.0375     0.0375     0.0375     0.0375
 0.0375     0.0375     0.125      0.125      0.125      0.125
 0.06666667 0.06666667 0.06666667]


## Compare all-against-all fingerprints

Let's now take a look at the bulk distance generators to generate all-against-all comparisons for a set of fingerprints.

For a `FingerprintGenerator` object, which contains the fingerprints for a set of structures, we will 

1. Calculate _feature distances_ for all fingerprint pairs using `FeatureDistancesGenerator` and 
2. Calculate based on these _feature distances_ and given _feature weights_ the final _fingerprint distance_ for all fingerprint pairs using `FingerprintDistanceGenerator`.

### Generate feature distances for all pairwise structures/fingerprints (`FeatureDistancesGenerator`)

- Input: `FingerprintGenerator` object
- Output: `FeatureDistancesGenerator` object

In [12]:
feature_distances_generator = FeatureDistancesGenerator.from_fingerprint_generator(fingerprint_generator)
feature_distances_list = list(feature_distances_generator.data.values())
print("One example structure pair:")
print(feature_distances_list[0].structure_pair_ids)
feature_distances_list[0].data

One example structure pair:
(109, 118)


Unnamed: 0,feature_type,feature_name,distance,bit_coverage
0,physicochemical,size,0.0,1.0
1,physicochemical,hbd,0.0,1.0
2,physicochemical,hba,0.0,1.0
3,physicochemical,charge,0.0,1.0
4,physicochemical,aromatic,0.0,1.0
5,physicochemical,aliphatic,0.0,1.0
6,physicochemical,sco,0.08,0.88
7,physicochemical,exposure,0.294118,1.0
8,distances,distance_to_centroid,0.059839,1.0
9,distances,distance_to_hinge_region,0.122168,1.0


### Generate fingerprint distance for all pairwise structures/fingerprints (`FingerprintDistanceGenerator`)

- Input: `FeatureDistancesGenerator` object and optionally feature weights
- Output: `FingerprintDistanceGenerator` object

In [13]:
fingerprint_distance_generator = FingerprintDistanceGenerator.from_feature_distances_generator(feature_distances_generator)

In [14]:
fingerprint_distance_generator.data.head(20)

Unnamed: 0,structure1,structure2,kinase1,kinase2,distance,coverage
0,109,118,ABL2,ABL2,0.074214,0.992
1,109,110,ABL2,ABL2,0.061968,0.986667
2,109,113,ABL2,ABL2,0.064064,0.984
3,109,111,ABL2,ABL2,0.064064,0.984
4,109,116,ABL2,ABL2,0.05863,0.978
5,109,112,ABL2,ABL2,0.061968,0.986667
6,109,114,ABL2,ABL2,0.05863,0.978
7,109,115,ABL2,ABL2,0.074239,0.992
8,109,117,ABL2,ABL2,0.000402,0.994
9,109,12347,ABL2,BRAF,0.259053,0.919333


#### Kinase distance matrix

In [15]:
fingerprint_distance_generator.kinase_distance_matrix(by="minimum")

kinase2,AAK1,ABL2,ADCK3,AKT1,ALK,BRAF,CHK1
kinase1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AAK1,0.0,0.254438,0.303542,0.350129,0.267515,0.307277,0.22959
ABL2,0.254438,0.0,0.343806,0.339039,0.147781,0.259043,0.2382
ADCK3,0.303542,0.343806,0.0,0.420406,0.328649,0.376875,0.347142
AKT1,0.350129,0.339039,0.420406,0.0,0.340086,0.291347,0.359764
ALK,0.267515,0.147781,0.328649,0.340086,0.0,0.277828,0.23237
BRAF,0.307277,0.259043,0.376875,0.291347,0.277828,0.0,0.30333
CHK1,0.22959,0.2382,0.347142,0.359764,0.23237,0.30333,0.0


#### Structure distance matrix

In [16]:
fingerprint_distance_generator.structure_distance_matrix()

structure2,109,110,111,112,113,114,115,116,117,118,1641,2542,3833,5399,9122,12347
structure1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
109,0.0,0.061968,0.064064,0.061968,0.064064,0.05863,0.074239,0.05863,0.000402,0.074214,0.253045,0.339039,0.277368,0.147781,0.358882,0.259053
110,0.061968,0.0,0.02399,0.0,0.02399,0.017678,0.084937,0.017678,0.061966,0.084942,0.240377,0.358892,0.266317,0.156963,0.350771,0.278838
111,0.064064,0.02399,0.0,0.02399,0.0,0.025339,0.081998,0.025339,0.064067,0.081992,0.238899,0.361963,0.254438,0.153613,0.343806,0.277165
112,0.061968,0.0,0.02399,0.0,0.02399,0.017678,0.084937,0.017678,0.061966,0.084942,0.240377,0.358892,0.266317,0.156963,0.350771,0.278838
113,0.064064,0.02399,0.0,0.02399,0.0,0.025339,0.081998,0.025339,0.064067,0.081992,0.238899,0.361963,0.254438,0.153613,0.343806,0.277165
114,0.05863,0.017678,0.025339,0.017678,0.025339,0.0,0.080826,0.0,0.058632,0.080825,0.2382,0.350851,0.25739,0.154057,0.353562,0.274522
115,0.074239,0.084937,0.081998,0.084937,0.081998,0.080826,0.0,0.080826,0.074353,0.000335,0.246792,0.353798,0.282923,0.163763,0.360811,0.273198
116,0.05863,0.017678,0.025339,0.017678,0.025339,0.0,0.080826,0.0,0.058632,0.080825,0.2382,0.350851,0.25739,0.154057,0.353562,0.274522
117,0.000402,0.061966,0.064067,0.061966,0.064067,0.058632,0.074353,0.058632,0.0,0.07433,0.253139,0.339073,0.277392,0.147863,0.358887,0.259043
118,0.074214,0.084942,0.081992,0.084942,0.081992,0.080825,0.000335,0.080825,0.07433,0.0,0.246844,0.353715,0.282949,0.163769,0.360833,0.273133
