# This notebook demonstrates how to compare the model assigned attributions to the ground truth attributions assigned via the Polar deterministic binding rule

The function compute_attribution_auc can be used to analyse any dataframe of the sort generated by our `compute_attributions` script, using the Polar deterministic binding rule. 

In [1]:
%config Completer.use_jedi = False
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob



In [2]:
def compute_attribution_auc(attribution_df, random = False):
    
    #Computed as 1 - #changes needed/Max #changes needed 
    #Where a change is a swap of adjacent rows and we would like to change the df so that all
    #of the binding atoms are at the top
    
    #Thus if all of the binding atoms are ranked at the top the score will be 1 and if they are all placed
    #at the bottom they will be 0
    
    
    #Obtain ranks of binding atoms and number of binding atoms
    
    if random:
        attribution_df['attribution'] = np.random.uniform(size = attribution_df.shape[0])
        attribution_df = attribution_df.sort_values('attribution', ascending = True).reset_index(drop = True)
    
    num_binding_atoms = int(attribution_df['binding'].sum())
    
    binding_atom_ranks = []
    
    for idx, row in attribution_df.iterrows():
        
        if row['binding'] == 1:
            binding_atom_ranks.append(idx + 1)
            
            
    num_changes_needed = sum(binding_atom_ranks) - sum(list(range(num_binding_atoms + 1)))
    max_num_changes = num_binding_atoms*(attribution_df.shape[0] - num_binding_atoms)
    
    attribution_auc = 1 - num_changes_needed/max_num_changes
    
    return attribution_auc

## Load in an example attribution df

In [3]:
df = pd.read_csv('../data/polar_attr_df_example.csv', sep = ' ')

In [4]:
df

Unnamed: 0,atom_idx,x,y,z,attribution,binding
0,10,-2.3033,0.7924,-0.3791,-0.042,0.0
1,8,-0.175,-0.2503,-0.0171,-0.053,1.0
2,13,-4.3739,1.429,-0.1046,-0.056,0.0
3,14,-3.6864,0.3331,-0.4069,-0.088,0.0
4,9,-1.2717,-0.0827,-0.9192,-0.099,0.0
5,12,-3.6611,2.4153,0.0938,-0.115,0.0
6,4,2.6241,-0.7778,0.1187,-0.116,0.0
7,11,-2.3332,2.1108,-0.0357,-0.117,1.0
8,3,2.1177,0.5014,-0.0444,-0.118,0.0
9,7,0.3303,-1.5907,-0.3213,-0.124,0.0


If we inspect the loaded dataframe, we can see the index of each atom and its 3D coordinates. These can be used to map the atom back to an specific atom in the original SDF file.

The 'attribution' gives the model assigned attributions computed via atomic masking (or by any other means). The binding column contains a 0 if the atom is not involved in binding, and a 1 if it is involved in binding (according to the deterministic binding rule). 

We want the active atoms to be given a high rank by the model attributions.


In [5]:
compute_attribution_auc(df) 

0.7666666666666666

The Attribution AUC score is between 0 and 1. 

* A score of 1 corresponds to a perfect ranking, with the active atoms ranked at the top and all other atoms below.
* A score of 0 corresponds to the worst possible ranking, with all the active atoms ranked at the bottom.
* The average of a random ranking will be 0.5

In the example above, the model assigns a high rank to the atom with index 8, which helps its Attribution AUC, but only assigns a medium ranking to the other active atom, number 11. 

For a set of $n$ synthetic protein-ligand complexes, we report the Attribution AUC statistic as the mean of the individual attribution AUCs. From our experience, an average score of over 0.8 equates to very good performance.