# Local copy of KLIFS IDs

In the `local` module of `opencadd.databases.klifs`, we load KLIFS metadata from two KLIFS download files, i.e. `overview.csv` and `KLIFS_export.csv`, to create one KLIFS metadata table (which is standardized across the `local` and `remote` module).

These KLIFS download files do not contain the kinase, ligand and structure KLIFS IDs. In order to make results from the `local` and `remote` module easily comparable, we add these KLIFS IDs to the local KLIFS metadata table upon local session initialization (`local.SessionInitialization`). 

Therefore, we need to find for each locally available structure (max. about 11,000 structures) its associated kinase, ligand and structure ID. 
Since we do not want to query the KLIFS webserver with each of theses structures every time we initialize a local session, we fetch here a local copy of KLIFS IDs.

In [1]:
from datetime import date

import pandas as pd

from opencadd.databases.klifs.api import setup_remote

INFO:opencadd.databases.klifs.api:If you want to see an non-truncated version of the DataFrames in this module, use `pd.set_option('display.max_columns', 50)` in your notebook.


In [2]:
# Work with remote KLIFS data
remote = setup_remote()

INFO:opencadd.databases.klifs.api:Set up remote session...
INFO:opencadd.databases.klifs.api:Remote session is ready!


## Get kinase and structure IDs

In [3]:
# Fetch all structures (keep only ID related columns)
structures_all = remote.structures.all_structures()
structures_all = structures_all[
    ["structure.klifs_id", "structure.pdb_id", "structure.alternate_model", "structure.chain", "kinase.klifs_name", "kinase.klifs_id", "ligand.expo_id"]
]
# Sort by structures ID
structures_all.sort_values("structure.klifs_id", inplace=True)
# Show data
print(structures_all.shape)
structures_all.head()

(11659, 7)


Unnamed: 0,structure.klifs_id,structure.pdb_id,structure.alternate_model,structure.chain,kinase.klifs_name,kinase.klifs_id,ligand.expo_id
7710,1,3dko,A,A,EphA7,415,IHZ
7709,2,2rei,B,A,EphA7,415,-
7708,3,3dko,B,A,EphA7,415,IHZ
7707,4,2rei,A,A,EphA7,415,-
9563,5,3v8t,B,A,ITK,474,477


In [4]:
print("Sanity check: Are there multiple structure KLIFS IDs for one KLIFS structure?")
sizes = structures_all.groupby(["structure.pdb_id", "structure.alternate_model", "structure.chain"]).size()
if len(sizes[sizes > 1]) > 0:
    print(sizes[sizes > 1])
else:
    print("All good!")

Sanity check: Are there multiple structure KLIFS IDs for one KLIFS structure?
All good!


In [5]:
# Save local copy of KLIFS IDs
filename = f"klifs_ids.{date.today().strftime('%Y%m%d')}.csv.zip"
structures_all.to_csv(filename, index=None, compression="zip")

## Test: Load data

In [6]:
pd.read_csv(filename).head()

Unnamed: 0,structure.klifs_id,structure.pdb_id,structure.alternate_model,structure.chain,kinase.klifs_name,kinase.klifs_id,ligand.expo_id
0,1,3dko,A,A,EphA7,415,IHZ
1,2,2rei,B,A,EphA7,415,-
2,3,3dko,B,A,EphA7,415,IHZ
3,4,2rei,A,A,EphA7,415,-
4,5,3v8t,B,A,ITK,474,477


## Get ligand IDs?

In [7]:
ligands_all = remote.ligands.all_ligands()

In [8]:
print("Sanity check: Are there multiple ligand KLIFS IDs for one ligand PDB?")
sizes = ligands_all.groupby(["ligand.expo_id"]).size()
if len(sizes[sizes > 1]) > 0:
    print(ligands_all[ligands_all["ligand.expo_id"].isin(sizes[sizes > 1].index)][["ligand.klifs_id", "ligand.expo_id"]].sort_values("ligand.expo_id"))
    print("These PDB IDs need to be check manually!")
else: 
    print("All good!")

Sanity check: Are there multiple ligand KLIFS IDs for one ligand PDB?
All good!
