# Loading `kissim` results

This is a short notebook showing how to load the `kissim` output files as Python objects.

- `fingerprint.json`: Fingerprints for all successfully encoded structures
- `fingerprint_clean.json`: Fingerprints dataset without outlier structures
- `feature_distances.csv.bz2`: Feature distances between all fingerprint pairs
- `fingerprint_distances.csv.bz2`: Fingerprint distances between all fingerprint pairs

In [1]:
from pathlib import Path

from kissim.encoding import FingerprintGenerator
from kissim.comparison import FeatureDistancesGenerator, FingerprintDistanceGenerator

from src.paths import PATH_RESULTS



In [2]:
HERE = Path(_dh[-1])  # noqa: F821
RESULTS = PATH_RESULTS / "all"
RESULTS

PosixPath('/home/dominique/Documents/GitHub/kissim_app/src/../results/all')

## Load fingerprints

### Without outlier filtering

In [3]:
%%time
fingerprints = FingerprintGenerator.from_json(RESULTS / "fingerprints.json")
len(fingerprints.data)

CPU times: user 1.1 s, sys: 125 ms, total: 1.22 s
Wall time: 1.27 s


4681

### With outlier filtering

In [4]:
%%time
fingerprints = FingerprintGenerator.from_json(RESULTS / "fingerprints_clean.json")
len(fingerprints.data)

CPU times: user 1.4 s, sys: 84.5 ms, total: 1.49 s
Wall time: 1.48 s


4681

## Load feature distances

In [5]:
%%time
feature_distances = FeatureDistancesGenerator.from_csv(RESULTS / "feature_distances.csv.bz2")
len(feature_distances.data)

CPU times: user 2min 17s, sys: 1.3 s, total: 2min 18s
Wall time: 2min 18s


10953540

## Load fingerprint distances

In [6]:
%%time
fingerprint_distances = FingerprintDistanceGenerator.from_csv(
    RESULTS / "fingerprint_distances.csv.bz2"
)
len(fingerprint_distances.data)

CPU times: user 20.4 s, sys: 32 ms, total: 20.4 s
Wall time: 20.4 s


10953540

## Stats on all `kissim` datasets

In [7]:
import numpy as np
import pandas as pd

In [8]:
dataset_types = ["all", "dfg_in", "dfg_out"]
stats = {}

for dataset_type in dataset_types:
    
    print(dataset_type)
    stats[dataset_type] = {}

    # Set path to data folder
    RESULTS = PATH_RESULTS / dataset_type
    
    # Load distances
    fingerprint_distances = FingerprintDistanceGenerator.from_csv(
        RESULTS / "fingerprint_distances.csv.bz2"
    )
    n_structures = len(fingerprint_distances.structure_ids)
    stats[dataset_type]["Number of structures"] = n_structures
    n_kinases = len(fingerprint_distances.kinase_ids)
    stats[dataset_type]["Number of kinases"] = n_kinases
    stats[dataset_type]["Number of structure pairs w/o self-comparison"] = len(fingerprint_distances.data)
    stats[dataset_type]["Number of structure pairs w/o self-comparison (theory)"] = int(
        (n_structures * n_structures - n_structures) / 2
    )
    # Sort kinase pairs alphabetically and remove self-comparison
    kinase_pairs = fingerprint_distances.data[["kinase.1", "kinase.2"]].copy()
    kinase_pairs[["kinase.1", "kinase.2"]] = np.sort(kinase_pairs[["kinase.1", "kinase.2"]], axis=1)
    kinase_pairs = kinase_pairs.drop_duplicates()
    kinase_pairs = kinase_pairs[kinase_pairs["kinase.1"] != kinase_pairs["kinase.2"]]
    stats[dataset_type]["Number of kinase pairs w/o self-comparison"] = len(kinase_pairs)
    stats[dataset_type]["Number of kinase pairs w/o self-comparison (theory)"] = int(
        (n_kinases * n_kinases - n_kinases) / 2
    )

pd.DataFrame(stats)

all
dfg_in
dfg_out


Unnamed: 0,all,dfg_in,dfg_out
Number of structures,4681,4112,406
Number of kinases,279,257,71
Number of structure pairs w/o self-comparison,10953540,8452216,82215
Number of structure pairs w/o self-comparison (theory),10953540,8452216,82215
Number of kinase pairs w/o self-comparison,38781,32896,2485
Number of kinase pairs w/o self-comparison (theory),38781,32896,2485
