# pssm.compare
> algorithms about comparing two PSSMs

In [None]:
#| default_exp pssm.compare

## Overview

Functions for comparing Position-Specific Scoring Matrices (PSSMs) using various divergence and similarity metrics.

---

**KL Divergence**

`kl_divergence` computes the Kullback-Leibler divergence $D_{KL}(P \| Q)$ between two PSSMs, measuring information lost when one distribution approximates another. Returns per-position divergence values (non-negative, 0 means identical). Note: KL divergence is **asymmetric**.

```python
kl_divergence(
    p1=pssm_df,   # target PSSM P (DataFrame, shape: AA × positions)
    p2=pssm_df2,  # predicted PSSM Q (DataFrame, same shape as p1)
)
```

`kl_divergence_flat` computes mean KL divergence from flattened PSSM Series (e.g., rows from a PSSM DataFrame).

```python
kl_divergence_flat(
    p1_flat=pssms.iloc[1],  # target flattened PSSM (pd.Series with 'pos_AA' index)
    p2_flat=pssms.iloc[0],  # predicted flattened PSSM (pd.Series, same structure)
)
```

---

**JS Divergence**

`js_divergence` computes the Jensen-Shannon divergence, a **symmetric** version of KL divergence using the mixture $M = \frac{1}{2}(P + Q)$. Returns per-position divergence as a Series.

```python
js_divergence(
    p1=pssm_df,    # first PSSM (DataFrame, shape: AA × positions)
    p2=pssm_df2,   # second PSSM (DataFrame, same shape)
    index=True,    # whether to return Series with position index
)
```

`js_divergence_flat` computes mean JS divergence from flattened PSSM Series.

```python
js_divergence_flat(
    p1_flat=pssms.iloc[1],  # first flattened PSSM (pd.Series)
    p2_flat=pssms.iloc[0],  # second flattened PSSM (pd.Series)
)
```

---

**JS Similarity**

`js_similarity` converts JS divergence to a similarity score in $[0, 1]$ by normalizing to bits (dividing by $\log 2$) then computing $1 - JS_{bits}$. Returns per-position similarity.

```python
js_similarity(
    pssm1=pssm_df,   # first PSSM (DataFrame)
    pssm2=pssm_df2,  # second PSSM (DataFrame)
).mean()  # average across positions
```

`js_similarity_flat` computes mean JS similarity from flattened PSSM Series.

```python
js_similarity_flat(
    p1_flat=pssms.iloc[1],  # first flattened PSSM (pd.Series)
    p2_flat=pssms.iloc[0],  # second flattened PSSM (pd.Series)
)
```

---

**Cosine Similarity**

`cosine_similarity` computes cosine similarity $\frac{P \cdot Q}{\|P\| \|Q\|}$ per position between two PSSMs. Since PSSM values are probabilities in $[0,1]$, similarity is bounded in $[0, 1]$.

```python
cosine_similarity(
    pssm1=pssm_df,   # first PSSM (DataFrame, shape: AA × positions)
    pssm2=pssm_df2,  # second PSSM (DataFrame, same shape)
)
```

`cosine_overall_flat` computes a single overall cosine similarity across all positions from flattened PSSMs.

```python
cosine_overall_flat(
    pssm1_flat=pssms.iloc[0],  # first flattened PSSM (pd.Series)
    pssm2_flat=pssms.iloc[1],  # second flattened PSSM (pd.Series)
)
```


## Setup

In [None]:
#| export
import numpy as np, pandas as pd
from katlas.pssm.core import EPSILON

## Compare PSSM

In [None]:
from katlas.data import *
from katlas.pssm.core import *

In [None]:
pssms = Data.get_pspa_scale()

In [None]:
# one example
pssm_df = recover_pssm(pssms.iloc[1])
pssm_df2 = recover_pssm(pssms.iloc[0])

### KL divergence

In [None]:
#| export
def kl_divergence(p1,  # target pssm p (array-like, shape: (AA, positions))
                  p2,  # pred pssm q (array-like, same shape as p1)
                 ):
    """
    KL divergence D_KL(p1 || p2) over positions.
    
    p1 and p2 are arrays (df or np) with index as aa and column as position.
    Returns average divergence across positions if mean=True, else per-position.
    """
    assert p1.shape == p2.shape
    p1, p2 = p1.align(p2, join='inner', axis=None)
    # Mask invalid positions (both zero)
    valid = (p1 + p2) > 0
    p1 = np.where(valid, p1, 0.0)
    p2 = np.where(valid, p2, 0.0)

    # KL divergence: sum_x p1(x) log(p1(x)/p2(x))
    kl = np.sum(p1 * np.log((p1 + EPSILON) / (p2 + EPSILON)), axis=0)

    return kl

The Kullback–Leibler (KL) divergence between two probability distributions \( P \) and \( Q \) is defined as:

$$
\mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right)
$$

This measures the information lost when \( Q \) is used to approximate \( P \). It is **not symmetric**, i.e.,

$$
\mathrm{KL}(P \| Q) \ne \mathrm{KL}(Q \| P)
$$

and it is **non-negative**, meaning:

$$
\mathrm{KL}(P \| Q) \ge 0
$$

with equality if and only if \( P = Q \) almost everywhere.

In practical computation, to avoid numerical instability when \( P(x) = 0 \) or \( Q(x) = 0 \), we often add a small constant \( \varepsilon \):

$$
\mathrm{KL}_\varepsilon(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{Q(x) + \varepsilon} \right)
$$

In [None]:
kl_divergence(pssm_df,pssm_df2)

array([0.29182172, 0.11138481, 0.24590698, 0.46021635, 0.36874823,
       0.53858511, 1.51571614, 0.02905442, 0.08530757, 0.07753394])

In [None]:
kl_divergence(pssm_df,pssm_df2).mean(),kl_divergence(pssm_df,pssm_df2).max()

(np.float64(0.37242752573216287), np.float64(1.5157161422110503))

In [None]:
#| export
def kl_divergence_flat(p1_flat, # pd.Series of target flattened pssm p
                       p2_flat, # pd.Series of pred flattened pssm q
                       ):

    "p1 and p2 are two flattened pd.Series with index as aa and column as position"
    kld = kl_divergence(p1_flat,p2_flat) # do not do js.mean() because it's 1d
    total_position = len(p1_flat.index.str.extract(r'(-?\d+)').drop_duplicates())
    return float(kld/total_position)

In [None]:
%%time
kl_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 3.03 ms, sys: 0 ns, total: 3.03 ms
Wall time: 3.04 ms


0.37242752573216287

### JS divergence

In [None]:
#| export
def js_divergence(p1, # pssm 
                  p2, # pssm
                  index=True,
                 ):
    "p1 and p2 are two arrays (df or np) with index as aa and column as position"
    assert p1.shape==p2.shape
    p1, p2 = p1.align(p2, join='inner', axis=None)
    if index: positions=p1.columns
    valid = (p1 + p2) > 0
    p1 = np.where(valid, p1, 0.0)
    p2 = np.where(valid, p2, 0.0)
    
    m = 0.5 * (p1 + p2)
    
    js = 0.5 * np.sum(p1 * np.log((p1+ EPSILON) / (m + EPSILON)), axis=0) + \
         0.5 * np.sum(p2 * np.log((p2+ EPSILON) / (m + EPSILON)), axis=0)
    return pd.Series(js,index=positions) if index else js

The Jensen-Shannon divergence between two probability distributions $P$ and $Q$ is defined as:

$$
\mathrm{JS}(P \| Q) = \frac{1}{2} \, \mathrm{KL}(P \| M) + \frac{1}{2} \, \mathrm{KL}(Q \| M)
$$

where $ M = \frac{1}{2}(P + Q) $ is the average (mixture) distribution, and $ \mathrm{KL} $ denotes the Kullback–Leibler divergence:

$$
\mathrm{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right)
$$

Therefore,

$$
\mathrm{JS}_\varepsilon(P \| Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} P(x) \log \left( \frac{P(x) + \varepsilon}{M(x) + \varepsilon} \right)
+ \frac{1}{2} \sum_{x \in \mathcal{X}} Q(x) \log \left( \frac{Q(x) + \varepsilon}{M(x) + \varepsilon} \right)
$$

In [None]:
js_divergence(pssm_df,pssm_df2)

Position
-5    0.065539
-4    0.025712
-3    0.054799
-2    0.103192
-1    0.083377
 0    0.105490
 1    0.344049
 2    0.007299
 3    0.020949
 4    0.018206
dtype: float64

In [None]:
js_divergence(pssm_df,pssm_df2).max(),js_divergence(pssm_df,pssm_df2).mean()

(np.float64(0.34404931056288773), np.float64(0.08286124552178498))

In [None]:
#| export
def js_divergence_flat(p1_flat, # pd.Series of flattened pssm
                       p2_flat, # pd.Series of flattened pssm
                       ):

    "p1 and p2 are two flattened pd.Series with index as aa and column as position"
    js = js_divergence(p1_flat,p2_flat,index=False)
    total_position = len(p1_flat.index.str.extract(r'(-?\d+)').drop_duplicates())
    return float(js/total_position)

In [None]:
%%time
js_divergence_flat(pssms.iloc[1],pssms.iloc[0])

CPU times: user 2.16 ms, sys: 14 μs, total: 2.18 ms
Wall time: 2.16 ms


0.08286124552178498

### JS similarity

To convert the Jensen–Shannon divergence into a similarity measure, we first normalize it to bits by dividing by log(2), ensuring that the divergence lies within the range [0, 1]. 
$$
\mathrm{JS}_{\text{bits}}(P \| Q) = \frac{\mathrm{JS}(P \| Q)}{\log 2}
$$

The similarity is then defined as one minus this normalized divergence:
$$
\mathrm{Sim}_{\mathrm{JS}}(P, Q) = 1 - \mathrm{JS}_{\text{bits}}(P \| Q)
$$

Thus, $\mathrm{Sim}_{\mathrm{JS}}$ ranges from 0 (completely dissimilar) to 1 (identical distributions).

In [None]:
#| export
def js_similarity(pssm1,pssm2):
    "Convert JSD to bits to be in range (0,1) then 1-JSD."
    distance = js_divergence(pssm1,pssm2)/np.log(2)
    similarity = 1-distance
    return similarity

In [None]:
js_similarity(pssm_df,pssm_df2).mean()

np.float64(0.880456492003838)

In [None]:
#| export
def js_similarity_flat(p1_flat,p2_flat):
    "Convert JSD to bits to be in range (0,1) then 1-JSD. "
    return 1-(js_divergence_flat(p1_flat,p2_flat)/np.log(2))

In [None]:
js_similarity_flat(pssms.iloc[1],pssms.iloc[0])

np.float64(0.880456492003838)

### Cosine similarity

In [None]:
#| export
def cosine_similarity(pssm1: pd.DataFrame, pssm2: pd.DataFrame) -> pd.Series:
    "Compute cosine similarity per position (column) between two PSSMs."
    
    assert pssm1.shape == pssm2.shape, "PSSMs must have the same shape"
    
    sims = {}
    for pos in pssm1.columns:
        v1 = pssm1[pos]
        v2 = pssm2[pos]
        v1,v2 = v1.align(v2, join='inner') # make sure the aa index match with each other

        norm1 = np.linalg.norm(v1)
        norm2 = np.linalg.norm(v2)

        if norm1 == 0 or norm2 == 0:
            sims[pos] = 0.0
        else:
            dot_product = sum(v1*v2) # np.dot(v1,v2)
            sims[pos] = dot_product / (norm1 * norm2)

    return pd.Series(sims)

The cosine similarity between two vectors \( P \) and \( Q \) (e.g., two PSSM columns representing amino acid probability distributions) is defined as:

$$
\mathrm{cos}(P, Q) = \frac{P \cdot Q}{\|P\| \, \|Q\|}
$$

where $ P \cdot Q = \sum_{i=1}^{n} P_i Q_i $ is the dot product between $ P $ and $ Q $, and $ \|P\| = \sqrt{\sum_{i=1}^{n} P_i^2} $ is the Euclidean norm of $ P $.

Since all entries of $ P $ and $ Q $ are nonnegative probabilities (i.e., $ P_i, Q_i \in [0,1] $), the cosine similarity lies within the range:

$$
0 \leq \mathrm{cos}(P, Q) \leq 1
$$


Given that pssm are probabilities between 0 and 1, cosine similarity is within (0,1)

In [None]:
cosine_similarity(pssm_df,pssm_df2).sort_values()

 1    0.130818
-2    0.606234
-1    0.731466
 0    0.780066
-5    0.780504
-3    0.786276
-4    0.901395
 3    0.918692
 4    0.934967
 2    0.971066
dtype: float64

In [None]:
cosine_similarity(pssm_df,pssm_df2).mean()

np.float64(0.7541484704577781)

In [None]:
#| export
def cosine_overall_flat(pssm1_flat, pssm2_flat):
    """Compute overall cosine similarity between two PSSMs (flattened)."""
    # match index for dot product
    pssm1_flat, pssm2_flat = pssm1_flat.align(pssm2_flat, join='inner')
    norm1 = np.linalg.norm(pssm1_flat)
    norm2 = np.linalg.norm(pssm2_flat)
    if norm1 == 0 or norm2 == 0: return 0.0
    dot_product = sum(pssm1_flat*pssm2_flat) # np.dot(pssm1_flat, pssm2_flat)
    return  dot_product/ (norm1 * norm2)

In [None]:
cosine_overall_flat(pssms.iloc[0],pssms.iloc[0])

np.float64(0.9999999999999999)

In [None]:
cosine_overall_flat(pssms.iloc[0],pssms.iloc[1])

np.float64(0.6614783212500968)

## Export -

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()