## Update

Now `bpps` is explained by the host here:

https://www.kaggle.com/c/stanford-covid-vaccine/discussion/182021#1006800

## About

In the provided dataset, there's a folder named `bpps`. There aren't enough explanation about this folder and the contents inside. I conducted [a simple search](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2018-4) and found that `bpps` stands for `Base Pairing Probabilities`. 
The matrix inside `bpps` folder is Base Pairing Probability Matrix (`BPPM`) and it's basically treated as adjascency matrix of the RNA sequence. I'm not 100% sure, but it's basically describing the structure of the RNA (in my understanding).
On the other hand, we have `structure` column in the `train.json` and `test.json`. The structure in json file and `BPPM` has strong connection. Let's have a look at it.

If you'd like to know more about BPPs, check out [this](https://onlinelibrary.wiley.com/doi/pdf/10.1002/bip.360290621?casa_token=5__Sglto484AAAAA%3AXSJ0MfHd0atxB5PYqMyDsJvqIE79vTeneakVoku__oJZFP-wTki5QvoRWp1tjOpYgtkccjtfE1MKzQ).

## Libraries

In [None]:
import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from pathlib import Path

## Data Loading

In [None]:
DATA_DIR = Path("../input/stanford-covid-vaccine/")
BPPS_DIR = DATA_DIR / "bpps"

train = pd.read_json(DATA_DIR / "train.json", lines=True)
test = pd.read_json(DATA_DIR / "test.json", lines=True)

bppm_paths = list(BPPS_DIR.glob("*.npy"))

In [None]:
len(train) + len(test) == len(bppm_paths)

Each `id` corresponds to the `.npy` file in `bpps` folder.

## Compare BPPM and structure

In [None]:
def get_bppm(id_):
    return np.load(BPPS_DIR / f"{id_}.npy")


def draw_structure(structure: str):
    pm = np.zeros((len(structure), len(structure)))
    start_token_indices = []
    for i, token in enumerate(structure):
        if token == "(":
            start_token_indices.append(i)
        elif token == ")":
            j = start_token_indices.pop()
            pm[i, j] = 1.0
            pm[j, i] = 1.0
    return pm


def plot_structures(bppm: np.ndarray, pm: np.ndarray):
    fig, axes = plt.subplots(1, 2, figsize=(10, 10))
    axes[0].imshow(bppm)
    axes[0].set_title("BPPM")
    axes[1].imshow(pm)
    axes[1].set_title("structure")
    plt.show()

In [None]:
idx = 0
sample = train.loc[idx]

bppm = get_bppm(sample.id)
pm = draw_structure(sample.structure)
plot_structures(bppm, pm)

In [None]:
idx = 1
sample = train.loc[idx]

bppm = get_bppm(sample.id)
pm = draw_structure(sample.structure)
plot_structures(bppm, pm)

In [None]:
idx = 2
sample = train.loc[idx]

bppm = get_bppm(sample.id)
pm = draw_structure(sample.structure)
plot_structures(bppm, pm)

In [None]:
idx = 3
sample = train.loc[idx]

bppm = get_bppm(sample.id)
pm = draw_structure(sample.structure)
plot_structures(bppm, pm)

In [None]:
idx = 4
sample = train.loc[idx]

bppm = get_bppm(sample.id)
pm = draw_structure(sample.structure)
plot_structures(bppm, pm)

In [None]:
idx = 5
sample = train.loc[idx]

bppm = get_bppm(sample.id)
pm = draw_structure(sample.structure)
plot_structures(bppm, pm)

It's similar, and some are almost the same. However, in some case, BPPM is a bit blurred - maybe related to `signal_to_noise` or `SN_filter` values.

## Visualize graph structure

As I note above, this matrix can be treated as graph structure. Let's visualize it as a graph.

In [None]:
def visualize_graph(bppm: np.ndarray, sequence: str, threshold=0.1):
    indices = np.where(bppm > threshold)
    edges = list(zip(indices[0], indices[1], bppm[indices]))
    
    g = graphviz.Graph(format="png")
    for from_, to, coef in edges:
        if from_ > to:
            g.edge(sequence[from_] + f"({from_})",
                   sequence[to] + f"({to})",
                   label=f"{coef:.2f}",
                   penwidth=f"{int(max(1, abs(coef * 20)))}")
    g.render("./graph")
    return g

In [None]:
idx = 0
sample = train.loc[idx]

bppm = get_bppm(sample.id)
visualize_graph(bppm, sample.sequence, threshold=0.05)

In [None]:
idx = 1
sample = train.loc[idx]

bppm = get_bppm(sample.id)
visualize_graph(bppm, sample.sequence)

In [None]:
idx = 2
sample = train.loc[idx]

bppm = get_bppm(sample.id)
visualize_graph(bppm, sample.sequence)

In [None]:
idx = 3
sample = train.loc[idx]

bppm = get_bppm(sample.id)
visualize_graph(bppm, sample.sequence)

In [None]:
idx = 4
sample = train.loc[idx]

bppm = get_bppm(sample.id)
visualize_graph(bppm, sample.sequence)

In [None]:
idx = 5
sample = train.loc[idx]

bppm = get_bppm(sample.id)
visualize_graph(bppm, sample.sequence)

## EOF