# Analysis of chemo diversity on pharmacological and biological activities

The [dataset](results/activities_2022-01-29_16-33-05.csv) is obtained from the Scopus downloader version 2.0 from the [base set](data/activities.csv) of chemical compounds and activities.

**TODO** check:

- <https://stackoverflow.com/questions/53927460/select-rows-in-pandas-multiindex-dataframe>
- <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#multiindex-query-syntax>


## Loading data from Scopus

### Boilerplate

Some Python configuration.


In [48]:
from pathlib import Path
from itertools import islice, product
from pprint import pprint

import pandas as pd
import numpy as np
from scipy import linalg

# import prince
import holoviews as hv
import seaborn as sns
import matplotlib.pyplot as plt


import biblio_extractor as bex

np.set_printoptions(precision=4, suppress=True)
pd.set_option("display.max_columns", 10)
pd.set_option("display.max_rows", 20)
pd.set_option("display.min_rows", 10)
# pd.set_option('display.expand_frame_repr', False)

DATASET_FILENAME = Path("results/activities_2022-01-29_16-33-05.csv")


# Loading the CSV file

Now, we load the dataset we'll use in this Notebook. It is generated using [our tool](biblio_extractor.py) in the same repository.
The dataset is a 106 \* 66 matrix, with rows and columns indexed by three levels (using Pandas's `MultiIndex`) as follows:

- level 0 is _class_ (chemical, biological or pharmacological class)
- level 1 is _name_ (compound or activity, a.k.a. keyword)
- level 2 is _kind_ (either _w/o_ for _without_ or _w/_ for _with_)

The matrix is indexed by _two_ disjoint finite sets of keywords:

- **rows**: a set of 53 (chemical) coumpounds = {acridine, triterpene, ...}
- **columns**: a set of 33 (biological, pharmacological) activities = {germination, cytotoxicity, ...}

An example is shown below, restricted to two compounds and two activities.
Note that this is **not** an extract of the whole dataset, the (w/o, w/o) cells being smaller here.

|                   |            |     | germination | germination | cytotoxicity | cytotoxicity |
| ----------------- | ---------- | --- | ----------- | ----------- | ------------ | ------------ |
|                   |            |     | allelopathy | allelopathy | pharmaco     | pharmaco     |
|                   |            |     | w/o         | w/          | w/o          | w/           |
| alkaloid          | acridine   | w/o | 2100        | 32          | 28           | 2104         |
| alkaloid          | acridine   | w/  | 1294        | 11          | 9            | 1296         |
| terpenoid/terpene | triterpene | w/o | 1283        | 11          | 9            | 1285         |
| terpenoid/terpene | triterpene | w/  | 2111        | 32          | 28           | 2115         |

Let $M$ be this matrix, and for each couple of keywords made of a compound and and activity, call $M_{ij} = (c_i, a_j)$, the **ij confusion submatrix**.
Assume that $M_ij$ is of the form :
\begin{bmatrix}
U & V\\
X & Y
\end{bmatrix}

Where :

- $U = (\text{w/o}, \text{w/o})$ is the number of papers that have **neither** the $c_i$ compound **nor** the $a_j$ activity;
- $V = (\text{w/o}, \text{w/})$ is the number of papers that have the $a_j$ activity but **not** the $c_i$ compound;
- $X = (\text{w/}, \text{w/o})$ is the number of papers that have the $c_i$ compound but **not** the $a_j$ activity;
- $Y = (\text{w/}, \text{w/})$ is the number of papers that have **both** the $c_i$ compound **and** the $a_j$ activity.

We avoid the open world hypothesis by restricting the analysis to the paper in the domain $D$,
which is the set of papers that have at least one compound and one activity.
By construction:

- $U + V$ and $X + Y$ are constants for each $c_i$ (whatever the choice of $a_j$) and is the total number of papers in $D$ with the $c_i$ comppound;
- $U + X$ and $V + Y$ are constants for each $a_j$ (whatever the choice of $c_i$) and is the total number of papers in $D$ with the $a_j$ activity.
- each confusion matrix $M_{ij}$ is such that $U + V + X + Y = |D|$ where $|D|$ is _the total number of paper_ under scrutiny.


In [49]:
dataset = pd.read_csv(DATASET_FILENAME, index_col=[0, 1, 2], header=[0, 1, 2])
# dataset.index.names = ["class", "name", "with"]
# dataset.columns.names = ["class", "name", "with"]

all_compounds = set(dataset.index.get_level_values(1))
all_activities = set(dataset.columns.get_level_values(1))

dataset


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,abiotic,abiotic,abiotic,abiotic,abiotic,...,pharmaco,pharmaco,pharmaco,toxicity,toxicity
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,antioxidant,antioxidant,drought,drought,metal,...,cytotoxicity,sedative,sedative,toxicity,toxicity
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,w/o,w/,w/o,w/,w/o,...,w/,w/o,w/,w/o,w/
alkaloid,acridine,w/o,179092,62176,240191,1077,216324,...,39229,238785,2483,215604,25664
alkaloid,acridine,w/,2430,266,2694,2,2439,...,1296,2688,8,2294,402
alkaloid,benzylamine,w/o,180754,62371,242046,1079,218089,...,40400,240640,2485,217161,25964
alkaloid,benzylamine,w/,768,71,839,0,674,...,125,833,6,737,102
alkaloid,colchicine,w/o,175968,62250,237143,1075,213101,...,39268,235761,2457,213018,25200
...,...,...,...,...,...,...,...,...,...,...,...,...,...
terpenoid/terpene,sesterterpene,w/,182,7,189,0,189,...,113,188,1,177,12
terpenoid/terpene,tetraterpene/carotenoid/xanthophyll,w/o,178534,54855,232655,734,208780,...,40054,230901,2488,208151,25238
terpenoid/terpene,tetraterpene/carotenoid/xanthophyll,w/,2988,7587,10230,345,9983,...,471,10572,3,9747,828
terpenoid/terpene,triterpene,w/o,177099,61285,237308,1076,213234,...,38416,235915,2469,212731,25653


We can extract the "old" 53\*33 matrix we use in the version version of this software, where we had only the with/with queries.
In our small running example, that would be:

|                   |            | germination | cytotoxicity |
| ----------------- | ---------- | ----------- | ------------ |
|                   |            | allelopathy | pharmaco     |
| alkaloid          | acridine   | 11          | 1296         |
| terpenoid/terpene | triterpene | 32          | 2115         |


In [50]:
with_with_matrix = dataset.xs("w/", level=2).xs("w/", level=2, axis=1)
with_with_matrix


Unnamed: 0_level_0,Unnamed: 1_level_0,abiotic,abiotic,abiotic,abiotic,abiotic,...,pharmaco,pharmaco,pharmaco,pharmaco,toxicity
Unnamed: 0_level_1,Unnamed: 1_level_1,antioxidant,drought,metal,uv,salt,...,wound,anticancer,cytotoxicity,sedative,toxicity
alkaloid,acridine,266,2,257,80,163,...,77,147,1296,8,402
alkaloid,benzylamine,71,0,165,23,80,...,19,19,125,6,102
alkaloid,colchicine,192,4,84,20,187,...,222,171,1257,34,866
alkaloid,cyclopeptide,57,1,168,16,35,...,51,45,523,4,171
alkaloid,imidazole,1082,8,2507,302,1195,...,460,336,2816,486,1768
...,...,...,...,...,...,...,...,...,...,...,...,...
terpenoid/terpene,polyterpene,0,0,0,0,0,...,0,0,0,0,2
terpenoid/terpene,sesquiterpene,863,17,72,39,59,...,124,114,2037,30,341
terpenoid/terpene,sesterterpene,7,0,0,0,3,...,2,6,113,1,12
terpenoid/terpene,tetraterpene/carotenoid/xanthophyll,7587,345,592,392,513,...,97,51,471,3,828


We do some sanity checks explained earlier :

- the name (level 2 of rows/columns) are _unique_
- the sum of each confusion submatrixes is _constant_ and is the total number of papers $|D|$, here 243 964.


In [51]:
# sanity check #1
assert dataset.shape == (2 * len(all_compounds), 2 * len(all_activities))
dataset


# sanity check : group by summing on level 2 on both rows and cols produce a matrix of constants : the number of papers
submatrix_sum = dataset.groupby(level=1).sum().groupby(level=1, axis=1).sum()
number_of_papers = np.unique(submatrix_sum.values)
# if the Scopus collection did not evolve during while querying
assert len(number_of_papers) == 1

number_of_papers = number_of_papers[0]
print(f"The domain contains {number_of_papers} papers")


The domain contains 243964 papers


Lets illustrate the content of this table. The **2 by 2 confusion submatrix** about _acridine_ and _cytotoxicity_ is as follows.


In [52]:
acridine_antioxidant_submatrix = dataset.loc[
    (
        "alkaloid",
        "acridine",
    ),
    (
        "pharmaco",
        "cytotoxicity",
    ),
]
print(f"Among {number_of_papers} papers, there are")
for i, j in product(bex.SELECTORS, bex.SELECTORS):
    print(f"{acridine_antioxidant_submatrix.loc[i,j]} papers {i} acridine and {j} cytotoxicity in their keywords")

print("The acridine and cytotoxicity confusion matrix is as follows")
acridine_antioxidant_submatrix


Among 243964 papers, there are
202039 papers w/o acridine and w/o cytotoxicity in their keywords
39229 papers w/o acridine and w/ cytotoxicity in their keywords
1400 papers w/ acridine and w/o cytotoxicity in their keywords
1296 papers w/ acridine and w/ cytotoxicity in their keywords
The acridine and cytotoxicity confusion matrix is as follows


  acridine_antioxidant_submatrix = dataset.loc[


Unnamed: 0,w/o,w/
w/o,202039,39229
w/,1400,1296


Unnamed: 0_level_0,Unnamed: 1_level_0,abiotic,abiotic,abiotic,abiotic,abiotic,allelopathy,allelopathy,allelopathy,allelopathy,allelopathy,...,pharmaco,pharmaco,pharmaco,pharmaco,pharmaco,pharmaco,pharmaco,pharmaco,pharmaco,toxicity
Unnamed: 0_level_1,Unnamed: 1_level_1,antioxidant,drought,metal,uv,salt,antifeedant,arbuscula,attractant,germination,herbicidal,...,antiparasitic,antiviral,anti-inflammatory,arthritis,burns,wound,anticancer,cytotoxicity,sedative,toxicity
alkaloid,acridine,266,2,257,80,163,1,0,0,11,1,...,26,117,62,26,16,77,147,1296,8,402
alkaloid,benzylamine,71,0,165,23,80,0,0,0,4,6,...,2,24,53,12,1,19,19,125,6,102
alkaloid,colchicine,192,4,84,20,187,0,0,0,47,0,...,16,108,878,1630,32,222,171,1257,34,866
alkaloid,cyclopeptide,57,1,168,16,35,0,0,0,18,1,...,9,92,89,685,2,51,45,523,4,171
alkaloid,imidazole,1082,8,2507,302,1195,3,0,1,53,9,...,92,1211,989,384,121,460,336,2816,486,1768
alkaloid,indole,1324,53,1482,283,717,4,0,13,177,6,...,55,859,1024,310,19,347,486,3683,79,1372
alkaloid,indolizidine,1,0,2,0,8,0,0,0,0,0,...,2,14,2,1,0,1,6,27,0,4
alkaloid,isoquinoline,138,2,149,20,142,2,0,0,6,3,...,15,322,99,59,5,27,84,633,18,185
alkaloid,isoxazole,127,2,78,20,137,0,0,0,4,13,...,51,233,462,509,2,51,77,364,27,273
alkaloid,muscarine,2,0,1,0,13,0,0,0,0,0,...,0,2,8,3,0,4,0,2,3,19


In [None]:
with_with_total = with_with_matrix.values.sum()
print(
    f"Total number of positive/positive occurences is {with_with_total} for {number_of_papers} papers (average={with_with_total/number_of_papers})"
)


We compute marginal sums on rows and cols


In [None]:
margin_idx = (bex.CLASS_SYMB, bex.MARGIN_SYMB, bex.SELECTORS[1])
margin_cols = dataset.groupby(level=1).sum().drop_duplicates().reset_index(drop=True)
margin_cols.index = pd.MultiIndex.from_tuples([margin_idx])
margin_cols


In [None]:
margin_rows = dataset.groupby(level=1, axis=1).sum().iloc[:, 0]
margin_rows.name = margin_idx
margin_rows = pd.DataFrame(margin_rows)
margin_rows


In [None]:
# me way add those margin to the original dataset
dataset_margins = dataset.copy()
dataset_margins[margin_idx] = margin_rows
dataset_margins = dataset_margins.append(margin_cols).fillna(number_of_papers).astype(int)


In [None]:
# see https://en.wikipedia.org/wiki/Confusion_matrix
# /!\ quelle jungle /!\


def fowlkes_mallows(arr):
    arr = arr.reshape(2, 2)
    return np.sqrt(arr[1][1] / (arr[1][1] + arr[0][1]) * arr[1][1] / (arr[1][1] + arr[1][0]))


def acc(arr):
    arr = arr.reshape(2, 2)
    return (arr[1][1] + arr[0][0]) / (arr.sum())


def x_score(arr):
    arr = arr.reshape(2, 2)
    return (arr[1][1] + arr[0][0] - arr[0][1] - arr[1][0]) / (arr.sum())


print(fowlkes_mallows(acridine_antioxidant_submatrix.values))
print(acc(acridine_antioxidant_submatrix.values))
print(x_score(acridine_antioxidant_submatrix.values))
acridine_antioxidant_submatrix


In [None]:
# redimension the values to a 4D array
C, A = len(all_compounds), len(all_activities)
print(C, A)
M = dataset.values.reshape((C, 2, A, 2))
M = np.moveaxis(M, 1, -2)


We obtain the same as submatrix_sum


In [None]:
# M.sum(axis=(2,3)) or similarly
np.sum(M, axis=(2, 3), keepdims=False)


In [None]:
# 1D is easier for apply_along_axis
M2 = M.reshape((C * A, 4))

print(M2[0])
print(M[0, 0])
print(acc(M[0, 0]))
print(x_score(M[0, 0]))
print(fowlkes_mallows(M[0, 0]))
print("--")


for fnct in [acc, x_score, fowlkes_mallows]:
    confused = np.apply_along_axis(fnct, 1, M2).reshape((C, A))

    print(f"{fnct.__name__}: min={confused.min()} max={confused.max()} mean={confused.mean()} std={confused.std()}")
# np.apply_over_axes(f, M2, axes=1)
# confused


In [None]:
confused_df = pd.DataFrame(100 * confused)
confused_df.index = with_with_matrix.index  #  set([(x,y) for x,y,_ in dataset.index.to_list()])
confused_df.columns = with_with_matrix.columns
# confused_df.values = confused
# confused_df


In [None]:
# TODO : les couleurs par classes

ca = prince.CA(n_components=2, n_iter=5, copy=True, check_input=True, engine="auto", random_state=42, benzecri=False)

for (df, name) in [(with_with_matrix, "Original"), (confused_df, "Confused")]:
    print(f"-------{name}-------")
    ca = ca.fit(df)

    # ca.row_coordinates(df)[:10]
    # ca.column_coordinates(df)[:10]

    pprint(ca.explained_inertia_)
    pprint(ca.col_masses_[:10])
    pprint(ca.eigenvalues_)
    pprint(ca.total_inertia_)

    ax = ca.plot_coordinates(
        X=confused_df,
        ax=None,
        figsize=(12, 12),
        x_component=0,
        y_component=1,
        show_row_labels=True,
        show_col_labels=True,
    )
    plt.show()
