# Exploratory Data Analysis

![histogram.png](../assets/histogram.png "histogram")

## Summary

#### How many different molelecules are represented in the dataset? 
In the training dataset of 9768 entries, 4084 unique molecule strings are present. 
 
#### What are the relative proportions of pIC50 and pKi measurements?

There are abokut 4 times as many pIC50 measurements as pKi measurements. Our objective is to estimate pKi values, so leveraging the 'extra' samples that only include the related pIC50 values will be important. 

#### What are typical values for pKi and pIC50? 

Values fall within a similar range of about 6 to 12, but pKi values tend to be a little higher and are less spread out. Neither follows a very Normal distribution.

#### How many measurements of which type are present for each enzyme? 

pKi measurements are rare in this dataset. While there are up to 2900 measurements of pIC50 for a single enzyme (JAK1), pKi measurements range from 143 (TYK2) to 608 (JAK1) samples. 

![measurements](../assets/measurement_frequency.png "samples by enzyme and measurement type")


## EDA notes 

The training dataset contains nearly 10,000 entries but only 4,084 unique molecules. Measurements are mostly pIC50 and enzymes in the datasets are present in decreasing frequency from JAK1/JAK2, JAK3, and TYK2. There are only a small number of samples with pKi measurements, and class imbalances between enzymes. 

To take advantage of the most plentiful data offered in the dataset, the SMILES sequences, my initial approach is to train an autoregressive transformer model on the SMILES strings. The next step is to use the pre-trained transformer as a feature extractor to train an ensemble of multilayer perceptrons (MLPs). 

The MLPs are trained as a cohort in bootstrap fashion, _i.e._ they are randomly exposed to different minibatches during training. Each MLP outputs predictions for each enzyme and measurement type (8 values), and losses are calculated with masks according to the measurements that are present for each molecule. Compared to training separate models for each measurement type and/or enzyme, this should allow the models to learn more general ways to process the featuers encoded by the transformer model.

At inference time, the entire ensemble is queried and the statistics of their predictions gives us a hint of prediction uncertainty. I consider uncertainty estimates to be important in situations with a small number of imbalanced samples. Another way to estimate uncertainty is to use the loss of an autoencoder as a proxy for novelty, with higher autoencoder losses corresponding to more rarely seen data. I did not use this method here (or its close cousin, random network distillation), but it would be worth investigating in future work, by training an autoencoder on the transformer encoded features.


In [None]:
import numpy.random as npr
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt
my_cmap = plt.get_cmap("viridis")

from importlib import reload



In [None]:
df = pd.read_csv("../data/train_JAK.csv")
df.head(10)

In [None]:

print("\n Whole dataset statistics \n", df.describe())
print("\n pIC50 statistics \n", df.loc[df['measurement_type'] == "pIC50"].describe())
print("\n pKi statistics \n", df.loc[df['measurement_type'] == "pKi"].describe())

In [None]:
unique_smiles = df["SMILES"].unique()

number_smiles = len(unique_smiles)

print(f"Dataset contains {number_smiles} unique molecules")

labels = []
sample_counts = []
for measurement in ["pIC50", "pKi"]:
    
    m_df = df.loc[df["measurement_type"] == measurement]
    
    for enzyme in ["JAK1", "JAK2", "JAK3", "TYK2"]:
        me_df = m_df.loc[m_df["Kinase_name"] == enzyme]
        
        message = f"\ndataset includes {len(me_df)} {measurement} measurements of {enzyme}"
        print(message)
        
        labels.append(f"{measurement}:{enzyme}")
        sample_counts.append(len(me_df))
        
fig, ax = plt.subplots(1,1, figsize=(6,5))
ax.bar(labels, sample_counts, color=my_cmap(20))

ax.set_ylabel("number of samples")
#ax.set_xticks([ii*2 for ii, elem in enumerate(labels)])
#ax.set_xticklabels(labels)
plt.xticks(rotation = 20)
#plt.tight_layout()
plt.title("Measurement frequency by type and enzyme")
plt.savefig("../assets/measurement_frequency.png")
plt.show()

In [None]:
pic50 = df.loc[df['measurement_type'] == "pIC50"]["measurement_value"].to_numpy()
pki = df.loc[df['measurement_type'] == "pKi"]["measurement_value"].to_numpy()

In [None]:

fig, ax = plt.subplots(1,1)
ax.hist(pic50[:,None], bins=32, label="pIC50", color=my_cmap(64), alpha=0.25)
ax.set_ylabel("pIC50")
ax2 = ax.twinx()

ax2.hist(pki[:, None], bins=32, label="pKi", color=my_cmap(192),alpha=0.25)

ax2.set_ylabel("pKi")
ax.set_title("pKi and  pIC50")
ax2.legend(loc=1)
ax.legend(loc=2)
plt.tight_layout()
plt.savefig("../assets/histogram.png")
plt.show()