### Analyzing distribution of new train/ test/ val fingerprints

The goal of this notebook is to visualize the distribution of reactions across training, testing, and validation sets. Ideally, reactions mapped to each unique generalized reaction rule in JN1224MIN should make it to each of the three sets and the distribution of positive/ negative reactions for reactions mapped to this rule in each set should also mirror this distribution in our overall dataset.

In order to ensure this, we performed an 80/ 10/ 10 train/ test/ validation split on reactions in a stratified manner and on the level of individual reaction rules. This works fine for the most part but we quickly run into 2 issues: (1) what to do when the number of reactions mapped to a rule is less than 10? and (2) what to do when even if there are many reactions mapped to a reaction rule, they are all negative (e.g. rule0005 reverse monooxygenation reactions). In such cases, a stratified split is not possible since both labels (positive and negative) are not available.

We circumvent the above two issues by first lumping all rare reaction rules that have less than 10 examples mapped to them under a rare rules category. All reactions beloning to this rare rules category are then divided into an 80/ 10/ 10 train/ test/ val set in a stratified fashion. Further, for reaction rules that do have more than 10 examples mapped to them but only comprise a single data class (i.e. only infeasible reactions), we drop the stratification parameter in using SciKit learn to perform our split since the stratification parameter won't do anything here.

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['figure.dpi'] = 500 # Resolution of figures
plt.rcParams["figure.autolayout"] = True
plt.rcParams["legend.loc"] = 'best'
plt.rcParams['xtick.labelsize']=18
plt.rcParams['ytick.labelsize']=18

In [2]:
train_filepath = '/projects/p30041/YashChainani/ML_feasibility_build/data/fingerprinted_data/training_fingerprints/all_BKM_rxns_ecfp4_train_fingerprints_max_species_4_by_descending_MW'

In [3]:
test_filepath = '/projects/p30041/YashChainani/ML_feasibility_build/data/fingerprinted_data/testing_fingerprints/all_BKM_rxns_ecfp4_test_fingerprints_max_species_4_by_descending_MW'

In [4]:
val_filepath = '/projects/p30041/YashChainani/ML_feasibility_build/data/fingerprinted_data/validation_fingerprints/all_BKM_rxns_ecfp4_val_fingerprints_max_species_4_by_descending_MW'

In [None]:
train_df = pd.read_parquet(train_filepath)
test_df = pd.read_parquet(test_filepath)
val_df = pd.read_parquet(val_filepath)

In [None]:
# Add a 'set' column to each dataframe to indicate the source
train_df['set'] = 'Train'
test_df['set'] = 'Test'
val_df['set'] = 'Validation'

In [None]:
# Combine the dataframes
combined_df = pd.concat([train_df, test_df, val_df])

In [None]:
# Get the counts of each reaction rule by feasibility in each set
count_data = combined_df.groupby(['set', 'Rule', 'Label']).size().reset_index(name='count')

In [None]:
# Pivot the data for plotting
pivot_data_combined = count_data.pivot_table(index=['Rule', 'Label'], columns='set', values='count').fillna(0)

#### Visualizing the distirbution of positive/ negative reactions for the first five reactions within each of the three sets - train, test, val

In [None]:
pivot_data = pivot_data_combined.loc[['rule0001','rule0002','rule0003','rule0004','rule0005']]

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(15, 7))

# Width of a bar 
width = 0.2       

# Set the position of the bars on the x-axis
indices = range(len(pivot_data.index))

# Plotting
train_bars = ax.bar(indices, pivot_data['Train'], width, label='Train')
test_bars = ax.bar([i + width for i in indices], pivot_data['Test'], width, label='Test')
val_bars = ax.bar([i + width*2 for i in indices], pivot_data['Validation'], width, label='Validation')

# Add xticks on the middle of the group bars
ax.set_xlabel('Reaction Rule and Feasibility')
ax.set_ylabel('Count')
ax.set_title('Distribution of Reaction Rules and Feasibility Across Train, Test, and Validation Sets')
ax.set_xticks([i + width for i in indices])
ax.set_xticklabels([f"{rule}_{feas}" for rule, feas in pivot_data.index], rotation=90)

# Create legend & Show graphic
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
pivot_data = pivot_data_combined.loc[['rule0006','rule0007','rule0008','rule0009','rule0012']]

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(15, 7))

# Width of a bar 
width = 0.2       

# Set the position of the bars on the x-axis
indices = range(len(pivot_data.index))

# Plotting
train_bars = ax.bar(indices, pivot_data['Train'], width, label='Train')
test_bars = ax.bar([i + width for i in indices], pivot_data['Test'], width, label='Test')
val_bars = ax.bar([i + width*2 for i in indices], pivot_data['Validation'], width, label='Validation')

# Add xticks on the middle of the group bars
ax.set_xlabel('Reaction Rule and Feasibility')
ax.set_ylabel('Count')
ax.set_title('Distribution of Reaction Rules and Feasibility Across Train, Test, and Validation Sets')
ax.set_xticks([i + width for i in indices])
ax.set_xticklabels([f"{rule}_{feas}" for rule, feas in pivot_data.index], rotation=90)

# Create legend & Show graphic
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
pivot_data = pivot_data_combined.loc[['rule0013','rule0014','rule0015','rule0016','rule0017']]

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(15, 7))

# Width of a bar 
width = 0.2       

# Set the position of the bars on the x-axis
indices = range(len(pivot_data.index))

# Plotting
train_bars = ax.bar(indices, pivot_data['Train'], width, label='Train')
test_bars = ax.bar([i + width for i in indices], pivot_data['Test'], width, label='Test')
val_bars = ax.bar([i + width*2 for i in indices], pivot_data['Validation'], width, label='Validation')

# Add xticks on the middle of the group bars
ax.set_xlabel('Reaction Rule and Feasibility')
ax.set_ylabel('Count')
ax.set_title('Distribution of Reaction Rules and Feasibility Across Train, Test, and Validation Sets')
ax.set_xticks([i + width for i in indices])
ax.set_xticklabels([f"{rule}_{feas}" for rule, feas in pivot_data.index], rotation=90)

# Create legend & Show graphic
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
pivot_data = pivot_data_combined.loc[['rule0018','rule0019','rule0020','rule0021','rule0022']]

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(15, 7))

# Width of a bar 
width = 0.2       

# Set the position of the bars on the x-axis
indices = range(len(pivot_data.index))

# Plotting
train_bars = ax.bar(indices, pivot_data['Train'], width, label='Train')
test_bars = ax.bar([i + width for i in indices], pivot_data['Test'], width, label='Test')
val_bars = ax.bar([i + width*2 for i in indices], pivot_data['Validation'], width, label='Validation')

# Add xticks on the middle of the group bars
ax.set_xlabel('Reaction Rule and Feasibility')
ax.set_ylabel('Count')
ax.set_title('Distribution of Reaction Rules and Feasibility Across Train, Test, and Validation Sets')
ax.set_xticks([i + width for i in indices])
ax.set_xticklabels([f"{rule}_{feas}" for rule, feas in pivot_data.index], rotation=90)

# Create legend & Show graphic
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
pivot_data = pivot_data_combined.loc[['rule0023','rule0024','rule0025','rule0026','rule0027']]

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(15, 7))

# Width of a bar 
width = 0.2       

# Set the position of the bars on the x-axis
indices = range(len(pivot_data.index))

# Plotting
train_bars = ax.bar(indices, pivot_data['Train'], width, label='Train')
test_bars = ax.bar([i + width for i in indices], pivot_data['Test'], width, label='Test')
val_bars = ax.bar([i + width*2 for i in indices], pivot_data['Validation'], width, label='Validation')

# Add xticks on the middle of the group bars
ax.set_xlabel('Reaction Rule and Feasibility')
ax.set_ylabel('Count')
ax.set_title('Distribution of Reaction Rules and Feasibility Across Train, Test, and Validation Sets')
ax.set_xticks([i + width for i in indices])
ax.set_xticklabels([f"{rule}_{feas}" for rule, feas in pivot_data.index], rotation=90)

# Create legend & Show graphic
ax.legend()
plt.tight_layout() 
plt.show()

In [None]:
pivot_data = pivot_data_combined.loc[['rule0028','rule0029','rule0030']]

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(15, 7))

# Width of a bar 
width = 0.2       

# Set the position of the bars on the x-axis
indices = range(len(pivot_data.index))

# Plotting
train_bars = ax.bar(indices, pivot_data['Train'], width, label='Train')
test_bars = ax.bar([i + width for i in indices], pivot_data['Test'], width, label='Test')
val_bars = ax.bar([i + width*2 for i in indices], pivot_data['Validation'], width, label='Validation')

# Add xticks on the middle of the group bars
ax.set_xlabel('Reaction Rule and Feasibility')
ax.set_ylabel('Count')
ax.set_title('Distribution of Reaction Rules and Feasibility Across Train, Test, and Validation Sets')
ax.set_xticks([i + width for i in indices])
ax.set_xticklabels([f"{rule}_{feas}" for rule, feas in pivot_data.index], rotation=90)

# Create legend & Show graphic
ax.legend()
plt.tight_layout() 
plt.show()