# Splitting Train and Test Sets based on clusters

The motive behind this experiment is to make sure our predictive model is not biased to the chemicals in the training set. So, what we will do here is use a list of chemicals clustered based on their similarities, find their neighbors, calculate their weight over the overall graph then use these weights to seperate training and test sets, in which a cluster of chemcials should only be in one of the sets

### Create dataframe from csv file

In [1]:
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

In [19]:
clustered_chemicals = pd.read_csv("/home/raldisi/Desktop/Clustered_chemicals.csv")

### Splitting training and testing sets based on clusters

In [6]:
train_inds, test_inds = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7).split(clustered_chemicals, groups=clustered_chemicals['Cluster']))

train = clustered_chemicals.iloc[train_inds]
test = clustered_chemicals.iloc[test_inds]

### Splitting training and testing sets based on weighted clusters

In [6]:
import pandas as pd
import pybel
import networkx as nx
from tqdm import tqdm_notebook as tqdm

In [4]:
full_graph = pybel.from_pickle("/home/raldisi/Desktop/full_graph.pickle")

In [5]:
clusters_dict = {i : clustered_chemicals['PubchemID'].loc[clustered_chemicals['Cluster'] == i].tolist()
                for i in range(1,clustered_chemicals.Cluster.nunique()+1)}

In this step, we make lists of the chemicals in each clusters and their neighbors

In [7]:
subgraphs_dict = {}
for cluster, chemicals in tqdm(clusters_dict.items()):
    chemicals_subgraph = []
    for chemical in chemicals:
        #formate chemical to BEL to match the nodes in full graph
        chemical = pybel.dsl.Abundance(namespace='pubchem', name=str(chemical)) 
        #ignore chemicals not in the graph
        if chemical not in full_graph.nodes():
            continue
        chemicals_subgraph.append(chemical)
        for neighbor in full_graph.neighbors(chemical):
            chemicals_subgraph.append(neighbor)
        #ignore empty lists
        if not chemicals_subgraph:
            continue
    subgraphs_dict[cluster] = list(dict.fromkeys(chemicals_subgraph)) # to remove duplicates

HBox(children=(IntProgress(value=0, max=3485), HTML(value='')))




Now, we calculate the weights of each cluster by counting the number of edges in the subgraph and dividing it by the number of edges in the fullgraph

In [8]:
fullgraph_edges = len(full_graph.edges())
cluster_weights = {}
for cluster, nodes in tqdm(subgraphs_dict.items()):
    subgraph = full_graph.subgraph(nodes)
    edges = len(subgraph.edges())
    cluster_weights[cluster] = edges/fullgraph_edges

HBox(children=(IntProgress(value=0, max=3485), HTML(value='')))




Next we create a csv file from the previous clusters file that contains the chemical, its cluster and its weight

In [20]:
clustered_chemicals['weight'] = None

In [24]:
for cluster, weight in cluster_weights.items():
    clustered_chemicals.loc[clustered_chemicals['Cluster'] == cluster, 'weight'] = weight

In [26]:
clustered_chemicals.to_csv('/home/raldisi/SE_KGE/resources/clusters_weights.csv')

In [None]:
# next steps:
## split sets based on weights or clusters