# Dataset Compilation

#### Introduction 

User can use this notebook to generate required embedding from a graph network,  label the positive data and obtain the final dataset(s).

<u>Define positive gene:</u>

Input:	Genetic association data, predefined score threshold

Process:    filter out the gene IDs below the score threshold, remaining gene ID will be the positive genes to the disease

Quality control:	Manually filter the data according to the threshold. It should match the number of records in the output csv

Output: positive gene list, in csv format

Remark:	user can modify the threshold for other needs, or use other disease association data for positive labelling

<u>Learn embedding: </u>

Input:	Largest subgraph of the PPI network, grid of hyperparameters for node embedding

Process:	Use node2vec to learn embeddings according to the parameter grid

Quality control:
-	Number of column should match the dimension specified
-	File names should match with the parameter specified
-	Number of row should match the number of total nodes in of the input subgraph

Output:
-	Tubular embedding file in emb format, respective to each parameter setting
-	Binary files of the node2vec models, in pickle format

<u>Dataset compilation:</u>

Input:	Positive gene list, embedding file

Process:	For each embedding file, parse the embedding data, and label the gene with class label {1: positive, 0: unlabelled}

Quality control:
-	9 sets of datasets should be generated. Each with the correct number of column. All should have the same number of rows
-	Manually check the sum of y label are match the number of positive genes

Output:	For each embedding file, generate a dataset, in csv format

Remarks:	user can define their own parameter grid


#### Import Libraries

In [1]:
import os
import pandas as pd
import pickle
from node2vec import Node2Vec

#### Define helper functions

In [3]:
# define helper function to parse embedding output of node2vec
def load_embedding(file_path_and_name):
    '''
    read the embedding file which is export from the node2vec. parse the embedding and output as a pandas dataframe
    parameter:
        file_path_and_name: str, the embedding file including the path
    return: 
        df: pandas dataframe object
    '''
    with open(file_path_and_name, 'r') as file:
        lines = file.readlines()

    num_rows, num_features = map(int, lines[0].split())

    data = []

    for line in lines[1:]:
        values = line.strip().split()
        data.append(values)

    df = pd.DataFrame(data)

    column_names = ['id'] + ['feature_' + str(i) for i in range(1, num_features+1)]
    df.columns = column_names

    return df

# define helper function to positive labelling data
def positive_labelling (data, positive_ids):
    '''
    label positive data should the key match with the key in the data object. 1 for positive class, 0 the otherwise
    the labelled column is 'y'
    
    parameters:
        data: pandas dataframe, the target dataset
        positive_ids: pandas dataframe, the list of positive id
    retrun
        data: pandas dataframe, a dataframe stacked with a column 'y' as the class label of the data
    '''
    data = data.copy()
    def set_y_value(row):
        if row['id'] in positive_ids:
            return 1
        else:
            return 0

    data['y'] = data.apply(set_y_value, axis=1)

    return data

#### Load required data files

In [2]:
# load the largest PPI subgraph

ppi_subgraph_path = os.path.join('data', 'ppi', 'largest_subgraph.gpickle')

with open(ppi_subgraph_path, 'rb') as f:
    largest_subgraph = pickle.load(f)

#### Set threshold of Disese Gene Association Score. For use of Positive Labelling

In [4]:
# load the disease genetic association list
dise_asso_score = pd.read_csv(os.path.join('open_targets_data', 'RAGeneticAssociationAll.csv'))

In [3]:
# preview the list
dise_asso_score

Unnamed: 0,target_id,target_approved_symbol,disease_id,disease_name,genetic_association_score_Indirect_and_direct
0,ENSG00000164512,ANKRD55,EFO_0000685,rheumatoid arthritis,0.885188
1,ENSG00000175354,PTPN2,EFO_0000685,rheumatoid arthritis,0.846631
2,ENSG00000172575,RASGRP1,EFO_0000685,rheumatoid arthritis,0.829440
3,ENSG00000150347,ARID5B,EFO_0000685,rheumatoid arthritis,0.792187
4,ENSG00000198369,SPRED2,EFO_0000685,rheumatoid arthritis,0.789214
...,...,...,...,...,...
880,ENSG00000131435,PDLIM4,EFO_0000685,rheumatoid arthritis,0.031211
881,ENSG00000156831,NSMCE2,EFO_0000685,rheumatoid arthritis,0.031077
882,ENSG00000197162,ZNF785,EFO_0000685,rheumatoid arthritis,0.031049
883,ENSG00000151135,TMEM263,EFO_0000685,rheumatoid arthritis,0.030665


In [5]:
# set the Gene Association threshold, it will be use for labeling the positive lables in the dataset
SCORE_TH = 0.2

In [6]:
dise_asso_score_TH = dise_asso_score.loc[dise_asso_score['genetic_association_score_Indirect_and_direct'] > SCORE_TH]
dise_asso_score_TH.to_csv(os.path.join('data', 'others', 'dise_asso_score_TH.csv'))

In [7]:
dise_asso_score_TH

Unnamed: 0,target_id,target_approved_symbol,disease_id,disease_name,genetic_association_score_Indirect_and_direct
0,ENSG00000164512,ANKRD55,EFO_0000685,rheumatoid arthritis,0.885188
1,ENSG00000175354,PTPN2,EFO_0000685,rheumatoid arthritis,0.846631
2,ENSG00000172575,RASGRP1,EFO_0000685,rheumatoid arthritis,0.829440
3,ENSG00000150347,ARID5B,EFO_0000685,rheumatoid arthritis,0.792187
4,ENSG00000198369,SPRED2,EFO_0000685,rheumatoid arthritis,0.789214
...,...,...,...,...,...
398,ENSG00000111785,RIC8B,EFO_0000685,rheumatoid arthritis,0.206936
399,ENSG00000135116,HRK,EFO_0000685,rheumatoid arthritis,0.206170
400,ENSG00000138311,ZNF365,EFO_0000685,rheumatoid arthritis,0.205471
401,ENSG00000119522,DENND1A,EFO_0000685,rheumatoid arthritis,0.204873


#### Set File Paths for Embeddings and Datasets

In [8]:
temp_folder = os.path.join('tempfolder')
datasets_path = os.path.join('data', 'datasets')
embeddings_path = os.path.join('data', 'embeddings')

#### Learn Embeddings from Graph

In [9]:
# learn embedding by node2vec
# caution: each embedding will take a long time to run
# budget time: for 1 embedding model, it takes more than 320 minutes to finish initialiation and fitting
# for the params grid in the below, it will create 9 embedding models

# Define the parameter grid
params = {
    'p': [4],
    'q': [1],
    'weight_key': ['scoring'],
    'dimensions': [128,96,64],
    'walk_length': [100],
    'num_walks': [500],
    'workers': [4],
    'temp_folder': [temp_folder],
    'seed': [None, 37, 44]
}

# Dictionary to store initialized objects
node2vec_models = {}

# Iterate through parameter combinations and initialize objects
for dimension in params['dimensions']:
    for walk_length in params['walk_length']:
        for num_walks in params['num_walks']:
            for workers in params['workers']:
                for seed in params['seed']:
                    for weight_key in params['weight_key']:
                        for temp_folder in params['temp_folder']:
                            for p in params['p']:
                                for q in params['q']:
                                    if seed:
                                        key = f"p_{p}_q_{q}_dim_{dimension}_walkleng_{walk_length}_numwalks_{num_walks}_seed_{seed}"
                                    else:
                                        key = f"p_{p}_q_{q}_dim_{dimension}_walkleng_{walk_length}_numwalks_{num_walks}"

                                    node2vec_obj = Node2Vec(
                                        largest_subgraph, p=p, q=q, weight_key=weight_key,
                                        dimensions=dimension, walk_length=walk_length,
                                        num_walks=num_walks, workers=workers,
                                        temp_folder=temp_folder, seed=seed
                                    )
                                    node2vec_models[key] = node2vec_obj.fit(window=10, min_count=1, batch_words=4)
                                    
                                    # save the models into gensim embeddings format
                                    filename = key + '.emb'
                                    file_path = os.path.join(embeddings_path, filename)
                                    node2vec_models[key].wv.save_word2vec_format(file_path)


Computing transition probabilities:   0%|          | 0/16210 [00:00<?, ?it/s]

Generating walks (CPU: 2):   0%|          | 0/125 [00:00<?, ?it/s]

[A[A
Generating walks (CPU: 2):   2%|▏         | 2/125 [01:23<1:26:04, 41.99s/it]

[A[A
Generating walks (CPU: 2):   2%|▏         | 3/125 [02:51<2:04:17, 61.13s/it]

[A[A
Generating walks (CPU: 2):   3%|▎         | 4/125 [04:21<2:24:39, 71.73s/it]

[A[A
Generating walks (CPU: 2):   4%|▍         | 5/125 [05:44<2:31:08, 75.57s/it]

[A[A
Generating walks (CPU: 2):   5%|▍         | 6/125 [07:09<2:36:10, 78.75s/it]

[A[A
Generating walks (CPU: 2):   6%|▌         | 7/125 [08:40<2:42:29, 82.62s/it]

[A[A
Generating walks (CPU: 2):   6%|▋         | 8/125 [10:10<2:45:38, 84.94s/it]

[A[A
Generating walks (CPU: 2):   7%|▋         | 9/125 [11:36<2:45:03, 85.38s/it]

[A[A
Generating walks (CPU: 2):   8%|▊         | 10/125 [12:57<2:41:19, 84.17s/it]

[A[A
Generating walks (CPU: 2):   9%|▉         | 11/125 [14:20<2:38:52, 83.62s/it]

[A[A
Generating walks (CPU: 2):  10%|▉         | 12/125 [15:47<2:39:43, 84.81s

Computing transition probabilities:   0%|          | 0/16210 [00:00<?, ?it/s]

Generating walks (CPU: 1):   0%|          | 0/125 [00:00<?, ?it/s]
[A

[A[A

[A[A
Generating walks (CPU: 1):   2%|▏         | 2/125 [01:24<1:26:17, 42.09s/it]

[A[A
Generating walks (CPU: 1):   2%|▏         | 3/125 [02:54<2:06:08, 62.04s/it]

[A[A
Generating walks (CPU: 1):   3%|▎         | 4/125 [04:19<2:22:33, 70.69s/it]

[A[A
Generating walks (CPU: 1):   4%|▍         | 5/125 [05:42<2:30:19, 75.17s/it]

[A[A
Generating walks (CPU: 1):   5%|▍         | 6/125 [07:12<2:38:19, 79.83s/it]

[A[A
Generating walks (CPU: 1):   6%|▌         | 7/125 [08:40<2:42:13, 82.49s/it]

[A[A
Generating walks (CPU: 1):   6%|▋         | 8/125 [10:10<2:45:29, 84.87s/it]

[A[A
[A

Generating walks (CPU: 1):   7%|▋         | 9/125 [11:40<2:47:11, 86.48s/it]
[A

Generating walks (CPU: 1):   8%|▊         | 10/125 [13:04<2:44:38, 85.90s/it]
[A

Generating walks (CPU: 1):   9%|▉         | 11/125 [14:42<2:49:54, 89.43s/it]
[A

Generating walks (CPU: 1):  10%|▉         | 12/125 [16:08<2:46:23

In [None]:
# show the keys of the node2vec models. Which should be the list of models named by the repsective parameters
# the key would be used to collate the respective datasets, model evaluation results in the later processes
node2vec_models.keys()

In [42]:
# store the fitted node2vec model files in case it may be of use

filename = 'node2vec_models.pickle'
file_path = os.path.join(embeddings_path, filename)

with open(file_path, 'wb') as f:
    pickle.dump(node2vec_models, f, pickle.HIGHEST_PROTOCOL)

In [34]:
# load the genetic association list for use of positive labelling
positive_label_file_path = os.path.join('data', 'others', 'dise_asso_score_TH.csv')
ra_gene_ass_TH = pd.read_csv(positive_label_file_path)
positive_ids = ra_gene_ass_TH['target_id'].values.tolist()
print ("number of positive labels: ", len(positive_ids))

number of positive labels:  403


#### Create Dataset in .csv Format

In [43]:
# convert all embeddings files to dataframe and do the positive labelling
# export the datasets in csv format

# List all files in the directory
file_list = os.listdir(embeddings_path)

# Filter files with .emb extension
emb_files = [file for file in file_list if file.endswith(".emb")] # user can add more criteria to filter the emb files

# Iterate through .emb files and export to csv 
for emb_file in emb_files:
    file_path = os.path.join(embeddings_path, emb_file)
    file_name_without_ext = os.path.splitext(emb_file)[0]
    
    embed_df = load_embedding(file_path)
    dataset = positive_labelling(embed_df, positive_ids)
    dataset_filename = 'dataset_' + file_name_without_ext + '.csv'
    dataset_file_path = os.path.join(datasets_path, dataset_filename)
    dataset.to_csv(dataset_file_path, index=False)
