# Prediction of Transcription Factors using GCNs
To evaluate the quality of graph convolutional networks on PPI networks and gene expression, I'll use the dual RNA-seq gene expression from humans. I want to predict if a gene is a transcription factor (TF) or not, using only that information and a limited number of genes as training set.

**In this notebook, I want to prepare the data for use with the GCN.**

In [9]:
import pandas as pd
import numpy as np
import h5py, os
from sklearn.model_selection_selection import train_test_split
import mygene

from goatools.base import download_go_basic_obo, download_ncbi_associations
from goatools.associations import read_ncbi_gene2go
from goatools.go_search import GoSearch

In [116]:
# constants
TEST_RATIO = 0.4
VAL_SIZE = 500

## 1. Load PPI network with Gene Expression
We first need the PPI network and gene expression which was already processed by the preprocessing steps.

In [2]:
fname = '../data/preprocessing/ppi_networks.h5'
with h5py.File(fname, 'r') as f:
    gene_expression_data = pd.DataFrame(f['gene_expression'][:])
    ppi_network = f['consensusPathDB_ppi'][:]
    gene_names = f['gene_names'][:]

## 2. Get Labels if a Gene is a TF or Not for all Genes
I will do that using goatools which is a library and part of bioconda. This involves multiple steps. First, I have to get the IDs of genes that annotated to the TF GO term or one of it's children.
Then, I have to convert the entrez IDs to Ensembl IDs.

In [15]:
# download GO terms and associations
obo_fname = download_go_basic_obo()
gene2go = download_ncbi_associations()
# get all GO terms associated with human genes (maybe only the ones contained in gene_names?)
go2geneids_human = read_ncbi_gene2go("gene2go", taxids=[9606], go2geneids=True)
print("{N} GO terms associated with human NCBI Entrez GeneIDs".format(N=len(go2geneids_human)))

# search in there for GO ID GO:0003700 (DNA binding TF activity)
srchhelp = GoSearch("go-basic.obo", go2items=go2geneids_human)
# Details of search are written to a log file
fout_allgos = "TF_GO_search.log" 
with open(fout_allgos, "w") as log:
    # Add children GOs of TF
    tf_with_children = srchhelp.add_children_gos(['GO:0003700', 'GO:0000130'])
    # Get Entrez GeneIDs for cell cycle GOs
    entrez_ids = list(srchhelp.get_items(tf_with_children))
print("{N} human NCBI Entrez GeneIDs under Transcription Factors found.".format(N=len(entrez_ids)))

  EXISTS: go-basic.obo
  EXISTS: gene2go
  READ: gene2go
17529 GO terms associated with human NCBI Entrez GeneIDs
load obo file go-basic.obo
go-basic.obo: fmt(1.2) rel(2017-11-02) 47,030 GO Terms
1248 human NCBI Entrez GeneIDs under Transcription Factors found.


In [39]:
# convert the entrez genes to ensembl IDs
mg = mygene.MyGeneInfo()
res = mg.querymany(entrez_ids, scopes='entrezgene', fields='ensembl.gene', species='human')
f = lambda x: x['ensembl'][0]['gene'] if type(x['ensembl']) is list else x['ensembl']['gene']
ens_ids = [f(x) for x in res if 'ensembl' in x]
print ("{} human transcription factors with corresponding Ensembl ID".format(len(ens_ids)))

querying 1-1000...done.
querying 1001-1248...done.
Finished.
1232 human transcription factors with corresponding Ensembl ID


## 3. Split genes into training, testing and validation sets

In [155]:
# Construct DF with label column and gene expression
gene_names_df = pd.DataFrame(gene_names,
                             index=gene_names[:, 0],
                             columns=['ID', 'name']
                            ).drop('ID', axis=1)
gene_names_df['node_number'] = np.arange(0, gene_names_df.shape[0])
ge = gene_expression_data.set_index(gene_names_df.index)
features_labeled = ge.join(gene_names_df)
assert (features_labeled.isnull().sum().sum() == 0)
features_labeled['label'] = gene_names_df.index.isin(ens_ids)

# split for training, testing & validation
X_train, X_test = train_test_split(features_labeled,
                                   stratify=features_labeled.label,
                                   test_size=TEST_RATIO
                                  )
X_val = X_train[-VAL_SIZE:]
X_train = X_train[:-VAL_SIZE]

print ("Split Training and Testing with {}% test nodes".format(TEST_RATIO*100.))
print ("Training Nodes: {}\t# of Labels in Train Set: {}".format(X_train.shape[0], X_train.label.sum()))
print ("Testing Nodes: {}\t# of Labels in Test Set: {}".format(X_test.shape[0], X_test.label.sum()))

# construct training, testing and validation masks
def build_mask(features_labeled, X):
    mask = features_labeled[features_labeled.isin(X)].label
    mask[mask.isnull()] = 0
    # sanity check
    assert (np.all(mask.index == gene_names[:, 0]))
    return mask.values

train_mask = build_mask(features_labeled, X_train)
test_mask = build_mask(features_labeled, X_test)
val_mask = build_mask(features_labeled, X_val)

# construct labels
y_train = pd.get_dummies(features_labeled[features_labeled.isin(X_train)].label).values
y_test = pd.get_dummies(features_labeled[features_labeled.isin(X_test)].label).values
y_val = pd.get_dummies(features_labeled[features_labeled.isin(X_val)].label).values

Split Training and Testing with 40.0% test nodes
Training Nodes: 5274	# of Labels in Train Set: 428
Testing Nodes: 3850	# of Labels in Test Set: 310


## 4. Write PPI Network, Gene Expression & Label Information to HDF5 Container

In [156]:
string_dt = h5py.special_dtype(vlen=str)
f = h5py.File('../data/tfprediction/gcn_input.h5', 'w')

# add network & features
f.create_dataset('network', data=ppi_network, shape=ppi_network.shape)
f.create_dataset('features', data=gene_expression_data, shape=gene_expression_data.shape)
f.create_dataset('gene_names', data=gene_names, dtype=string_dt)

# add labels
f.create_dataset('y_train', data=y_train, shape=y_train.shape)
f.create_dataset('y_test', data=y_test, shape=y_test.shape)
f.create_dataset('y_val', data=y_val, shape=y_val.shape)

# add masks
f.create_dataset('mask_train', data=train_mask, shape=train_mask.shape)
f.create_dataset('mask_test', data=test_mask, shape=test_mask.shape)
f.create_dataset('mask_val', data=val_mask, shape=val_mask.shape)

f.close()