# Prediction of Transcription Factors using GCNs
To evaluate the quality of graph convolutional networks on PPI networks and gene expression, I'll use the dual RNA-seq gene expression from humans. I want to predict if a gene is a transcription factor (TF) or not, using only that information and a limited number of genes as training set.

**In this notebook, I want to prepare the data for use with the GCN.**

In [9]:
import pandas as pd
import numpy as np
import h5py, os
import mygene

from goatools.base import download_go_basic_obo, download_ncbi_associations
from goatools.associations import read_ncbi_gene2go
from goatools.go_search import GoSearch

## 1. Load PPI network with Gene Expression
We first need the PPI network and gene expression which was already processed by the preprocessing steps.

In [2]:
fname = '../data/preprocessing/ppi_networks.h5'
with h5py.File(fname, 'r') as f:
    gene_expression_data = pd.DataFrame(f['gene_expression'][:])
    ppi_network = f['consensusPathDB_ppi'][:]
    gene_names = f['gene_names'][:]

## 2. Get Labels if a Gene is a TF or Not for all Genes
I will do that using goatools which is a library and part of bioconda. This involves multiple steps. First, I have to get the IDs of genes that annotated to the TF GO term or one of it's children.
Then, I have to convert the entrez IDs to Ensembl IDs.

In [15]:


# download GO terms and associations
obo_fname = download_go_basic_obo()
gene2go = download_ncbi_associations()
# get all GO terms associated with human genes (maybe only the ones contained in gene_names?)
go2geneids_human = read_ncbi_gene2go("gene2go", taxids=[9606], go2geneids=True)
print("{N} GO terms associated with human NCBI Entrez GeneIDs".format(N=len(go2geneids_human)))

# search in there for GO ID GO:0003700 (DNA binding TF activity)
srchhelp = GoSearch("go-basic.obo", go2items=go2geneids_human)
# Details of search are written to a log file
fout_allgos = "TF_GO_search.log" 
with open(fout_allgos, "w") as log:
    # Add children GOs of TF
    tf_with_children = srchhelp.add_children_gos(['GO:0003700', 'GO:0000130'])
    # Get Entrez GeneIDs for cell cycle GOs
    entrez_ids = list(srchhelp.get_items(tf_with_children))
print("{N} human NCBI Entrez GeneIDs under Transcription Factors found.".format(N=len(entrez_ids)))

  EXISTS: go-basic.obo
  EXISTS: gene2go
  READ: gene2go
17529 GO terms associated with human NCBI Entrez GeneIDs
load obo file go-basic.obo
go-basic.obo: fmt(1.2) rel(2017-11-02) 47,030 GO Terms
1248 human NCBI Entrez GeneIDs under Transcription Factors found.


In [39]:
mg = mygene.MyGeneInfo()
res = mg.querymany(entrez_ids, scopes='entrezgene', fields='ensembl.gene', species='human')
f = lambda x: x['ensembl'][0]['gene'] if type(x['ensembl']) is list else x['ensembl']['gene']
ens_ids = [f(x) for x in res if 'ensembl' in x]
print ("{} human transcription factors with corresponding Ensembl ID".format(len(ens_ids)))

querying 1-1000...done.
querying 1001-1248...done.
Finished.
1232 human transcription factors with corresponding Ensembl ID


## 3. Split genes into training, testing and validation sets

## 4. Write PPI Network, Gene Expression & Label Information to HDF5 Container