## KEGG Pathways Selection Process  

From the previous preprocessing steps for KEGG Pathways, we now select only the pathways that have high representation of genes from our Gene Expression matrix in each pathway.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('df-geneid1.csv')
df.head(1)

Unnamed: 0.1,Unnamed: 0,pathways,gene1,gene2,gene3,gene4,gene5,gene6,gene7,gene8,...,gene991,gene992,gene993,gene994,gene995,gene996,gene997,gene998,gene999,gene1000
0,0,path:hsa00010,3101,3098,3099,80201,2645,2821,5213,5214,...,0,0,0,0,0,0,0,0,0,0


In [2]:
#Turning pathway names into a list to have it in a column in the final dataframe later
pathways = df.pathways.tolist()

In [3]:
rows = df.values.tolist()

In [16]:
all_genes = pd.read_csv('all_genes.csv').dropna()
all_genes["HGNC_ID"] = all_genes["HGNC_ID"].astype(int)
all_genes.head(1)

Unnamed: 0,gene_sliced,2001/1/1,2002/2/1,2003/1/1,2004/2/1,2006/2/1,2008/1/1,2010/2/1,2012/2/1,2013/2/1,...,2078/2/1,2080/2/1,2081/2/1,2082/1/1,2083/2/1,2084/1/1,2085/2/1,HGNC_ID,Approved_Symbol,Approved_name
0,ENSG00000237973,41.1537,32.840876,33.472636,68.599342,55.83454,85.471215,82.970549,75.094779,34.152149,...,58.834271,44.044076,33.586355,41.367272,81.654563,56.295355,42.16532,52014,MTCO1P12,MT-CO1 pseudogene 12


In [17]:
#Creating a list with the approved symbol
symbol = all_genes['Approved_Symbol'].tolist()

Turn the rows of df dataframe into lists in order to compare with the column that contains all the genes of interest

In [18]:
listofdf = df.values.tolist()

In [39]:
genesofinterest = all_genes['HGNC_ID'].tolist()

In [40]:
result = [[j for j in i if j in genesofinterest] for i in listofdf]

The dataframe below is populated with pathways and genes in each pathway by HGNC_ID

In [68]:
df2 = pd.DataFrame(data = result, index = pathways, dtype = int)
df2.index.name='Pathway'
df2.head(1)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,89,90,91,92,93,94,95,96,97,98
Pathway,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
path:hsa00010,3098.0,5213.0,8789.0,226.0,2597.0,5232.0,5224.0,10327.0,220.0,669.0,...,,,,,,,,,,


We will retain the pathways with highest gene representation

In [57]:
Non_nan_values = df2.count(axis=1)

In [58]:
non_nan = pd.DataFrame(Non_nan_values)

In [59]:
non_nan.index.name = 'Pathway'

In [60]:
non_nan.columns = ['Number of non NaN values']

In [70]:
#non_nan.sort_values('Number of non NaN values', ascending = False).head(5)

In [71]:
pathwayscount = df2.merge(non_nan, on = 'Pathway')