# Constructing a Cancer-Specific Network

__Introduction:__  
This notebook uses [__PCNet__](http://www.ndexbio.org/#/network/f93f402c-86d4-11e7-a10d-0ac135e8bacf) from (Huang, Carlin, et al. in press) and various collections of cancer-related genes (see below) to construct a cancer-specific subnetwork that can be used in the pyNBS algorithm to stratify patients with sparse mutational profiles. This method also uses a module wrapping the [MyGene.info](http://mygene.info/) Python API (developed by Huang et al.) to normalize all gene names to HUGO symbols.


__Steps to construct Cancer Subnetwork:__
1. Load network
2. Compile all cancer genes from cancer-related gene sets into a single list
3. Extract only edges from network connecting cancer genes together, remove all other nodes and edges from the network
4. Write the filtered network to file as an edge list.

__The following is a list of the four cancer-related gene sets used to filter PCNet:__  

|File Name|Cancer Gene Set Description|Citation|
|:---|:---|:---|
|hallmarks.txt|Genes from hallmark cancer pathways|Hanahan D and Weinberg RA (2011) Hallmarks of Cancer: The Next Generation. Cell. 144(5), 646-674.|
|vogelstein.txt|List of tumor suppressor and oncogenes from Vogelstein et al.|Vogelstein B, et al. (2013) Cancer genome landscapes. Science. 339(6127), 1546-1558.|
|sanger_CL_genes.txt|Recurrently mutated cancer genes discovered from cancer cell lines (Sanger UK)|Iorio F, et al. (2016) A Landscape of Pharmacogenomic Interactions in Cancer. Cell. 166(3), 740-754.|
|cgc.txt|Genes from the Cancer Gene Census (COSMIC v81)|Forbes SA, et al. (2017) COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45(D1), D777-D783.|


In [1]:
import pandas as pd
import networkx as nx
from pyNBS import gene_conversion_tools as gct 

# Inside this gct module will call another module named "requests"
# Trouble-shooting for: "ImportError: No module named requests"
# Go back to terminal, conda activate py27env, $ python -m pip install requests
# https://stackoverflow.com/questions/17309288/importerror-no-module-named-requests
# https://pypi.org/project/requests/#downloads

## Load Network

In [3]:
# Load network
network_file = './Supplementary_Notebook_Data/CancerSubnetwork_Data/PCNet.txt'
network = nx.read_edgelist(network_file, delimiter='\t', data=True)

## Get all cancer-related genes


#### Get genes from all cancer hallmark pathways and convert them from Entrez to HUGO Symbols (Hanahan, Weinberg 2011)

In [25]:
# Load pathway gene sets
f = open('./Supplementary_Notebook_Data/CancerSubnetwork_Data/hallmarks.txt')
lines = f.read().splitlines()
hallmark_genesets = {}
for line in lines:
    if '\t' in line:
        hallmark_genesets[line.split('\t')[0].split('|')[1]] = line.split('\t')[2:]
        
len(hallmark_genesets)

28

In [8]:
hallmark_genesets

{'Adherens junction': ['4301',
  '4008',
  '998',
  '7414',
  '60',
  '646048',
  '646821',
  '71',
  '8826',
  '8826',
  '8826',
  '5879',
  '5880',
  '5881',
  '998',
  '1500',
  '7414',
  '6714',
  '25945',
  '5818',
  '5819',
  '81607',
  '1499',
  '8826',
  '81',
  '87',
  '88',
  '89',
  '117178',
  '3643',
  '3480',
  '1457',
  '1459',
  '1460',
  '5795',
  '7046',
  '7048',
  '2260',
  '1956',
  '2064',
  '4233',
  '7082',
  '7082',
  '60',
  '646048',
  '646821',
  '71',
  '1495',
  '1496',
  '29119',
  '5777',
  '5770',
  '5879',
  '5880',
  '5881',
  '5879',
  '5880',
  '5881',
  '4089',
  '1495',
  '1496',
  '29119',
  '999',
  '1500',
  '1500',
  '999',
  '999',
  '51176',
  '6932',
  '6934',
  '83439',
  '999',
  '81',
  '87',
  '88',
  '89',
  '8976',
  '6714',
  '7454',
  '8976',
  '56288',
  '25945',
  '5818',
  '5819',
  '81607',
  '6714',
  '5879',
  '5880',
  '5881',
  '998',
  '1499',
  '999',
  '6885',
  '51701',
  '4087',
  '4088',
  '1387',
  '2033',
  '2534',
 

In [26]:
# Convert cancer-hallmark gene set genes to HUGO with MyGene.info
all_hallmark_genes_entrez = []
for hallmark in hallmark_genesets:
    all_hallmark_genes_entrez = all_hallmark_genes_entrez + hallmark_genesets[hallmark]
all_hallmark_genes_entrez = list(set(all_hallmark_genes_entrez))

len(all_hallmark_genes_entrez) # 1711 genes

1711

In [8]:
# Get gene conversion query string
query_string, valid_genes, invalid_genes = gct.query_constructor(all_hallmark_genes_entrez)
# Set scopes (gene naming systems to search)
scopes = "entrezgene, retired"
# Set fields (systems from which to return gene names from)
fields = "symbol, entrezgene"
# Query MyGene.Info
match_list = gct.query_batch(query_string, scopes=scopes, fields=fields)

1711 Valid Query Genes
0 Invalid Query Genes
1711 Matched query results
Batch query complete: 3.63 seconds


In [27]:
# Get gene conversion maps
match_table_trim, query_to_symbol, query_to_entrez = gct.construct_query_map_table(match_list, 
                                                    valid_genes, display_unmatched_queries=True)
# Collapse cancer-hallmark gene set genes as HUGO Symbols only
all_hallmark_genes_symbol = [str(query_to_symbol[gene]) for gene in all_hallmark_genes_entrez]

print len(all_hallmark_genes_symbol) # 1711

Queries with partial matching results found: 7
{u'query': u'731751', u'notfound': True}
{u'query': u'652671', u'notfound': True}
{u'query': u'651610', u'notfound': True}
{u'query': u'652799', u'notfound': True}
{u'query': u'646821', u'notfound': True}
{u'query': u'652346', u'notfound': True}
{u'query': u'650621', u'notfound': True}

0 Queries with mutliple matches found

Query mapping table/dictionary construction complete: 0.27 seconds
1711


In [10]:
all_hallmark_genes_symbol[0:3]

['LPL', 'LRP11', 'RFC2']

In [11]:
# Check existing of 47 driver genes in gene set
'PRKCB' in all_hallmark_genes_symbol

True

#### Load genes determined by Vogelstein as tumor suppressors or oncogenes (Vogelstein et al 2013)

In [11]:
# Vogelstein cancer genes list
f = open('./Supplementary_Notebook_Data/CancerSubnetwork_Data/vogelstein.txt')
lines = f.read().splitlines()
Vogelstein_genes = [line.split('\t')[0] for line in lines] 
# index[0] is select only the first column (gene name)

In [13]:
len(Vogelstein_genes) #138

138

#### Load genes determined as recurrently mutated across 1,001 cancer cell lines (Iorio et al 2016)

In [12]:
f = open('./Supplementary_Notebook_Data/CancerSubnetwork_Data/sanger_CL_genes.txt')
Sanger_genes = f.read().splitlines()

In [15]:
len(Sanger_genes) # 2369 (contain repeated names)

2369

In [16]:
len(set(Sanger_genes)) # 470 (unique names)

470

#### Load genes from the Cancer Gene Census from COSMIC v81 (Forbes et al 2017)

In [13]:
COSMIC_table = pd.read_csv('./Supplementary_Notebook_Data/CancerSubnetwork_Data/cgc_v81.txt')
COSMIC_genes = list(COSMIC_table['Gene Symbol'])

In [14]:
print len(COSMIC_genes) # 595 genes
print len(set(COSMIC_genes)) # 594 genes

595
594


In [34]:
COSMIC_table.head(2)

Unnamed: 0,Gene Symbol,Name,Entrez GeneId,Genome Location,Chr Band,Somatic,Germline,Tumour Types(Somatic),Tumour Types(Germline),Cancer Syndrome,Tissue Type,Molecular Genetics,Role in Cancer,Mutation Types,Translocation Partner,Other Germline Mut,Other Syndrome,Synonyms
0,ABI1,abl-interactor 1,10006,10:26748570-26860863,10p11.2,yes,,AML,,,L,Dom,TSG,T,KMT2A,,,"ABI1,E3B1,ABI-1,SSH3BP1,10006"
1,ABL1,v-abl Abelson murine leukemia viral oncogene h...,25,9:130835447-130885683,9q34.1,yes,,"CML, ALL, T-ALL",,,L,Dom,oncogene,"T, Mis","BCR, ETV6, NUP214",,,"ABL1,p150,ABL,c-ABL,JTK7,bcr/abl,v-abl,P00519,..."


#### Combine all cancer gene lists together

In [15]:
cancer_genes = list(set(all_hallmark_genes_symbol+Vogelstein_genes+Sanger_genes+COSMIC_genes))
print "Number of HUGO Cancer Genes:", len(cancer_genes) # 2322 genes

Number of HUGO Cancer Genes: 2322


In [16]:
##---------Extra: Export Cancer gene list to Table
all_hallmark_genes_symbol.sort(reverse=False)
Vogelstein_genes.sort(reverse=False)
Sanger_genes.sort(reverse=False)
COSMIC_genes.sort(reverse=False)


print '----------------'
print len(all_hallmark_genes_symbol),len(Vogelstein_genes),len(Sanger_genes),len(COSMIC_genes) # python 2.7
print len(list(set(all_hallmark_genes_symbol+Vogelstein_genes+Sanger_genes+COSMIC_genes)))

# Make Table
df_A = pd.DataFrame(all_hallmark_genes_symbol)
df_A.rename(columns={0:'HallmarkCan_1711'},inplace=True) # column 1
df_B = pd.DataFrame(Vogelstein_genes)
df_B.rename(columns={0:'Vogelstein_138'},inplace=True) # column 2
df_C = pd.DataFrame(Sanger_genes)
df_C.rename(columns={0:'Sanger_2369'},inplace=True) # column 3
df_D = pd.DataFrame(COSMIC_genes)
df_D.rename(columns={0:'COSMIC_595'},inplace=True) # column 4

df_cat = pd.concat([df_A, df_B, df_C, df_D ], axis=1) # Final Table
df_cat.head(3)

----------------
1711 138 2369 595
2322


Unnamed: 0,HallmarkCan_1711,Vogelstein_138,Sanger_2369,COSMIC_595
0,A2M,ABL1,ABCB1,ABI1
1,ABCB1,ACVR1B,ABL2,ABL1
2,ABCB11,AKT1,ACACA,ABL2


In [26]:
# Export/convert dataframe to csv or tsv file
# https://towardsdatascience.com/how-to-export-pandas-dataframe-to-csv-2038e43d9c03
# https://www.geeksforgeeks.org/how-to-write-pandas-dataframe-as-tsv-using-python/
 
#datafraxx.to_csv('testdatafraxx.csv', header=False, index=False)    # in case of apriori frequency file

df_cat.to_csv('SuppTable2_GeneList_CancerSubnetwork_2322_4sources.tsv', sep="\t",index=False) 

In [17]:
# Extra: Combine 47 Driver genes
Driver_table = pd.read_csv('../../../12.TCGA-LUAD/4.1_LUAD-26Drivers-273Passen.tsv',sep='\t',header=0)
Driver47 = Driver_table['Final_47Drivers_15Feb22'].dropna().unique().tolist()

In [18]:
Driver_table.head(1)

Unnamed: 0,Drivers,Passengers,273genes,183Drivers,182Drivers,87Drivers,22Drivers,181Drivers_Rubio-Perez2015,26Drivers_Aus2019,20Drivers_Bailey2018,...,List_181(Rubio-Perez2015)+34Drivers(Bailey2020)_13Feb2022,Unnamed: 16,TestRemained_165Drivers_13Feb22,Unnamed: 18,Final_47Drivers_15Feb22,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24
0,ATM,ABCC11,HNRNPM,ACAD8,ACAD8,TP53,TP53,ACAD8,ADGRL2,ARID1A,...,ACAD8,,ACAD8,Rubio-Perez2015,AFF2,"Bailey et al., 2020",Group A (35 Drivers),,,


In [19]:
len(Driver47)

47

In [67]:
print Driver47

['AFF2', 'AMER1', 'ARID1A', 'ASXL3', 'ATM', 'BAZ2B', 'BRAF', 'CDH12', 'CDK12', 'CDKN2A', 'COL1A1', 'CREBBP', 'CTNNB1', 'DNMT3A', 'EGFR', 'EIF4G1', 'EPHA4', 'FAT1', 'FN1', 'KDR', 'KEAP1', 'KMT2A', 'KMT2C', 'KRAS', 'LPHN2', 'MAP3K4', 'MET', 'MGA', 'MMP2', 'MMP16', 'NCAM1', 'NF1', 'NTRK2', 'NUP98', 'PIK3CA', 'PRKCB', 'PTPRD', 'RB1', 'RBM10', 'RYR2', 'SETD2', 'SMAD4', 'SMARCA4', 'SORCS3', 'STK11', 'SVEP1', 'TP53']


In [20]:
# Note: Gene name 'LPHN2'(NCBI) was changed to 'ADGRL2'(HUGO)
Driver47b = ['AFF2', 'AMER1', 'ARID1A', 'ASXL3', 'ATM', 'BAZ2B', 'BRAF', 'CDH12', 'CDK12', 'CDKN2A', 
             'COL1A1', 'CREBBP', 'CTNNB1', 'DNMT3A', 'EGFR', 'EIF4G1', 'EPHA4', 'FAT1', 'FN1', 'KDR', 
             'KEAP1', 'KMT2A', 'KMT2C', 'KRAS', 'ADGRL2', 'MAP3K4', 'MET', 'MGA', 'MMP2', 'MMP16', 'NCAM1', 
             'NF1', 'NTRK2', 'NUP98', 'PIK3CA', 'PRKCB', 'PTPRD', 'RB1', 'RBM10', 'RYR2', 'SETD2', 'SMAD4', 
             'SMARCA4', 'SORCS3', 'STK11', 'SVEP1', 'TP53']

In [21]:
cancer_genes2 = list(set(all_hallmark_genes_symbol+Vogelstein_genes+Sanger_genes+COSMIC_genes+Driver47))
print "Number of HUGO Cancer Genes:", len(cancer_genes2) # increase 2330-2322 = 8 genes

Number of HUGO Cancer Genes: 2330


In [22]:
cancer_genes3 = list(set(all_hallmark_genes_symbol+Vogelstein_genes+Sanger_genes+COSMIC_genes+Driver47b))
print "Number of HUGO Cancer Genes:", len(cancer_genes3) # increase 2331-2322 = 9 genes

Number of HUGO Cancer Genes: 2331


### Rename variables for Thesis

In [24]:
# Original merged cancer gene sets
cancer_genes_ori = list(set(all_hallmark_genes_symbol+Vogelstein_genes
                            +Sanger_genes+COSMIC_genes))

# All cancer gene sets merged with 47 LUAD driver genes
cancer_genes_fin = list(set(all_hallmark_genes_symbol+Vogelstein_genes
                            +Sanger_genes+COSMIC_genes+Driver47b))

print "Number of HUGO Original Cancer Genes:", len(cancer_genes_ori)
print "Number of HUGO Cancer Genes merged with drivers:", len(cancer_genes_fin)
print "No.of increased genes:", len(cancer_genes_fin)- len(cancer_genes_ori)

Number of HUGO Original Cancer Genes: 2322
Number of HUGO Cancer Genes merged with drivers: 2331
No.of increased genes: 9


### Generate Cancer Gene Network
Note: The resulting network may not be **exactly** the same as the Cancer Subnetwork found in ```'~/Examples/Example_Data/Network_Files/CancerSubnetwork.txt'``` due to the fact that [MyGene.Info](http://mygene.info/) may be updating gene name mappings over time.

In [61]:
# PCnet
PCnodes = network.nodes # len(network.nodes) = 19781
list(PCnodes)

[u'UBE2Q1',
 u'RNF14',
 u'UBE2Q2',
 u'RNF10',
 u'RNF11',
 u'RNF13',
 u'REM1',
 u'REM2',
 u'C16orf13',
 u'RPEL1',
 u'CCDC109B',
 u'UCHL5',
 u'RNF17',
 u'NBEAL1',
 u'MZT2A',
 u'MZT2B',
 u'ATRX',
 u'PMM2',
 u'PMM1',
 u'ASS1',
 u'OR14C36',
 u'FHIT',
 u'SPX',
 u'ZNF709',
 u'ZNF708',
 u'ZNF879',
 u'ZNF878',
 u'DISC1',
 u'CAMK1',
 u'STYXL1',
 u'SPR',
 u'ZNF700',
 u'ZNF707',
 u'ZNF706',
 u'ZNF704',
 u'ZC3H10',
 u'RNF114',
 u'RNF115',
 u'ZC3H15',
 u'ZC3H14',
 u'SPN',
 u'RNF111',
 u'NACAP1',
 u'ZC3H18',
 u'GRIN1',
 u'DHX8',
 u'DHX9',
 u'TCOF1',
 u'NSRP1',
 u'NUP98',
 u'XPC',
 u'SP1',
 u'XPA',
 u'SP3',
 u'NUP93',
 u'SP5',
 u'SP6',
 u'CAMKV',
 u'C16orf93',
 u'SPPL3',
 u'GOLIM4',
 u'OPA3',
 u'OPA1',
 u'LOC100996763',
 u'RAB40C',
 u'RAB40B',
 u'RAB40A',
 u'COL7A1',
 u'GTSE1',
 u'FAM183A',
 u'ARFRP1',
 u'OVCH1',
 u'FLG2',
 u'OR52I2',
 u'SPPL2A',
 u'SPPL2C',
 u'SPPL2B',
 u'MYO3A',
 u'ITGA9',
 u'UGCG',
 u'MYO3B',
 u'ATP2A1',
 u'ATP2A2',
 u'ATP2A3',
 u'ITGA1',
 u'ITGA2',
 u'NOP2',
 u'ITGA4',
 u'ITGA5',


In [65]:
# 'LPHN2' that was changed to 'ADGRL2' 
'ADGRL2' in PCnodes

True

In [53]:
# Filter PCNet to only contain genes from the combined cancer gene list and the edges between those genes
cancer_subnetwork1 = network.subgraph(cancer_genes) # cancer_genes = 2322

In [54]:
cancer_subnetwork2 = network.subgraph(cancer_genes2)# cancer_genes2 = 2330

In [71]:
cancer_subnetwork3 = network.subgraph(cancer_genes3)# cancer_genes3 = 2331

In [56]:
# cancer_genes2 = 2330 (including 47 driver genes)
gene_degree2 = pd.Series(cancer_subnetwork2.degree(), name='degree')
print "Number of connected genes in Cancer Subnetwork2:", len(cancer_subnetwork2.nodes())-len(gene_degree2[gene_degree2==0])
print "Number of interactions in Cancer Subnetwork2:", len(cancer_subnetwork2.edges())

Number of connected genes in Cancer Subnetwork2: 2304
Number of interactions in Cancer Subnetwork2: 204871


In [72]:
# "network" is loaded PCNet
# cancer_genes3 = 2331 (including 47 driver genes, but change 'LPHN2'(NCBI) to 'ADGRL2'(HUGO))

cancer_subnetwork3 = network.subgraph(cancer_genes3)
gene_degree3 = pd.Series(cancer_subnetwork3.degree(), name='degree')

print "Number of connected genes in Cancer Subnetwork3:", len(cancer_subnetwork3.nodes())
                                                            -len(gene_degree3[gene_degree3==0])
print "Number of interactions in Cancer Subnetwork3:", len(cancer_subnetwork3.edges())

#****---------Note: Genes with no edges had not been removed in this step.---------*****

Number of connected genes in Cancer Subnetwork3: 2305
Number of interactions in Cancer Subnetwork3: 204922


In [73]:
## Write output cancer subnetwork generated from cancer_gene3 that include 47 driver genes, but change 'LPHN2'(NCBI) to 'ADGRL2'(HUGO)

# Write the filtered cancer subnetwork to file
# Note: Genes with no edges connecting them to any other gene will be removed during this step

gct.write_edgelist(cancer_subnetwork3.edges(), 
                './Supplementary_Notebook_Results/CancerSubnetwork3_47driv_15Feb22.txt', binary=True)

Edge list saved: 1.08 seconds
