In [1]:
import gseapy as gp
import pandas as pd
import numpy as np
import scipy

# CWAS Gene Set Enrichment

In this notebook, we explore whether categories significantly enriched for SVs in cases vs. controls reflect any higher level biological process. One way to do this is with gene set enrichment.

# What is gene set enrichment anyway?

The idea of gene set enrichment is that in a list of genes (i.e. upregulated genes, or genes targeted by a "category"), if it were random, would have a certain degree of overlap with a gene set by pure chance. The idea of gene set enrichment is that we determine overlap with a gene set, and compare this to the background.

This is a bit weirder with these data, since we have actual _SV counts_. The underlying test for gene set enrichment is a fisher's exact test, which works perfectly fine with counts in that way, but I'm not entirely sure it's right to do?

We'll do it anyway I guess.

# Load in the data

We load in the SVs and dosages, which we'll need.

## Read in SVs

In [2]:
# define the folder name for all our results
folder_name = "processed-data-v2.5.2"

# define SVs and dosages for discovery and validation
sv_path = "gs://vanallen-pedsv-analysis/beds/PedSV.v2.5.2.full_cohort.analysis_samples.sites.bed.gz"
dosages_path = "gs://vanallen-pedsv-analysis/beds/PedSV.v2.5.2.full_cohort.analysis_samples.allele_dosages.bed.gz"

# define metadata
metadata_path = "gs://vanallen-pedsv-analysis/sample_info/PedSV.v2.5.2.cohort_metadata.w_control_assignments.tsv.gz"
samples_path = "gs://vanallen-pedsv-analysis/sample_info/PedSV.v2.5.2.final_analysis_cohort.samples.list"

Load metadata and SVs

In [3]:
metadata = pd.read_csv(
    metadata_path,
    sep="\t",
)

# add a sex label to metadata
metadata["sex"] = (metadata["chrX_CopyNumber"].round() < 2).astype(int)

###############
### Samples ###
###############
samples = pd.read_csv(
    samples_path,
    header=None,
)[0].to_list()

Now we load the SVs. We'll eventually combine discovery and validation data, but it's easiest to keep them separate for now, since the SVs and dosages are not fully overlapping.

In [4]:
###############
##### SVs #####
###############
svs = pd.read_csv(
    sv_path,
    sep="\t",
)

###############
### Dosages ###
###############
dosages = pd.read_csv(
    dosages_path,
    sep="\t",
    index_col=False,
)

  exec(code_obj, self.user_global_ns, self.user_ns)


## Read in the category results

I'm going to concatenate all this data so that I'm only dealing with a few files. Focusing on neuroblastoma.

In [203]:
###########
### SVs ###
###########
nbl_coding_svs = pd.read_csv(
    "data/CWAS data for Jett/List of variants by category for each CWAS analysis/neuroblastoma_all_coding_SVs_in_each_category_list_combined_11_3_23.txt",
    sep="\t",
)

nbl_noncoding_svs = pd.read_csv('data/CWAS data for Jett/List of variants by category for each CWAS analysis/neuroblastoma_all_noncoding_SVs_in_each_category_list_combined_BURDEN_TESTING_with_col_names_11_3_23.txt', sep='\t')
nbl_noncoding_svs = nbl_noncoding_svs.rename(columns = {'emd': 'end'})

# combine the SVs
nbl_coding_svs['sv_category'] = 'coding'
nbl_noncoding_svs['sv_category'] = 'non-coding'

nbl_category_svs = pd.concat([nbl_coding_svs, nbl_noncoding_svs])
nbl_category_svs.head(2)

Unnamed: 0,SV,chrom,start,end,category,sv_category
0,PedSV.2.5.2_DUP_chr1_794,chr1,19221626,19301822,DUP.RARE.PREDICTED_COPY_GAIN.lof_constrained.e...,coding
1,PedSV.2.5.2_DUP_chr1_1379,chr1,44731601,44792024,DUP.RARE.PREDICTED_COPY_GAIN.lof_constrained.e...,coding


In [204]:
#########################
### FRAMEWORK RESULTS ###
#########################
nbl_singleton_coding_framework_results = pd.read_csv(
    "data/CWAS data for Jett/CWAS sum stats/neuroblastoma_all_coding_cwas_concatenated_glm_results_SINGLETON_11_3_23.txt",
    sep="\t",
)
nbl_singleton_coding_framework_results[['af_category', 'sv_category']] = ['singleton', 'coding']

nbl_rare_coding_framework_results = pd.read_csv(
    "data/CWAS data for Jett/CWAS sum stats/neuroblastoma_all_coding_cwas_concatenated_glm_results_RARE_11_3_23.txt",
    sep="\t",
)
nbl_rare_coding_framework_results[['af_category', 'sv_category']] = ['rare', 'coding']

nbl_singleton_noncoding_framework_results = pd.read_csv(
    "data/CWAS data for Jett/CWAS sum stats/neuroblastoma_all_noncoding_cwas_concatenated_glm_results_SINGLETON_11_3_23.txt",
    sep="\t",
)
nbl_singleton_noncoding_framework_results[['af_category', 'sv_category']] = ['singleton', 'non-coding']

nbl_rare_noncoding_framework_results = pd.read_csv(
    "data/CWAS data for Jett/CWAS sum stats/neuroblastoma_all_noncoding_cwas_concatenated_glm_results_RARE_11_3_23.txt",
    sep="\t",
)
nbl_rare_noncoding_framework_results[['af_category', 'sv_category']] = ['rare', 'non-coding']

nbl_framework_results = pd.concat([nbl_singleton_coding_framework_results, nbl_rare_coding_framework_results, 
                                   nbl_singleton_noncoding_framework_results, nbl_rare_noncoding_framework_results])

In [205]:
nbl_framework_results.head(2)

Unnamed: 0,point_estimate,std_error,z_score,p_value,SV_counts_cases,SV_counts_cases_max,number_of_cases_with_zero_SVs,total_cases,SV_counts_controls,SV_counts_controls_max,...,number_of_unique_SVs,category_name,sv_type,frequency,mean_SVs_per_case,mean_SVs_per_control,mean_SVs_total,negative_log10_p_value,af_category,sv_category
0,0.271918,0.051227,5.3081,1.11e-07,438,5,336,646,2441,4,...,459,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,ANY,SINGLETON,0.678019,0.519362,0.538533,6.955563,singleton,coding
1,0.276065,0.054006,5.111756,3.19e-07,393,5,356,646,2172,4,...,411,DEL.SINGLETON.ANY.ANY.ANY.protein_coding,DEL,SINGLETON,0.608359,0.462128,0.479798,6.495968,singleton,coding


In [206]:
nbl_coding_framework = pd.read_csv(
    "data/CWAS data for Jett/CWAS frameworks/CWAS_categories_neuroblastoma_coding_8_17_23.txt",
    sep="\t",
)

nbl_noncoding_framework = pd.read_csv(
    "data/CWAS data for Jett/CWAS frameworks/CWAS_rare_categories_neuroblastoma_noncoding_10_2_23.txt",
    sep="\t",
)

# Walk through a coding example

Let's extract out all the data that we need to examine the highest result for a single hallmark gene set.

In [207]:
test_framework = nbl_framework_results.query('af_category == "singleton" & sv_category == "coding"').loc[0, ["category_name"]].item()
framework_components = test_framework.split(".")
genic_relationship = framework_components[2]
test_framework, genic_relationship

('ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP.ANY.ANY.protein_coding',
 'PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP')

In [208]:
nbl_coding_framework.head(5)

Unnamed: 0,sv_type,frequency,genic_relationship,constraint,expression,gene_group
0,DUP,RARE,PREDICTED_COPY_GAIN,lof_constrained,expressed_in_adrenal_gland,protein_coding
1,DEL,SINGLETON,PREDICTED_INTRAGENIC_EXON_DUP,missense_constrained,ANY,cosmic_cancer_genes
2,CPX_or_INV,,PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP,unconstrained,,germline_CPGs
3,INS_ALL,,ANY,ANY,,base_excision_repair_genes
4,ANY,,,,,chromatin_organization_genes


In [209]:
svs_in_category = nbl_category_svs[(nbl_category_svs['sv_category'] == "coding") & 
                                   (nbl_category_svs["category"] == test_framework)]

# subset the actual SV matrix
svs_in_category = svs[svs["name"].isin(svs_in_category["SV"].tolist())]

svs_in_category.head(2)

Unnamed: 0,#chrom,start,end,name,svtype,AC,AF,ALGORITHMS,AN,BOTHSIDES_SUPPORT,...,trio_POPMAX_FREQ_HOMALT,trio_POPMAX_CN_FREQ,trio_POPMAX_CN_NONREF_FREQ,gnomad_v3.1_sv_POPMAX_AF,gnomad_v3.1_sv_POPMAX_FREQ_HOMREF,gnomad_v3.1_sv_POPMAX_FREQ_HET,gnomad_v3.1_sv_POPMAX_FREQ_HOMALT,gnomad_v3.1_sv_POPMAX_CN_FREQ,gnomad_v3.1_sv_POPMAX_CN_NONREF_FREQ,FILTER
64,chr1,923800,943501,PedSV.2.5.2_DEL_chr1_80,DEL,1,7.7e-05,depth,13038,False,...,0.0,,,0.000107,,,,,,PASS
152,chr1,1240217,1243609,PedSV.2.5.2_DEL_chr1_210,DEL,1,7.4e-05,manta,13462,True,...,0.0,,,1.8e-05,,,,,,PASS


In [210]:
svs_in_category.shape

(3837, 952)

Next, we determine the genes in question. We reference the column where the genes can be found, `genic_relationship`. We have to split this one in half.

In [211]:
genic_relationships = genic_relationship.split("_or_")
genic_relationships

['PREDICTED_LOF', 'PREDICTED_PARTIAL_EXON_DUP']

In [212]:
svs_in_category[genic_relationships].head()

Unnamed: 0,PREDICTED_LOF,PREDICTED_PARTIAL_EXON_DUP
64,SAMD11,
152,C1QTNF12,
162,"ACAP3,INTS11,PUSL1,SCNN1D",
173,CPTP,
198,"TMEM88B,VWA1",


We should be a bit more careful if any results turn up positive, but for now we'll just register an SV as contributing to a count for that gene.

In [223]:
nbl_samples = metadata[
    (metadata["neuroblastoma_case"] == True)
]["entity:sample_id"].tolist()

nbl_sv_dosages = (
    dosages
    .set_index("ID")
    .loc[svs_in_category["name"].tolist(), nbl_samples]
)
nbl_sv_dosages.head(2)

Unnamed: 0_level_0,PT_00QYKRAX,PT_00Y8C0XA,PT_025YMME2,PT_02AE4RSP,PT_02SNWVRF,PT_06Z51EN5,PT_0CKD259J,PT_0GMP9VVY,PT_0MVMPZKX,PT_11XN6CG5,...,SJ058317,SJ058342,SJ058362,SJ058440,SJ058446,SJ058473,SJ058476,SJ063820,SJ063821,SJ071354
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PedSV.2.5.2_DEL_chr1_80,0.0,0.0,0.0,0.0,,0.0,0.0,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PedSV.2.5.2_DEL_chr1_210,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We hit a small snag here. We need to drop samples that are poorly genotyped for these SVs. The strategy that Ryan and Riaz used is to drop samples with > 5% `NaN` genotyping rate. We do that here.

In [224]:
dropped = pd.isnull(nbl_sv_dosages).sum(axis = 0) / len(nbl_sv_dosages) < 0.05
nbl_sv_dosages = nbl_sv_dosages.loc[:, dropped]

In [225]:
nbl_sv_dosages.shape

(3837, 646)

In [226]:
nbl_sv_dosages.iloc[:, 3:].fillna(0).sum(axis = 1)

ID
PedSV.2.5.2_DEL_chr1_80       0.0
PedSV.2.5.2_DEL_chr1_210      0.0
PedSV.2.5.2_DEL_chr1_219      0.0
PedSV.2.5.2_DEL_chr1_248      0.0
PedSV.2.5.2_DEL_chr1_291      1.0
                             ... 
PedSV.2.5.2_DEL_chr22_3934    0.0
PedSV.2.5.2_DEL_chr22_3958    0.0
PedSV.2.5.2_CPX_chr22_67      0.0
PedSV.2.5.2_DEL_chr22_4049    1.0
PedSV.2.5.2_DEL_chr22_4050    1.0
Length: 3837, dtype: float64

So now we can generate our counts. We can ignore `NaNs`. Note that these are all singleton SVs, so this point is a little moot.

In [227]:
sv_counts = nbl_sv_dosages.fillna(0).sum(axis=1).sort_values()
sv_counts.head()

ID
PedSV.2.5.2_DEL_chr1_80       0.0
PedSV.2.5.2_DEL_chr12_7605    0.0
PedSV.2.5.2_DEL_chr12_7621    0.0
PedSV.2.5.2_DEL_chr12_7627    0.0
PedSV.2.5.2_DEL_chr12_7653    0.0
dtype: float64

Note that some SVs have 0 counts, presumably because those SVs are present in non-neuroblastoma samples? Let's just verify that.

In [228]:
test_sv = dosages.set_index("ID").loc["PedSV.2.5.2_DEL_chr1_80"].iloc[3:]
test_sv[test_sv == 1]

SJ042098    1.0
Name: PedSV.2.5.2_DEL_chr1_80, dtype: object

In [229]:
metadata.set_index("entity:sample_id").loc["SJ042098"]

ancestry_short_variant_inferred_or_reported                                       NaN
batch                                          PedSV.v2-wgd_score_1-median_coverage_2
study                                                                          StJude
disease                                                                  osteosarcoma
family_id                                                                         NaN
                                                                ...                  
pancan_control                                                                  False
osteosarcoma_control                                                            False
neuroblastoma_control                                                           False
ewing_control                                                                   False
sex                                                                                 0
Name: SJ042098, Length: 73, dtype: object

Yep. Alright, we can move on and actually count things up. First, how many SVs are we actually dealing with here?

In [230]:
sv_counts.sum()

438.0

So maybe that's not so bad? We'll see what happens.

In [231]:
genes_in_svs = svs_in_category[['name'] + genic_relationships].set_index('name')
genes_in_svs.loc[sv_counts.index, 'count'] = sv_counts.astype(int)

# simple enough to go through
gene_counts = []
for index, row in genes_in_svs.iterrows():
    if not pd.isnull(row['PREDICTED_LOF']):
        gene_counts += row['PREDICTED_LOF'].split(',') * row['count']
    if not pd.isnull(row['PREDICTED_PARTIAL_EXON_DUP']):
        gene_counts += row['PREDICTED_PARTIAL_EXON_DUP'].split(',') * row['count']

gene_counts = pd.DataFrame(np.unique(gene_counts, return_counts = True), index = ['gene', 'count']).T
gene_counts.head(2), gene_counts.shape

(    gene count
 0   ABAT     1
 1  ABCA8     1,
 (511, 2))

Great. Now we can try merging this with a gene set to test significance. We'll try it with a small, well characterized gene set first.

In [232]:
hallmark = gp.get_library(name='MSigDB_Hallmark_2020')

In [233]:
g2m_checkpoint = hallmark['G2-M Checkpoint']
len(g2m_checkpoint)

200

We need to calculate 4 numbers for our Fisher's exact test:

1. The counts of genes in the gene set and category
2. The counts of genes in the category and not the gene set
3. The counts of genes in the gene set and not the category
4. The counts of genes in neither (~19k)

In [234]:
genes_in_category_and_gs = gene_counts[gene_counts['gene'].isin(g2m_checkpoint)]['count'].sum()
genes_in_category_and_not_gs = gene_counts[~gene_counts['gene'].isin(g2m_checkpoint)]['count'].sum()

genes_not_in_category_and_in_gs = len(set(g2m_checkpoint) - set(gene_counts['gene']))
genes_not_in_category_and_not_gs = 19000 - genes_not_in_category_and_in_gs

Let's do a fischer's exact test.

In [235]:
cont_table = np.array([[genes_in_category_and_gs, genes_in_category_and_not_gs], 
                       [genes_not_in_category_and_in_gs, genes_not_in_category_and_not_gs]])
scipy.stats.fisher_exact(cont_table)

(0.8896300501466553, 1.0)

So this is not significant. But at least this process makes sense. Now we can generalize a bit.

In [236]:
hallmark_results = []
for gs, genes in hallmark.items():

    genes_in_category_and_gs = gene_counts[gene_counts['gene'].isin(genes)]['count'].sum()
    genes_in_category_and_not_gs = gene_counts[~gene_counts['gene'].isin(genes)]['count'].sum()

    genes_not_in_category_and_in_gs = len(set(genes) - set(gene_counts['gene']))
    genes_not_in_category_and_not_gs = 19000 - genes_not_in_category_and_in_gs
    
    cont_table = np.array([[genes_in_category_and_gs, genes_in_category_and_not_gs], 
                       [genes_not_in_category_and_in_gs, genes_not_in_category_and_not_gs]])
    res, p = scipy.stats.fisher_exact(cont_table)
    
    data = f'{genes_in_category_and_gs}/{genes_in_category_and_not_gs}'
    expected = f'{genes_not_in_category_and_in_gs}/{genes_not_in_category_and_not_gs}'
    
    hallmark_results.append([gs, res, p, data, expected])
    
hallmark_results = pd.DataFrame(hallmark_results, columns = ['gene_set', 'res', 'p', 'data', 'expected'])

In [237]:
hallmark_results.query('p < 0.05')

Unnamed: 0,gene_set,res,p,data,expected
11,Adipogenesis,2.642514,0.001632,14/533,187/18813
25,mTORC1 Signaling,0.0,0.007595,0/547,200/18800


Interesting... we'll follow up on that in a second.

# Walk through a non-coding example

Let's extract out all the data that we need to examine the highest result for a single hallmark gene set.

In [238]:
# we'll select a non-tad framework for testing
nontad_test_framework = nbl_framework_results[(nbl_framework_results['af_category'] == "singleton") & 
                      (nbl_framework_results['sv_category'] == "non-coding") &
                      (~nbl_framework_results['category_name'].str.contains('tad'))].iloc[0]['category_name']

framework_components = nontad_test_framework.split(".")
genic_relationship = framework_components[2]
test_framework, genic_relationship

('ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP.ANY.ANY.protein_coding',
 'ANY')

This is good practice - we see that the genic relationship here is `ANY`. In the context of noncoding analysis, this has a specific meaning.

In [239]:
nbl_noncoding_framework.head(6)

Unnamed: 0,sv_type,frequency,functional_intersection,functional_category,genic_relationship,constraint,expression,gene_group
0,DUP,RARE,PREDICTED_NONCODING_BREAKPOINT,neuroblastoma_atac_peaks,PREDICTED_INTERGENIC,lof_constrained,expressed_in_adrenal_gland,protein_coding
1,DEL,SINGLETON,PREDICTED_NONCODING_SPAN,neuroblastoma_chromHMM15_Enh,PREDICTED_INTRONIC,ANY,ANY,cosmic_and_germline_CPGs
2,CPX_or_INV,,ANY,neuroblastoma_chromHMM15_Enh_conserved,PREDICTED_PROMOTER,,,
3,INS_ALL,,,neuroblastoma_chromHMM15_EnhG,PREDICTED_UTR,,,
4,ANY,,,neuroblastoma_chromHMM15_EnhG_conserved,ANY,,,
5,,,,neuroblastoma_H3K27Ac_peak,,,,


So `ANY` really means `PREDICTED_INTERGENIC | PREDICTED INTRONIC | PREDICTED PROMOTER | PREDICTED UTR`

In [240]:
genic_relationships = ['PREDICTED_INTERGENIC', 'PREDICTED_INTRONIC', 'PREDICTED_PROMOTER', 'PREDICTED_UTR']

# a weird feature of these data is that PREDICTED_INTERGENIC is actually boolean, and refers to PREDICTED_NEAREST_TSS
genic_relationships[genic_relationships.index('PREDICTED_INTERGENIC')] = 'PREDICTED_NEAREST_TSS'

Subset down to those SVssvs

In [241]:
svs_in_category = nbl_category_svs[(nbl_category_svs['sv_category'] == "non-coding") & 
                                   (nbl_category_svs["category"] == nontad_test_framework)]

# subset the actual SV matrix
svs_in_category = svs[svs["name"].isin(svs_in_category["SV"].tolist())]

svs_in_category.head(2)

Unnamed: 0,#chrom,start,end,name,svtype,AC,AF,ALGORITHMS,AN,BOTHSIDES_SUPPORT,...,trio_POPMAX_FREQ_HOMALT,trio_POPMAX_CN_FREQ,trio_POPMAX_CN_NONREF_FREQ,gnomad_v3.1_sv_POPMAX_AF,gnomad_v3.1_sv_POPMAX_FREQ_HOMREF,gnomad_v3.1_sv_POPMAX_FREQ_HET,gnomad_v3.1_sv_POPMAX_FREQ_HOMALT,gnomad_v3.1_sv_POPMAX_CN_FREQ,gnomad_v3.1_sv_POPMAX_CN_NONREF_FREQ,FILTER
120,chr1,1116266,1116473,PedSV.2.5.2_DEL_chr1_165,DEL,1,7.4e-05,manta,13462,True,...,0.0,,,0.0,,,,,,PASS
137,chr1,1157302,1157390,PedSV.2.5.2_DEL_chr1_186,DEL,1,7.4e-05,wham,13462,False,...,0.0,,,3.1e-05,,,,,,PASS


In [242]:
svs_in_category.shape

(3021, 952)

In [243]:
svs_in_category[genic_relationships].head()

Unnamed: 0,PREDICTED_NEAREST_TSS,PREDICTED_INTRONIC,PREDICTED_PROMOTER,PREDICTED_UTR
120,,,C1orf159,
137,TTLL10,,,
224,,ATAD3B,,
584,,,C1orf174,
836,HES3,,,


We should be a bit more careful if any results turn up positive, but for now we'll just register an SV as contributing to a count for that gene.

In [244]:
nbl_samples = metadata[
    (metadata["neuroblastoma_case"] == True)
]["entity:sample_id"].tolist()

nbl_sv_dosages = (
    dosages
    .set_index("ID")
    .loc[svs_in_category["name"].tolist(), nbl_samples]
)
nbl_sv_dosages.head(2)

Unnamed: 0_level_0,PT_00QYKRAX,PT_00Y8C0XA,PT_025YMME2,PT_02AE4RSP,PT_02SNWVRF,PT_06Z51EN5,PT_0CKD259J,PT_0GMP9VVY,PT_0MVMPZKX,PT_11XN6CG5,...,SJ058317,SJ058342,SJ058362,SJ058440,SJ058446,SJ058473,SJ058476,SJ063820,SJ063821,SJ071354
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PedSV.2.5.2_DEL_chr1_165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PedSV.2.5.2_DEL_chr1_186,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Again subset our dosage matrix, dropping samples with bad GT rates

In [245]:
dropped = pd.isnull(nbl_sv_dosages).sum(axis = 0) / len(nbl_sv_dosages) < 0.05
nbl_sv_dosages = nbl_sv_dosages.loc[:, dropped]

In [246]:
nbl_sv_dosages.shape

(3021, 646)

In [247]:
sv_counts = nbl_sv_dosages.fillna(0).sum(axis=1).sort_values()
sv_counts.head()

ID
PedSV.2.5.2_DEL_chr1_165     0.0
PedSV.2.5.2_DEL_chr12_612    0.0
PedSV.2.5.2_DEL_chr12_615    0.0
PedSV.2.5.2_DEL_chr12_705    0.0
PedSV.2.5.2_DEL_chr12_839    0.0
dtype: float64

First, how many SVs are we actually dealing with here?

In [248]:
sv_counts.sum()

339.0

So maybe that's not so bad? We'll see what happens.

In [249]:
genes_in_svs = svs_in_category[['name'] + genic_relationships].set_index('name')
genes_in_svs.loc[sv_counts.index, 'count'] = sv_counts.astype(int)

# simple enough to go through
gene_counts = []
for index, row in genes_in_svs.iterrows():
    for col in genic_relationships:
        if not pd.isnull(row[col]):
            gene_counts += row[col].split(',') * row['count']

gene_counts = pd.DataFrame(np.unique(gene_counts, return_counts = True), index = ['gene', 'count']).T
gene_counts.head(2)

Unnamed: 0,gene,count
0,ABCD3,1
1,ACP6,1


Calculate with Fisher's exact:

In [250]:
genes_in_category_and_gs = gene_counts[gene_counts['gene'].isin(g2m_checkpoint)]['count'].sum()
genes_in_category_and_not_gs = gene_counts[~gene_counts['gene'].isin(g2m_checkpoint)]['count'].sum()

genes_not_in_category_and_in_gs = len(set(g2m_checkpoint) - set(gene_counts['gene']))
genes_not_in_category_and_not_gs = 19000 - genes_not_in_category_and_in_gs

Let's do a fischer's exact test.

In [251]:
cont_table = np.array([[genes_in_category_and_gs, genes_in_category_and_not_gs], 
                       [genes_not_in_category_and_in_gs, genes_not_in_category_and_not_gs]])
scipy.stats.fisher_exact(cont_table)

(0.854746571709978, 1.0)

So this is not significant. But at least this process makes sense. Now we can generalize a bit.

In [252]:
hallmark_results = []
for gs, genes in hallmark.items():

    genes_in_category_and_gs = gene_counts[gene_counts['gene'].isin(genes)]['count'].sum()
    genes_in_category_and_not_gs = gene_counts[~gene_counts['gene'].isin(genes)]['count'].sum()

    genes_not_in_category_and_in_gs = len(set(genes) - set(gene_counts['gene']))
    genes_not_in_category_and_not_gs = 19000 - genes_not_in_category_and_in_gs
    
    cont_table = np.array([[genes_in_category_and_gs, genes_in_category_and_not_gs], 
                       [genes_not_in_category_and_in_gs, genes_not_in_category_and_not_gs]])
    res, p = scipy.stats.fisher_exact(cont_table)
    
    data = f'{genes_in_category_and_gs}/{genes_in_category_and_not_gs}'
    expected = f'{genes_not_in_category_and_in_gs}/{genes_not_in_category_and_not_gs}'
    
    hallmark_results.append([gs, res, p, data, expected])
    
hallmark_results = pd.DataFrame(hallmark_results, columns = ['gene_set', 'res', 'p', 'data', 'expected'])

In [253]:
hallmark_results.query('p < 0.05')

Unnamed: 0,gene_set,res,p,data,expected
0,TNF-alpha Signaling via NF-kB,3.00233,0.002914,10/328,191/18809
3,Mitotic Spindle,2.387308,0.024082,8/330,191/18809
11,Adipogenesis,2.374747,0.02472,8/330,192/18808
16,Protein Secretion,3.085912,0.02769,5/333,92/18908
23,Unfolded Protein Response,2.602281,0.049873,5/333,109/18891
38,UV Response Dn,3.822695,0.000987,9/329,135/18865
43,Bile Acid Metabolism,2.626515,0.048346,5/333,108/18892
45,Allograft Rejection,2.693884,0.008778,9/329,191/18809
49,Pancreas Beta Cells,4.468657,0.034603,3/335,38/18962


Interesting... we'll follow up on that in a second.

# Generalized gene set enrichment

Alright, we've been through two examples. Now let's try generalizing across two axes--categories and gene sets. For now, we'll only neuroblastoma significant categories.

I don't know if these `p_values` are already corrected or not. I'll assume they're not.

## Define the categories for analysis

Here, we'll select which categories we want to examine. We'll stick to neuroblastoma, but we'll examine `singleton` and `rare`, as well as `noncoding` and `coding`.

Worth mentioning that the `noncoding` categories could be quite difficult to interpret.

In [254]:
nbl_analysis_categories = nbl_framework_results.query('negative_log10_p_value > 3.5')
nbl_analysis_categories.shape

(77, 22)

Here, we'll also define a helpful lookup that maps from the "collapsed" genic relationships to all their component relationships.

In [255]:
gr_coding_mapping = {'PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP': ['PREDICTED_LOF', 'PREDICTED_PARTIAL_EXON_DUP'],
                     'ANY': ['PREDICTED_COPY_GAIN', 'PREDICTED_INTRAGENIC_EXON_DUP', 'PREDICTED_LOF', 'PREDICTED_PARTIAL_EXON_DUP']}

gr_noncoding_mapping = {'ANY': ['PREDICTED_INTERGENIC', 'PREDICTED_INTRONIC', 'PREDICTED_PROMOTER', 'PREDICTED_UTR']}


## Define the gene sets for analysis

Let's highlight some specific gene sets for analysis. We'll do the following:

* `MSigDB_Hallmark_2020`
* `GO_Biological_Process_2023` (this is the default for GO term analysis)
* `Reactome_2022`

We'll begin with these, and then we can add in other specific ones that might be relevant later:

In [256]:
gs = {}
for db_name in ['MSigDB_Hallmark_2020', 'GO_Biological_Process_2023', 'Reactome_2022']:
    db = gp.get_library(name=db_name)
    gs[db_name] = db

In [258]:
adrenal_genes = pd.read_csv('ref/adrenal-specific-genes.txt', sep='\t', comment = '#')['Gene Name'].tolist()
gs['custom'] = {'adrenal-specific-exp': adrenal_genes}

In [259]:
gs_count = 0
for db_name, db in gs.items():
    gs_count += len(db.values())
gs_count

7273

Clearly that's going to lead to some false positives, but it is what it is.

## Run the thing

This code will need to be decently adaptable, since it has to handle a few different unique components (noncoding categories, etc).

In [316]:
nbl_samples = metadata[
    (metadata["neuroblastoma_case"] == True)
]["entity:sample_id"].tolist()

In [317]:
gse_results = []

for i, (index, row) in enumerate(nbl_analysis_categories.iterrows()):
    print(i, end = ', ')
    cat_name = row['category_name']
    af_category = row['af_category']
    sv_category = row['sv_category']
    p_category = row['p_value']
    
    cat_components = cat_name.split('.')
    
    # here, we define the necessary genic relationships
    # we handle the collapsed categories as well
    if sv_category == 'coding':
        gr = cat_components[2]

        # convert gr to components
        genic_rel = gr_coding_mapping.get(gr, [gr])
        
    elif sv_category == 'non-coding':
        gr = cat_components[4]

        # convert gr to components
        genic_rel = gr_noncoding_mapping.get(gr, [gr])
        
    # swap out intergenic for nearest_tss
    if 'PREDICTED_INTERGENIC' in genic_rel:
        genic_rel[genic_rel.index('PREDICTED_INTERGENIC')] = 'PREDICTED_NEAREST_TSS'
        
    # next, we pull out our SVs in this category
    svs_in_category = nbl_category_svs[(nbl_category_svs['sv_category'] == sv_category) & 
                                       (nbl_category_svs["category"] == cat_name)]
    
    # subset the actual SV matrix
    svs_in_category = svs[svs["name"].isin(svs_in_category["SV"].tolist())]
    
    # extract the dosages
    nbl_sv_dosages = (
        dosages
        .set_index("ID")
        .loc[svs_in_category["name"].tolist(), nbl_samples]
        )
    
    # drop samples with bad GT rates
    dropped = pd.isnull(nbl_sv_dosages).sum(axis = 0) / len(nbl_sv_dosages) < 0.05
    nbl_sv_dosages = nbl_sv_dosages.loc[:, dropped]
    
    # convert dosages to counts
    sv_counts = nbl_sv_dosages.fillna(0).sum(axis=1).sort_values()
    
    # define the number of unique SVs
    sv_counts = sv_counts[sv_counts > 0]
    num_unique_svs = len(sv_counts)
    
    # define our gene counts
    genes_in_svs = svs_in_category[['name'] + genic_rel].set_index('name').loc[sv_counts.index]
    genes_in_svs['count'] = sv_counts.astype(int)
    
    gene_counts = []
    for index, row in genes_in_svs.iterrows():
        for col in genic_rel:
            if not pd.isnull(row[col]):
                gene_counts += row[col].split(',') * row['count']
           
    gene_counts = pd.DataFrame(np.unique(gene_counts, return_counts = True), index = ['gene', 'count']).T
    
    # for each gene set, create a contingency matrix and calculate our
    # fisher's result
    base_row = [cat_name, af_category, sv_category, p_category]
    for db_name, db in gs.items():
        for gs_name, gs_genes in db.items():
            top_left = gene_counts[gene_counts['gene'].isin(gs_genes)]['count'].sum()
            top_right = gene_counts[~gene_counts['gene'].isin(gs_genes)]['count'].sum()

            bottom_left = len(set(gs_genes) - set(gene_counts['gene']))
            bottom_right = 19000 - bottom_left
            
            cont_table = np.array([[top_left, top_right], 
                                   [bottom_left, bottom_right]])
            res, p = scipy.stats.fisher_exact(cont_table)
            
            unique_overlaps = len(set(gs_genes) & set(gene_counts['gene']))
            data = f'{top_left}/{top_right + top_left}'
            expected = f'{bottom_left}/{bottom_right + bottom_left}'
            
            # store our data
            row = base_row + [db_name, gs_name, unique_overlaps, data, expected, p]
            gse_results.append(row)
            
names = ['category', 'af_category', 'sv_category', 'category_p', 'db', 'gs', 'gs_unique_overlap', 'data', 'expected', 'p']
gse_results = pd.DataFrame(gse_results, columns = names)

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 

In [318]:
gse_results.query('p < 0.05')

Unnamed: 0,category,af_category,sv_category,category_p,db,gs,gs_unique_overlap,data,expected,p
11,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,singleton,coding,1.110000e-07,MSigDB_Hallmark_2020,Adipogenesis,13,14/547,187/19000,1.632194e-03
25,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,singleton,coding,1.110000e-07,MSigDB_Hallmark_2020,mTORC1 Signaling,0,0/547,200/19000,7.594762e-03
65,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,singleton,coding,1.110000e-07,GO_Biological_Process_2023,ATP Metabolic Process (GO:0046034),4,5/547,30/19000,2.732070e-03
119,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,singleton,coding,1.110000e-07,GO_Biological_Process_2023,DNA Recombination (GO:0006310),4,4/547,38/19000,2.935292e-02
139,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,singleton,coding,1.110000e-07,GO_Biological_Process_2023,ER-associated Misfolded Protein Catabolic Proc...,1,2/547,8/19000,3.031102e-02
...,...,...,...,...,...,...,...,...,...,...
559833,DEL.RARE.ANY.recombination_hotspot_conserved.A...,rare,non-coding,2.740000e-04,Reactome_2022,TP53 Regulates Transcription Of Additional Cel...,1,1/26,13/19000,1.896912e-02
559866,DEL.RARE.ANY.recombination_hotspot_conserved.A...,rare,non-coding,2.740000e-04,Reactome_2022,Tie2 Signaling R-HSA-210993,1,1/26,16/19000,2.298868e-02
559888,DEL.RARE.ANY.recombination_hotspot_conserved.A...,rare,non-coding,2.740000e-04,Reactome_2022,Transcriptional Regulation By RUNX2 R-HSA-8878166,1,11/26,118/19000,6.387280e-18
559908,DEL.RARE.ANY.recombination_hotspot_conserved.A...,rare,non-coding,2.740000e-04,Reactome_2022,Transmission Across Chemical Synapses R-HSA-11...,1,2/26,245/19000,4.446992e-02


In [319]:
gse_results.query('category == "ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP.ANY.ANY.protein_coding" & p < 0.05 & db == "MSigDB_Hallmark_2020"').sort_values(by = 'p')

Unnamed: 0,category,af_category,sv_category,category_p,db,gs,gs_unique_overlap,data,expected,p
11,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,singleton,coding,1.11e-07,MSigDB_Hallmark_2020,Adipogenesis,13,14/547,187/19000,0.001632
25,ANY.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTI...,singleton,coding,1.11e-07,MSigDB_Hallmark_2020,mTORC1 Signaling,0,0/547,200/19000,0.007595


In [285]:
# gs['GO_Biological_Process_2023']['Neurogenesis (GO:0022008)']

In [320]:
gse_results[gse_results['gs'].str.contains('Neurogenesis|Nervous')].query('p < 0.05').sort_values(by = 'p')

Unnamed: 0,category,af_category,sv_category,category_p,db,gs,gs_unique_overlap,data,expected,p
337043,ANY.SINGLETON.ANY.recombination_hotspot_conser...,singleton,non-coding,0.000313,GO_Biological_Process_2023,Nervous System Development (GO:0007399),11,13/115,422/19000,0.000002
198856,DEL.SINGLETON.ANY.recombination_hotspot_conser...,singleton,non-coding,0.000012,GO_Biological_Process_2023,Nervous System Development (GO:0007399),9,11/87,424/19000,0.000004
197176,DEL.SINGLETON.ANY.recombination_hotspot_conser...,singleton,non-coding,0.000012,GO_Biological_Process_2023,Central Nervous System Development (GO:0007417),7,8/87,276/19000,0.000046
335363,ANY.SINGLETON.ANY.recombination_hotspot_conser...,singleton,non-coding,0.000313,GO_Biological_Process_2023,Central Nervous System Development (GO:0007417),8,9/115,275/19000,0.000055
518256,DEL.RARE.ANY.recombination_hotspot_conserved.P...,rare,non-coding,0.000228,GO_Biological_Process_2023,Myelination In Peripheral Nervous System (GO:0...,1,2/19,11/19000,0.000073
...,...,...,...,...,...,...,...,...,...,...
220900,DEL.SINGLETON.ANY.neuroblastoma_H3K27Ac_peak.A...,singleton,non-coding,0.000146,GO_Biological_Process_2023,Peripheral Nervous System Neuron Development (...,1,1/178,4/19000,0.045559
215482,ANY.SINGLETON.ANY.neuroblastoma_chromHMM15_Enh...,singleton,non-coding,0.000074,GO_Biological_Process_2023,Regulation Of Nervous System Process (GO:0031644),1,2/354,17/19000,0.046481
214244,ANY.SINGLETON.ANY.neuroblastoma_chromHMM15_Enh...,singleton,non-coding,0.000074,GO_Biological_Process_2023,Positive Regulation Of Nervous System Process ...,1,2/354,17/19000,0.046481
328352,ANY.SINGLETON.PREDICTED_NONCODING_BREAKPOINT.n...,singleton,non-coding,0.000309,GO_Biological_Process_2023,Enteric Nervous System Development (GO:0048484),1,1/154,5/19000,0.047287


In [271]:
[g for g in gs['GO_Biological_Process_2023'].keys() if 'neurogenesis' in g.lower()]

['Negative Regulation Of Neurogenesis (GO:0050768)',
 'Neurogenesis (GO:0022008)',
 'Positive Regulation Of Neurogenesis (GO:0050769)',
 'Regulation Of Neurogenesis (GO:0050767)']

In [None]:
def calc_p_dev_rank()