In [1]:
import glob
from collections import defaultdict

import gseapy as gp
import numpy as np
import pandas as pd
import scipy

In [1]:
# efficient fishers tests
import pyranges
import statsmodels.stats.multitest

# CWAS Gene Set Enrichment

In this notebook, we implement the gene set enrichment analysis for categories. 

## How is gene set enrichment defined generally?

Gene set enrichment (or "GO term analysis"--the terms are interchangeable in this context) generally attempts to determine if groups of genes are overrepresented in a particular context. Gene set enrichment always requires three components:

1. A gene set (a list of genes)
2. A background "reference" from which genes are "drawn" under the null model
3. An "observed" set of genes (e.g. from an experiment)

There are a variety of tests that can be performed on the above data, but the most common is the Fisher's exact test, which tests if membership in the observed set of genes is associated with membership in the gene set. The classic case in which this is used is differential expression with RNA sequencing. Consider an example below:

1. A gene set corresponding to DNA damage repair: `BRCA1, BRCA2, TP53, MSH1, MSH2`
2. An RNA-sequencing experiment in which all ~19k genes are sequenced (and are thus "eligible" to be differentially expressed genes),
3. A list of differentially expressed genes (`BRCA1, BRCA2, TTN`)

A gene set enrichment analysis might then be a Fisher's exact table of the following formulation:

```
number of genes in gene set and differentially expressed (n = 2) | number of genes in gene set and not differentially expressed (n = 4)
number of genes not in gene set and differentially expressed (n = 1)| number of genes not in gene set and not differentially expressed (n = 18993)
```

In [3]:
# fisher exact test for the above contigency table
stat, p = scipy.stats.fisher_exact(np.array([[2, 4], [1, 18993]]))
print(p)

2.4928560531097303e-07


Another way to conceptualize the Fisher's test is in terms of "true positives", "true negatives", etc. In this way, the above contigency table would be:

```
True positives (n = 2) | False positives (n = 4)
False negatives (n = 1)| True negatives (n = 18993)
```

## How is gene set enrichment defined in this context?

In the context of this project, we are interested in the categories that are demonstrated to have higher burden in cases vs. controls. If specific gene sets are enriched in these categories, this may reflect an underlying biological process that is targeted by the increased SV burden.

Rather than unique genes, the process we define makes use of the fact that the Fisher's exact test is well defined for _counts_ generally--in this case, SV counts. Traditional packages for GO term enrichment analysis do not allow for gene duplicates (e.g. a gene hit multiple times by an SV), but that is valid here. We use the following Fisher's contigency table, for a given category and given gene set:

```
Count of gene-set genes affected by category SVs | Count of non-gene-set genes affected by category SVs
Count of gene-set genes affected by non-category SVs | Count of non-gene-set genes affected by non-category SVs
```

We add a continuity correct of 1 to all cells, and the "background" of non-category SVs will be defined as all rare SVs that affect genes. This "background" will be further filtered such that:

1) the background SV class matches the category of interest (coding vs. noncoding)
2) SVs affecting more than 10 genes are removed (across coding and noncoding effects)
3) SVs that do not have a coding/noncoding consequence detailed in the CWAS are removed
4) noncoding SVs farther than 500 kb from their nearest gene for the noncoding consequence “INTERGENIC” are removed.

## A sidenote on baseline enrichment

Some categories, particular those that restrict to regions of the genome like "adrenally expressed", will be definition be enriched for particular gene sets. In these cases, the calculation of enrichment is particularly important in controls vs. cases.

# Load in the data

Here, we load in the SVs and associated sample dosages.

## Read in SVs, metadata, dosages, and reference genes

In [4]:
# define SVs and dosages for discovery and validation
sv_path = "gs://vanallen-pedsv-analysis/beds/PedSV.v2.5.3.full_cohort.analysis_samples.sites.bed.gz"
dosages_path = "gs://vanallen-pedsv-analysis/beds/PedSV.v2.5.3.full_cohort.analysis_samples.allele_dosages.bed.gz"

# define metadata
metadata_path = "gs://vanallen-pedsv-analysis/sample_info/PedSV.v2.5.3.cohort_metadata.w_control_assignments.tsv.gz"
samples_path = "gs://vanallen-pedsv-analysis/sample_info/PedSV.v2.5.3.final_analysis_cohort.samples.list"

gene_ref_path = "data/updated-cwas/genes/gencode_hg38_protein_coding_genes_for_annotation_7_31_23 (1).txt"

In [5]:
gene_ref = pd.read_csv(gene_ref_path)

# ENSG00 genes are removed
gene_ref = gene_ref[~gene_ref["value"].str.startswith("ENSG00")]
gene_ref = sorted(set(gene_ref["value"].tolist()))
len(gene_ref)

19201

In [6]:
intergenic_sv_to_gene_distances = pd.read_csv('data/updated-cwas/intergenic-sv-to-gene-distances.csv')

Load metadata and SVs

In [7]:
metadata = pd.read_csv(
    metadata_path,
    sep="\t",
)

# add a sex label to metadata
metadata["sex"] = (metadata["chrX_CopyNumber"].round() < 2).astype(int)

###############
### Samples ###
###############
samples = defaultdict(dict)

total_samples = []
for disease in ["neuroblastoma", "ewing", "osteosarcoma"]:
    for cohort in ["case", "control"]:
        disease_cohort_samples = metadata[(metadata[f"{disease}_{cohort}"] == True)][
            "entity:sample_id"
        ].tolist()

        samples[disease][cohort] = disease_cohort_samples
        total_samples += disease_cohort_samples

        print(disease, cohort, len(disease_cohort_samples))

total_samples = sorted(set(total_samples))

neuroblastoma case 688
neuroblastoma control 4830
ewing case 773
ewing control 4574
osteosarcoma case 284
osteosarcoma control 4805


Now we load the SVs.

In [8]:
###############
##### SVs #####
###############

# only use the first 49 columns of the SV file, which includes affected genes
svs = pd.read_csv(sv_path, sep="\t", usecols=range(49))

###############
### Dosages ###
###############=

# we only need the dosages of our samples in question
dosage_head = pd.read_csv(dosages_path, sep="\t", index_col=False, nrows=1)

cols = [3] + [i for i, c in enumerate(dosage_head.columns) if c in total_samples]

dosages = pd.read_csv(dosages_path, sep="\t", index_col=False, usecols=cols)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [9]:
svs.head(2)

Unnamed: 0,#chrom,start,end,name,svtype,AC,AF,ALGORITHMS,AN,BOTHSIDES_SUPPORT,...,PREDICTED_INV_SPAN,PREDICTED_LOF,PREDICTED_MSV_EXON_OVERLAP,PREDICTED_NEAREST_TSS,PREDICTED_NONCODING_BREAKPOINT,PREDICTED_NONCODING_SPAN,PREDICTED_PARTIAL_EXON_DUP,PREDICTED_PROMOTER,PREDICTED_TSS_DUP,PREDICTED_UTR
0,chr1,12000,30001,PedSV.2.5.2_CNV_chr1_1,CNV,0,,depth,0,False,...,,,,OR4F5,"ewing_chromHMM15_Quies,neuroblastoma_chromHMM1...","neuroblastoma_chromHMM15_EnhG_conserved,neurob...",,,,
1,chr1,12000,40001,PedSV.2.5.2_DUP_chr1_1,DUP,356,0.027423,depth,12982,False,...,,,,OR4F5,"ewing_chromHMM15_Quies,neuroblastoma_chromHMM1...","neuroblastoma_chromHMM15_EnhG_conserved,neurob...",,,,


In [10]:
dosages = dosages.set_index("ID")

Add the intergenic distances, if relevant

In [11]:
svs = svs.merge(intergenic_sv_to_gene_distances[['name', 'distance']].rename(columns = {'distance': 'distance_to_nearest_gene'}), 
          on = 'name', how = 'left')

In [12]:
svs.shape

(228887, 50)

## Read in the category SVs and category results

Here, we read in the category SVs (which SVs are in each category) and the category results (case vs. control burdens)

In [13]:
###########
### SVs ###
###########
category_svs = []
for file in glob.glob("data/updated-cwas/svs-in-categories/*.txt"):
    disease = file.split("/")[-1].split("_")[0]

    sv_category = "noncoding" if "noncoding" in file else "coding"

    cat_svs = pd.read_csv(file, sep="\t")
    cat_svs[["disease", "sv_category"]] = [disease, sv_category]
    category_svs.append(cat_svs)

category_svs = pd.concat(category_svs)
category_svs.head(2)

Unnamed: 0,SV,chrom,start,end,category,disease,sv_category
0,PedSV.2.5.2_DUP_chr1_455,chr1,6720524,6722070,DUP.RARE.PREDICTED_NONCODING_BREAKPOINT.ewing_...,ewing,noncoding
1,PedSV.2.5.2_DUP_chr1_898,chr1,23649481,23651544,DUP.RARE.PREDICTED_NONCODING_BREAKPOINT.ewing_...,ewing,noncoding


Read in category burden testing:

In [14]:
#########################
### FRAMEWORK RESULTS ###
#########################
columns = ['category_name', 'point_estimate', 'std_error', 'z_score', 'p_value']
framework_results = []
for file in glob.glob("data/updated-cwas/summary-stats/*.txt"):

    disease = file.split("/")[-1].split("_")[0]

    sv_category = "noncoding" if "noncoding" in file else "coding"

    data = pd.read_csv(file, sep="\t", usecols = columns)
    data[["disease", "sv_category"]] = [
        disease,
        sv_category,
    ]
    framework_results.append(data)

framework_results = pd.concat(framework_results)
framework_results['negative_log10_p_value'] = -np.log10(framework_results['p_value'])

In [15]:
framework_results.head(2)

Unnamed: 0,point_estimate,std_error,z_score,p_value,category_name,disease,sv_category,negative_log10_p_value
0,0.219704,0.58782,0.373761,0.708582,DUP.RARE.PREDICTED_NONCODING_BREAKPOINT.ewing_...,ewing,noncoding,0.14961
1,-0.14026,0.584964,-0.239775,0.810504,DUP.RARE.PREDICTED_NONCODING_BREAKPOINT.ewing_...,ewing,noncoding,0.091245


Here, we import a file that depicts all the possible filters that can be applied to categories:

In [16]:
framework_schema = defaultdict(dict)
for file in glob.glob("data/updated-cwas/schema/*.txt"):

    sv_category = "noncoding" if "noncoding" in file else "coding"
    suffix = file.split("/")[-1]
    if sv_category == "coding":
        disease = suffix.split('_')[2]
    else:
        disease = suffix.split('_')[3]

    data = pd.read_csv(file, sep="\t")
    framework_schema[disease][sv_category] = data

In [17]:
framework_schema["neuroblastoma"]["coding"]

Unnamed: 0,sv_type,frequency,genic_relationship,constraint,expression,gene_group
0,DUP,RARE,PREDICTED_COPY_GAIN,lof_constrained,expressed_in_adrenal_gland,protein_coding
1,DEL,SINGLETON,PREDICTED_INTRAGENIC_EXON_DUP,missense_constrained,ANY,cosmic_cancer_genes
2,CPX_or_INV,,PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP,unconstrained,,germline_CPGs
3,INS_ALL,,ANY,ANY,,base_excision_repair_genes
4,ANY,,,,,chromatin_organization_genes
5,,,,,,dna_damage_bypass_genes
6,,,,,,dna_damage_reversal_genes
7,,,,,,dna_DSB_repair_genes
8,,,,,,dna_DSB_response_genes
9,,,,,,dna_repair_genes


# Defining the "background reference" of gene counts

Here, we define all the genes affected by SVs in the dataset that go into the "denominator" of the Fisher exact test. Later, we simply extract the genes affected by SVs in categories to form the "numerator" from this reference.

As mentioned earlier, the following filters are also applied to this background:

1) the background SV class matches the category of interest (coding vs. noncoding)
2) SVs affecting more than 10 genes are removed (across coding and noncoding effects)
3) SVs that do not have a coding/noncoding consequence detailed in the CWAS are removed
4) noncoding SVs farther than 500 kb from their nearest gene for the noncoding consequence “INTERGENIC” are removed.

## How an SV affecting a gene appears in the VCF

An example:

In [18]:
svs[svs['name'] == 'PedSV.2.5.2_DEL_chr1_100']

Unnamed: 0,#chrom,start,end,name,svtype,AC,AF,ALGORITHMS,AN,BOTHSIDES_SUPPORT,...,PREDICTED_LOF,PREDICTED_MSV_EXON_OVERLAP,PREDICTED_NEAREST_TSS,PREDICTED_NONCODING_BREAKPOINT,PREDICTED_NONCODING_SPAN,PREDICTED_PARTIAL_EXON_DUP,PREDICTED_PROMOTER,PREDICTED_TSS_DUP,PREDICTED_UTR,distance_to_nearest_gene
78,chr1,958900,983001,PedSV.2.5.2_DEL_chr1_100,DEL,4,0.000307,depth,13032,False,...,"KLHL17,NOC2L,PERM1,PLEKHN1",,,"ewing_H3K27Ac_peak,ewing_H3K27Ac_peak_conserve...","ewing_ABC_MAX_enhancer,ewing_H3K27Ac_peak_cons...",,,,,


This SV is a coding SV that affects 4 genes with a predicted loss of function consequence: `KLHL17,NOC2L,PERM1,PLEKHN1`. Thus it would contribute a count to each of these genes in our reference dataset.

## Adding an annotation for coding/non-coding/multi effects

We add this annotation to filter out SVs that affect many genes

In [19]:
# an svs is True for "multiple-genes-coding" if it affects more than one gene across all coding genic relationships.
# The same is true for multiple-genes-noncoding.
coding_cols = [
    "PREDICTED_LOF",
    "PREDICTED_PARTIAL_EXON_DUP",
    "PREDICTED_INTRAGENIC_EXON_DUP",
    "PREDICTED_COPY_GAIN",
]
noncoding_cols = [
    "PREDICTED_NEAREST_TSS",
    "PREDICTED_INTRONIC",
    "PREDICTED_PROMOTER",
    "PREDICTED_UTR",
]

# for each class of consequences (noncoding, coding, and "both", we asssess the genes that are affected by the SV)
for label, cols in zip(
    ["coding", "noncoding", "both"],
    [coding_cols, noncoding_cols, coding_cols + noncoding_cols],
):

    # we define all the genes that are "affected" by an SV. We
    # do this by looking at successive genic relationships
    affected_genes = np.array(svs[cols[0]].fillna("").str.split(","))
    for c in cols[1:]:
        affected_genes += np.array(svs[c].fillna("").str.split(","))

    # remove blanks and duplicate genes - if an SV affects the same gene in more than one way, that's fine
    affected_genes = [[g for g in set(g_list) if g != ""] for g_list in affected_genes]
    affected_genes_col = [",".join(g_list) for g_list in affected_genes]

    # extract out how many unique genes are affected by each SV
    num_unique_affected_genes = [len(set(g_list)) for g_list in affected_genes]
    multiple_genes = np.array(num_unique_affected_genes) > 1

    # add a column to our SVs
    svs[f"num_genes_{label}"] = num_unique_affected_genes
    svs[f"genes_{label}"] = affected_genes_col

In [20]:
# as an example, the 3rd SV in this dataset affects ORF45 - a copy gain
svs[['name', 'genes_coding'] + coding_cols].head(4)

Unnamed: 0,name,genes_coding,PREDICTED_LOF,PREDICTED_PARTIAL_EXON_DUP,PREDICTED_INTRAGENIC_EXON_DUP,PREDICTED_COPY_GAIN
0,PedSV.2.5.2_CNV_chr1_1,,,,,
1,PedSV.2.5.2_DUP_chr1_1,,,,,
2,PedSV.2.5.2_DUP_chr1_4,OR4F5,,,,OR4F5
3,PedSV.2.5.2_CNV_chr1_2,,,,,


In [21]:
multigene_svs = svs.query("num_genes_both > 10")["name"].tolist()

# only 64 SVs affect more than 10 genes
svs.shape, len(multigene_svs)

((228887, 56), 64)

We drop these SVs.

In [22]:
svs = svs.query("num_genes_both <= 10").copy()
category_svs = category_svs[~(category_svs["SV"].isin(multigene_svs))].copy()

## Drop intergenic SVs that are far from their gene

We only drop an SV in this manner if it's the only way it affects a gene.

In [23]:
only_intergenic_svs = svs[(svs['PREDICTED_NEAREST_TSS'] == svs['genes_both']) & (svs['num_genes_coding'] == 0)].copy()
dropped_intergenic_svs = only_intergenic_svs.query('distance_to_nearest_gene > 5e5')['name'].tolist()

In [24]:
svs.shape

(228823, 56)

In [25]:
svs = svs[~(svs['name'].isin(dropped_intergenic_svs))]

In [26]:
# we are left with 206533 SVs
svs.shape

(206533, 56)

## Create the reference for gene counts

These references will also serve as the lookup point for our actual categories. That way, we don't have to recalculate counts, etc.

In [44]:
# identify rare SVs first - this is the "background" (multigene SVs and noncoding SVs far from their gene have already been removed)
rare_svs = svs[svs["AF"] < 0.01].copy().set_index("name")

ref_counts = []
for disease in ["neuroblastoma", "ewing"]:

    sub_reference = rare_svs.copy()
    
    # fetch the samples we have the for this disease
    disease_samples = samples[disease]

    # identify SVs that affect genes in coding or non-coding fashion
    for sv_category in ["coding", "noncoding"]:
        sub_reference = rare_svs[rare_svs[f"num_genes_{sv_category}"] > 0]

        # now build the reference gene counts - for cases and controls
        gene_counts = {}
        for cohort in ["case", "control"]:

            # get the case or control samples, and then subset the dosage matrix to these samples
            cohort_samples = disease_samples[cohort]
            sub_dosages = dosages.loc[sub_reference.index, cohort_samples]

            # drop samples with bad genotyping and fill in na's as 0 (effectively ignored)
            kept_samples = np.where(
                pd.isnull(sub_dosages).sum(axis=0) / len(sub_dosages) < 0.05
            )[0]
            sub_dosages = sub_dosages.iloc[:, kept_samples].fillna(0).astype(int)

            # finally, we create the gene list with counts. First get the genes affected by each SV
            sv_genes = sub_reference[f"genes_{sv_category}"]
            
            # and expand them into their own rows (in case an SV affects more than one gene)
            sv_genes = sv_genes.str.split(',').explode()
            
            # next, determine the number of times the SV is present in our samples (case or control)
            sv_names = sv_genes.index
            sv_counts = sub_dosages.loc[sv_names].sum(axis=1)

            # construct the dataframe containing the SV names, the genes they affect (one each row),
            # and their dosage in the dataset (e.g. number of times this gene is affected)
            subreference_counts = pd.DataFrame(
                [sv_names, sv_counts, sv_genes], index=["name", "dose", "gene"]
            ).T

            # remove genes not in our reference (shouldn't happen, but just in case)
            subreference_counts = subreference_counts[
                subreference_counts["gene"].isin(gene_ref)
            ]

            # add relevant metadata
            subreference_counts["disease"] = disease
            subreference_counts["sv_category"] = sv_category
            subreference_counts["cohort"] = cohort
            
            # store
            ref_counts.append(subreference_counts)
            
            # if neuroblastoma cases and coding SVs, do a small subanalysis
            if disease == "neuroblastoma" and cohort == "case" and sv_category == "coding":
                
                # get the samples affected by the SVs
                sv_to_samples = {sv: ','.join(sub_dosages.columns[sub_dosages.loc[sv] > 0].tolist()) for sv in sub_dosages.index}
                
                # add this info
                nbl_coding_singleton_export = subreference_counts.copy()
                nbl_coding_singleton_export['samples'] = nbl_coding_singleton_export['name'].apply(lambda sv: sv_to_samples[sv])

        print(disease, sv_category)

ref_counts = pd.concat(ref_counts)

neuroblastoma coding
neuroblastoma noncoding
ewing coding
ewing noncoding


In [41]:
result_dict = {sv: sub_dosages.columns[sub_dosages.loc[sv] > 0].tolist() for sv in sub_dosages.index}

In [48]:
nbl_coding_singleton_export.query('dose == 1').to_csv('nbl-singletons.csv', index = False)

In [49]:
!open .

In [37]:
sv_genes

name
PedSV.2.5.2_DEL_chr1_2        OR4F5
PedSV.2.5.2_INS_chr1_2        OR4F5
PedSV.2.5.2_DEL_chr1_3        OR4F5
PedSV.2.5.2_INS_chr1_3        OR4F5
PedSV.2.5.2_DEL_chr1_9        OR4F5
                              ...  
PedSV.2.5.2_DEL_chrX_9256      IL9R
PedSV.2.5.2_DEL_chrX_9259      IL9R
PedSV.2.5.2_DEL_chrX_9264      IL9R
PedSV.2.5.2_DEL_chrY_59       TSPY2
PedSV.2.5.2_DEL_chrY_440     EIF1AY
Name: genes_noncoding, Length: 179975, dtype: object

In [35]:
subreference_counts

Unnamed: 0,name,dose,gene,disease,sv_category,cohort
0,PedSV.2.5.2_DEL_chr1_2,2,OR4F5,ewing,noncoding,control
1,PedSV.2.5.2_INS_chr1_2,1,OR4F5,ewing,noncoding,control
2,PedSV.2.5.2_DEL_chr1_3,2,OR4F5,ewing,noncoding,control
3,PedSV.2.5.2_INS_chr1_3,1,OR4F5,ewing,noncoding,control
4,PedSV.2.5.2_DEL_chr1_9,1,OR4F5,ewing,noncoding,control
...,...,...,...,...,...,...
179970,PedSV.2.5.2_DEL_chrX_9256,0,IL9R,ewing,noncoding,control
179971,PedSV.2.5.2_DEL_chrX_9259,0,IL9R,ewing,noncoding,control
179972,PedSV.2.5.2_DEL_chrX_9264,1,IL9R,ewing,noncoding,control
179973,PedSV.2.5.2_DEL_chrY_59,25,TSPY2,ewing,noncoding,control


In [34]:
sub_dosages

Unnamed: 0_level_0,TPMCCDG10002,TPMCCDG10012,TPMCCDG10017,TPMCCDG10018,TPMCCDG10028,TPMCCDG10029,TPMCCDG10041,TPMCCDG10042,TPMCCDG10043,TPMCCDG10044,...,ssi_25815,ssi_25837,ssi_25849,ssi_25985,ssi_26055,ssi_26060,ssi_26301,ssi_26305,ssi_26369,ssi_26393
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PedSV.2.5.2_DEL_chr1_2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PedSV.2.5.2_INS_chr1_2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PedSV.2.5.2_DEL_chr1_3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PedSV.2.5.2_INS_chr1_3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PedSV.2.5.2_DEL_chr1_9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PedSV.2.5.2_DEL_chrX_9256,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PedSV.2.5.2_DEL_chrX_9259,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PedSV.2.5.2_DEL_chrX_9264,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PedSV.2.5.2_DEL_chrY_59,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
len(cohort_samples), len(kept_samples)

(4574, 4573)

In [28]:
ref_counts.head(2)

Unnamed: 0,name,dose,gene,disease,sv_category,cohort
0,PedSV.2.5.2_DEL_chr1_80,0,SAMD11,neuroblastoma,coding,case
1,PedSV.2.5.2_DEL_chr1_100,1,KLHL17,neuroblastoma,coding,case


For example, if we wanted to see how often `TTN` is affected in neuroblastoma cases by coding SVs:

In [30]:
ref_counts.query('gene == "TTN" & disease == "neuroblastoma" & cohort == "case" & dose > 0')

Unnamed: 0,name,dose,gene,disease,sv_category,cohort
1451,PedSV.2.5.2_DUP_chr2_3218,1,TTN,neuroblastoma,coding,case
1454,PedSV.2.5.2_DEL_chr2_11838,1,TTN,neuroblastoma,coding,case
24078,PedSV.2.5.2_DEL_chr2_11830,2,TTN,neuroblastoma,noncoding,case
24079,PedSV.2.5.2_DEL_chr2_11831,1,TTN,neuroblastoma,noncoding,case
24080,PedSV.2.5.2_DEL_chr2_11835,2,TTN,neuroblastoma,noncoding,case
24081,PedSV.2.5.2_INS_chr2_5134,2,TTN,neuroblastoma,noncoding,case
24083,PedSV.2.5.2_DEL_chr2_11839,1,TTN,neuroblastoma,noncoding,case


So the total dosage here in cases is 1 + 1 + 2 + 1 + 2 + 2 + 1 = 10

# An example

To illustrate the process we'll systematically undertake later, consider a simple example gene set:

In [31]:
# a contrived "gene set" designed explicitly to be higher in cases than controls
test_gs = ['MSL1', 'WIZ', 'ACTC1', 'NFAT5', 'STAG1', 'MRC2', 'NANOGP8', 'CLTA', 'TTN']

And a specific category: singleton lof deletions in lof_constrainted genes expressed in adrenal tissue

In [32]:
test_cat = "DEL.SINGLETON.PREDICTED_LOF_or_PREDICTED_PARTIAL_EXON_DUP.lof_constrained.expressed_in_adrenal_gland.protein_coding"

We begin by identifying the SVs that are part of that category.

In [33]:
test_cat_svs = category_svs[
    (category_svs["category"] == test_cat)
    & (category_svs["disease"] == "neuroblastoma")
]["SV"].tolist()

len(test_cat_svs)

249

So 249 SVs are part of this category. We next extract all the genes affected by these SVs.

In [34]:
# category SVs
test_cat_counts = ref_counts[
    (ref_counts["name"].isin(test_cat_svs))
    & (ref_counts["disease"] == "neuroblastoma")
    & (ref_counts["sv_category"] == "coding")
    & (ref_counts["dose"] > 0)
]


# non-category SVs
test_non_cat_counts = ref_counts[
    (~ref_counts["name"].isin(test_cat_svs))
    & (ref_counts["disease"] == "neuroblastoma")
    & (ref_counts["sv_category"] == "coding")
    & (ref_counts["dose"] > 0)
]

test_cat_counts.head(2)

Unnamed: 0,name,dose,gene,disease,sv_category,cohort
46,PedSV.2.5.2_DEL_chr1_401,1,GNB1,neuroblastoma,coding,case
397,PedSV.2.5.2_DEL_chr1_5766,1,NFIA,neuroblastoma,coding,case


This includes all cohorts (case and controls). Now let's pull up a hallmark gene set and get cracking.

Next, for both cases and controls, we calculate the 4 elements of our contingency table

In [35]:
for cohort in ["case", "control"]:

    # the counts ascribed to the category within the cohort
    cohort_counts_in_cat = test_cat_counts[test_cat_counts["cohort"] == cohort]
    
    # the counts ascribed to non-category SVs within the cohort
    cohort_counts_in_non_cat = test_non_cat_counts[test_non_cat_counts['cohort'] == cohort]
    
    # number of genes affected by category SVs and in gene set (top left)
    true_positives = cohort_counts_in_cat[cohort_counts_in_cat['gene'].isin(test_gs)]['dose'].sum()
    
    # number of genes affected by category SVs and not gene set (top right)
    false_positives = cohort_counts_in_cat[~cohort_counts_in_cat['gene'].isin(test_gs)]['dose'].sum()
    
    # number of genes affected by non-category SVs and in the gene set (bottom left)
    false_negatives = cohort_counts_in_non_cat[cohort_counts_in_non_cat['gene'].isin(test_gs)]['dose'].sum()
    
    # number of genes affected by category SVs and not gene set (top right)
    true_negatives = cohort_counts_in_non_cat[~cohort_counts_in_non_cat['gene'].isin(test_gs)]['dose'].sum()
    
    # make the contingency table
    cont_table = np.array(
            [
                [true_positives, false_positives],
                [false_negatives, true_negatives],
            ]
        )
    
    print(cohort)
    print(cont_table)

    # pseudocount of 1
    print(scipy.stats.fisher_exact(cont_table + 1))
    print()

case
[[   8   45]
 [   2 3926]]
(256.10869565217394, 1.958690806922406e-15)

control
[[    4   178]
 [   41 26443]]
(17.58712423516893, 1.79514162402813e-05)



So the odds ratio of this gene set from the Fisher's exact test is much higher in cases compared to controls (though it is significant in both)

# Generalized gene set enrichment

Here, we apply the above process to all gene sets and a subset of categories


## Define the gene sets

These GO terms were taken directly from the GO website via the GO API. We only analyze gene sets that have between 30 and 1000 genes.

In [36]:
gene_sets = {}
with open("data/updated-cwas/go-gene-sets.txt") as gs_in:
    for line in gs_in:
        comp = line.strip().split("\t")
        gs = comp[0]
        genes = comp[1:]

        if len(genes) >= 30 and len(genes) <= 1000:
            gene_sets[gs] = genes

In [37]:
# 2725 out of 12558 gene sets are analyzed
print(len(gene_sets))

2725


## Define the categores

We only analyze categories that are at least nominally significant. Additionally, we do not evaluate categories that are not `protein_coding`, as the others are already prefiltered to specific gene sets

In [38]:
categories_to_analyze = framework_results[(framework_results['p_value'] < 0.05) &
                                          (framework_results['category_name'].str.contains('protein_coding'))]
categories_to_analyze = categories_to_analyze.sort_values(by = ['disease', 'sv_category']).reset_index(drop=True)
len(categories_to_analyze)

810

In [39]:
categories_to_analyze.head(2)

Unnamed: 0,point_estimate,std_error,z_score,p_value,category_name,disease,sv_category,negative_log10_p_value
0,0.589401,0.27894,2.113004,0.0346,ANY.SINGLETON.PREDICTED_COPY_GAIN.lof_constrai...,ewing,coding,1.460918
1,0.54119,0.227879,2.374903,0.017554,ANY.SINGLETON.PREDICTED_INTRAGENIC_EXON_DUP.AN...,ewing,coding,1.755634


## Convert to matrix algebra

The entire process of the fisher exact test can be implemented using matrix algebra (simple dot products). To achieve this, we construct a sparse matrix representing which genes are in each gene set

In [40]:
# define lookup dictionaries for indices
gene_to_idx = {gene: i for i, gene in enumerate(gene_ref)}

In [41]:
# convert the gene sets to a sparse matrix
gs_to_idx = {}

values = []
row_indices = []
column_indices = []
for i, (gs, genes) in enumerate(gene_sets.items()):
    gs_to_idx[gs] = i

    column_indices += [gene_to_idx[g] for g in genes]
    row_indices += [i] * len(genes)
    values += [1] * len(genes)

values = np.array(values)
row_indices = np.array(row_indices)
column_indices = np.array(column_indices)

gs_gene_sparse_mtx = scipy.sparse.csr_matrix(
    (values, (row_indices, column_indices)), shape=(len(gs_to_idx), len(gene_to_idx))
).T

In [42]:
gs_gene_sparse_mtx

<19201x2725 sparse matrix of type '<class 'numpy.int64'>'
	with 396034 stored elements in Compressed Sparse Column format>

## Run the gene set enrichment

In [43]:
categories_to_analyze['disease'].value_counts()

neuroblastoma    539
ewing            271
Name: disease, dtype: int64

In [44]:
fisher_results = []

for i, (index, row) in enumerate(categories_to_analyze.iterrows()):
    
    #######################################
    ### STEP 1 - DEFINE OUR GENE COUNTS ###
    #######################################
    
    # This process is the exact same for all categories - later steps will subset our gene counts
    if i % 10 == 0:
        print(i, end = ', ')
    cat_name = row['category_name']
    sv_category = row['sv_category']
    disease = row['disease']
    
    # next, we pull out our SVs in this category
    svs_in_category = category_svs[(category_svs["category"] == cat_name) &
                                   (category_svs["disease"] == disease)]['SV'].tolist()
    
    ###############################
    ### EXTRACT CATEGORY COUNTS ###
    ###############################
    
    # get the counts of genes affected by the category SVs
    category_counts = ref_counts[
        (ref_counts["name"].isin(svs_in_category))
        & (ref_counts["disease"] == disease)
        & (ref_counts["sv_category"] == sv_category)
        & (ref_counts["dose"] > 0)
    ]
    
    ###############################
    ### DEFINE REFERENCE COUNTS ###
    ###############################
    
    # get the counts of genes affected by the NON category SVs
    ref_category_counts = ref_counts[
        (~ref_counts["name"].isin(svs_in_category))
        & (ref_counts["disease"] == disease)
        & (ref_counts["sv_category"] == sv_category)
        & (ref_counts["dose"] > 0)
    ]
    
    ##############################
    ### RUN ANALYSES BY COHORT ###
    ###############################
        
    for cohort in ['case', 'control']:
        
        cohort_category_counts = category_counts[category_counts['cohort'] == cohort]
        cohort_ref_category_counts = ref_category_counts[ref_category_counts['cohort'] == cohort]
        
        # transform gene counts into arrays
        cohort_gene_counts = cohort_category_counts.groupby(['gene'])['dose'].sum()
        cohort_ref_gene_counts = cohort_ref_category_counts.groupby(['gene'])['dose'].sum()
        
        # fill in missing gene counts (with 0)
        analysis_gene_counts = cohort_gene_counts.reindex(gene_ref, fill_value = 0).values.reshape(1, -1).astype(int)
        analysis_ref_gene_counts = cohort_ref_gene_counts.reindex(gene_ref, fill_value = 0).values.reshape(1, -1).astype(int)
        
        ###################################
        ### RUN THE FISHER'S EXACT TEST ###
        ###################################
                
        # convert to sparse
        analysis_gene_counts = scipy.sparse.csr_matrix(analysis_gene_counts)
        analysis_ref_gene_counts = scipy.sparse.csr_matrix(analysis_ref_gene_counts)
            
        # build our contingency table using matrix math
        cat_and_gs = analysis_gene_counts.dot(gs_gene_sparse_mtx)
        cat_and_not_gs = analysis_gene_counts.sum(axis = 1) - cat_and_gs
        not_cat_and_gs = analysis_ref_gene_counts.dot(gs_gene_sparse_mtx)
        not_cat_and_not_gs = analysis_ref_gene_counts.sum(axis = 1) - not_cat_and_gs

        # contingency table
        tp = cat_and_gs.todense().A1.astype(int)
        fp = cat_and_not_gs.A1.astype(int)
        fn = not_cat_and_gs.todense().A1.astype(int)
        tn = not_cat_and_not_gs.A1.astype(int)

        # calculate p values
        results = pyranges.statistics.fisher_exact(tp, fp, fn, tn, pseudocount = 1)

        results['ref_freq'] = pd.Series(fn.astype(str)) + '/' + pd.Series((fn + tn).astype(str))
        results['cat_freq'] = pd.Series((tp).astype(str)) + '/' + pd.Series((tp + fp).astype(str))
        results['gs'] = gene_sets.keys()
        results['category'] = cat_name
        results['cohort'] = cohort
        results['sv_category'] = sv_category
        results['disease'] = disease
            
        # add in a feature for unique number of genes in the overlap
        unique_gene_overlaps = analysis_gene_counts.astype(bool).multiply(gs_gene_sparse_mtx.T.astype(bool))
        unique_gene_overlaps = unique_gene_overlaps.todense().A.sum(axis = 1)
        results['num_unique_genes_in_overlap'] = unique_gene_overlaps

        results = results[['disease', 'category', 'sv_category', 'cohort', 'gs', 
                           'ref_freq', 'cat_freq', 'num_unique_genes_in_overlap', 'OR', 'P']]
        results = results.rename(columns = {'OR': 'odds_ratio', 'P': 'p'})

        # here, we do FDR correction.
        fdr_p = statsmodels.stats.multitest.multipletests(results['p'].to_list(), method='fdr_bh')[1]
        results['fdr_p'] = fdr_p
        results['bonf_p'] = results['p'] * len(gene_sets)

        fisher_results.append(results)
                
gse_results = pd.concat(fisher_results)

0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 

In [48]:
gse_results.head(2)

Unnamed: 0,disease,category,sv_category,cohort,gs,ref_freq,cat_freq,num_unique_genes_in_overlap,odds_ratio,p,fdr_p,bonf_p
0,ewing,ANY.SINGLETON.PREDICTED_COPY_GAIN.lof_constrai...,coding,case,regulation of protein secretion (GO:0050708),54/3909,1/50,1,2.804364,0.505341,1.0,1377.053474
1,ewing,ANY.SINGLETON.PREDICTED_COPY_GAIN.lof_constrai...,coding,case,negative regulation of fibroblast proliferatio...,3/3909,0/50,0,19.151961,1.0,1.0,2725.0


Compress the p values for export

In [49]:
# gse_results["p"] = gse_results["p"].round(5)
# gse_results["fdr_p"] = gse_results["fdr_p"].round(5)
# gse_results["odds_ratio"] = gse_results["odds_ratio"].round(5)

We now reorganize this dataframe so that the two cohorts (`cases`, `controls`) are all on one row.

In [50]:
cases = gse_results.query('cohort == "case"')
controls = gse_results.query('cohort == "control"')

Each cohort should has the exact same order of columns, so we can just add new columns.

In [51]:
overlap_cols = [
    "disease",
    "category",
    "sv_category",
    "gs",
]

In [52]:
# check that the overlaps are the same
(cases[overlap_cols] == controls[overlap_cols]).all()

disease        True
category       True
sv_category    True
gs             True
dtype: bool

In [53]:
combined_gse_results = cases.drop(columns=["cohort"]).copy()

# rename some columns
combined_gse_results.columns = [
    c + "_cases" if c not in overlap_cols else c for c in combined_gse_results.columns
]

d_temp = controls.copy().drop(columns=overlap_cols + ['cohort'])
d_temp.columns = [c + '_controls' for c in d_temp.columns]
combined_gse_results[d_temp.columns] = d_temp

In [54]:
combined_gse_results.head(2)

Unnamed: 0,disease,category,sv_category,gs,ref_freq_cases,cat_freq_cases,num_unique_genes_in_overlap_cases,odds_ratio_cases,p_cases,fdr_p_cases,bonf_p_cases,ref_freq_controls,cat_freq_controls,num_unique_genes_in_overlap_controls,odds_ratio_controls,p_controls,fdr_p_controls,bonf_p_controls
0,ewing,ANY.SINGLETON.PREDICTED_COPY_GAIN.lof_constrai...,coding,regulation of protein secretion (GO:0050708),54/3909,1/50,1,2.804364,0.505341,1.0,1377.053474,270/22714,0/192,0,0.429134,0.179149,0.513875,488.181684
1,ewing,ANY.SINGLETON.PREDICTED_COPY_GAIN.lof_constrai...,coding,negative regulation of fibroblast proliferatio...,3/3909,0/50,0,19.151961,1.0,1.0,2725.0,7/22714,1/192,1,29.567708,0.065132,0.288123,177.483892


In [58]:
combined_gse_results.to_csv(
    "data/updated-cwas/results/cwas-gene-set-enrichment-results.csv", index=False
)

And export the reference counts

In [59]:
ref_counts.to_csv('data/updated-cwas/reference-counts.csv', index = False)

This notebook is getting much too large, so I'm going to analyze these results in a different notebook.