#### Filter the gene set based on the output from script `Step_01_remove_non_redundant_gene_set.R`

In [1]:
library(tidyverse)
library(pheatmap)
library(fgsea)

── [1mAttaching core tidyverse packages[22m ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted pac

Load the original gene set file -- we will need to format our subsetted gene set according to this 

In [2]:
pathways <- gmtPathways("c5.go.bp.v2023.2.Hs.symbols.gmt")

Load the gene sets from the `Step_01`

In [3]:
distance_matrix <- readRDS("fgsea_gene_set_distance_matrix.rds")

In [4]:
dim(distance_matrix)

In [5]:
# run hierarchical clustering
hc <- hclust(as.dist(distance_matrix), method = "complete")

In [9]:
# define clusters as those whose dissimilarity is 70% or less
# the higher this value (close to 1), the fewer the clusters since h = 1 means everything that has a dissimilarity of 1 or less
# (which is everything since the Jaccard values are from 0 to 1) is one cluster
# the lower the value, the more clusters there are 0, so if h = 0, then every gene set is its own cluster
dissimilarity_threshold = 0.7
clusters <- cutree(hc, h = 0.7)

In [10]:
GO_term_cluster_assignments <- as.data.frame(clusters)

In [11]:
# number of clusters
num_clusters <- length(unique(GO_term_cluster_assignments$clusters))
num_clusters

In [8]:
write.csv(GO_term_cluster_assignments, file = paste0("fgseapy_at_", dissimilarity_threshold, "_dissimilarity_GO_BP_terms.csv"))

In [9]:
# cluster names are strings 
cluster_names = as.character(seq(1, num_clusters))

Randomly sample one gene set per cluster

In [10]:
# Initialize an empty named character vector
representative_GO_term_per_cluster <- character(length(cluster_names))
names(representative_GO_term_per_cluster) <- cluster_names

# Loop to fill the vector with representative GO terms
for (cluster_name in cluster_names) {
    representative_GO_term_per_cluster[cluster_name] <- sample(names(clusters[clusters == cluster_name]), 1)
}

The original .gmt file is a list, we will just subset it to those represented GO terms

In [11]:
class(pathways)

In [12]:
filtered_pathways <- pathways[names(pathways) %in% representative_GO_term_per_cluster]

In [13]:
head(filtered_pathways)

Write this in the gmt format

In [14]:
# specify output and open a file connectio
output_file <- "filtered_GO_Hs_symbols.gmt"
file_conn <- file(output_file, "w")

# loop through all of the GO terms in the filtered set
for (go_term in names(filtered_pathways)) {
  # fetch the associated genes for this GO term
  genes <- filtered_pathways[[go_term]]
  
  # write to the file in GMT format: GO term, description (use GO term as placeholder), and genes
  line <- paste(go_term, go_term, paste(genes, collapse = "\t"), sep = "\t")
  
  # write the file to the file
  writeLines(line, file_conn)
}

# close the file connection
close(file_conn)

cat("File written to", output_file, "\n")

File written to filtered_GO_Hs_symbols.gmt 


In [15]:
reloaded_filtered_pathways <- gmtPathways(output_file)

In [16]:
head(reloaded_filtered_pathways)

The filtered set is now compatible with fgseapy and can be properly loaded