Decide on minimum read count for exon expression #88

kubu4 · 2024-01-11T22:18:06Z

No description provided.

sr320 · 2024-01-13T16:35:33Z

20x

kubu4 · 2024-01-13T16:49:19Z

Is this per exon or mean across all exons per gene ?

sr320 · 2024-01-13T17:12:44Z

I think we want to keep exon with low / no expression. .. Lets set a threshold as sum of read counts for first 6 exons (as this is what we are looking at) to be 1000.

kubu4 · 2024-01-22T18:04:58Z

Okay, when I do this (sum read coverage across first 6 exons per gene), I end up with only 2,497 genes having a sum of >= 1000.

< 1000      >= 1000 
  35763        2501

kubu4 · 2024-01-22T18:07:09Z

Can check out code here:

ceabigr/code/65-exon-coverage.qmd

Lines 357 to 375 in a6a5f24

    
           ## Test exon sum threshold 
        
           ### Sums of Exons 1 - 6 
        
           ```{r test-exon-sums} 
        
           test_sum <- S9Mdata %>% 
        
             dplyr::select(GeneID,Exon1,Exon2,Exon3,Exon4,Exon5,Exon6) 
        
           test_sum$Sum <- rowSums(test_sum[, -1], na.rm = TRUE) 
        
           str(test_sum) 
        
           ``` 
        
           ### Gene counts with exon coverage threshold 
        
           ```{r test-check-gene-counts-exon-sums} 
        
           # Count the occurrences of values in test_sum$Sum 
        
           sum_counts <- table(ifelse(test_sum$Sum <= 1000, "<= 1000", "> 1000")) 
        
           # Display the counts 
        
           print(sum_counts)

sr320 · 2024-01-22T18:19:26Z

what would the gene count be if reduced sum to 100.

kubu4 · 2024-01-22T18:23:49Z

< 100    >= 100 
  6510    31754

sr320 · 2024-01-22T18:33:21Z

Greater than 500?

kubu4 · 2024-01-22T18:45:06Z

 < 500     >= 500 
 31627       6637

sr320 · 2024-01-22T19:09:14Z

lets go forward with > 100

kubu4 · 2024-01-24T00:39:28Z

Alrighty, we may need to make further adjustments. Those numbers above were just from a single sample that I was using for code testing.

I've managed to write code to look at all the files and do the threshold filtering for all samples on a per gene basis. I.e. All samples must have an exon coverage sum threshold of n.

Threshold	Genes
10	23101
25	18119
50	13827
75	10357
100	7485
500	385

kubu4 · 2024-01-24T00:47:14Z

Relevant code section:

ceabigr/code/65-exon-coverage.qmd

Lines 437 to 486 in 6b4bf89

    
           # Filter genes with exon threshold 
        
           ## Sum of Exons 1-6 for each file. 
        
           ```{r} 
        
           # Initialize an empty list to store results for each file 
        
           sums_list <- list() 
        
           # Loop through each file in the list 
        
           for (file in all) { 
        
             # Read in the data file 
        
             data <- read.csv(paste0("../output/65-exon-coverage/", file, "_exon_reads.processed_data.csv")) 
        
             # Calculate the sum of Exons 1-6 
        
             sum_result <- data %>% 
        
               dplyr::select(GeneID, Exon1, Exon2, Exon3, Exon4, Exon5, Exon6) %>% 
        
               mutate(Sum = rowSums(.[, -1], na.rm = TRUE)) 
        
             # Count GeneIDs meeting the threshold for the current file 
        
             geneid_count <- sum_result %>% 
        
               dplyr::filter(Sum >= exon_sum_threshold) %>% 
        
               nrow() 
        
             cat("\nCount of GeneIDs meeting the threshold (", exon_sum_threshold, "summed reads ):\n") 
        
             # Append the result and count to the lists 
        
             sums_list[[file]] <- sum_result 
        
           } 
        
           ``` 
        
           ## Filter across all samples 
        
           ```{r} 
        
           # Rename the Sum column in each data frame in sums_list 
        
           renamed_sums_list <- lapply(seq_along(sums_list), function(i) { 
        
             colname <- paste0(all[i], "_Sum") 
        
             col_index <- grep("Sum", colnames(sums_list[[i]])) 
        
             colnames(sums_list[[i]])[col_index] <- colname 
        
             dplyr::select(sums_list[[i]], GeneID, colname) 
        
           }) 
        
           # Join data frames in renamed_sums_list on GeneID 
        
           all_geneids_df <- Reduce(function(x, y) dplyr::full_join(x, y, by = "GeneID"), renamed_sums_list) 
        
           # Filter rows where all values are >= the set threshold 
        
           final_geneids <- all_geneids_df %>% 
        
             dplyr::filter(if_all(ends_with("_Sum"), ~. >= exon_sum_threshold)) %>% 
        
             dplyr::select(GeneID) 
        
           # Print count of unique GeneIDs meeting the threshold across all files 
        
           cat("Final count of unique GeneIDs meeting the threshold (", exon_sum_threshold, "summed reads ) across all files:", nrow(final_geneids), "\n") 
        
           # Print the structure of the final data frame 
        
           str(final_geneids)

sr320 · 2024-01-24T17:06:53Z

lets go with threshold of 10

sr320 closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide on minimum read count for exon expression #88

Decide on minimum read count for exon expression #88

kubu4 commented Jan 11, 2024

sr320 commented Jan 13, 2024

kubu4 commented Jan 13, 2024 •

edited

sr320 commented Jan 13, 2024

kubu4 commented Jan 22, 2024

kubu4 commented Jan 22, 2024

sr320 commented Jan 22, 2024

kubu4 commented Jan 22, 2024

sr320 commented Jan 22, 2024 •

edited

kubu4 commented Jan 22, 2024

sr320 commented Jan 22, 2024

kubu4 commented Jan 24, 2024

kubu4 commented Jan 24, 2024

sr320 commented Jan 24, 2024

Decide on minimum read count for exon expression #88

Decide on minimum read count for exon expression #88

Comments

kubu4 commented Jan 11, 2024

sr320 commented Jan 13, 2024

kubu4 commented Jan 13, 2024 • edited

sr320 commented Jan 13, 2024

kubu4 commented Jan 22, 2024

kubu4 commented Jan 22, 2024

sr320 commented Jan 22, 2024

kubu4 commented Jan 22, 2024

sr320 commented Jan 22, 2024 • edited

kubu4 commented Jan 22, 2024

sr320 commented Jan 22, 2024

kubu4 commented Jan 24, 2024

kubu4 commented Jan 24, 2024

sr320 commented Jan 24, 2024

kubu4 commented Jan 13, 2024 •

edited

sr320 commented Jan 22, 2024 •

edited