Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on minimum read count for exon expression #88

Closed
kubu4 opened this issue Jan 11, 2024 · 13 comments
Closed

Decide on minimum read count for exon expression #88

kubu4 opened this issue Jan 11, 2024 · 13 comments

Comments

@kubu4
Copy link
Collaborator

kubu4 commented Jan 11, 2024

No description provided.

@sr320
Copy link
Owner

sr320 commented Jan 13, 2024

20x

@kubu4
Copy link
Collaborator Author

kubu4 commented Jan 13, 2024

Is this per exon or mean across all exons per gene ?

@sr320
Copy link
Owner

sr320 commented Jan 13, 2024

I think we want to keep exon with low / no expression. .. Lets set a threshold as sum of read counts for first 6 exons (as this is what we are looking at) to be 1000.

@kubu4
Copy link
Collaborator Author

kubu4 commented Jan 22, 2024

Okay, when I do this (sum read coverage across first 6 exons per gene), I end up with only 2,497 genes having a sum of >= 1000.

< 1000      >= 1000 
  35763        2501

@kubu4
Copy link
Collaborator Author

kubu4 commented Jan 22, 2024

Can check out code here:

## Test exon sum threshold
### Sums of Exons 1 - 6
```{r test-exon-sums}
test_sum <- S9Mdata %>%
dplyr::select(GeneID,Exon1,Exon2,Exon3,Exon4,Exon5,Exon6)
test_sum$Sum <- rowSums(test_sum[, -1], na.rm = TRUE)
str(test_sum)
```
### Gene counts with exon coverage threshold
```{r test-check-gene-counts-exon-sums}
# Count the occurrences of values in test_sum$Sum
sum_counts <- table(ifelse(test_sum$Sum <= 1000, "<= 1000", "> 1000"))
# Display the counts
print(sum_counts)

@sr320
Copy link
Owner

sr320 commented Jan 22, 2024

what would the gene count be if reduced sum to 100.

@kubu4
Copy link
Collaborator Author

kubu4 commented Jan 22, 2024

< 100    >= 100 
  6510    31754

@sr320
Copy link
Owner

sr320 commented Jan 22, 2024

Greater than 500?

@kubu4
Copy link
Collaborator Author

kubu4 commented Jan 22, 2024

 < 500     >= 500 
 31627       6637

@sr320
Copy link
Owner

sr320 commented Jan 22, 2024

lets go forward with > 100

@kubu4
Copy link
Collaborator Author

kubu4 commented Jan 24, 2024

Alrighty, we may need to make further adjustments. Those numbers above were just from a single sample that I was using for code testing.

I've managed to write code to look at all the files and do the threshold filtering for all samples on a per gene basis. I.e. All samples must have an exon coverage sum threshold of n.

Threshold Genes
10 23101
25 18119
50 13827
75 10357
100 7485
500 385

@kubu4
Copy link
Collaborator Author

kubu4 commented Jan 24, 2024

Relevant code section:

# Filter genes with exon threshold
## Sum of Exons 1-6 for each file.
```{r}
# Initialize an empty list to store results for each file
sums_list <- list()
# Loop through each file in the list
for (file in all) {
# Read in the data file
data <- read.csv(paste0("../output/65-exon-coverage/", file, "_exon_reads.processed_data.csv"))
# Calculate the sum of Exons 1-6
sum_result <- data %>%
dplyr::select(GeneID, Exon1, Exon2, Exon3, Exon4, Exon5, Exon6) %>%
mutate(Sum = rowSums(.[, -1], na.rm = TRUE))
# Count GeneIDs meeting the threshold for the current file
geneid_count <- sum_result %>%
dplyr::filter(Sum >= exon_sum_threshold) %>%
nrow()
cat("\nCount of GeneIDs meeting the threshold (", exon_sum_threshold, "summed reads ):\n")
# Append the result and count to the lists
sums_list[[file]] <- sum_result
}
```
## Filter across all samples
```{r}
# Rename the Sum column in each data frame in sums_list
renamed_sums_list <- lapply(seq_along(sums_list), function(i) {
colname <- paste0(all[i], "_Sum")
col_index <- grep("Sum", colnames(sums_list[[i]]))
colnames(sums_list[[i]])[col_index] <- colname
dplyr::select(sums_list[[i]], GeneID, colname)
})
# Join data frames in renamed_sums_list on GeneID
all_geneids_df <- Reduce(function(x, y) dplyr::full_join(x, y, by = "GeneID"), renamed_sums_list)
# Filter rows where all values are >= the set threshold
final_geneids <- all_geneids_df %>%
dplyr::filter(if_all(ends_with("_Sum"), ~. >= exon_sum_threshold)) %>%
dplyr::select(GeneID)
# Print count of unique GeneIDs meeting the threshold across all files
cat("Final count of unique GeneIDs meeting the threshold (", exon_sum_threshold, "summed reads ) across all files:", nrow(final_geneids), "\n")
# Print the structure of the final data frame
str(final_geneids)

@sr320
Copy link
Owner

sr320 commented Jan 24, 2024

lets go with threshold of 10

@sr320 sr320 closed this as completed Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants