Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Recommendations for preprocessing RNA-Seq data #99

Closed
nijibabulu opened this issue Apr 16, 2024 · 2 comments
Closed

General Recommendations for preprocessing RNA-Seq data #99

nijibabulu opened this issue Apr 16, 2024 · 2 comments

Comments

@nijibabulu
Copy link

We are generally interested in detecting perturbed regulator proteins and their networks using causal networks in a de-novo fashion, which would in theory make sense to apply inverse CARNIVAL to solve. However, I am a unclear on how to preprocess the input genes and network to get a reliably solvable problem. My questions are generally:

  • Is there a current general guideline for how to preprocess inputs for CARNIVAL based off of single-contrast DE runs (either limma, DESeq2 or edgeR) as well as PKNs?
  • Is there also a way to estimate whether the input PKN+measurement combination may not be solvable?

Below I discuss what I have tried so far.

I noticed that in the available examples, the PKNs and measurement inputs are filtered by some subset of genes, usually TFs and pathway genes. (Below)

https://github.com/saezlab/transcriptutorial
https://github.com/saezlab/CausalToxNet
https://github.com/harrytr/p53

These often have pre-loaded networks and TFs, but I am trying to develop a pipeline. I tried two different methods based on those and got very different outcomes in terms of converging on a solution. I am starting with 17.5k gene statistics (based on running limma) as measurements. I then used the following:

ULM/collectri

# start with gene statistics in gene_mat 
net <- decoupleR::get_collectri(organism = 'human', split_complexes = FALSE)
gene_tf_ulm <- decoupleR::run_ulm(mat = gene_mat, net = net, .source = "source",
      .mor = "mor", .target = "target")
gene_tf_acts <- tibble::deframe(gene_tf_ulm[,c("source", "score")])
net_pruned <- CARNIVAL::prune_network(net, names(gene_tf_acts), names(gene_tf_acts)) |>
    dplyr::select(source, interaction=mor, target)
CARNIVAL::generateLpFileCarnival(measurements = gene_tf_acts,
                                 priorKnowledgeNetwork = net_pruned)

VIPER/Omnipath

This is adapted more or less from the p53 repository

regulon_df <- OmnipathR::import_tfregulons_interactions(organism = 9606)
network <-
    regulon_df |>
    dplyr::filter(is_directed == 1, is_inhibition + is_stimulation == 1,
                  dorothea_level %in% c("A", "B", "C")) |>
    dplyr::transmute(source = source_genesymbol,
                     interaction = -1 * is_inhibition + is_stimulation,
                     target = target_genesymbol) |>
    dplyr::distinct()

TF_activities <- decoupleR::run_viper(gene_t_mat, dorothea::dorothea_hs_pancancer,
    .source = "tf", .target = "target", .mor = "mor")
measurements <- TF_activities[,c("source", "score")]  |> tibble::deframe()
pathway_activities <-
    progeny::progeny(as.matrix(measurements), scale = TRUE, top = 100, perm = 1000, organism = "Human")

# snip: convert these activities to progeny_weights via the progenyMembers.Rdata

CARNIVAL::generateLpFileCarnival(measurements = measurements,
                                   priorKnowledgeNetwork = network,
                                   weights = progeny_weights)

The VIPER/Omnipath method is solvable quickly with CBC, but as far as I am aware from what I am reading, is somewhat out of date in comparison to the previous method.

Thank you for your help!

@gabora
Copy link
Member

gabora commented Apr 17, 2024

Hello

  1. see the tutorial on how to produce CARNIVAL inputs, i.e. estimating TF activity: https://saezlab.github.io/decoupleR/articles/tf_bk.html

We are not using all DE genes as "measurements" in CARNIVAL. CARNIVAL models how sinaling leads to the regulators of differential gene expression, ie. to transcription factors. This is also why our prior knowledge network is a protein-protein interaction network and not a gene regulatory network.

  1. About the solvability of problems:
    I cannot think of a case when the CARNIVAL optimization would lead to infeasibility, i.e. when some constraints are contradictory. I guess what you observe is the numerical issues with CBC. We also observed strong limitation with the non-commercial optimizers and thus we recommend Gurobi or cplex that are free in academia.

If you would work with this in a non-academic setting, I would suggest to try to eliminate nodes and edges from the prior knowledge network. We used a couple of strategies in cosmos (https://github.com/saezlab/cosmosR), e.g.
(1) remove nodes from the network that are not expressed in any condition based on the RNAseq data
(2) remove nodes that are not connected with a directed path to the measurement (they are not observable anyways)
(3) remove nodes that are further than N number of steps from the inputs/measurements. E.g. N = 8. The rationale here is that the more steps we do on the network the less likely that our prediction makes any sense.

From these, mostly (3) has the strongest impact on the computational time. Maybe you need to go down to N=5 for CBC.

@gabora gabora closed this as completed Apr 17, 2024
@nijibabulu
Copy link
Author

Thank you for this very clear answer! This is all very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants