Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does a ULM on LFCs from bulk RNA-seq work for TF activity inference? #21

Closed
adamklie opened this issue Oct 14, 2022 · 6 comments
Closed
Assignees
Labels
question User question: anything that's not obviously a bug

Comments

@adamklie
Copy link

Describe your question
Thanks once again for putting together this package. I've found it super usable and useful so far. I just have hopefully a quick question about the underlying method for analysis of bulk RNA-seq data.

I'm getting some awesome results from fitting a univariate linear model to log fold changes (LFC) from DESeq2 using the DoRothEA network, but I don't really understand how I'm getting these TF activities. I follow the general idea that "the observed molecular readouts in mat are the response variable and the regulator weights in net are the explanatory ones," but I guess I don't really understand what these are in my specific context and how I end up with an activity score for each TF. I'm not sure I even follow what each data point is in this context. Am I fitting a separate model for each TF?

If someone could help me understand this better or point me to the details (that I apologize in advance if I missed them) that would be amazing!

@adamklie adamklie added the question User question: anything that's not obviously a bug label Oct 14, 2022
@PauBadiaM PauBadiaM self-assigned this Oct 17, 2022
@PauBadiaM
Copy link
Member

Hi @adamklie

Glad you find the package useful!

As you know decoupler takes an input matrix of gene expression (GEX) and a prior knowledge network (Net) which we transform into matrix format internally. In your case, your GEX is made out of the contrasts' statistics between conditions (if you only have one then it is only one row). When you run ulm, for each "sample" (in your case contrasts) and TF, you fit a univariate linear model. The response variable is the observed GEX (in your case the observed change between conditions) and the explanatory are the weights for that TF. Once the model is fitted, we extract the t-value of the fitted model as the "activity". If the genes that belong to a certain TF show increased activity when their weights are positive, the slope will be positive and we will have positive activities. If there is disagreement, for example genes with negative logFCs and positive weights, the slope will be negative.

image

To briefly answer your question, yes, we fit a separate model for each TF in ulm. However, in the case of mlm we fit one for each sample/contrast, where we include all TFs at the same time, thus the name "multivariate".

Hope this is helpful! Feel free to ask more questions if needed.

@adamklie
Copy link
Author

Thank you so much for the detailed explanation and the figure! Makes it very clear how this is working. Now I'm trying to think about when you might expect one to work better than the other. I would expect that many of the explanatory TFs would have correlated weights that might make it harder to fit a mlm, but that many target genes should be explained by the action of multiple TFs. I guess its not hard to try both and inspect the fit, but are there other considerations or sub-selections of the network or data that make sense before fitting the mlm?

@PauBadiaM
Copy link
Member

Very good points. In the end there is no free lunch, there is a tradeoff.

The advantage of ulm is that we don't care if two TFs have highly correlated weights because we are testing then independently, but you might get false positives and we are not accounting for TF interaction effects when modeling activities.

The advantage of mlm is that since it is multivariate, it accounts for these interaction effects when modeling activities but if the network is too co-linear it might break for those TFs. BTW, you can always check how co-linear a given net is for your data using:

dc.check_corr(net, mat=mat)

If you see that some TFs pairs have high correlations (> 0.9), you should definitely double check how the obtained activities look for these when using mlm.

Therefore, if you are not sure of which one to pick, you can always use the consensus score, which by default models activities based on the results of ulm, norm_wsum and mlm. In our benchmarks we have seen that depending on the data, one of these three methods outperforms the other two, but that their consensus is always the slightly better alternative.

Hope this is helpful!

@adamklie
Copy link
Author

So helpful. Thanks again for taking the time to explain, I really appreciate it!

@gjones3339
Copy link

gjones3339 commented Oct 24, 2024

If we only care regulatory activity in a list of about 30-50 genes (and their logFCs), can we still run consensus? Or we should stick to ORA in that case?

@PauBadiaM
Copy link
Member

Hi @gjones3339,

Without a proper background of genes I would recommend just running ORA (there we assume a background of ~20k genes, all the protein coding genes). For list of genes you can easily run ora with get_ora_df, an example can be seen in this vignette. Hope this is helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question User question: anything that's not obviously a bug
Projects
None yet
Development

No branches or pull requests

3 participants