Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation plots of the effect of Bayesian estimated gene expression distribution vs empirical gene expression from samples of various sizes #36

Closed
jianwu1 opened this issue Mar 4, 2022 · 8 comments
Projects

Comments

@jianwu1
Copy link
Collaborator

jianwu1 commented Mar 4, 2022

No description provided.

@jianwu1 jianwu1 created this issue from a note in benchmark (In progress) Mar 4, 2022
@stemangiola
Copy link
Owner

What do you mean, installing the package will update the model. The new model takes into account datasets of origin. Can you please reformulate?

@jianwu1
Copy link
Collaborator Author

jianwu1 commented Mar 4, 2022

@stemangiola
Hi Stefano,
Here is an evaluation of the gene expression distribution estimated from the model.
IQR_bayes_vs_empirical
I selected the most variable gene (by IQR.bayes) from each of the cell type and looked at their empirical expression across samples. The black bars indicate the median of the estimated gene expression from bayesian model. So for example, SNORD22 expression is estimated to have a higher variation in t_helper cells by Bayesian model than empirical data(imputed). The black bar for SNORD22 (median of SNORD22 expression in t_helper cells) is much higher than the empirical expression of SNORD22 in samples we have.
empirical_counts_of_genes_highly_variable_in_bayes

mean_bayes_vs_empirical
I selected the gene with mean expression of the highest deviation(mean.bayes against means.empirical) from each of the cell type and looked at their empirical expression across samples. The black bars indicate the median of the estimated gene expression from bayesian model.
empirical_counts_of_genes_with_higher_mean_in_bayes

I think the plots show that Bayesian estimated gene expression preserves the variability of imputed genes. However, for unimputed genes, such as in t_CD8_naive, the bayesian estimated gene expression variation is much higher than the empirical expression.

@jianwu1 jianwu1 changed the title FOR STEFANO: I have to update the Bayes model to take into account of dataset of origin. Evaluation plots of the effect of Bayesian estimated gene expression distribution vs empirical gene expression from samples of various sizes Mar 4, 2022
@jianwu1
Copy link
Collaborator Author

jianwu1 commented Mar 4, 2022

for comparision, the plot using the Bayes model before:
figure4

@stemangiola
Copy link
Owner

Hello Jian,

this can make sense, in case we have just one dataset, and extreme consistency within that dataset. We would fool us that that world would be so consistent, while it is just one laboratory has output a dataset with no variation. Can you

  1. color by dataset, and add feature in the x axis
  2. tell me if this datapoint are imputed somehow? For example T helper h1 I see a lot of points > 0 but with sd ~ 0. Very peculiar for experimental data.

@jianwu1
Copy link
Collaborator Author

jianwu1 commented Mar 4, 2022

@stemangiola

this can make sense, in case we have just one dataset, and extreme consistency within that dataset.
I see. You are right Stefano, the CD8_naive cells only have 4 datasets in the collection, which is reflected by its few data points in the boxplot. The small dataset should be the cause of artificially small empirical gene expression variation.

  1. color by dataset, and add feature in the x axis
    color_by_dataset

2. T helper h1 I see a lot of points > 0 but with sd ~ 0. Very peculiar for experimental data.
Hi Stefano, you are right, most of the invariant gene expressions are imputed.
invariant_imputed_genes

I just have another question about imputation, which is actually the question brought to me in my final thesis talk. Someone asked if the imputation by cell type would distort the biological expression of genes in cell types and I didn't comprehend it fully back then. When we have a few samples of the same cell type, the transcriptomic differences between samples are technical missing values and ought to be imputed. However, when we use the gene expression of child cell types to impute its expression in the parent cell types, even level by level, wouldn't that confound the biological differences?

@stemangiola
Copy link
Owner

stemangiola commented Mar 5, 2022

To do:

  1. select imputed different between beayes and non-bayes and draw boxplots, coloured by dataset
  2. select imputed excluding 0 counts and do the same
  3. select unimputed and do the same
  4. select genes imputed with the mean difference and do the same
  5. select genes unim[uted with the mean difference and do the same
  6. draw a boxplot for genes where the bayes mean = 0 and arithmetic mean > 0

@jianwu1
Copy link
Collaborator Author

jianwu1 commented Mar 5, 2022

@stemangiola

  1. select imputed different between bayes and non-bayes and draw boxplots, coloured by dataset
    IQR_bayes_vs_empirical

highly_variable_imputed_genes

2. select imputed excluding 0 counts and do the same
highly_variable_imputed_non_zero

highly_variable_non_zero_imputed_genes

3. select unimputed and do the same
highly_variable_unimputed_genes

highly_variable_unimputed_genes(1)

4. select genes imputed with the mean difference and do the same

high_mean_difference_imputed

imputed_genes_with_highest_mean_difference

5. select genes unim[uted with the mean difference and do the same

high_mean_difference_unimputed

unimputed_genes_highest_diference_in_mean

6. draw a boxplot for genes where the bayes mean = 0 and arithmetic mean > 0

zero_mean

genes_with_zero_mean_in_bayes

@Kamran-Khan96
Copy link
Collaborator

Obsolete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
benchmark
In progress
Development

No branches or pull requests

3 participants