Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About data preprocessing for diffusion pseudotime #26

Closed
ShuhuaGao opened this issue Jul 3, 2017 · 4 comments
Closed

About data preprocessing for diffusion pseudotime #26

ShuhuaGao opened this issue Jul 3, 2017 · 4 comments

Comments

@ShuhuaGao
Copy link

Hi, first thanks for sharing this analysis tool. I prefer Python much more to R, though most Bioinformatics tools are written in R. Here I want to ask a question about data processing before we feed it as adata into dpt for pseudotime ordering.

As the DPT algorithm can accept multiple types of data, such as the most commonly single-cell qPCR (Ct values) and RNA-Seq (FPKM/TPM) data, is the data processing procedure identical with each other? Since I have also checked the Monocle 2 algorithm, it seems much more complicated in Monocle 2. For instance, in the 4th page of its document link, it asks you to specify different expressionFamily, i.e., the proper distribution of the data, for different kinds of data. Then, how about the dpt function in scanpy? Does it take all kinds of data the same way?

According to my understanding,

  • For qPCR data, we should provide delta_Ct=LOD-Ct values to dpt (LOD: limit of detection);
  • For RNA-Seq data, we should offer log2(FPKM+1) to dpt.
    Is it right?

Any help is appreciated.

@falexwolf
Copy link
Member

falexwolf commented Jul 3, 2017

Hi ShuhuaGao,

thanks for your input! Monocle 2 has many more options for preprocessing, that's right. I believe though that you should get along with the limited options of Scanpy for a robust pseudotime and branching inference using DPT; simply because DPT is very robust. Nonetheless I have to admit that I've not worked with an extensive number of data types. From this experience, my understanding is the following

  • for RNA-Seq data, you should normalize and extract highly-variable genes. this is most simply done by using the procedure of cell ranger sc.pp.recipe_zheng17 (example here) or, if you want more control, the Seurat workflow (example here)
  • for qPCR, a simple log-normalization (sc.pp.log1p) should suffice (see example here); you might though consider "normalizing per cell / UMI correction", one of the steps done in RNA-seq part (sc.pp.normalize_per_cell)

Ask if you have further questions. 😄

@ShuhuaGao
Copy link
Author

Hi, Alex,

Many thanks for your quick reply. I just saw your reply as it is almost 10PM in Singapore now. It is understandable to perform quality control, in-cell normalization and to extract the highly variable genes for ordering. I got your point.

For your reply about qPCR, do we need a log normalization? I think a log transform is only required for RNA-Seq data to get a non-skewed normal distribution. As for qPCR data, the delta_Ct value is actually already in a log scale. In the example you have mentioned, there is no call of sc.pp.log1p, either. Instead, we just read the data by
adata = sc.read(filename, sheet='dCt_values.txt', backup_url=backup_url)
and no more processing is applied. As can be found from the original paper, the so-called dCt_value is just defined as HK_Ct - Ct, where HK_Ct is the mean Ct of 4 housing keeping genes on a cell-wise basis.

Besides, in many cases, there may be no UMI data available. In such a case, the normalization per cell for RNA-Seq is actually to compute the FPKM/TPM to compensate for the sequencing depth, right? Usually, the RNA-Seq data in FPKM form is already provided in publications. And then we work on this data to find the highly variable genes. (Just personal understanding. I am new to this field from mechatronics engineering.)

Anyway, thanks again for your help. I noticed that there are no examples for pseudo-time ordering with RNA-Seq data. Maybe I can provide one in the near future, as I am working on gene network modeling based on the pseudo-time information.

@falexwolf
Copy link
Member

Hi!

Everything that you write makes sense: if the qPCR values are already on a log scale, you shouldn't log-transform them anymore / if the RNA-seq data is already in FPKM form, you do not need to do account for UMI correction ...

Regarding the pseudotime example for RNA-seq data: here is a public one. But it would be nice to have more!

Thanks for your input!

@ShuhuaGao
Copy link
Author

Thanks for your reply. I will try that and may given more feedback. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants