Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding batch effect removal step #5

Open
sdontsay opened this issue May 14, 2023 · 6 comments
Open

Question regarding batch effect removal step #5

sdontsay opened this issue May 14, 2023 · 6 comments

Comments

@sdontsay
Copy link

Hi Ye,

Thanks for this great tool first. I am trying to run it on the Kim et al. 2020 dataset, but not the one you provided, I download it elsewhere, where the cells are concatenated together, and 16707 cells in total. Well, that shouldn't be a problem, the resolution is still 500k, and I isolated the cells by their ids, and converted the data to the format that scVI-3D can process. In order to account for the batch effects, I also included a cell summary file as follows (sampled from the file),

name batch cell_type
cell_1.txt IMR90-HAP1.R1 HAP1
cell_2.txt IMR90-HAP1.R1 HAP1
cell_3.txt IMR90-HAP1.R1 HAP1
cell_4.txt IMR90-HAP1.R1 HAP1
cell_5.txt IMR90-HAP1.R1 HAP1
cell_6.txt IMR90-HAP1.R1 IMR90
cell_7.txt IMR90-HAP1.R1 HAP1
cell_8.txt IMR90-HAP1.R1 HAP1
cell_9.txt IMR90-HAP1.R1 HAP1

as the cell summary is shown above, although I don't have the depth and sparsity information in the example file, I think it should be enough for batch removal.

However, when implementing the algorithm, I got the following error message after 400 epochs,
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/mmfs1/apps/spack/0.16.1/linux-rhel8-zen2/gcc-10.2.0/python-3.8.6-2pmflf74yv3epdgoav5gykxzbrdxl37l/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/mmfs1/scratch/sdontsay/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in call
return self.func(*args, **kwargs)
File "/mmfs1/scratch/sdontsay/lib/python3.8/site-packages/joblib/parallel.py", line 262, in call
return [func(*args, **kwargs)
File "/mmfs1/scratch/sdontsay/lib/python3.8/site-packages/joblib/parallel.py", line 262, in
return [func(*args, **kwargs)
File "/mmfs1/scratch/scVI-3D/scripts/scVI-3D.py", line 194, in normalize
imputeTmp = imputeTmp + model.get_normalized_expression(library_size = bandDepth, transform_batch = batchName)
File "/mmfs1/scratch/sdontsay/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mmfs1/scratch/sdontsay/lib/python3.8/site-packages/scvi/model/base/_rnamixin.py", line 100, in get_normalized_expression
transform_batch = _get_batch_code_from_category(
File "/mmfs1/scratch/sdontsay/lib/python3.8/site-packages/scvi/model/_utils.py", line 243, in _get_batch_code_from_category
raise ValueError(f'"{cat}" not a valid batch category.')
ValueError: "GM12878-IMR90.R1" not a valid batch category.
"""

I don't understand how to fix this problem, as I have included the batch information in the cell summary file. And I checked out the source code in scvi tools, it looks like it can only account for known batches, is that correct? I used to use another tool, and it can deal with this batch effect I attached, so I thought that you can just throw the batch information over and get it eliminated. If not, please correct me, thanks!

@yezhengSTAT
Copy link
Owner

yezhengSTAT commented May 14, 2023

Hello,
I didn't come across a similar error before. Maybe you can check if there is anything special about the GM12878-IMR90.R1 batch category? Have you tried the demo data and see if you can make it run successfully? Will the program continue successfully if GM12878-IMR90.R1-related cells are not included?

Personally, I do not recommend removing the batch effect where batch and cell types are confounded. We also provided the batch removal results (UMAP figures) where batch and cell types are confounded in the paper, which tends to mess up the cell type separation.

Best,
Ye

@sdontsay
Copy link
Author

Thank you, Ye, for your quick response! I run the demo data already, and it worked well. I think I can try to run the Kim et al. dataset without the GM12878-IMR90.R1-related cells, but even if it works, it does not mean too much to me, since I still need those cells to be normalized.

I guess maybe I can run with no batch removal turning on, as you suggested above. Moreover, may I ask do you have a cell summary file for the Kim et al. dataset you provided in BandNorm? If you do, perhaps I can try yours.

Additionally, when I run your script, some of the scvi-3D.py code that comes from the scvi tools package have been deprecated, which is "scvi.data.setup_anndata(adata)", it is now called by "scvi.model.SCVI.setup_anndata(adata)", you may need to update your code accordingly.

Thanks

@yezhengSTAT
Copy link
Owner

Yes, the summary file for Kim2020 is provided through the BandNorm package: https://sshen82.github.io/BandNorm/articles/BandNorm-tutorial.html#download-existing-single-cell-hi-c-data

More specifically: https://pages.stat.wisc.edu/~sshen82/bandnorm/Summary/Kim2020_Summary.txt

Yes, the scvi-tool has been updated quite frequently after we launched scVI-3D. Thanks for pointing it out! We will make it more robust to newer and older versions.

Thanks,
Ye

@sdontsay
Copy link
Author

Thanks for the information!

Moreover, may I ask a question regarding BandNorm? I see that in the tutorial of BandNorm that you provided (https://sshen82.github.io/BandNorm/articles/BandNorm-tutorial.html), you can just provide the same contact regions format input files (format 1) to BandNorm to do the normalization. However, I don't see anywhere you mentioned including the cell summary information when implementing the main function of BandNorm, which is "bandnorm_result = bandnorm(hic_df = hic_df, save = FALSE)", while you have that option in scVI-3D. Did I miss something or it is just not necessary?

Thanks

@yezhengSTAT
Copy link
Owner

Hello,
Yes, you are right. Summary information is not needed for BandNorm normalization. BandNorm itself does not remove the batch effect. Therefore, BandNorm only needs the contact counts as input. To remove the batch effect, you can run harmony after the BandNorm normalization as indicated in https://sshen82.github.io/BandNorm/articles/BandNorm-tutorial.html#use-bandnorm.

Best,
Ye

@sdontsay
Copy link
Author

Thank you, Ye, I think I've got what I want to know about.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants