Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a batch_key option to HVG function to reduce the batch effects #622

merged 7 commits into from Apr 30, 2019


Copy link

commented Apr 26, 2019

(sorry, I messed up my branch, so sending a new PR)

I added a new batch_key option to HVG function. If specified, it runs the HVG selection in every batch separately and then merges the list in order to reduce the batch effects by avoiding the selection of batch-specific genes. This doesn't fully correct the batch effect but reduces it.

Running the function for each batch is trivial but merging is trickier than I thought. How I do it now is as follows:

  • hvg is run on each batch and resulting hvg lists are first concatenated into a single dataframe.
    The data frame is grouped by genes. mean, dispersion and normalized dispersion values are aggregated via np.nanmean. Two new columns are created 1) "in how many batches a gene is detected as hvg". 2) intersection of all HVGs across batches.

  • if n_top_genes is given, combined hvg lists are sorted by 1) in how many batches a gene is detected as hvg 2) normalized dispersion. normalized dispersion is used to break the ties. Then top n genes are selected as the final hvg list.

  • if n_top_genes is not given, same mean and dispersion thresholds are applied to the combined hvg list.

Here is the code to see the improvement of this approach:

import scanpy as sc
import numpy as np
import pandas as pd

ad ="pancreas.h5ad", backup_url="") # adapted from scGen repo
ad.obs["cell_type"] = ad.obs["celltype"].tolist()

ad = sc.AnnData(ad.raw.X, var=ad.raw.var, obs=ad.obs)

sc.pp.highly_variable_genes(ad, batch_key='batch')
sc.pp.neighbors(ad), color=["batch", "cell_type"])

This comment has been minimized.

Copy link

commented Apr 26, 2019

This is very helpful! Great! 😄

But, can have a non-recursive formulation of this? Others and I worked to get rid of many of the initial recursive formulations as they were hard to read.

And here, it's the same thing. It's already a very long function and should not get longer. Can you just rename the old highly_variable_genes to _highly_variable_genes_single_batch and remove the recursion? It's a very simple change, I'd be grateful...

Can you also update docs/release_notes.rst with a link to this PR?

@gokceneraslan gokceneraslan force-pushed the gokceneraslan:master branch from 2ef0632 to 3db4715 Apr 30, 2019

@gokceneraslan gokceneraslan merged commit 7302283 into theislab:master Apr 30, 2019

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed

This comment has been minimized.

Copy link

commented Apr 30, 2019

Awesome! 😄

But you forgot to advertise your work in the release_notes.rst. 😉


This comment has been minimized.

Copy link
Collaborator Author

commented May 1, 2019

Actually, I added two commits to my master branch and one was about the release notes. But then instead of pushing to my fork, I pushed these to the master branch of scanpy repo by mistake. Fortunately, there were just these two commits that I wanted to add to this PR, so everything should be all right. Sorry for the confusion.


This comment has been minimized.

Copy link

commented May 7, 2019

Just noticed this. I had a similar function in scanpy-scripts ( but yours is much more comprehensive in reporting aggregated information. Good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can’t perform that action at this time.