New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combat Function for Batch Effects #398

Merged
merged 6 commits into from Jan 6, 2019

Conversation

Projects
None yet
4 participants
@Marius1311
Copy link
Contributor

Marius1311 commented Dec 14, 2018

Hi all,

this is a re-implementation of the ComBat function in python for batch effect removal by Brent Pedersen at the University of Utah which I slightly modified to work with AnnData objects. I asked Brent for permission and he would be happy with us using this.

Originally, the code was written in R for the SVA package:
Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Andrew E. Jaffe
and John D. Storey (). sva: Surrogate Variable Analysis. R package
version 3.4.0.

The idea is taken from this paper:
Johnson WE, Rabinovic A, Li C (2007). Adjusting batch effects in microarray
expression data using Empirical Bayes methods. Biostatistics 8:118-127.

Originally, the method was developed to adjust for batch effects in microarray data, however, it is commonly applied to scRNA-seq data nowadays. The method fits linear models to the genes and pools statistical power by means of EB to estimate per gene correction factors.

I understand that @mbuttner also has an implementation of this - maybe we can combine and get the best of both approaches?

@falexwolf
Copy link
Member

falexwolf left a comment

Thank you for the very well documented PR, @Marius1311! I have a few stylistic comments on the code below, which you should be able to address without much work.

But much more importantly, as this is a complete reimplementation, we need at least one unit test, better a couple more. Looking forward to the updated PR! 😄

Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
sys.stderr.write("Finding parametric adjustments\n")
gamma_star, delta_star = [], []
for i, batch_idxs in enumerate(batch_info):
temp = it_sol(s_data[batch_idxs], gamma_hat[i],

This comment has been minimized.

@falexwolf

falexwolf Dec 15, 2018

Member

Mind your indentation. Generally, we'd indent along increments of 4 spaces and not along the position where a parenthesis opens. Here, you added a space that created misalignment.

This comment has been minimized.

@Marius1311

Marius1311 Dec 17, 2018

Author Contributor

I don't get this comment - indentation between line 97 and 98 is 4 spaces, so I assume you mean the indentation between lines 98 and 99?

This comment has been minimized.

@flying-sheep

flying-sheep Dec 18, 2018

Member

yup, that one:

temp = it_sol(
    s_data...,
    delta_hat...,
)

This comment has been minimized.

@Marius1311

Marius1311 Dec 18, 2018

Author Contributor

okay, got it.

This comment has been minimized.

@Marius1311

Marius1311 Dec 18, 2018

Author Contributor

spyder does that automatically though.

adata.X = bayesdata.values.transpose()


def it_sol(sdat, g_hat, d_hat, g_bar, t2, a, b, conv=0.0001):

This comment has been minimized.

@falexwolf

falexwolf Dec 15, 2018

Member

Make this a private function. Also, at least some documentation would be good, otherwise no one will ever understand how this works.

Finally, this looks a lot like a function that can be dramatically be sped up using numba, eg as in pp.calculate_qc_metrics.

This comment has been minimized.

@Marius1311

Marius1311 Dec 17, 2018

Author Contributor

Numba is a good option here but not entirely straightforward since data is stored in pd.DataFrame which numba doesn't know. Could add this in the future. I did however make it_sol a private function and I added some comments.

This comment has been minimized.

@flying-sheep

flying-sheep Dec 18, 2018

Member

you can access the underlying data using some_series.values: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.values.html

That’s a numpy array that numba should understand.

This comment has been minimized.

@Marius1311

Marius1311 Dec 18, 2018

Author Contributor

did that.

Show resolved Hide resolved scanpy/preprocessing/combat.py
Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
@fidelram

This comment has been minimized.

Copy link
Collaborator

fidelram commented Dec 17, 2018

adata.X = bayesdata.values.transpose()


def it_sol(sdat, g_hat, d_hat, g_bar, t2, a, b, conv=0.0001):

This comment has been minimized.

@flying-sheep

flying-sheep Dec 18, 2018

Member

you can access the underlying data using some_series.values: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.values.html

That’s a numpy array that numba should understand.

Show resolved Hide resolved scanpy/preprocessing/combat.py Outdated
Show resolved Hide resolved scanpy/preprocessing/combat.py

Marius1311 added some commits Dec 18, 2018

@Marius1311

This comment has been minimized.

Copy link
Contributor Author

Marius1311 commented Dec 18, 2018

Hey everyone, thanks for your feedback! In the latest commit, I have tried to include all of your comments, including the more stylistic comments, the references, the numba integration, the unit tests and so on. Have a look and see what you think. I won't be able to work on this any more this year because I am going on holidays. Merry Christmas everyone!

@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Jan 3, 2019

A very happy new year!

Your improvements look great! I see one small issue that I'll fix myself and I'll merge this and add you to the author list. It's a very nice comprehensive contribution!

@falexwolf falexwolf merged commit 8668f66 into theislab:master Jan 6, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Jan 6, 2019

Thank you very much! I merged this via the command line after adapting to the private module design.

I still get a to me cryptic AttributeError from patsy on my Mac, but the tests are fine and on the Linux server it also runs fine:

preprocessing/_combat.py:150: in combat
    s_data, design, var_pooled, stand_mean = stand_data(model, data)
preprocessing/_combat.py:78: in stand_data
    design = design_mat(model, batch_levels)
preprocessing/_combat.py:32: in design_mat
    model, return_type="dataframe")
../../../miniconda3/lib/python3.6/site-packages/patsy/highlevel.py:291: in dmatrix
    NA_action, return_type)
../../../miniconda3/lib/python3.6/site-packages/patsy/highlevel.py:165: in _do_highlevel_design
    NA_action)
../../../miniconda3/lib/python3.6/site-packages/patsy/highlevel.py:62: in _try_incr_builders
    formula_like = ModelDesc.from_formula(formula_like)
../../../miniconda3/lib/python3.6/site-packages/patsy/desc.py:164: in from_formula
    tree = parse_formula(tree_or_string)
../../../miniconda3/lib/python3.6/site-packages/patsy/parse_formula.py:148: in parse_formula
    _atomic_token_types)
../../../miniconda3/lib/python3.6/site-packages/patsy/infix_parser.py:210: in infix_parse
    for token in token_source:
../../../miniconda3/lib/python3.6/site-packages/patsy/parse_formula.py:94: in _tokenize_formula
    yield _read_python_expr(it, end_tokens)
../../../miniconda3/lib/python3.6/site-packages/patsy/parse_formula.py:44: in _read_python_expr
    for pytype, token_string, origin in it:
../../../miniconda3/lib/python3.6/site-packages/patsy/util.py:332: in next
    return six.advance_iterator(self._it)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

code = ''

    def python_tokenize(code):
        # Since formulas can only contain Python expressions, and Python
        # expressions cannot meaningfully contain newlines, we'll just remove all
        # the newlines up front to avoid any complications:
        code = code.replace("\n", " ").strip()
        it = tokenize.generate_tokens(StringIO(code).readline)
        try:
            for (pytype, string, (_, start), (_, end), code) in it:
                if pytype == tokenize.ENDMARKER:
                    break
                origin = Origin(code, start, end)
>               assert pytype not in (tokenize.NL, tokenize.NEWLINE)
E               AssertionError

../../../miniconda3/lib/python3.6/site-packages/patsy/tokens.py:35: AssertionError
@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Jan 6, 2019

This is a substantial contribution and I added Marius as author: 5557338 😄

@Marius1311

This comment has been minimized.

Copy link
Contributor Author

Marius1311 commented Jan 7, 2019

Thanks Alex! That's great, thanks also for adding me to the authors list. I haven't seen that patsy error on my Ubuntu machine either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment