Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries fix #467

Open
wants to merge 24 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@ivirshup
Copy link
Collaborator

commented Feb 8, 2019

Following up on #242.

Here's my solution to the current queries being pretty unreliable for me (due to issue with bioservices module). It's all a pretty thin wrapper around pybiomart, which has a nice API and is well tested but has maintenance issues.

Currently I've replaced the gene_coordinates query with a more generic biomart_annotations – the example covers the functionality of gene_coordinates. I'm debating how to add tests given that they're network based (could fail when nothing is wrong with the code) and can take a while.

@flying-sheep flying-sheep force-pushed the theislab:master branch 2 times, most recently from 3efb194 to fc84096 Feb 12, 2019

@ivirshup ivirshup force-pushed the ivirshup:queries_fix branch from af52e01 to 5cab263 Feb 15, 2019

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Feb 15, 2019

Looks nice! The singledispatch makes it pretty neat!

Could you please add return type annotations? I think gprofiler usually returns a pd.DataFrame, but if the query comes back empty it for some reason returns an empty list? I guess that’s a leftover from the R port.

I think we could return gprofiler(...) or None to catch that case and add the return annotation Optional[pd.DataFrame].

@LuckyMD

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2019

gprofiler functionality is being added to scanpy? I have a small wrapper for that as well... the main components being a try-catch wrapper around it as it can give an error when there are no results.

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Feb 15, 2019

Ah? The code seems like it just returns an empty list when there’s no results. Where is the error thrown?

@LuckyMD

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2019

This is the traceback I get when I receive no results:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-65d9edda0106> in <module>()
      1 test = gprofiler(list(module_membership['MEpink']), 
      2                                custom_bg=list(np.unique([v.split(".")[0] for v in var.index.tolist()])),
----> 3                                organism='mmusculus', correction_method='fdr', src_filter=['GO:BP'])

~/anaconda3/lib/python3.6/site-packages/gprofiler/__init__.py in gprofiler(query, organism, ordered_query, significant, exclude_iea, region_query, max_p_value, max_set_size, correction_method, hier_filtering, domain_size, custom_bg, numeric_ns, no_isects, png_fn, include_graph, src_filter)
    147                           "query.size", "overlap.size", "recall", "precision",
    148                           "term.id", "domain", "subgraph.number", "term.name",
--> 149                           "relative.depth", "intersection"]
    150     enrichment.index = enrichment['term.id']
    151     numeric_columns = ["query.number", "p.value", "term.size",

~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   4387         try:
   4388             object.__getattribute__(self, name)
-> 4389             return object.__setattr__(self, name, value)
   4390         except AttributeError:
   4391             pass

pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    644 
    645     def _set_axis(self, axis, labels):
--> 646         self._data.set_axis(axis, labels)
    647         self._clear_item_cache()
    648 

~/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   3321             raise ValueError(
   3322                 'Length mismatch: Expected axis has {old} elements, new '
-> 3323                 'values have {new} elements'.format(old=old_len, new=new_len))
   3324 
   3325         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 0 elements, new values have 14 elements

I guess it has to do with the output expecting 14 values for the pd.Dataframe, but receiving 0.

@ivirshup ivirshup force-pushed the ivirshup:queries_fix branch from 5cab263 to a88fbb2 Feb 16, 2019

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 16, 2019

How about an empty DataFrame instead of None? I think I prefer len(results) == 0 to results is None.

Also, I need to look into the arguments to gprofiler a bit more before this is ready to merge. I'd also like to add tests, but probably ones that are optional. Does scanpy have a preferred way of adding tests that don't run by default?

@ivirshup ivirshup force-pushed the ivirshup:queries_fix branch from a88fbb2 to a8c15ce Feb 16, 2019

Attempted documentation of `enrich` and formatting
I've given a shot at documenting a single dispatch function, largely following how it's done in `R`. I assume that will be most familiar to people who use scanpy. Also made the api to `gprofiler` more flexible by letting the user pass arbitrary arguments.

Additionally, ran formatting with black.
@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 18, 2019

Ha, just saw your PR to gprofiler @flying-sheep! I guess enrich should return None then?

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Feb 18, 2019

Eh, good question what would be better. I think both work, but if we return an empty one, we have to make sure that the column names and types are still correct.

In [4]: bool(pd.DataFrame(dict(a=[])))                                                                                                                                    
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-5108fad10cc0> in <module>
----> 1 bool(pd.DataFrame(dict(a=[])))

/usr/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
   1477         raise ValueError("The truth value of a {0} is ambiguous. "
   1478                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1479                          .format(self.__class__.__name__))
   1480 
   1481     __bool__ = __nonzero__

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [5]: pd.DataFrame(dict(a=[])).empty                                                                                                                                    
Out[5]: True
@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 19, 2019

I liked returning a data frame for type stability (i.e. this method always returns a data frame). However, you bring up a good point about the columns and dtypes of the data frame being part of it's type. Because it'd add more complexity that it's worth and type stability isn't that much of a thing in python, I've gone ahead with None.

org: str = "hsapiens",
correction_method: str = "analytical",
gprofiler_kwargs: dict = {},
) -> pd.DataFrame:

This comment has been minimized.

Copy link
@flying-sheep

flying-sheep Feb 19, 2019

Member

You said you’ve gone with None, so this is Optional[pd.DataFrame]

This comment has been minimized.

Copy link
@ivirshup

ivirshup Feb 20, 2019

Author Collaborator

Oops, only updated the docs.

Any chance you know of some tooling where I could just specify the types once and have them propagate between annotations and docs? Or maybe a linter that would warn me about mismatches?

This comment has been minimized.

Copy link
@flying-sheep

flying-sheep Feb 21, 2019

Member

That’s what I’ve worked on a lot in the scanpydoc package.

Just delete them from the docs, they’ll be added automatically from the type annotations!

@liiskolb

This comment has been minimized.

Copy link

commented Feb 22, 2019

Hey!
I'm a member of g:Profiler (biit.cs.ut.ee/gprofiler) development team and was scanning through web to find services that might depend on us.

We recently went live with an extensive update which might break some of the previous pipelines and wrappers.

All the existing Python and R packages should work, however they are linking to an archived data version and they don't access the most up-to-date data from g:Profiler due to the new API etc.

We have already created a new R package that corresponds to the new API, Python package is still in the progress.

I just wanted to let you know and please feel free to contact me if I can be of any help.

All the best,
Liis

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Feb 27, 2019

@liiskolb Thanks for letting us know. Where can I track the python package at? Could you also point me towards the old and new docs for the API? I'd like to check what's different. Thanks!

@flying-sheep I'm thinking I'll drop the enrich function from this PR until this is resolved. Thoughts?

@liiskolb

This comment has been minimized.

Copy link

commented Feb 27, 2019

@ivirshup
Our official Python package is not public yet.
The old API description is here: https://biit.cs.ut.ee/gprofiler_archive2/r1760_e93_eg40/web/help.cgi?help_id=55
and the new one with Python examples is here: https://biit.cs.ut.ee/gprofiler/page/apis.
So, not only the parameters changed but also the request type, paths and output format (it's JSON now).
Let me know if you have any other questions about the tool, results, enrichment analysis, etc

@LuckyMD

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

Hi,
This is very relevant information for me as well. I currently use valentine svensson's g:profiler python API (https://github.com/vals/python-gprofiler) in my single-cell tutorial (https://github.com/theislab/single-cell-tutorial). I'm guessing that will be similarly outdated soon... I would thus be quite keen for this to be included in scanpy @ivirshup ;).

@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 10, 2019

Let me know when this can be merged. The discussion here got a bit long; it's interesting but I can't figure if we're there, yet. 😄

@flying-sheep, happy if you merge this, too, of course.

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

I made some changes regarding how we want APIs defined from now on.

@ivirshup I hope I didn’t make any mistakes, I’m still a bit sick. Could you please check my changes in 4550142?

@flying-sheep

This comment has been minimized.

Copy link
Member Author

commented on scanpy/queries/_queries.py in 4550142 Mar 11, 2019

Import order: Three blocks, separated by a blank line. stdlib imports, then 3rd-party imports, then relative imports (from .foo import bar)

@flying-sheep

This comment has been minimized.

Copy link
Member Author

commented on scanpy/queries/_queries.py in 4550142 Mar 11, 2019

Rely on type annotations unless their information isn’t sufficient.

Exception: If the default value is not literally there, but None means “use the default”, that can simply be mentioned in the description.

This comment has been minimized.

Copy link
Collaborator

replied Mar 13, 2019

While I like the DRY aspect, I don't think I like removing type information from the doc-strings. For instances where types can't quite be captured (e.g. list-like, string as enumerated type), or you want to avoid an import for performance/ dependency reasons. Also I personally look at the docs mostly through an ipython REPL, and it's nice to have the type annotation next to the description instead of having to scroll up.

Ideally I'd like a linter which tells you when the type annotations and docs are out of sync and a formatter which propagates one to the other.

This comment has been minimized.

Copy link
Member Author

replied Mar 13, 2019

that’s all accounted for, don’t worry.

  1. your examples can be captured very well in the type system. “list-like” is too vague, but you can pick something from collections.abc and use its generic version from typing. e.g. if you expect isinstance(param, collections.abc.Iterable) for a parameter, you do def fun (..., param: typing.Iterable[ElemType]):

  2. if you really can’t capture something (e.g. you want to specify param in {'x', 'y'}), then you can of course leave in manual type information.

  3. You don’t have to scroll up. We add the types back to the docstrings using

    scanpy/scanpy/utils.py

    Lines 80 to 82 in d4a7a2d

    def annotate_doc_types(mod: ModuleType, root: str):
    for c_or_f in descend_classes_and_funcs(mod, root):
    c_or_f.getdoc = partial(getdoc, c_or_f)

This comment has been minimized.

Copy link
Collaborator

replied Mar 16, 2019

  1. I think this breaks down into two parts
  • I think we never actually resolved how you're supposed to say, "can be a tuple, list, array, pd.Series, or equivalent but not a str" using types.
  • But what about the case of typing for an optional dependency? The signatures get evaluated before the function is called, so you can't do the internal import thing. If I have a function which actually only takes a Zarr array, but Zarr is an optional dependency, is there a simple way to type that?
  1. That sounds good, I just don't want to split where I look for documentation based on the kind of argument, which leads me to:
  2. My documentation might be rendering differently than yours then. Here's what ?sc.pp.pca looks like for me:

image

This comment has been minimized.

Copy link
Member Author

replied Mar 16, 2019

I think we never actually resolved how you're supposed to say, "can be a tuple, list, array, pd.Series, or equivalent but not a str" using types.

Am I missing something? That’s just typing.Sequence[ElemType] or typing.Iterable[ElemType] depending on what the function actually expects (is iteration enough or do we need indexing)

Strings are a Sequence[str], so if ElemType isn’t str, we have expressed that.

But what about the case of typing for an optional dependency?

try:
    import zarr.Array as Zarr  # or so
except ImportError:
    Zarr = NewType('Zarr', np.ndarray)

Here's what ?sc.pp.pca looks like for me

Good catch. It broke when moving everything to scanpy/__init__.py, I restored it in 89abcb8

grafik

This comment has been minimized.

Copy link
Collaborator

replied Mar 17, 2019

The issue of how to type that sequence (or collection, iterable, etc.) has come up for me with scanpy since many arguments take list like collections of strings.

Zarr = NewType('Zarr', np.ndarray): I'm note sure this hits the cases of a function which only accepts that Zarr type or expensive optional imports. I think there might be a solution based on PEP 563, but I'd have to play around with that for a bit first.

That change change fixes it, thanks! Any chance the type annotations under Parameters could get backticks or (🤞) syntax highlighting for values?

This comment has been minimized.

Copy link
Member Author

replied Mar 17, 2019

The issue of how to type that sequence (or collection, iterable, etc.) has come up for me with scanpy since many arguments take list like collections of strings.

That’s good. When we’re confused about what’s we can really pass there and ask ourselves “what’s a ‘list-like’?” then we just discovered a bug in the documentation. Either ours, or (if we just pass on the parameters), another projects’. In any case, someone didn’t properly specify what’s accepted there, and us finally thinking about that will only improve the life of everyone.

I think you approach this from a different angle than me. You probably think “there’s this magic ‘list-like’ that we can’t know anything about so we should just should pass it on to the user” and I think “Someone failed to specify what ‘list-like’ really means, so someone will have to dive into the code to figure it out, either some user running into a bug or the person writing the parameter docs right now”.

Zarr = NewType('Zarr', np.ndarray): I'm note sure this hits the cases of a function which only accepts that Zarr type or expensive optional imports.

It does. NewType(name, cls) says “we have a subtype of cls”. zarr.Array is a subtype of np.ndarray, so if zarr can’t be imported, we instead define that dummy class that takes its role. PEP 563 is the more elegant solution, but only available in Python 3.7+

Any chance the type annotations under Parameters could get backticks or (crossed_fingers) syntax highlighting for values?

OK, so what you see there is the final step. Shift-Tab or ? will retrieve some info about the object (signature, docstring, …) and display it. The signature is python code, so it can be highlighted, but the docstring can be anything (Python doesn’t require it to be valid reStructuredText) so it will just be displayed as-is.

It’s possibly to change this function to include highlighting, but it would be a bit involved.

You can easily add backticks in the getdoc function.

This comment has been minimized.

Copy link
Collaborator

replied Mar 26, 2019

Sorry, just getting back to this, had a conference and getting my own package together (which scanpydoc is really helping with).

On -like objects: to some extent this magic unknowable type is pythonic to me. For example, all the times file-like is used. I get what you mean though, it's easy to run into trouble when you assume listlike[0] will return the same thing as next(iter(listlike)). Polymorphism in python leaves a lot to be desired.

If there was an easy way to say "coercible to {x}" with types, that might be the best here. For example, "coercible to np.array" means you can pass a list, tuple, series, or whatever else, but not an iterator.

NewType(name, cls) says “we have a subtype of cls”.

Ahh, when I first looked into NewType I missed that it makes a unique new type, this makes more sense now. I see how this works when the package is not installed, but what if the package is available later? If the NewType signature gets set on an object, but that object is pickled and loaded into an environment where the actual type is available, what does it's signature say now?

OK, so what you see there is the final step

ipython's codebase has always seemed hard to wrangle with, so I probably won't be the one to fix that. Might consider changing getdoc if you don't get to it first.

Any chance you've considered moving the getdoc like functions to scanpydoc? It seems like it'd fit well there.

This comment has been minimized.

Copy link
Member Author

replied Mar 26, 2019

“file-like” is just a synonym for isinstance(thing, io.IOBase). “list-like” is often a synonym for isinstance(thing, collections.abc.Sequence), but sometimes not.

Polymorphism in python leaves a lot to be desired.

Python’s protocols are beautiful and their usability through ABCs is pretty cool. The problem is that people aren’t too aware of all this. With 10 minutes more effort, they could figure out if they just need an Iterable or a Sequence and cleanly define an API, but instead they write “list-like” and call it a day.

The problem is that python has a great runtime type system, but only some rigorously typed libraries (e.g. ones using mypy in testing) know the advantages.

If there was an easy way to say "coercible to {x}" with types, that might be the best here.

There’s no standard coercion mechanism in python (like R’s as function), so “coercible” just means “which types’ __init__ accept this as their only parameter”.

array-like is pretty complex. The right way would be that numpy formalizes this in code, which is being done in numpy-stubs, but not ready yet. mypy has numpy-mypy with a working _ArrayLike today. Sadly from __future__ import annotations does only exist since Python 3.7 or we could use it today. (I don’t want a hard runtime dependency on mypy. Maybe we could do a “scanpy.types” module that imports the mypy types if available and stubs them out if not.)

If the NewType signature gets set on an object, but that object is pickled and loaded into an environment where the actual type is available, what does it's signature say now?

Good point. Generally, it says “pickle is not designed to work through software updates, don’t do this” 😉 There’s a few ways to customize pickle to circumvent this though.

ipython's codebase has always seemed hard to wrangle with, so I probably won't be the one to fix that.

Too bad, because that flow-control-heavy function is something we can’t easily monkeypatch to get some rad HTML shift-tab introspections. I think it’s much easier to introduce it there than in scanpydoc or so.

Might consider changing getdoc if you don't get to it first. Any chance you've considered moving the getdoc like functions to scanpydoc? It seems like it'd fit well there.

Sure, sounds like a good idea! The whole annotate_doc_types thing would be a good fit.

@flying-sheep

This comment has been minimized.

Copy link
Member Author

commented on scanpy/queries/_queries.py in 4550142 Mar 11, 2019

We use the * after the mandatory arguments so people have to spell out rarer arguments.

Sadly Python 3.5 doesn’t accept a * and a trailing ,.

This comment has been minimized.

Copy link
Collaborator

replied Mar 13, 2019

I like this. Any chance you know if there is a formatter/ error code for black to stop adding trailing commas?

This comment has been minimized.

Copy link
Member Author

replied Mar 13, 2019

Yes. black -t py35 or add the following to your pyproject.toml:

[tool.black]
target_version = ['py35']
@fidelram

This comment has been minimized.

Copy link
Collaborator

commented Mar 12, 2019

This PR looks very useful. Do you mind adding an example in the function descriptions?

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 13, 2019

@liiskolb, any chance you have an estimate of when the python package will be released? I'd like to have this PR merge with up-to-date results, and am trying to figure out if I should write a little client.

@fidelram Sure!

Just a heads up to everyone, I'm pretty swamped this week and probably won't get around to updating this PR until at least this weekend.

@liiskolb

This comment has been minimized.

Copy link

commented Mar 14, 2019

@ivirshup We estimate that the package will be released around 15th of April. So, in a month or so.
If this is ok, then I'll let you know when it is out:)

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 17, 2019

Writing the example for sc.queries.enrich(adata, ...) made me realize I probably don't want to encourage that. Potentially could be fixed by a default p-value cutoff.

As for the gprofiler version, there are two paths forward I think would work:

  1. Put a hold on the enrich function for a bit for the new API to be released
  2. If it'd be useful enough to include now, the docs could note current implementation is provisional and results will change pretty soon. Maybe @fidelram and @LuckyMD have thoughts on that?
@liiskolb

This comment has been minimized.

Copy link

commented Apr 9, 2019

@ivirshup We have updated gprofiler-official to version 1.0.0 that corresponds to the new API.
See the descriptions here: https://pypi.org/project/gprofiler-official/#description

Update queries
Now uses `gprofiler-official` and can take cut-offs for differential expression.
@LuckyMD

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2019

@ivirshup Sorry, I didn't see this until just now. I guess @liiskolb's comment addresses your options though... as it's out now, it would be really helpful to have in scanpy.

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 15, 2019

Just pushing the updates now @LuckyMD 😄.

One issue with the enrichment as is, is that gprofiler-official import name conflicts with the previous unofficial wrapper. I'm worried that this will break peoples environments if they're not aware of this. @liiskolb, do you have any thoughts on this?

Otherwise, I think this should be alright. I'd like to know if there'd be any interest in moving the utility function rank_genes_groups_df (added here) into a more central place. I personally use it anytime I use scanpys differential expression.

@ivirshup ivirshup force-pushed the ivirshup:queries_fix branch from 891c42d to 449cd3d Apr 15, 2019

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 17, 2019

@flying-sheep, current thoughts on this?

@@ -93,7 +93,8 @@ def biomart_annotations(
Returns
-------
A `pd.DataFrame` containing annotations.
:class:`pandas.DataFrame`
Dataframe containing annotations.

This comment has been minimized.

Copy link
@flying-sheep

flying-sheep Apr 17, 2019

Member

Due to recent doc changes: If you just have one return value, write prose and use a return value annotation.

(same for every other time you changed this in this commit)

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 17, 2019

Author Collaborator

Just changed.

Also, I think it would be reasonable to include this with v1.5, if v1.4.1 is otherwise just bug fixes.

This comment has been minimized.

Copy link
@flying-sheep
@liiskolb

This comment has been minimized.

Copy link

commented Apr 17, 2019

@ivirshup "One issue with the enrichment as is, is that gprofiler-official import name conflicts with the previous unofficial wrapper. I'm worried that this will break peoples environments if they're not aware of this. @liiskolb, do you have any thoughts on this?"

I'm not really sure I got it right, but if new version of scanpy includes new version of gprofiler-official, then it should work well. If people have old version of scanpy that uses old version of gprofiler, then it should also work but with data from archived release of gprofiler.

With this kind of updates it is inevitable that some environments break (we have the experience as you can see;)), these just need to be solved case by case if people with problems start to contact. They could be advised to update their packages to solve these issues.

@ivirshup ivirshup changed the title [WIP] Queries fix Queries fix Apr 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.