Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marker gene overlap scoring function #549

Merged
merged 17 commits into from Mar 31, 2019

Conversation

Projects
None yet
5 participants
@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 22, 2019

Introduces a function to calculate marker gene overlaps between a reference set of marker genes provided as a dictionary, and data-derived marker genes as calculated by sc.tl.rank_genes_groups().

Currently implemented overlap functions are: overlap counts (with row or column normalization), overlap coefficient, and jaccard index.

Still to do:

  • write a test
  • finish documentation
  • allow p-value thresholding when available
  • allow using top X marker genes rather than all calculated markers
  • test that it works properly...
@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Mar 24, 2019

This looks very good already! Nice code! :) Thanks for also adding the things you're still planning to add.

@flying-sheep might not like that you spell out the types in the docs in addition to the type annotations but I think it's good: the string selection values for method can't be represented as a type annotation, I think; and I still like browsing the type annotations with shift+tab at the same time when reading the rest of the parameter explanation. So, I'd say, just keep it as it is.

@flying-sheep

This comment has been minimized.

Copy link
Member

flying-sheep commented Mar 25, 2019

Could you please do Dict[KeyType, ValueType] instead of dict in the type annotations?

@falexwolf you forgot that the types will be added to the shift-tab info too:

annotate_doc_types(sys.modules[__name__], 'scanpy')

So yes, @LuckyMD please remove all redundant type info in the docs. The info for method and normalize should stay, the rest can go with absolutely no loss of information anywhere.

@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Mar 29, 2019

OK, cool; thanks @flying-sheep!

The function looks otherwise good!

Examples
--------
>>> adata = sc.datasets.pbmc68k_reduced()
>>>

This comment has been minimized.

Copy link
@fidelram

fidelram Mar 29, 2019

Collaborator

don't forget to finish this

This comment has been minimized.

Copy link
@LuckyMD

LuckyMD Mar 29, 2019

Author Contributor

of course... it was just a copy-paste from embedding density for now ;).

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 29, 2019

Could you please do Dict[KeyType, ValueType] instead of dict in the type annotations?

@flying-sheep When I put in key: dict[str, set] it tells me type annotations are not subscriptable. How can I state the dict key-value types required?

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 29, 2019

Finished most of the features... just need to add the documentation... would appreciate if someone could take a look at the code @flying-sheep, @falexwolf...

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 29, 2019

From my side it's ready to be merged. I have left a note in the comments about extending this to enrichment scores, and that this would be difficult. I though it might be good to leave that in there in case anyone wants to extend it later. Thoughts on that?

# Create a pandas dataframe with the results
marker_groups = list(reference_markers.keys())
clusters = list(cluster_ids)
marker_matching_df = pd.DataFrame(marker_match, index=marker_groups, columns=clusters)

This comment has been minimized.

Copy link
@falexwolf

falexwolf Mar 29, 2019

Member

Currently, we cannot write DataFrames using adata.write... :(

What I would do, is add an inplace parameter that defaults to False. If users set it to True, they get a NotImplementedError. Then, instead of adding the marker_matching_df inplace to adata, you simply return it.

At some point, we will be able to store dfs within .h5ad. Then we can also implement the inplace version of this function.

@flying-sheep Any chance that you would have fun implementing this. It's just some reorganization fo the code as I implemented the functionality already for .obs and .var via https://github.com/theislab/anndata/blob/473413a76ebe741754b63558c4c5a34861451c82/anndata/base.py#L193-L222

It could be a 30 min job. Upon write, each dataframe is transformed to a record array, upon read, it's transformed back to a dataframe.

Having this would also allow a much nicer return value of rank_genes_groups...

@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Mar 29, 2019

This looks great, Malte! Only thing see my comment above. You should get on the author list! :)

LuckyMD
@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 29, 2019

I'd love to get on the author list :).

Will add the inplace option now with the NotImplementedError raised. The only thing remaining is what @flying-sheep suggested with the typing for the dictionary, which I can't seem to do.

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 29, 2019

Found the Dict[str, set] option in the typing module. I've left the dict typing for the hidden functions though... I guess they're not user exposed anyway.

I left the code in there for inplace=True... so when the pandas writing option is implemented, the if inplace: raise NotImplementedError('...') lines just need to be removed.

Ready to be merged from my side now.

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 29, 2019

Not sure what's up with Travis... the tests pass on my machine, and they were passing on Travis the whole time... my last commit hasn't really changed anything that would cause this fail.

LuckyMD added some commits Mar 29, 2019

@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Mar 31, 2019

Great! Travis looks like stalled build...

@falexwolf falexwolf merged commit 4acb7f0 into master Mar 31, 2019

1 of 2 checks passed

continuous-integration/travis-ci/push The Travis CI build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@flying-sheep flying-sheep deleted the marker_overlap branch Mar 31, 2019

@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Mar 31, 2019

Damn, there seems to be some Python 3.5 related thing going on: https://travis-ci.org/theislab/scanpy/jobs/513835981

@LuckyMD, any idea why this returns something completely different?

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 31, 2019

I have no idea... I was trying to understand how this would be different... the matrix I expect is [[3,1],[1,3]]... And that's what I get for python 3.6. I don't know what changed from 3.5. Maybe pandas .iloc[] does something else?

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Mar 31, 2019

It started to go wrong with ff26149 and all I changed there was that I moved the if inplace: statement up to the user tests instead of down to where the outputs were written.

Is python 3.5 somehow sensitive to whitespace after if statements? I found a whitespace after the if inplace: ... maybe that's it?

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Apr 1, 2019

Does anyone have a nice instruction set on how I can reproduce the travis python 3.5 environment? @flying-sheep? Maybe I can hunt down the cause of the error then... although it's apparently only sporadic according to @ivirshup :/

@flying-sheep

This comment has been minimized.

Copy link
Member

flying-sheep commented Apr 1, 2019

I don’t think there’s an easy way to reproduce it exactly, sorry.

@ivirshup

This comment has been minimized.

Copy link
Collaborator

ivirshup commented Apr 1, 2019

@flying-sheep

This comment has been minimized.

Copy link
Member

flying-sheep commented Apr 2, 2019

Maybe travis’ ubuntu-systemd?

Then you’d need to use travis-build (for which there also is a docker image) to convert the .yml to a .sh and execute that on the ubuntu container. (travis-build exists to do all this but that’s maybe overkill)

@LuckyMD

This comment has been minimized.

Copy link
Contributor Author

LuckyMD commented Apr 2, 2019

In #583 I have amended this PR a bit... corrected the normalization to make more sense, made the input dict value type more malleable, and I'm playing with the Travis build there.

For some reason it already stopped failing when I just changed the order of assert statements. I thought there might have been too many assert statements in a single function, so I split them up into separate testing functions. I think that's probably better coding practice anyway... but now they fail again for python 3.5... Maybe move the discussion there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.