New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier 10x reading #442

Merged
merged 5 commits into from Jan 23, 2019

Conversation

Projects
None yet
3 participants
@ivirshup
Copy link
Contributor

ivirshup commented Jan 23, 2019

Removes the need to specify a genome string in 10x h5 files by default. This removes a personal annoyance of mine, where I had to figure out what the reference was called when there's often only one reference used per file.

For cellranger v3.0.0+ files, specifying a genome acts as a filter on input, as it did already. However, it doesn't only act if gex_only is True. Additionally, the behavior of gex_only has been changed to fit the documentation, i.e. it just filters for gene expression variables.

For legacy files:

  • If the file only has one genome group, that one is used by default
  • If multiple genomes are found and the user did not specify one, an error will be thrown. This is because there are no structural assurances the genomes will match to the same samples.

As the behavior of the function has meaningfully changed, this is a breaking change (though I'd be surprised if it affected many people). Personally, I haven't seen many 10x files which contain multiple genomes, so I'd appreciate feedback or examples from people who have.

ivirshup added some commits Jan 22, 2019

No longer require genome field for read_10x_h5
Removed requirement for `genome` field in `read_10x_h5`. For legacy files:

* If only one genome is found, it will be used
* If multiple genomes are found, an error will be thrown, as there are no assurances the genomes will match to the same samples

For v3 files, specifying genome now acts as a filter on features. I've also changed the behaviour of `gex_only` to fit its documentation.
@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Jan 23, 2019

Wow, awesome! @gokceneraslan wanted this from the beginning. Thank you so much! 😄

@falexwolf falexwolf merged commit 6d2ece5 into theislab:master Jan 23, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@flying-sheep

This comment has been minimized.

Copy link
Member

flying-sheep commented Jan 23, 2019

I don’t consider it breaking. If I understod you right, the only change in behavior are that not specifying a genome works now in cases where there’s only one.

No longer throwing an error is a perfectly fine change!

@ivirshup

This comment has been minimized.

Copy link
Contributor Author

ivirshup commented Jan 23, 2019

I thought it was breaking due to a couple behavior changes:

One case where the results would be different is the call sc.read_10x_h5(h5pth), where the file at h5pth is a legacy formatted file which contains mm10 and hg38 genomes. Prior to this PR, only the mm10 genome would be read in. Now, an error is thrown.

If the file had the v3 format (also containing two genomes), now values for features from both genomes would be read in, instead of just mm10.

Even better than removing an error, previously v3 files would get all the vars filtered out and not throw an error!

@flying-sheep

This comment has been minimized.

Copy link
Member

flying-sheep commented Jan 23, 2019

OK! You explained that the behavior you have for v3 isn’t possible to replicate for legacy, so I’m sure this is fine.

@ivirshup

This comment has been minimized.

Copy link
Contributor Author

ivirshup commented Jan 23, 2019

It's definitely possible to replicate, I don't know if it makes sense as a default. If the legacy files with multiple genomes almost always had the same cells aligned against all those genomes, it could make sense to read everything in by default. I'm not sure if that's the case though.

@ivirshup ivirshup referenced this pull request Jan 23, 2019

Merged

Better 10x read errors #444

@flying-sheep

This comment has been minimized.

Copy link
Member

flying-sheep commented Jan 23, 2019

Ha, that’s what I meant, that you said it doesn’t make sense.

If multiple genomes are found and the user did not specify one, an error will be thrown. This is because there are no structural assurances the genomes will match to the same samples.

I very much agree!

@falexwolf

This comment has been minimized.

Copy link
Member

falexwolf commented Jan 26, 2019

Oh, I was too hasty in merging this. Thanks for clarifying more of this. I think it's perfectly fine to have this better and more stringent behavior.

Added a note in the release notes: f428848

Btw: I think we should have much nicer release notes with batches both for subversions and author contributions. I'll try improving them very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment