Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate signatures from 10x single cell bam file #539

Merged
merged 24 commits into from
Oct 1, 2018
Merged

Generate signatures from 10x single cell bam file #539

merged 24 commits into from
Oct 1, 2018

Conversation

olgabot
Copy link
Collaborator

@olgabot olgabot commented Sep 1, 2018

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@luizirber
Copy link
Member

Hi @olgabot! This is pretty cool =]

I created a PR for your branch to fix the tests and avoid a direct dependency on pysam. When you write the tests for 10x you can do like the tests in tests/test_sbt.py, where they check if ipfsapi and redis are installed: https://github.com/dib-lab/sourmash/blob/12aa76f0a08e5b702d927eed7185a87277af2b51/tests/test_sbt.py#L367
(so it will be pysam = pytest.importorskip('pysam')

Another thing: did you try bamnostic? It's a pure Python SAM/BAM reader, and avoids problems like pysam not working on Python 3.7 at the moment...

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 2, 2018

Glad you like it! @jamestwebber helped me get started with the code :)

(so it will be pysam = pytest.importorskip('pysam')

Great! I'll add that change.

Another thing: did you try bamnostic? It's a pure Python SAM/BAM reader, and avoids problems like pysam not working on Python 3.7 at the moment...

I hadn't heard of bamnostic before! My concern is that running this code took ~3 hours on a smallish 10X run of ~600 QC-passing cells and since bamnostic doesn't use the C-level API, it will be even slower. I can play around with using joblib to parallelize, though. But wider usability may win over speed...

@luizirber
Copy link
Member

almost there, we are missing an
import pytest
in the beginning of tests/test_sourmash.py, and then tests will work =]

@luizirber
Copy link
Member

Cool, tests passing! \o/

On the performance side, you're building one minhash per barcode, and then saving all the barcodes on a signature:
https://github.com/dib-lab/sourmash/blob/ef25bcf36a2db9c55ef6b5812abd6d2128c5ad8c/sourmash/commands.py#L250L266
I think you can rewrite this block and use a multiprocessing.pool to process them in parallel, but I wouldn't necessarily require that for accepting this PR

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 6, 2018 via email

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 6, 2018

Hello! I'm trying to figure out whether it's the iterating over the alignment file or creating the hashes that's taking the most time but I can't seem to get the profiler working due to the relative imports in commands.py, e.g.

from . import DEFAULT_SEED, MinHash, load_sbt_index, create_sbt_index

Here's the full stack trace below:

 ⚙  Thu  6 Sep - 11:01  ~/code/sourmash   origin ☊ olgabot/10x-singlecell-bam 3☀ 1● 
  python -m cProfile -o sourmash_10x.prof sourmash/commands.py compute -k 31 --input-is-10x tests/test-data/lung_ptprc/
Traceback (most recent call last):
  File "/anaconda3/envs/sourmash/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/envs/sourmash/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 160, in <module>
    main()
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 153, in main
    runctx(code, globs, None, options.outfile, options.sort)
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 20, in runctx
    filename, sort)
  File "/anaconda3/envs/sourmash/lib/python3.6/profile.py", line 64, in runctx
    prof.runctx(statement, globals, locals)
  File "/anaconda3/envs/sourmash/lib/python3.6/cProfile.py", line 100, in runctx
    exec(cmd, globals, locals)
  File "sourmash/commands.py", line 12, in <module>
    from . import DEFAULT_SEED, MinHash, load_sbt_index, create_sbt_index
ImportError: cannot import name 'DEFAULT_SEED'

Do you know what may be happening?

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 6, 2018

omg nevermind .. somehow I moved the __init__.py file to the root dir:

⚙  Thu  6 Sep - 11:47  ~/code/sourmash   origin ☊ olgabot/10x-singlecell-bam ↑1 5☀ 4● 1‒ 1✚ ⚑ 
  git status
On branch olgabot/10x-singlecell-bam
Your branch is ahead of 'origin/olgabot/10x-singlecell-bam' by 1 commit.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	renamed:    sourmash/__init__.py -> __init__.py

-_-;;;;;

🤦‍♀️

@codecov-io
Copy link

codecov-io commented Sep 6, 2018

Codecov Report

Merging #539 into master will increase coverage by 0.04%.
The diff coverage is 97.22%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #539      +/-   ##
=========================================
+ Coverage   90.76%   90.8%   +0.04%     
=========================================
  Files          33      34       +1     
  Lines        5011    5047      +36     
  Branches       36      36              
=========================================
+ Hits         4548    4583      +35     
- Misses        463     464       +1
Impacted Files Coverage Δ
sourmash/tenx.py 100% <100%> (ø)
sourmash/commands.py 91.75% <96.15%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12aa76f...3f03839. Read the comment docs.

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 6, 2018

BTW, bamnostic seems to not be able to deal with 10x bam files, which have 3-field tags like this: CB:Z:TACTTACAGCGAGAAA-1

(sourmash) 
 ⚙  Thu  6 Sep - 13:33  ~/code/sourmash   origin ☊ olgabot/10x-singlecell-bam ↑1 8☀ 3● 3‒ 1✚ ⚑ 
  sourmash compute -k 31 --input-is-10x  --force tests/test-data/lung_ptprc
computing signatures for files: tests/test-data/lung_ptprc
Computing signature for ksizes: [31]
Computing only DNA (and not protein) signatures.
Computing a total of 1 signature(s).
Traceback (most recent call last):
  File "/anaconda3/envs/sourmash/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/Users/olgabot/code/sourmash/sourmash/__main__.py", line 77, in main
    cmd(sys.argv[2:])
  File "/Users/olgabot/code/sourmash/sourmash/commands.py", line 268, in compute
    barcodes, bam_file = read_10x_folder(filename)
  File "/Users/olgabot/code/sourmash/sourmash/tenx.py", line 17, in read_10x_folder
    os.path.join(folder, 'possorted_genome_bam.bam'), mode='rb')
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/core.py", line 153, in __init__
    bgzf.BgzfReader.__init__(self, **kwargs)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 471, in __init__
    self._load_header(check_sq)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 652, in _load_header
    self._header = BAMheader(self)
  File "/Users/olgabot/anaconda/envs/sourmash/lib/python3.6/site-packages/bamnostic/bgzf.py", line 311, in __init__
    tag, value = field.split(':')
ValueError: too many values to unpack (expected 2)

@luizirber
Copy link
Member

I see, thanks for creating an issue in their repo!

Another profiling approach you can take is https://github.com/benfred/py-spy (saw it yesterday, tested and it works pretty well!)

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 13, 2018

Update from me: I'm migrating to bamnostic because pysam alignment objects cannot be pickled. But there's still issues with reading the 10x bam files: betteridiot/bamnostic#16

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 14, 2018

Alright, the bamnostic issue has been fixed! It does require the latest bamnostic git version (reflected in the requirements.txt) and I asked them to do a version bump soon so the requirements are simpler.

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 14, 2018

ahh I spoke too soon... fixing now

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 14, 2018

Now this seems to be failing due to a deprecation error that happens for every read. Hopefully fixed by: betteridiot/bamnostic#18

@olgabot
Copy link
Collaborator Author

olgabot commented Sep 18, 2018

Ugh Python2.7 gets in the way of everything!! @betteridiot is working on fixing a RecursionError in Python2.7 in bamnostic here: betteridiot/bamnostic#20

@codecov
Copy link

codecov bot commented Sep 18, 2018

Codecov Report

Merging #539 into master will decrease coverage by 0.16%.
The diff coverage is 70.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #539      +/-   ##
==========================================
- Coverage   90.78%   90.62%   -0.17%     
==========================================
  Files          33       34       +1     
  Lines        5016     5057      +41     
  Branches       37       37              
==========================================
+ Hits         4554     4583      +29     
- Misses        462      474      +12
Impacted Files Coverage Δ
sourmash/tenx.py 100% <100%> (ø)
sourmash/commands.py 90.52% <61.29%> (-1.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2534989...f0cf1b1. Read the comment docs.

@olgabot olgabot changed the title [WIP] Generate signatures from 10x single cell bam file Generate signatures from 10x single cell bam file Sep 19, 2018
@olgabot
Copy link
Collaborator Author

olgabot commented Sep 19, 2018

The build is passing now! Though it looks like I didn't cover everything with the tests.. let me know if you want me to add more.

@luizirber
Copy link
Member

Hi @olgabot! This is looking great =]

Two things:

@luizirber luizirber merged commit 05584d2 into sourmash-bio:master Oct 1, 2018
@luizirber
Copy link
Member

Thanks @olgabot! 🎊

I've punted the test issue to #551, and made 10x support optional in #550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants