Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GWAS tutorial #629

Merged
merged 8 commits into from
Jul 30, 2021
Merged

GWAS tutorial #629

merged 8 commits into from
Jul 30, 2021

Conversation

tomwhite
Copy link
Collaborator

Fixes #463

Here is the rendered notebook with brief explanatory notes. The notebook output is also included in the site documentation using JupyterBook.

@tomwhite
Copy link
Collaborator Author

The build is failing because the notebook can't import sgkit. I'm not sure how to fix this though. Should the notebook call pip install sgkit?

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jupyter_cache/executors/utils.py", line 51, in single_nb_execution
    executenb(
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/nbclient/client.py", line 1112, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/nbclient/util.py", line 74, in wrapped
    return just_run(coro(*args, **kwargs))
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/nbclient/util.py", line 53, in just_run
    return loop.run_until_complete(coro)
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/nbclient/client.py", line 553, in async_execute
    await self.async_execute_cell(
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/nbclient/client.py", line 857, in async_execute_cell
    self._check_raise_for_error(cell, exec_reply)
  File "/opt/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/nbclient/client.py", line 760, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr
------------------

�[0;31m---------------------------------------------------------------------------�[0m
�[0;31mModuleNotFoundError�[0m                       Traceback (most recent call last)
�[0;32m/tmp/ipykernel_4990/1400953291.py�[0m in �[0;36m<module>�[0;34m�[0m
�[0;32m----> 1�[0;31m �[0;32mimport�[0m �[0msgkit�[0m �[0;32mas�[0m �[0msg�[0m�[0;34m�[0m�[0;34m�[0m�[0m
�[0m�[1;32m      2�[0m �[0;32mfrom�[0m �[0msgkit�[0m�[0;34m.�[0m�[0mio�[0m�[0;34m.�[0m�[0mvcf�[0m �[0;32mimport�[0m �[0mvcf_to_zarr�[0m�[0;34m�[0m�[0;34m�[0m�[0m

�[0;31mModuleNotFoundError�[0m: No module named 'sgkit'
ModuleNotFoundError: No module named 'sgkit'

@tomwhite
Copy link
Collaborator Author

The build is now failing when trying to execute the notebook, with

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/jupyter_cache/executors/utils.py", line 56, in single_nb_execution
    record_timing=False,
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/nbclient/client.py", line 1112, in execute
    return NotebookClient(nb=nb, resources=resources, km=km, **kwargs).execute()
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/nbclient/util.py", line 74, in wrapped
    return just_run(coro(*args, **kwargs))
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/nbclient/util.py", line 53, in just_run
    return loop.run_until_complete(coro)
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
    return future.result()
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/nbclient/client.py", line 554, in async_execute
    cell, index, execution_count=self.code_cells_executed + 1
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/nbclient/client.py", line 857, in async_execute_cell
    self._check_raise_for_error(cell, exec_reply)
  File "/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/nbclient/client.py", line 760, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
VCF_FILE = "https://storage.googleapis.com/sgkit-gwas-tutorial/1kg.vcf.bgz"
vcf_to_zarr(VCF_FILE, "1kg.zarr", max_alt_alleles=1,
          fields=["FORMAT/GT", "FORMAT/DP", "FORMAT/GQ", "FORMAT/AD"],
          field_defs={"FORMAT/AD": {"Number": "R"}})

and

[0;32m/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/cyvcf2/cyvcf2.pyx�[0m in �[0;36mcyvcf2.cyvcf2.VCF.__init__�[0;34m()�[0m

�[0;32m/opt/hostedtoolcache/Python/3.7.11/x64/lib/python3.7/site-packages/cyvcf2/cyvcf2.pyx�[0m in �[0;36mcyvcf2.cyvcf2.HTSFile._open_htsfile�[0;34m()�[0m

�[0;31mOSError�[0m: Error opening https://storage.googleapis.com/sgkit-gwas-tutorial/1kg.vcf.bgz
OSError: Error opening https://storage.googleapis.com/sgkit-gwas-tutorial/1kg.vcf.bgz

I can't reproduce on Mac, so I think it may be related to the Linux wheel. See also brentp/cyvcf2#205

@tomwhite tomwhite force-pushed the gwas-tutorial branch 2 times, most recently from cf2d4fe to e470220 Compare July 20, 2021 13:36
@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2021

Codecov Report

Merging #629 (4a404e2) into main (5f43ecc) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #629   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        35           
  Lines         2815      2829   +14     
=========================================
+ Hits          2815      2829   +14     
Impacted Files Coverage Δ
sgkit/io/utils.py 100.00% <0.00%> (ø)
sgkit/io/vcf/vcf_reader.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f43ecc...4a404e2. Read the comment docs.

@tomwhite
Copy link
Collaborator Author

The build was failing when cyvcf2 was reading a remote VCF file, so I spent some time getting the cyvcf2 wheels getting built again in the hope that would fix it. Unfortunately it didn't (brentp/cyvcf2#216), so to make progress on this issue I have changed the notebook to download the remote VCF to a local file before using cyvcf2 to read it.

This should be ready for review now. The rendered notebook is here.

@jeromekelleher
Copy link
Collaborator

One question: is this a good thing to encourage:

from sgkit.io.vcf import vcf_to_zarr

Why not just use sg.vcf_to_zarr? (I had assumed we'd export these functions to the top-level?) Seems odd to me to bind ourselves to this internal package layout permanently

@jeromekelleher
Copy link
Collaborator

Not sure about phrasing: " so it only loads the first alternate allele " - how about "converts" rather than "loads"?

(This is one of the reasons I dislike notebooks - impossible to collaborate on via Git!)

@jeromekelleher
Copy link
Collaborator

Actually, is there any reason not to convert this to MyST markdown? I guess the upside of a notebook is that people can download it directly and run it there. This isn't so easy if you go the whole hog with JupyterBook. Maybe there's an intermediate step though?

@tomwhite
Copy link
Collaborator Author

Thanks for taking a look @jeromekelleher!

Why not just use sg.vcf_to_zarr? (I had assumed we'd export these functions to the top-level?)

This goes back to the discussion about keeping the IO packages separate (#494). As it stands if you import sgkit.io.vcf on Windows you get an import error. I think we could move everything to the top-level with a bit more work, though arguably we could do it in a future release.

Not sure about phrasing: " so it only loads the first alternate allele " - how about "converts" rather than "loads"?

Agreed - I'll fix this.

Actually, is there any reason not to convert this to MyST markdown? I guess the upside of a notebook is that people can download it directly and run it there.

Yes, that's why I went with a notebook. Also MyST is relatively new, so I wasn't sure how stable it is yet.

@jeromekelleher
Copy link
Collaborator

This goes back to the discussion about keeping the IO packages separate (#494). As it stands if you import sgkit.io.vcf on Windows you get an import error. I think we could move everything to the top-level with a bit more work, though arguably we could do it in a future release.

Right, gotcha.

@jeromekelleher
Copy link
Collaborator

I've had a quick look through, and it LGTM. Probably easier to do small edits for any changes, so let's merge whenever you're happy.

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Jul 29, 2021
@tomwhite tomwhite merged commit 5d58d09 into sgkit-dev:main Jul 30, 2021
@hammer
Copy link
Contributor

hammer commented Jul 30, 2021

Great to see this go in! @tomwhite should we file an sgkit issue to follow up on the remote read issue in the cyvcf2 wheels?

@tomwhite
Copy link
Collaborator Author

tomwhite commented Aug 2, 2021

should we file an sgkit issue to follow up on the remote read issue in the cyvcf2 wheels?

Opened #645

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replicating Hail's GWAS Tutorial
4 participants