Minimum viable sgkit dataset #748

benjeffery · 2022-10-20T12:40:21Z

Round trips an sgkit dataset.

benjeffery · 2022-10-28T11:31:45Z

@jeromekelleher I'm thinking about how to cope with the uuid method here. We originally thought to make this a read-only operation on the sgkit dataset, but the only way i can see to support uuid is to generate one and write it back to the dataset. Otherwise, the dataset will have a different uuid each time it is read.

jeromekelleher · 2022-10-28T12:51:54Z

Let's get rid of UUID, it's pointless. We've never used it and it's based on an unrealistic idea that datasets get made once and never changed.

So, just hack around it in whatever way is simplest in the short term.

benjeffery · 2022-10-31T16:32:31Z

Ok I've removed uuid as that was the simplest way.

For ancestral allele - I'm thinking we provide a method on the dataset that takes either an array of ancestral allele values and converts it to the needed index. Could also accept a sequence that the alleles are extracted from at the site's positions?

hyanwong · 2022-10-31T17:12:56Z

For ancestral allele - I'm thinking we provide a method on the dataset that takes an array of ancestral allele values and converts it to the needed index. Could also accept a sequence that the alleles are extracted from at the site's positions?

Both would be useful, as ancestral alleles are sometimes provided as a FASTA file by ensembl (e.g. http://ftp.ensembl.org/pub/release-108/fasta/ancestral_alleles/). I guess we would need to check that they were of the same length as the sequence length. The approach would be similar to the reference_sequence stuff, I guess, other than we don't have to store it.

benjeffery · 2022-10-31T22:09:34Z

I guess we would need to check that they were of the same length as the sequence length.

There is no sequence length stored in sgkit (makes sense as a VCF has no such concept. We should check that the sequence value at each site is one of the alleles though, right?

FASTA won't work for indels right? Can't get alleles from the FASTA for them without knowing their length.

hyanwong · 2022-10-31T22:12:58Z

There is sometimes a VCF sequence length definition, I think: you can get it using vcf.seqlens in cyvcf2, but it's an optional thing. Either way, we need some way to put the seq len into the new format, right? And we can check it from there.

But yes, FASTA is no good for indels. In general, using the VCF AA field seems preferable, but the ensembl AA data used to be of better quality. Being able to change the AA state by editing the samples file without having to read all the variation data in again would be really useful, I think.

benjeffery · 2022-11-01T09:58:06Z

Either way, we need some way to put the seq len into the new format, right?

At the moment the concept is read-only from sgkit. Any use case that requires changing the dataset will need to be thought through as where possible that should be an sgkit function/workflow unless it is very tsinfer specific.

hyanwong · 2022-11-01T10:05:31Z

Either way, we need some way to put the seq len into the new format, right?

At the moment the concept is read-only from sgkit. Any use case that requires changing the dataset will need to be thought through as where possible that should be an sgkit function/workflow unless it is very tsinfer specific.

Right, so the question is whether SGkit should have seqlen stored somewhere, e.g. when read from a VCF line like

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

(https://samtools.github.io/hts-specs/VCFv4.2.pdf). I guess at the moment that data is not stored in an SGkit data file, although is is returned by htslib/cyvcf2 (annoyingly undocumented, though: brentp/cyvcf2@a138fad)

tomwhite · 2022-11-01T10:27:50Z

Right, so the question is whether SGkit should have seqlen stored somewhere

We should change sgkit to store it if it's in the VCF. Some previous discussion here: https://github.com/pystatgen/sgkit/issues/464

jeromekelleher · 2022-11-01T13:10:02Z

We should be able to derive VCF-spec defined assembly and contig information (including length) from the sgkit dataset, so let's follow that up as an upstream improvement.

tomwhite · 2022-11-02T15:26:42Z

I've opened https://github.com/pystatgen/sgkit/pull/946 to add contig lengths to the dataset. Assembly information is not exposed by cyvcf2 so I haven't added that yet.

benjeffery · 2022-11-03T14:55:51Z

Trying to sort out dependency issues here as sgkit has upper bounds on pandas and cyvcf2 needs to be in the conda part for windows compatibility. Apart from that this is ready to review!

jeromekelleher

Yep, LGTM as a minimum viable start. Caught a few small things.

jeromekelleher · 2022-11-03T15:28:23Z

tsinfer/formats.py

@@ -703,7 +694,6 @@ def __str__(self):
            ("format_name", self.format_name),
            ("format_version", self.format_version),
            ("finalised", self.finalised),
-            ("uuid", self.uuid),


We should document this in the CHANGELOG I guess, but totally fine to day "this is gone, deal with it"

jeromekelleher · 2022-11-03T15:30:12Z

tsinfer/formats.py

+    def __init__(self, path):
+        self.path = path
+        self.data = zarr.open(path, mode="r")
+        self._num_sites, self._num_individuals, self.ploidy = self.data[


Maybe

genotypes_arr = self.data["call_genotypes"] self._num_sites, self._num_individuals, self.ploidy = genotypes_arr.shape

Currently line breaking is a bit eye hurty

jeromekelleher · 2022-11-03T15:32:24Z

tsinfer/formats.py

+    @property
+    def sites_ancestral_allele(self):
+        try:
+            return self.data["sites/ancestral_allele"]


We should probably be more sgkit-like in our choice of variable names here, I guess variant_ancestral_allele?

I think it likely we end up with variant_ancestral_allele_index so will put that for now, but final decision will be part of #764

jeromekelleher · 2022-11-03T15:33:24Z

tsinfer/formats.py

+
+    @property
+    def sites_genotypes(self):
+        gt = self.data["call_genotype"]


Not obvious to me what this is doing, can we get a comment please? I.e., is it making a full copy and returning as a numpy array?

jeromekelleher · 2022-11-03T15:34:17Z

tsinfer/formats.py

+    @property
+    def metadata(self):
+        try:
+            return self.data.attrs["metadata_schema"]


Typo, should be "metadata"

jeromekelleher · 2022-11-03T15:37:17Z

It's probably simplest to drop python 3.7 here from CI I'd say?

benjeffery · 2022-11-04T15:31:28Z

It's probably simplest to drop python 3.7 here from CI I'd say?

It's not that simple, the circle CI tests are also failing, I can recreate locally and it seems to be an issue with xarray on Py3.7 (which is unsupported). Using Py3.8 and doing an unpinned install of the circle CI deps results in:

ERROR: sgkit 0.5.0 has requirement numpy<1.22, but you'll have numpy 1.23.4 which is incompatible.
ERROR: sgkit 0.5.0 has requirement pandas<1.4.0, but you'll have pandas 1.5.1 which is incompatible.
ERROR: sgkit 0.5.0 has requirement zarr<2.11.0,>=2.10.0, but you'll have zarr 2.13.3 which is incompatible

Along with failing tests.

Fixing these versions then passes - so I need to upgrade the circle CI tests to 3.8 and then be very careful about the pinning of sgkits deps - this should fix circle CI. Then remove the 3.7 tests on the github workflow.

codecov · 2022-11-04T15:41:56Z

Codecov Report

Merging #748 (ff0b28c) into main (53b1866) will increase coverage by 0.12%.
The diff coverage is 91.06%.

❗ Current head ff0b28c differs from pull request most recent head 38cd717. Consider uploading reports for the commit 38cd717 to get more accurate results

@@            Coverage Diff             @@
##             main     #748      +/-   ##
==========================================
+ Coverage   93.34%   93.46%   +0.12%     
==========================================
  Files          17       17              
  Lines        5361     5691     +330     
  Branches      984     1008      +24     
==========================================
+ Hits         5004     5319     +315     
- Misses        235      246      +11     
- Partials      122      126       +4

Flag	Coverage Δ
C	`93.46% <91.06%> (+0.12%)`	⬆️
python	`96.44% <91.06%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tsinfer/formats.py	`97.59% <91.01%> (-0.91%)`	⬇️
tsinfer/inference.py	`98.71% <100.00%> (+0.15%)`	⬆️
tsinfer/constants.py	`100.00% <0.00%> (ø)`
tsinfer/algorithm.py	`98.47% <0.00%> (+0.01%)`	⬆️
tsinfer/threads.py	`60.86% <0.00%> (+1.29%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jeromekelleher · 2022-11-07T10:39:26Z

I'm going to unsubscribe here @benjeffery - can you ping me for final review when ready please?

jeromekelleher · 2022-11-07T10:40:04Z

(Looks like a fun time trying to get the package version problem solved! 🤮 )

benjeffery · 2022-11-07T14:44:56Z

@jeromekelleher Have you seen this error before? "The process cannot access the file because it is being used by another process" It's in the CLI tests that I haven't touched!

jeromekelleher · 2022-11-07T17:22:23Z

Yes, I think I'm having the same problem over in #769 ...

benjeffery · 2022-11-08T11:53:54Z

The sgkit[vcf] format doesn't seem to work with conda as cyvcf2 is missing. I could add this to the conda specification, but that would then fail on windows as there is no cyvcf2 for windows on conda. With the previous pip setup I could specify not to install a module on a specific OS - I'll try to find the conda way of doing this.

jeromekelleher · 2022-11-08T12:50:23Z

I'm not sure we need vcf support in sgkit here?

benjeffery · 2022-11-08T12:53:29Z

It's nice to confirm that the version of sgkit creates a dataset we can read, we could store the dataset, but would need to update it when sgkit changed anything.

jeromekelleher · 2022-11-08T13:00:06Z

Right, fair enough.

The sgkit on conda should be "batteries included" anyway and have all the functionality. There's no advantage to getting it via pip because cyvcf still won't be available.

jeromekelleher · 2022-11-08T13:00:52Z

requirements/CI-tests-conda/requirements.txt

@@ -10,7 +12,8 @@ pytest==7.2.0
 pytest-xdist==3.0.2 
 python-lmdb==1.3.0 
 seaborn==0.12.1
-sortedcontainers==2.4.0 
+sgkit[vcf]==0.5.0


delete the [vcf] here I think

benjeffery · 2022-11-29T13:59:28Z

@jeromekelleher Would be good to get this merged before #778 as it contains the sgkit testing infrastructure. We have issues filed for the follow-up work.

jeromekelleher · 2022-11-29T15:18:12Z

Go for it!

benjeffery · 2022-11-29T23:49:25Z

@Mergifyio rebase

mergify · 2022-11-29T23:50:33Z

rebase

✅ Branch has been successfully rebased

benjeffery · 2022-11-30T00:04:06Z

Manual merge due to mergify config change.

tomwhite mentioned this pull request Nov 2, 2022

Add contig_lengths dataset attribute if found in the VCF file sgkit-dev/sgkit#946

Merged

benjeffery force-pushed the sgkit-sampledata branch from ff30e85 to 0a0f60c Compare November 3, 2022 12:16

benjeffery changed the title ~~WIP - infer from a sgkit dataset~~ Minimum viable sgkit dataset Nov 3, 2022

benjeffery force-pushed the sgkit-sampledata branch 5 times, most recently from cf2e67d to 9821f20 Compare November 3, 2022 14:30

benjeffery marked this pull request as ready for review November 3, 2022 14:55

jeromekelleher reviewed Nov 3, 2022

View reviewed changes

benjeffery force-pushed the sgkit-sampledata branch 3 times, most recently from 236ac2b to 61e1d70 Compare November 7, 2022 13:36

benjeffery force-pushed the sgkit-sampledata branch from 61e1d70 to ae2b087 Compare November 7, 2022 14:12

benjeffery force-pushed the sgkit-sampledata branch from ae2b087 to 770b3b1 Compare November 8, 2022 11:19

jeromekelleher reviewed Nov 8, 2022

View reviewed changes

benjeffery force-pushed the sgkit-sampledata branch from 7b04422 to db8b8ef Compare November 10, 2022 11:04

benjeffery force-pushed the sgkit-sampledata branch from ff0b28c to fdebc53 Compare November 29, 2022 13:58

jeromekelleher approved these changes Nov 29, 2022

View reviewed changes

benjeffery added the AUTOMERGE-REQUESTED label Nov 29, 2022

benjeffery added 2 commits November 29, 2022 23:50

Add sgkit dataset

b809db0

Remove uuid

38cd717

benjeffery force-pushed the sgkit-sampledata branch from fdebc53 to 38cd717 Compare November 29, 2022 23:50

benjeffery merged commit 6ca6edc into tskit-dev:main Nov 30, 2022

mergify bot removed the AUTOMERGE-REQUESTED label Nov 30, 2022

benjeffery deleted the sgkit-sampledata branch November 30, 2022 00:04

Minimum viable sgkit dataset #748

Minimum viable sgkit dataset #748

Conversation

benjeffery commented Oct 20, 2022 • edited Loading

benjeffery commented Oct 28, 2022

jeromekelleher commented Oct 28, 2022

benjeffery commented Oct 31, 2022

hyanwong commented Oct 31, 2022 • edited by benjeffery Loading

benjeffery commented Oct 31, 2022 • edited Loading

hyanwong commented Oct 31, 2022 • edited Loading

benjeffery commented Nov 1, 2022

hyanwong commented Nov 1, 2022

tomwhite commented Nov 1, 2022

jeromekelleher commented Nov 1, 2022

tomwhite commented Nov 2, 2022

benjeffery commented Nov 3, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Nov 3, 2022

benjeffery commented Nov 4, 2022 • edited Loading

codecov bot commented Nov 4, 2022 • edited Loading

Codecov Report

jeromekelleher commented Nov 7, 2022

jeromekelleher commented Nov 7, 2022

benjeffery commented Nov 7, 2022

jeromekelleher commented Nov 7, 2022

benjeffery commented Nov 8, 2022

jeromekelleher commented Nov 8, 2022

benjeffery commented Nov 8, 2022

jeromekelleher commented Nov 8, 2022

Choose a reason for hiding this comment

benjeffery commented Nov 29, 2022

jeromekelleher commented Nov 29, 2022

benjeffery commented Nov 29, 2022

mergify bot commented Nov 29, 2022

✅ Branch has been successfully rebased

benjeffery commented Nov 30, 2022

benjeffery commented Oct 20, 2022 •

edited

Loading

hyanwong commented Oct 31, 2022 •

edited by benjeffery

Loading

benjeffery commented Oct 31, 2022 •

edited

Loading

hyanwong commented Oct 31, 2022 •

edited

Loading

benjeffery commented Nov 4, 2022 •

edited

Loading

codecov bot commented Nov 4, 2022 •

edited

Loading