Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum viable sgkit dataset #748

Merged
merged 2 commits into from
Nov 30, 2022
Merged

Conversation

benjeffery
Copy link
Member

@benjeffery benjeffery commented Oct 20, 2022

Round trips an sgkit dataset.

@benjeffery
Copy link
Member Author

@jeromekelleher I'm thinking about how to cope with the uuid method here. We originally thought to make this a read-only operation on the sgkit dataset, but the only way i can see to support uuid is to generate one and write it back to the dataset. Otherwise, the dataset will have a different uuid each time it is read.

@jeromekelleher
Copy link
Member

Let's get rid of UUID, it's pointless. We've never used it and it's based on an unrealistic idea that datasets get made once and never changed.

So, just hack around it in whatever way is simplest in the short term.

@benjeffery
Copy link
Member Author

Ok I've removed uuid as that was the simplest way.

For ancestral allele - I'm thinking we provide a method on the dataset that takes either an array of ancestral allele values and converts it to the needed index. Could also accept a sequence that the alleles are extracted from at the site's positions?

@hyanwong
Copy link
Member

hyanwong commented Oct 31, 2022

For ancestral allele - I'm thinking we provide a method on the dataset that takes an array of ancestral allele values and converts it to the needed index. Could also accept a sequence that the alleles are extracted from at the site's positions?

Both would be useful, as ancestral alleles are sometimes provided as a FASTA file by ensembl (e.g. http://ftp.ensembl.org/pub/release-108/fasta/ancestral_alleles/). I guess we would need to check that they were of the same length as the sequence length. The approach would be similar to the reference_sequence stuff, I guess, other than we don't have to store it.

@benjeffery
Copy link
Member Author

benjeffery commented Oct 31, 2022

I guess we would need to check that they were of the same length as the sequence length.

There is no sequence length stored in sgkit (makes sense as a VCF has no such concept. We should check that the sequence value at each site is one of the alleles though, right?

FASTA won't work for indels right? Can't get alleles from the FASTA for them without knowing their length.

@hyanwong
Copy link
Member

hyanwong commented Oct 31, 2022

There is sometimes a VCF sequence length definition, I think: you can get it using vcf.seqlens in cyvcf2, but it's an optional thing. Either way, we need some way to put the seq len into the new format, right? And we can check it from there.

But yes, FASTA is no good for indels. In general, using the VCF AA field seems preferable, but the ensembl AA data used to be of better quality. Being able to change the AA state by editing the samples file without having to read all the variation data in again would be really useful, I think.

@benjeffery
Copy link
Member Author

Either way, we need some way to put the seq len into the new format, right?

At the moment the concept is read-only from sgkit. Any use case that requires changing the dataset will need to be thought through as where possible that should be an sgkit function/workflow unless it is very tsinfer specific.

@hyanwong
Copy link
Member

hyanwong commented Nov 1, 2022

Either way, we need some way to put the seq len into the new format, right?

At the moment the concept is read-only from sgkit. Any use case that requires changing the dataset will need to be thought through as where possible that should be an sgkit function/workflow unless it is very tsinfer specific.

Right, so the question is whether SGkit should have seqlen stored somewhere, e.g. when read from a VCF line like

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

(https://samtools.github.io/hts-specs/VCFv4.2.pdf). I guess at the moment that data is not stored in an SGkit data file, although is is returned by htslib/cyvcf2 (annoyingly undocumented, though: brentp/cyvcf2@a138fad)

@tomwhite
Copy link

tomwhite commented Nov 1, 2022

Right, so the question is whether SGkit should have seqlen stored somewhere

We should change sgkit to store it if it's in the VCF. Some previous discussion here: https://github.com/pystatgen/sgkit/issues/464

@jeromekelleher
Copy link
Member

We should be able to derive VCF-spec defined assembly and contig information (including length) from the sgkit dataset, so let's follow that up as an upstream improvement.

@tomwhite
Copy link

tomwhite commented Nov 2, 2022

I've opened https://github.com/pystatgen/sgkit/pull/946 to add contig lengths to the dataset. Assembly information is not exposed by cyvcf2 so I haven't added that yet.

@benjeffery benjeffery changed the title WIP - infer from a sgkit dataset Minimum viable sgkit dataset Nov 3, 2022
@benjeffery benjeffery force-pushed the sgkit-sampledata branch 5 times, most recently from cf2e67d to 9821f20 Compare November 3, 2022 14:30
@benjeffery
Copy link
Member Author

Trying to sort out dependency issues here as sgkit has upper bounds on pandas and cyvcf2 needs to be in the conda part for windows compatibility. Apart from that this is ready to review!

@benjeffery benjeffery marked this pull request as ready for review November 3, 2022 14:55
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, LGTM as a minimum viable start. Caught a few small things.

@@ -703,7 +694,6 @@ def __str__(self):
("format_name", self.format_name),
("format_version", self.format_version),
("finalised", self.finalised),
("uuid", self.uuid),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document this in the CHANGELOG I guess, but totally fine to day "this is gone, deal with it"

def __init__(self, path):
self.path = path
self.data = zarr.open(path, mode="r")
self._num_sites, self._num_individuals, self.ploidy = self.data[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

genotypes_arr = self.data["call_genotypes"]
self._num_sites, self._num_individuals, self.ploidy = genotypes_arr.shape

Currently line breaking is a bit eye hurty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@property
def sites_ancestral_allele(self):
try:
return self.data["sites/ancestral_allele"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be more sgkit-like in our choice of variable names here, I guess variant_ancestral_allele?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it likely we end up with variant_ancestral_allele_index so will put that for now, but final decision will be part of #764

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@property
def sites_genotypes(self):
gt = self.data["call_genotype"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not obvious to me what this is doing, can we get a comment please? I.e., is it making a full copy and returning as a numpy array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@property
def metadata(self):
try:
return self.data.attrs["metadata_schema"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, should be "metadata"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromekelleher
Copy link
Member

It's probably simplest to drop python 3.7 here from CI I'd say?

@benjeffery
Copy link
Member Author

benjeffery commented Nov 4, 2022

It's probably simplest to drop python 3.7 here from CI I'd say?

It's not that simple, the circle CI tests are also failing, I can recreate locally and it seems to be an issue with xarray on Py3.7 (which is unsupported). Using Py3.8 and doing an unpinned install of the circle CI deps results in:

ERROR: sgkit 0.5.0 has requirement numpy<1.22, but you'll have numpy 1.23.4 which is incompatible.
ERROR: sgkit 0.5.0 has requirement pandas<1.4.0, but you'll have pandas 1.5.1 which is incompatible.
ERROR: sgkit 0.5.0 has requirement zarr<2.11.0,>=2.10.0, but you'll have zarr 2.13.3 which is incompatible

Along with failing tests.

Fixing these versions then passes - so I need to upgrade the circle CI tests to 3.8 and then be very careful about the pinning of sgkits deps - this should fix circle CI. Then remove the 3.7 tests on the github workflow.

@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Merging #748 (ff0b28c) into main (53b1866) will increase coverage by 0.12%.
The diff coverage is 91.06%.

❗ Current head ff0b28c differs from pull request most recent head 38cd717. Consider uploading reports for the commit 38cd717 to get more accurate results

@@            Coverage Diff             @@
##             main     #748      +/-   ##
==========================================
+ Coverage   93.34%   93.46%   +0.12%     
==========================================
  Files          17       17              
  Lines        5361     5691     +330     
  Branches      984     1008      +24     
==========================================
+ Hits         5004     5319     +315     
- Misses        235      246      +11     
- Partials      122      126       +4     
Flag Coverage Δ
C 93.46% <91.06%> (+0.12%) ⬆️
python 96.44% <91.06%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/formats.py 97.59% <91.01%> (-0.91%) ⬇️
tsinfer/inference.py 98.71% <100.00%> (+0.15%) ⬆️
tsinfer/constants.py 100.00% <0.00%> (ø)
tsinfer/algorithm.py 98.47% <0.00%> (+0.01%) ⬆️
tsinfer/threads.py 60.86% <0.00%> (+1.29%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jeromekelleher
Copy link
Member

I'm going to unsubscribe here @benjeffery - can you ping me for final review when ready please?

@jeromekelleher
Copy link
Member

(Looks like a fun time trying to get the package version problem solved! 🤮 )

@benjeffery benjeffery force-pushed the sgkit-sampledata branch 3 times, most recently from 236ac2b to 61e1d70 Compare November 7, 2022 13:36
@benjeffery
Copy link
Member Author

@jeromekelleher Have you seen this error before? "The process cannot access the file because it is being used by another process" It's in the CLI tests that I haven't touched!

@jeromekelleher
Copy link
Member

Yes, I think I'm having the same problem over in #769 ...

@benjeffery
Copy link
Member Author

The sgkit[vcf] format doesn't seem to work with conda as cyvcf2 is missing. I could add this to the conda specification, but that would then fail on windows as there is no cyvcf2 for windows on conda. With the previous pip setup I could specify not to install a module on a specific OS - I'll try to find the conda way of doing this.

@jeromekelleher
Copy link
Member

I'm not sure we need vcf support in sgkit here?

@benjeffery
Copy link
Member Author

It's nice to confirm that the version of sgkit creates a dataset we can read, we could store the dataset, but would need to update it when sgkit changed anything.

@jeromekelleher
Copy link
Member

Right, fair enough.

The sgkit on conda should be "batteries included" anyway and have all the functionality. There's no advantage to getting it via pip because cyvcf still won't be available.

@@ -10,7 +12,8 @@ pytest==7.2.0
pytest-xdist==3.0.2
python-lmdb==1.3.0
seaborn==0.12.1
sortedcontainers==2.4.0
sgkit[vcf]==0.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the [vcf] here I think

@benjeffery
Copy link
Member Author

@jeromekelleher Would be good to get this merged before #778 as it contains the sgkit testing infrastructure. We have issues filed for the follow-up work.

@jeromekelleher
Copy link
Member

Go for it!

@benjeffery
Copy link
Member Author

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Nov 29, 2022

rebase

✅ Branch has been successfully rebased

@benjeffery
Copy link
Member Author

Manual merge due to mergify config change.

@benjeffery benjeffery merged commit 6ca6edc into tskit-dev:main Nov 30, 2022
@benjeffery benjeffery deleted the sgkit-sampledata branch November 30, 2022 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants