Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I... additions #1165

Open
hammer opened this issue Jan 6, 2024 · 3 comments
Open

How do I... additions #1165

hammer opened this issue Jan 6, 2024 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@hammer
Copy link
Contributor

hammer commented Jan 6, 2024

Operations I find myself doing regularly

  • add new sample metadata?
  • add new variant metadata?
  • merge 2 datasets with separate chromosomes/contigs into 1 dataset?

We have a TODO in the docs for Adding custom data to a Dataset so the first 2 solutions should probably go there as well.

@hammer hammer added the documentation Improvements or additions to documentation label Jan 6, 2024
@hammer
Copy link
Contributor Author

hammer commented Jan 6, 2024

Ah I filed the sample metadata part of this already https://github.com/pystatgen/sgkit/issues/1151

@hammer
Copy link
Contributor Author

hammer commented Jan 6, 2024

Just noting that merging separate contigs could be easy with
ds = xr.concat([ds1, ds2], dim='variants') but we also need to update ds.variant_contig, and I think contig_id needs to have its contigs dimension updated, and I think we need to update the contigs entry in the Dataset Attributes. I need to read more about what we do with the contigs dimension...

(Looks like the contigs attribute was deprecated in https://github.com/pystatgen/sgkit/issues/1035, so I'll just update contig_id.)

Eh on second thought that concat was far too optimistic, only want to do variant data variables.

@hammer
Copy link
Contributor Author

hammer commented Jan 6, 2024

Okay I think this should do it for merging 2 datasets with no overlapping contigs, which I think will be the most common case. If we wanted to do this safely we'd need to build an index over samples and merge on sample_id, which may be worth doing someday...

(This was almost there: somehow contig_id and variant_contig become regular arrays instead of dask arrays, and I need to rechunk everything to be able to write it as Zarr)

def concat_chrs(chr1, chr2):
    new_ds_dict = {}
    
    # Concatenate contig_id and increment chr2 variant_contig indexes by chr1.contigs.size
    new_ds_dict['contig_id'] = xr.concat([chr1.contig_id, chr2.contig_id], dim='contigs')
    new_ds_dict['variant_contig'] = xr.concat([chr1.variant_contig, chr2.variant_contig + chr1.contigs.size], dim='variants')

    # Concatenate remaining variant data variables
    data_vars_variants = [
        'call_genotype',
        'call_genotype_mask',
        'variant_allele',
        'variant_id',
        'variant_position',
        ]
    for dv in data_vars_variants:
        new_ds_dict[dv] = xr.concat([chr1[dv], chr2[dv]], dim='variants')

    # Copy over sample data variables from chr1
    data_vars_samples = [
        'sample_family_id',
        'sample_id',
        'sample_maternal_id',
        'sample_member_id',
        'sample_paternal_id',
        'sample_phenotype',
        'sample_sex',
    ]
    for dv in data_vars_samples:
        new_ds_dict[dv] = chr1[dv]

    return xr.Dataset(data_vars=new_ds_dict)  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant