# Merging VCFs sample-wise

See https://github.com/pystatgen/sgkit/issues/840
    
How can we merge VCFs that contain one sample per VCF, and where some variants are not in the other file?

It's normally preferable to do this with [bcftools merge](https://samtools.github.io/bcftools/bcftools.html#merge), but this page shows a way to do it with Xarray.

In [1]:
! cat 1.vcf

##fileformat=VCFv4.0
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE1
chr1	14458	.	G	.	.	PASS	AC=0;AN=2	GT	0/0
chr1	14464	.	A	T	110	PASS	AC=1;AN=2	GT	0/1
chr1	14465	.	G	.	.	PASS	AC=0;AN=2	GT	0/0


In [2]:
! cat 2.vcf

##fileformat=VCFv4.0
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE2
chr1	14464	.	A	T	110	PASS	AC=1;AN=2	GT	0/1
chr1	14465	.	G	.	.	PASS	AC=0;AN=2	GT	0/0
chr1	14676	.	G	.	.	PASS	AC=0;AN=2	GT	0/0
chr1	14677	.	G	A	85	PASS	AC=1;AN=2	GT	0/1


In [3]:
import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr
import xarray as xr

In [4]:
# Convert to Zarr and load as datasets
vcf_to_zarr("1.vcf.gz", "1.zarr")
vcf_to_zarr("2.vcf.gz", "2.zarr")

ds1 = sg.load_dataset("1.zarr")
ds2 = sg.load_dataset("2.zarr")

# load into memory
ds1.load()
ds2.load()

[W::vcf_parse] INFO 'AC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AN' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AC' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'AN' is not defined in the header, assuming Type=String


In [5]:
# Restrict datasets to contig and pos (for the join index) and variables with a samples dimension
vars = ["variants", "variant_contig", "variant_position"]  # index vars
vars.extend([v for v in ds1.data_vars if "samples" in ds1[v].dims])  # vars with a samples dim

ds1 = ds1[vars]
ds2 = ds2[vars]

ds1_ind = ds1.set_index(variants=("variant_contig", "variant_position"))
ds1_ind.load()

ds2_ind = ds2.set_index(variants=("variant_contig", "variant_position"))
ds2_ind.load()

In [6]:
# Concatenate the VCFs
ds = xr.concat([ds1_ind, ds2_ind], dim="samples", data_vars="minimal")
ds

## Note
TODO: add back other variant fields (like `variant_allele`)