# Founder haplotypes

## Overview

`xftsim` simulations can be highly realistic, but they can also be quite simplistic. Founder data is one place we see this: we can use real phased haplotype data or simulate haplotypes as independent Bernoulli trials. As fully synthetic data is extremely convenient to work with, we recommend at the very least using such data to prototype or debug simulations.

In what follows, we first introduce tools for generating haplotypes from scratch, then cover tools for importing external haplotype data.

## Haplotypes from scratch

The simplest methods for generating fully synthetic haplotype data involve generating independent Bernoulli trials. Given a set of allele frequencies, `founders.founder_haplotypes_from_AFs()` will randomly generate a haplotype array using this method: 

In [1]:
import xftsim as xft
from xftsim import founders


afs = [0,1,.5]
founders.founder_haplotypes_from_AFs(n=10, afs=afs)

For convenience, the function `founders.founder_haplotypes_uniform_AFs()` will uniformly sample `m` allele frequencies between `minMAF` and `1 - minMAF`:

In [2]:
founders.founder_haplotypes_uniform_AFs(n=10,m=10,minMAF=.05)

:::{note}
    
When creating a generic `xft.Index.VariantIndex` (to assign variant IDs etc) as in the above examples, `xftsim` will attempt to evenly divide variants between (up to) 22 chromosomes.
    
:::

Of course, we can always directly provide haplotypes to the `xft.struct.HaplotypeArray` constructor:

In [3]:
import numpy as np

arbitrary_haplotypes = np.random.binomial(1,.3, (100,200))
xft.struct.HaplotypeArray(arbitrary_haplotypes)



## Haplotypes from external datasets

We currently support the [plink binary (bfile)](https://www.cog-genomics.org/plink/1.9/formats#bed) and variant call (VCF) formats by interfacing with the [pandas-plink](https://github.com/limix/pandas-plink) and [sgkit](https://pystatgen.github.io/sgkit/latest/) libraries, respectively. We hope to add support for `bgen` and `plink2 (pfile)` formats in the future.


### plink bfiles

:::{danger}

The plink bfile format is inherantly diploid and will break phasing. Thus, when reading in plink data, haplotypes at heterozygous loci are assignmened randomly.

:::


For this example, we'll use  the example data that's packaged with the `pandas-plink` library:

In [4]:
from pandas_plink import get_data_folder
from os.path import join

pdat = founders.founder_haplotypes_from_plink_bfile(join(get_data_folder(), "chr*.bed"))
pdat

Mapping files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 344.31it/s]
Mapping files:  83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                          | 5/6 [00:00<00:00, 345.12it/s]


Multiple files read in this order: ['chr11', 'chr12']


Unnamed: 0,Array,Chunk
Bytes,34.23 kiB,21.30 kiB
Shape,"(14, 2504)","(14, 1558)"
Dask graph,2 chunks in 11 graph layers,2 chunks in 11 graph layers
Data type,int8 numpy.ndarray,int8 numpy.ndarray
"Array Chunk Bytes 34.23 kiB 21.30 kiB Shape (14, 2504) (14, 1558) Dask graph 2 chunks in 11 graph layers Data type int8 numpy.ndarray",2504  14,

Unnamed: 0,Array,Chunk
Bytes,34.23 kiB,21.30 kiB
Shape,"(14, 2504)","(14, 1558)"
Dask graph,2 chunks in 11 graph layers,2 chunks in 11 graph layers
Data type,int8 numpy.ndarray,int8 numpy.ndarray


As `pandas-plink` will generate a lazy dask array, we need to use the `.compute()` method to explicitly load it into memory:

In [5]:
pdat.compute()

We can write haplotype arrays out to plink bfiles using `xft.io.write_to_plink1()`:


In [6]:
xft.io.write_to_plink1(pdat, '/tmp/plink')

Writing BED: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 202.75it/s]

Writing FAM... done.
Writing BIM... done.



  df.to_csv(
  df.to_csv(


### VCF / sgkit data

Support for VCF files is provided via compatability the `sgkit` library. We recommend the following procedure:

 1. Convert the VCF file (or a subset there of) to a [zarr](https://zarr.readthedocs.io/en/stable/) store on disk using
 `sgkit.io.vcf.vcf_to_zarr()` 
 2. Lazy load the sgkit data using `sgkit.load_dataset()`
 3. Convert the sgkit data to an `xftsim` compatable DataArray using `xft.founders.founder_haplotypes_from_sgkit_dataset()`
 4. Save the founder data to zarr for future use using `xft.io.save_haplotype_zarr()`
 

## Haplotypes from zarr stores

Zarr stores are the preferred file format for saving and loading haplotype array data in `xftsim`. Haplotype data can be read from zarr using `xft.io.load_haplotype_zarr()` and written using `xft.io.save_haplotype_zarr()`.