# Data structures


In [22]:
import numpy as np
import pandas as pd
import xftsim as xft
from xftsim import index, struct

xft.config.print_durations_threshold=10. ## reduce verbosity
np.random.seed(123) ## set random seed for reproducibility

Here we introduce `HaplotypeArray` and `PhenotypeArray` objects, the two primary data objects `xftsim` operates on. These objects are indexed as follows:

| Object | Row Index | Column Index |
| --- | --- | --- |
|`struct.HaplotypeArray`|`index.SampleIndex`|`index.HaploidVariantIndex`|
|`struct.PhenotypeArray`|`index.SampleIndex`|`index.ComponentIndex`|


:::{warning}

It will be challenging to to understand this information if you haven't read [the tutorial on indexing](./indexing.ipynb)!

:::

Counterintuively, Neither `HaplotypeArray` are `PhenotypeArray` actual classes. Rather, both construct instances of `xarray.DataArray` with an extended API available through the `xft` accessor. If you're confused, don't worry--we'll go through all of this step by step.

:::{tip}

The [xarray documentation](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html) can be very helpful if you haven't used `xarray` before!

:::

First, we'll run a small simulation (the details of which aren't important for now) so that we have some arrays to work with:


In [None]:
founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=800, m=100)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.5], phenotype_name=['height', 'BMD'], haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes, p =.1)
mating_regime = xft.mate.LinearAssortativeMatingRegime(r = .5, offspring_per_pair=2,
                                                       component_index = xft.index.ComponentIndex.from_product(['height', 'BMD'], ['phenotype']))
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         mating_regime=mating_regime,
                         recombination_map=recombination_map,
                         architecture=architecture)
sim.run(2)

haplo = sim.haplotypes
pheno = sim.phenotypes

## Phenotype arrays

A phenotype array's rows correspond to individuals and even columns correspond to phenotypic components:

In [62]:
pheno

### Retrieving index objects

We can extract component and sample index objects as follows:

In [63]:
pheno.xft.get_sample_indexer() ## same as .get_row_indexer()
pheno.xft.get_component_indexer() ## same as .get_column_indexer()


<ComponentIndex>
  3 components of 2 phenotypes spanning 1 generation
                               phenotype_name   component_name  \
component                                                        
height.additiveGenetic.proband         height  additiveGenetic   
BMD.additiveGenetic.proband               BMD  additiveGenetic   
height.additiveNoise.proband           height    additiveNoise   
BMD.additiveNoise.proband                 BMD    additiveNoise   
height.phenotype.proband               height        phenotype   
BMD.phenotype.proband                     BMD        phenotype   

                                vorigin_relative  
component                                         
height.additiveGenetic.proband                -1  
BMD.additiveGenetic.proband                   -1  
height.additiveNoise.proband                  -1  
BMD.additiveNoise.proband                     -1  
height.phenotype.proband                      -1  
BMD.phenotype.proband                       

The six columns of `pheno` are those indexed above. In confirmation, we observe that the 5th column (height phenotype) is the sum of the first and third columns (height genetic, noise, respectively): 

In [74]:
np.all(pheno[:,[0,2]].sum(axis=1) == pheno[:,4])

### Subsetting phenotype arrays

In addition to the standard `xarray.DataArray` indexing shown above, we can subset according to phenotype_name, component_name, and/or vorigin_relative by supplying a `dict` to `xft.__getitem__`: 

In [86]:
pheno.xft[{'phenotype_name':'height'}]

In [88]:
pheno.xft[{'component_name':'phenotype'}]

In [90]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}]

We can also use this method to alter values of underlying DataArray:

In [None]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}] *= 2
pheno.xft[{'component_name':['phenotype','additiveNoise']}]

## Haplotype arrays

For $n$ individuals and $m$ diploid loci, a haplotype array is an $n\times 2m$ `xr.DataArray` of 8-bit integers with rows corresponding to individuals, even columns (indexed starting at zero) corresponding to maternal haplotypes, and odd columns corresponding to paternal haplotypes:

In [10]:
haplo

In [None]:
There several custom methods available through the `xft` accessor, including:

### Retrieving index objects

We can extract variant and sample index objects as follows:

In [12]:
haplo.xft.get_sample_indexer()  ## same as haplo.xft.get_row_indexer()

<SampleIndex>
  Generation 0
  800 indviduals from 400 families
  387 biological females
  413 biological males
                  iid    fid  sex
sample                           
0..1_0.1_0        1_0    1_0    0
            ...
0..1_799.1_399  1_799  1_399    1

[800 rows x 3 columns]

In [14]:
haplo.xft.get_variant_indexer()  ## same as haplo.xft.get_column_indexer()

<HaploidVariantIndex>
  100 diploid variants on 20 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s) 
        vid  chrom zero_allele one_allele        af hcopy  pos_bp  pos_cM
variant                                                                  
0.0       0      0           A          G  0.641788     0     NaN     NaN
0.1       0      0           A          G  0.641788     1     NaN     NaN
1.0       1      0           A          G  0.335031     0     NaN     NaN
1.1       1      0           A          G  0.335031     1     NaN     NaN
2.0       2      0           A          G  0.527685     0     NaN     NaN
...      ..    ...         ...        ...       ...   ...     ...     ...
97.1     97     19           A          G  0.276791     1     NaN     NaN
98.0     98     19           A          G  0.701539     0     NaN     NaN
98.1     98     19           A          G  0.701539     1     NaN     NaN
99.0     99     19           A          G  0.500467     0     NaN     NaN

### Computing empirical allele frequencies

Empirical diploid allele frequences are available via the `af_empirical` property. Here we compare ancestral and empirical allele frequencies:

In [24]:
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
ancestral,0.641788,0.335031,0.527685,0.452337,0.847903,0.33893,0.128769,0.396704,0.241113,0.258492,...,0.109714,0.394861,0.845744,0.323193,0.477052,0.101487,0.247678,0.276791,0.701539,0.500467
empirical,0.66125,0.321875,0.524375,0.484375,0.850625,0.3625,0.143125,0.405,0.254375,0.254375,...,0.098125,0.383125,0.830625,0.336875,0.440625,0.0975,0.263125,0.291875,0.69875,0.495


We can use the `use_empirical_afs()` method to replace ancestral allele frequences with empirical allele frequences. This is useful when ancestral allele frequences are unknown, as is often the case with real data.

In [25]:
haplo.xft.use_empirical_afs()
## now ancestral == empirical
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
ancestral,0.66125,0.321875,0.524375,0.484375,0.850625,0.3625,0.143125,0.405,0.254375,0.254375,...,0.098125,0.383125,0.830625,0.336875,0.440625,0.0975,0.263125,0.291875,0.69875,0.495
empirical,0.66125,0.321875,0.524375,0.484375,0.850625,0.3625,0.143125,0.405,0.254375,0.254375,...,0.098125,0.383125,0.830625,0.336875,0.440625,0.0975,0.263125,0.291875,0.69875,0.495


### Interpolating genetic distances

Given physical positions and a genetic map, centiMorgan distances can interpolated using the `interpolate_cM()` method.

:::{danger}

Example coming soon!

:::

