# Data structures


In [1]:
import numpy as np
import pandas as pd
import xftsim as xft
from xftsim import index, struct

xft.config.print_durations_threshold=10. ## reduce verbosity
np.random.seed(123) ## set random seed for reproducibility

Here we introduce `HaplotypeArray` and `PhenotypeArray` objects, the two primary data objects `xftsim` operates on. These objects are indexed as follows:

| Object | Row Index | Column Index |
| --- | --- | --- |
|`struct.HaplotypeArray`|`index.SampleIndex`|`index.HaploidVariantIndex`|
|`struct.PhenotypeArray`|`index.SampleIndex`|`index.ComponentIndex`|


:::{warning}

It will be challenging to to understand this information if you haven't read [the tutorial on indexing](./indexing.ipynb)!

:::

Counterintuively, Neither `HaplotypeArray` are `PhenotypeArray` actual classes. Rather, both construct instances of `xarray.DataArray` with an extended API available through the `xft` accessor. If you're confused, don't worry--we'll go through all of this step by step.

:::{tip}

The [xarray documentation](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html) can be very helpful if you haven't used `xarray` before!

:::

First, we'll run a small simulation (the details of which aren't important for now) so that we have some arrays to work with:


In [2]:
founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=800, m=100)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.5], phenotype_name=['height', 'BMD'], haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes, p =.1)
mating_regime = xft.mate.LinearAssortativeMatingRegime(r = .5, offspring_per_pair=2,
                                                       component_index = xft.index.ComponentIndex.from_product(['height', 'BMD'], ['phenotype']))
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         mating_regime=mating_regime,
                         recombination_map=recombination_map,
                         architecture=architecture)
sim.run(2)

haplo = sim.haplotypes
pheno = sim.phenotypes

  comp_type[component_name==key] = value
  comp_type[component_name==key] = value


## Phenotype arrays

A phenotype array's rows correspond to individuals and even columns correspond to phenotypic components:

In [3]:
pheno

### Retrieving index objects

We can exConversiontract component and sample index objects as follows:

In [4]:
pheno.xft.get_sample_indexer() ## same as .get_row_indexer()
pheno.xft.get_component_indexer() ## same as .get_column_indexer()


<ComponentIndex>
  3 components of 2 phenotypes spanning 1 generation
                               phenotype_name   component_name  \
component                                                        
height.additiveGenetic.proband         height  additiveGenetic   
BMD.additiveGenetic.proband               BMD  additiveGenetic   
height.additiveNoise.proband           height    additiveNoise   
BMD.additiveNoise.proband                 BMD    additiveNoise   
height.phenotype.proband               height        phenotype   
BMD.phenotype.proband                     BMD        phenotype   

                                vorigin_relative     comp_type  
component                                                       
height.additiveGenetic.proband                -1  intermediate  
BMD.additiveGenetic.proband                   -1  intermediate  
height.additiveNoise.proband                  -1  intermediate  
BMD.additiveNoise.proband                     -1  intermediate  
height.phen

The six columns of `pheno` are those indexed above. In confirmation, we observe that the 5th column (height phenotype) is the sum of the first and third columns (height genetic, noise, respectively): 

In [5]:
np.all(pheno[:,[0,2]].sum(axis=1) == pheno[:,4])

### Subsetting phenotype arrays

In addition to the standard `xarray.DataArray` indexing shown above, we can subset according to phenotype_name, component_name, and/or vorigin_relative by supplying a `dict` to `xft.__getitem__`: 

In [6]:
pheno.xft[{'phenotype_name':'height'}]

In [7]:
pheno.xft[{'component_name':'phenotype'}]

In [8]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}]

We can also use this method to alter values of underlying DataArray:

In [9]:
pheno.xft[{'component_name':['phenotype','additiveNoise']}] *= 2
pheno.xft[{'component_name':['phenotype','additiveNoise']}]

### Conversion to Pandas

Finally, we can convert from a phenotype DataArray to a multiindexed Pandas DataFrame using the `DataArray.xft.as_pd()` method:

In [10]:
pheno.xft.as_pd()

Unnamed: 0_level_0,Unnamed: 1_level_0,phenotype_name,height,BMD,height,BMD,height,BMD
Unnamed: 0_level_1,Unnamed: 1_level_1,component_name,additiveGenetic,additiveGenetic,additiveNoise,additiveNoise,phenotype,phenotype
Unnamed: 0_level_2,Unnamed: 1_level_2,vorigin_relative,proband,proband,proband,proband,proband,proband
iid,fid,sex,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
1_0,1_0,0,-0.172140,-2.900228,0.586642,0.260576,0.242362,-5.539879
1_1,1_0,0,-0.220377,-2.694321,-2.370100,-2.470488,-2.810854,-7.859131
1_2,1_1,1,-1.749179,0.770812,-0.494897,0.059959,-3.993255,1.601582
1_3,1_1,1,-0.452184,-0.654971,-0.873004,-1.964959,-1.777373,-3.274901
1_4,1_2,0,-1.181860,-1.471807,0.849380,0.655862,-1.514341,-2.287753
...,...,...,...,...,...,...,...,...
1_795,1_397,0,0.890462,0.752347,-0.316503,1.258179,1.464421,2.762873
1_796,1_398,0,0.435559,0.841148,-0.780206,-0.057162,0.090912,1.625135
1_797,1_398,1,0.576414,1.731700,-1.368366,-0.218389,-0.215537,3.245011
1_798,1_399,1,1.045583,0.784169,1.202751,-0.892162,3.293918,0.676176


## Haplotype arrays

For $n$ individuals and $m$ diploid loci, a haplotype array is an $n\times 2m$ `xr.DataArray` of 8-bit integers with rows corresponding to individuals, even columns (indexed starting at zero) corresponding to maternal haplotypes, and odd columns corresponding to paternal haplotypes:

In [15]:
haplo

### Retrieving index objects

We can extract variant and sample index objects as follows:

In [16]:
haplo.xft.get_sample_indexer()  ## same as haplo.xft.get_row_indexer()

<SampleIndex>
  Generation 0
  800 indviduals from 400 families
  399 biological females
  401 biological males
                  iid    fid  sex
sample                           
0..1_0.1_0        1_0    1_0    0
            ...
0..1_799.1_399  1_799  1_399    1

[800 rows x 3 columns]

In [17]:
haplo.xft.get_variant_indexer()  ## same as haplo.xft.get_column_indexer()

<HaploidVariantIndex>
  100 diploid variants on 20 chromosome(s)
  MAF ranges from nan to nan
  0 annotation(s) 
        vid  chrom zero_allele one_allele        af hcopy  pos_bp  pos_cM
variant                                                                  
0.0       0      0           A          G  0.657175     0     NaN     NaN
0.1       0      0           A          G  0.657175     1     NaN     NaN
1.0       1      0           A          G  0.328911     0     NaN     NaN
1.1       1      0           A          G  0.328911     1     NaN     NaN
2.0       2      0           A          G  0.281481     0     NaN     NaN
...      ..    ...         ...        ...       ...   ...     ...     ...
97.1     97     19           A          G  0.419101     1     NaN     NaN
98.0     98     19           A          G  0.292685     0     NaN     NaN
98.1     98     19           A          G  0.292685     1     NaN     NaN
99.0     99     19           A          G  0.374765     0     NaN     NaN

### Computing empirical allele frequencies

Empirical diploid allele frequences are available via the `af_empirical` property. Here we compare ancestral and empirical allele frequencies:

In [18]:
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
ancestral,0.657175,0.328911,0.281481,0.541052,0.675575,0.438485,0.884611,0.647864,0.484746,0.413694,...,0.663967,0.896287,0.384732,0.710038,0.574542,0.653361,0.220902,0.419101,0.292685,0.374765
empirical,0.6425,0.33,0.265,0.51125,0.67875,0.450625,0.913125,0.6475,0.498125,0.435625,...,0.654375,0.898125,0.388125,0.73375,0.573125,0.66375,0.23375,0.415625,0.325625,0.375


We can use the `use_empirical_afs()` method to replace ancestral allele frequences with empirical allele frequences. This is useful when ancestral allele frequences are unknown, as is often the case with real data.

In [19]:
haplo.xft.use_empirical_afs()
## now ancestral == empirical
pd.DataFrame.from_dict(dict(ancestral = haplo.xft.get_variant_indexer().to_diploid().af,
                            empirical = haplo.xft.af_empirical)).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
ancestral,0.6425,0.33,0.265,0.51125,0.67875,0.450625,0.913125,0.6475,0.498125,0.435625,...,0.654375,0.898125,0.388125,0.73375,0.573125,0.66375,0.23375,0.415625,0.325625,0.375
empirical,0.6425,0.33,0.265,0.51125,0.67875,0.450625,0.913125,0.6475,0.498125,0.435625,...,0.654375,0.898125,0.388125,0.73375,0.573125,0.66375,0.23375,0.415625,0.325625,0.375


### Interpolating genetic distances

Given physical positions and a genetic map, centiMorgan distances can interpolated using the `interpolate_cM()` method.

:::{danger}

Example coming soon!

:::

