# Statistics

## Overview

We provide a number of built in procedures for computing sample statistics and running various common estimation procedures (e.g., variance-component estimation, association studies, and the like).  During each generation of a simulation, each procedure provided as a `statistics` attribute of a `sim.Simulation` object will be run. Of course, you can also configure `xftsim` to write genotypes and phenotypes to disk each generation and can thereby use any external software you like, as is detailed in [the post-processing section](./proc.py).

As a starting point, we'll initialize a small simulation that incorporates three `Statistic` objects: 

In [1]:
import xftsim as xft
from xftsim import stats
xft.config.print_durations_threshold = 10
n=2000; m=800

founder_haplotypes = xft.founders.founder_haplotypes_uniform_AFs(n=n, m=m)
architecture = xft.arch.GCTA_Architecture(h2=[.5,.4], phenotype_name=['height', 'BMD'], 
                                          haplotypes=founder_haplotypes)
recombination_map = xft.reproduce.RecombinationMap.constant_map_from_haplotypes(founder_haplotypes, 
                                                                                p =.1)
mating_regime = xft.mate.RandomMatingRegime(mates_per_female=1,
                                            offspring_per_pair=2)

sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         architecture=architecture,
                         recombination_map=recombination_map,
                         mating_regime=mating_regime,
                         statistics = [])

  comp_type[component_name==key] = value
  comp_type[component_name==key] = value


We didn't specify any statistics, so the simulation will just proceed through phenotype construction and mating each generation without saving any additional results.

## Accessing results

After adding a `Statistic`, we'll begin to accumulate results in the `results_store` attribute as we iterate. The `results_store` is just a dict of dicts that we can print nicely using `xft.utils.print_tree()`:

In [2]:
sim.statistics = [stats.SampleStatistics(means=False, 
                                         variances=False, 
                                         variance_components = False)]
sim.run(2)
xft.utils.print_tree(sim.results_store)

0: 
|__sample_statistics: 
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
1: 
|__sample_statistics: 
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>


The `results_store` is a dict of dicts of dicts indexed by generation, than statistic name, than statistic component. For convenience, calling `sim.results` returns the most recent generation only, and is equivalent to calling `sim.results_store[sim.generation]`:

In [3]:
sim.results

{'sample_statistics': {'vcov': phenotype_name                                           height  \
  component_name                                  additiveGenetic   
  vorigin_relative                                        proband   
  phenotype_name component_name  vorigin_relative                   
  height         additiveGenetic proband                 0.490407   
  BMD            additiveGenetic proband                -0.014808   
  height         additiveNoise   proband                -0.004141   
  BMD            additiveNoise   proband                 0.000860   
  height         phenotype       proband                 0.486266   
  BMD            phenotype       proband                -0.013948   
  
  phenotype_name                                              BMD        height  \
  component_name                                  additiveGenetic additiveNoise   
  vorigin_relative                                        proband       proband   
  phenotype_name component_na

## Skipping generations

We dont always care to run each statistic each generation. We can pick and choose by altering the `statistics` attribute of a simulation object on the fly. Here we only compute sample statistics for generation zero and four:


In [4]:
sim = xft.sim.Simulation(founder_haplotypes=founder_haplotypes,
                         architecture=architecture,
                         recombination_map=recombination_map,
                         mating_regime=mating_regime,
                         statistics = [stats.SampleStatistics()])
sim.run(1)
sim.statistics = []
sim.run(3)
sim.statistics = [stats.SampleStatistics()]
sim.run(1)
xft.utils.print_tree(sim.results_store)

0: 
|__sample_statistics: 
|____means: <class 'pandas.core.series.Series'>
|____variances: <class 'pandas.core.series.Series'>
|____variance_components: <class 'pandas.core.series.Series'>
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>
4: 
|__sample_statistics: 
|____means: <class 'pandas.core.series.Series'>
|____variances: <class 'pandas.core.series.Series'>
|____variance_components: <class 'pandas.core.series.Series'>
|____vcov: <class 'pandas.core.frame.DataFrame'>
|____corr: <class 'pandas.core.frame.DataFrame'>


## Sample and mating statistics

Two of the helpful sets of statistics are provided by `stats.SampleStatistics` and `stats.MatingStatistics`. We've seen plenty of the former already, but the latter will produce statistics at the level of the mate pairing (rather than the offspring, for which there could be multiple for a give pairing). By default, `stats.MatingStatistics` will generate information about reproduction rates and cross-mate cross-trait phenotypic correlation matrices:


In [5]:
sim.statistics = [stats.MatingStatistics()]
sim.run(1)
sim.results

{'mating_statistics': {'n_reproducing_pairs': 1000,
  'n_total_offspring': 2000,
  'mean_n_offspring_per_pair': 2.0,
  'mean_n_female_offspring_per_pair': 1.056,
  'mate_correlations': component                height.phenotype.mother  BMD.phenotype.mother  \
  component                                                                
  height.phenotype.mother                 1.000000             -0.004248   
  BMD.phenotype.mother                   -0.004248              1.000000   
  height.phenotype.father                -0.010330             -0.012517   
  BMD.phenotype.father                    0.021719              0.025569   
  
  component                height.phenotype.father  BMD.phenotype.father  
  component                                                               
  height.phenotype.mother                -0.010330              0.021719  
  BMD.phenotype.mother                   -0.012517              0.025569  
  height.phenotype.father                 1.000000        

We can expand this to include all phenotype components (e.g., heritable and non-heritable portions) by supplying the `full` flag:

In [6]:
sim.statistics = [stats.MatingStatistics(full=True)]
sim.run(1)
sim.results['mating_statistics']['mate_correlations']

component,height.additiveGenetic.mother,BMD.additiveGenetic.mother,height.additiveNoise.mother,BMD.additiveNoise.mother,height.phenotype.mother,BMD.phenotype.mother,height.additiveGenetic.father,BMD.additiveGenetic.father,height.additiveNoise.father,BMD.additiveNoise.father,height.phenotype.father,BMD.phenotype.father
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
height.additiveGenetic.mother,1.0,-0.021017,-0.007734,0.049782,0.697232,0.026017,-0.054339,-0.00799,0.023681,-0.058008,-0.019167,-0.050465
BMD.additiveGenetic.mother,-0.021017,1.0,-0.022737,-0.000432,-0.03107,0.620076,-0.021841,-0.028459,0.034324,0.007174,0.010244,-0.012142
height.additiveNoise.mother,-0.007734,-0.022737,1.0,0.005671,0.711432,-0.009657,-0.024672,0.031432,0.010509,-0.028199,-0.008878,-0.002485
BMD.additiveNoise.mother,0.049782,-0.000432,0.005671,1.0,0.039051,0.784274,0.023771,-0.03727,0.007536,-0.023036,0.021333,-0.041328
height.phenotype.mother,0.697232,-0.03107,0.711432,0.039051,1.0,0.011361,-0.055875,0.016917,0.024176,-0.060981,-0.019834,-0.037247
BMD.phenotype.mother,0.026017,0.620076,-0.009657,0.784274,0.011361,1.0,0.005098,-0.046896,0.027208,-0.013622,0.023093,-0.039956
height.additiveGenetic.father,-0.054339,-0.021841,-0.024672,0.023771,-0.055875,0.005098,1.0,-0.049287,0.031182,-0.020988,0.690617,-0.047224
BMD.additiveGenetic.father,-0.00799,-0.028459,0.031432,-0.03727,0.016917,-0.046896,-0.049287,1.0,-0.030599,-0.004466,-0.055067,0.620793
height.additiveNoise.father,0.023681,0.034324,0.010509,0.007536,0.024176,0.027208,0.031182,-0.030599,1.0,0.046876,0.744404,0.017647
BMD.additiveNoise.father,-0.058008,0.007174,-0.028199,-0.023036,-0.060981,-0.013622,-0.020988,-0.004466,0.046876,1.0,0.019897,0.781194


## Estimators

So far, `xftsim` includes routines for genome-wide association studies and Haseman Elston regression for estimating heritability and genetic correlations, which we demonstrate briefly below. For further details see the API documentation: {ref}`stats <xftsim:xftsim.stats module>`.


In [7]:
sim.statistics = [stats.GWAS_Estimator(), stats.HasemanElstonEstimator(randomized=True)]
sim.run(1)

### GWAS

The GWAS estimator returns a 3-dimensional array consisting of slopes, standard errors, test statistics and p-values:

In [8]:
sim.results['GWAS']['estimates']

### HE Regression
The HE regression estimator produces genetic (co)variance and correlation estimates. We highly recommend setting the `randomized` flag to `True` (as we have done above) for all but the smallest problems.

In [9]:
sim.results['HE_regression']

{'cov_HE': phenotype_name                                    height       BMD
 component_name                                 phenotype phenotype
 vorigin_relative                                 proband   proband
 phenotype_name component_name vorigin_relative                    
 height         phenotype      proband           0.478390  0.021629
 BMD            phenotype      proband           0.021629  0.420101,
 'corr_HE': phenotype_name                                    height       BMD
 component_name                                 phenotype phenotype
 vorigin_relative                                 proband   proband
 phenotype_name component_name vorigin_relative                    
 height         phenotype      proband           1.000000  0.048248
 BMD            phenotype      proband           0.048248  1.000000}

### Coming soon

We plan to implement the following additional estimation procedures in the near-term:

 - polygenic score computation
 - partitioned HE-regression
 - LD score regression
 - a general cross-validation wrapper for exmaining out-of-sample performance of all estimators 

### Creating custom estimators

:::{note}

Tutorial coming soon.

:::
