In [1]:
import linear_dag as ld
import networkx as nx
import numpy as np

First, we simulate some genotypes from one several fixed ARGs. For example, "2-1" corresponds to a simulation with 4 ancestral haplotypes and 3 mutations. There is a root haplotype with a mutation and two children, each also having a mutation, and those children recombine into the third haplotype, which does not have a mutation of its own.

In [2]:
number_of_samples = 100
sim = ld.Simulate.simulate_example(example="2-1", ns=number_of_samples)

`sim` is a linearARG instance itself, and it also holds genotypes.

In [3]:
print(sim.shape, sim.A.shape, sim.sample_haplotypes.shape)

(100, 3) (104, 104) (100, 3)


In `sim.A`, the last four rows/columns correspond to ancestral haplotypes.

In [4]:
print(sim.A[number_of_samples:,:][:,number_of_samples:].todense())

[[ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [-1.  1.  1.  0.]]


This is the same matrix:

In [5]:
print(sim.A_ancestral)

[[ 0.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [-1.  1.  1.  0.]]


We can reconstruct the linear ARG from the genotype matrix.

In [6]:
linarg_initial = ld.LinearARG.from_genotypes(sim.sample_haplotypes)

However, this has extra edges due to the missing recombination node:

In [7]:
linarg_initial.nnz, sim.nnz

(160, 105)

linarg_initial has no recombination nodes, so it gives us the reduced graph of the simulated linear ARG:

In [8]:
print(linarg_initial.A[number_of_samples:,:][:,number_of_samples:].todense())

[[0 0 0]
 [1 0 0]
 [1 0 0]]


We can find the recombination node and improve sparsity as follows:

In [9]:
linarg_recom = linarg_initial.unweight()
linarg_recom = linarg_recom.find_recombinations()
linarg_recom.nnz

All properties hold for the Trios instance.


107

This isn't exactly the original linear ARG, as it has 2 additional edges. One of the things I'm currently working on is to improve the find_recombinations function such that it doesn't create these. Here's where those extra edges come from:

In [10]:
print(linarg_recom.A[number_of_samples:,:][:,number_of_samples:].todense())
print(linarg_recom.variant_indices - number_of_samples)

[[ 0  0  0  0  0  0]
 [-1  0  0  0  0  0]
 [ 1  0  0  0  0  0]
 [ 1  0  0  0  0  0]
 [ 0  1  1  0  0  0]
 [ 0  0  0  1  1  0]]
[0 2 3]


Notice that the `-1` edge has been given its own node (node 1), and additionally, the recombination event has been split into two separate events. The first event produces a recombination between nodes 1 and 2 (node 4), and the second produces a recombination between node 4 and node 3. Let's check that the only nodes that have samples as descendants are nodes 0, 2, 3, and 5:

In [11]:
samples = linarg_recom.sample_indices
ancestors_with_children = number_of_samples + np.array([0,2,3,5])
ancestors_without_children = number_of_samples + np.array([1,4])

print(linarg_recom.A[samples, :][:, ancestors_with_children].nnz)
print(linarg_recom.A[samples, :][:, ancestors_without_children].nnz)


100
0


To verify that the linear ARGs are equivalent, we can compute allele counts as follows:

In [12]:
linarg_triangular = linarg_recom.make_triangular() # re-orders rows + columns s.t. adjacency matrix is triangular
sim_triangular = sim.make_triangular()

v = np.ones(number_of_samples)
allele_count_from_X = v @ sim.sample_haplotypes
allele_count_from_linarg = v @ linarg_triangular
allele_count_from_sim = v @ sim_triangular

print(allele_count_from_X, allele_count_from_linarg, allele_count_from_sim)

[100.  52.  50.] [[100.  52.  50.]] [[100.  52.  50.]]
