# Treemix Ag1000G phase2 populations

For build a dataset Treemix I need unlinked SNPs. So I have to prune my allele count datasets to obtain SNPs in high LD.
For doing this I need:

    - Phase2 Genotype callset
    - Phase2 Allele count
    
Importing modules:

In [49]:
%run imports.ipynb

Importing callsets:

In [50]:
callset_pass= callset_biallel
allele_counts= zarr.open('data/phase2_biallel_allele_count.zarr/')
outgroup_allele_counts= zarr.open('data/outgroup_alleles_phase2.zarr/')

------------------------

Taking only segregating SNPs for the phase2 callset:

In [51]:
def ingroup_ascertainment(chrom, start, stop, segpops):

    # locate region
    pos = allel.SortedIndex(callset_pass[chrom]['variants']['POS'][:])
    locr = pos.locate_range(start, stop)

    # ascertain SNPs
    loca = np.zeros(pos.shape, dtype='b1')
    loca[locr] = True
    log('Populations ascertainment, initial', nnz(loca))
    
    # require segregating
    for pop in segpops:
        ac = allel.AlleleCountsArray(allele_counts[chrom][pop][:, :2])
        loc_seg = (ac.min(axis=1) > 0)
        loca &= loc_seg
        log('After require segregating in', pop, nnz(loca))
        
    return loca

Define function for ld pruning. LD-pruning remove SNPs with an high correlation. Using windows this function compute pairwise LD between all SNPs within each window, then removing one SNP from each correlated pair.

In [52]:
def downsample_and_prune(chrom, start, stop, loc_asc,
                         n=100000, ldp_size=500, ldp_step=250, ldp_threshold=.1, ldp_n_iter=1):

    # all variant positions
    pos = allel.SortedIndex(callset_pass[chrom]['variants']['POS'][:])
    posa = pos[loc_asc]

    # randomly downsample
    if n < posa.shape[0]:
        posds = np.random.choice(posa, n, replace=False)
        posds.sort()
        posds = allel.SortedIndex(posds)
    else:
        # skip downsampling
        posds = posa
    locds = pos.locate_keys(posds)    

    # load genotype data
    genotype = allel.GenotypeChunkedArray(callset_pass[chrom]['calldata/GT'])
    geno_subset = genotype.subset(sel0=loc_asc)
    gn = geno_subset.to_n_alt()

    
    # prune    
    for i in range(ldp_n_iter):
        loc_unlinked = allel.locate_unlinked(gn, size=ldp_size, step=ldp_step, threshold=ldp_threshold)
        n = np.count_nonzero(loc_unlinked)
        n_remove = gn.shape[0] - n
        log('iteration', i+1, 'retaining', n, 'removing', n_remove, 'variants')
        gnu = gn.compress(loc_unlinked, axis=0)
        posu = pos.compress(loc_unlinked)
        locu = pos.locate_keys(posu)

    return locu

Define function for generating treemix file:

In [53]:
def to_treemix(acs, fn):
    pops = sorted(acs.keys())
    n_variants = acs[pops[0]].shape[0]
    n_alleles = acs[pops[0]].shape[1]
    assert n_alleles == 2, 'only biallelic variants supported'
    for pop in pops[1:]:
        assert n_variants == acs[pop].shape[0], 'bad number of variants for pop %s' % pop
        assert n_alleles == acs[pop].shape[1], 'bad number of alleles for pop %s' % pop
        
    with open(fn, 'wt', encoding='ascii') as f:
        print(' '.join(pops), file=f)
        for i in range(n_variants):
            print(' '.join([','.join(map(str, acs[pop][i])) for pop in pops]), file=f)


Define last function, the analysis function that includes all function above and applies these on my populations, chromosomes and regions of interest.

In [54]:
def run_analysis(rname, chrom, start, stop, segpops,
                 n=100000, ldp_size=500, ldp_step=250, ldp_threshold=.1, ldp_n_iter=1):

    # initial ascertainment
    loc_asc = ingroup_ascertainment(chrom, start, stop, segpops=segpops)
    
    # downsample and prune
    locu = downsample_and_prune(chrom, start, stop, loc_asc, 
                                n=n, ldp_size=ldp_size, ldp_step=ldp_step, 
                                ldp_threshold=ldp_threshold, ldp_n_iter=ldp_n_iter)
    
    # write allele counts
    acsu = dict()
    for pop in populations:
        acsu[pop] = allele_counts[chrom][pop][:, :2][locu]

    outdir = 'data/treemix/seg_%s_ldp_%s' % ('_'.join(segpops), ldp_n_iter)
    !mkdir -pv {outdir}
    fn = os.path.join(outdir, '%s.allele_counts.txt' % rname)
    to_treemix(acsu, fn)
    !gzip -fv {fn}

Declaring values for generating my treemix file and ran on it for chromosome 3R, 3L, X, and the X region involved on speciation between <i>An.gambiae</i> and <i>An.coluzzii</i>

In [55]:
segpops = ['AOcol', 'BFcol', 'CIcol', 'GHcol', 'GNcol','GHgam', 'CMgam', 'BFgam', 'GNgam', 'GQgam', 'UGgam', 'GAgam', 'FRgam','KE', 'GM', 'GW']
n = 100000
ldp_n_iter = 1
region_X_speciation = 'X-speciation', 'X', 15000000, 24000000 
region_X_free = 'X-free', 'X', 1, 14000000 
region_3L_free = '3L-free', '3L', 15000000, 41000000
region_3R_free = '3R-free', '3R', 1, 37000000 

In [46]:
rname, chrom, start, stop = region_3L_free
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops,n=n, ldp_n_iter=ldp_n_iter)

3L-free 3L 15000000 41000000
ingroup ascertainment, initial 5989818
after require segregating in AOcol 672596
after require segregating in BFcol 448836
after require segregating in CIcol 392968
after require segregating in GHcol 368591
after require segregating in GNcol 190540
after require segregating in GHgam 168923
after require segregating in CMgam 168833
after require segregating in BFgam 168532
after require segregating in GNgam 167228
after require segregating in GQgam 146236
after require segregating in UGgam 146178
after require segregating in GAgam 142414
after require segregating in FRgam 96294
after require segregating in KE 68187
after require segregating in GM 68141
after require segregating in GW 68139
iteration 1 retaining 48664 removing 19475 variants
data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_GAgam_FRgam_KE_GM_GW_ldp_1/3L-free.allele_counts.txt:	 92.5% -- replaced with data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_B

In [45]:
rname, chrom, start, stop = region_3R_free
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops,n=n, ldp_n_iter=ldp_n_iter) #outgroups

3R-free 3R 1 37000000
ingroup ascertainment, initial 8535400
after require segregating in AOcol 944999
after require segregating in BFcol 632162
after require segregating in CIcol 555287
after require segregating in GHcol 520052
after require segregating in GNcol 270605
after require segregating in GHgam 238337
after require segregating in CMgam 238226
after require segregating in BFgam 237796
after require segregating in GNgam 236126
after require segregating in GQgam 204576
after require segregating in UGgam 204483
after require segregating in GAgam 200000
after require segregating in FRgam 134437
after require segregating in KE 94097
after require segregating in GM 94004
after require segregating in GW 93998
iteration 1 retaining 66465 removing 27533 variants
mkdir: created directory 'data/treemix'
mkdir: created directory 'data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_GAgam_FRgam_KE_GM_GW_ldp_1'
data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHg

In [47]:
rname, chrom, start, stop = region_X_free
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops, n=n, ldp_n_iter=ldp_n_iter)

X-free X 1 14000000
ingroup ascertainment, initial 3357129
after require segregating in AOcol 283965
after require segregating in BFcol 173888
after require segregating in CIcol 146215
after require segregating in GHcol 135601
after require segregating in GNcol 55134
after require segregating in GHgam 48802
after require segregating in CMgam 48790
after require segregating in BFgam 48695
after require segregating in GNgam 48384
after require segregating in GQgam 40476
after require segregating in UGgam 40466
after require segregating in GAgam 38465
after require segregating in FRgam 16632
after require segregating in KE 9986
after require segregating in GM 9980
after require segregating in GW 9980
iteration 1 retaining 8207 removing 1773 variants
data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_GAgam_FRgam_KE_GM_GW_ldp_1/X-free.allele_counts.txt:	 93.6% -- replaced with data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_

In [56]:
rname, chrom, start, stop = region_X_speciation
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops, n=n, ldp_n_iter=ldp_n_iter)

X-speciation X 15000000 24000000
Populations ascertainment, initial 883199
After require segregating in AOcol 54420
After require segregating in BFcol 22465
After require segregating in CIcol 18595
After require segregating in GHcol 17516
After require segregating in GNcol 7051
After require segregating in GHgam 2757
After require segregating in CMgam 2734
After require segregating in BFgam 2634
After require segregating in GNgam 2549
After require segregating in GQgam 1574
After require segregating in UGgam 1570
After require segregating in GAgam 1331
After require segregating in FRgam 553
After require segregating in KE 350
After require segregating in GM 347
After require segregating in GW 347
iteration 1 retaining 149 removing 198 variants
data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_GAgam_FRgam_KE_GM_GW_ldp_1/X-speciation.allele_counts.txt:	 92.8% -- replaced with data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGg

## Treemix

Total SNPs per chromosomes:
- <b>3L-free</b>: 48664 SNPs
- <b>3R-free</b>: 66465 SNPs
- <b>X-free</b>: 8207 SNPs
- <b>X-speciation</b>: 149 SNPs

In [2]:
from IPython.display import Image

### 3L-free

### 3R-free

### X-free

### X-speciation