# Treemix Ag1000G phase2 populations

For build a dataset Treemix I need unlinked SNPs. So I have to prune my allele count datasets to obtain SNPs in high LD.
For doing this I need:

    - Phase2 Genotype callset
    - Phase2 Allele count
    
Importing modules:

In [1]:
%run imports.ipynb

The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
  "outputs_hidden": false


Importing callsets:

In [2]:
callset_pass= callset_biallel
allele_counts= zarr.open('../data/phase2_biallel_allele_count.zarr/')
outgroup_allele_counts= zarr.open('../data/outgroup_alleles_phase2.zarr/')

------------------------

Taking only segregating SNPs for the phase2 callset:

In [3]:
def ingroup_ascertainment(chrom, start, stop, segpops):

    # locate region
    pos = allel.SortedIndex(callset_pass[chrom]['variants']['POS'][:])
    locr = pos.locate_range(start, stop)

    # ascertain SNPs
    loca = np.zeros(pos.shape, dtype='b1')
    loca[locr] = True
    log('Populations ascertainment, initial', nnz(loca))
    
    # require segregating
    for pop in segpops:
        ac = allel.AlleleCountsArray(allele_counts[chrom][pop][:, :2])
        loc_seg = (ac.min(axis=1) > 0)
        loca &= loc_seg
        log('After require segregating in', pop, nnz(loca))
        
    return loca

Define function for ld pruning. LD-pruning remove SNPs with an high correlation. Using windows this function compute pairwise LD between all SNPs within each window, then removing one SNP from each correlated pair.

In [4]:
def downsample_and_prune(chrom, start, stop, loc_asc,
                         n=100000, ldp_size=500, ldp_step=250, ldp_threshold=.1, ldp_n_iter=1):

    # all variant positions
    pos = allel.SortedIndex(callset_pass[chrom]['variants']['POS'][:])
    posa = pos[loc_asc]

    # randomly downsample
    if n < posa.shape[0]:
        posds = np.random.choice(posa, n, replace=False)
        posds.sort()
        posds = allel.SortedIndex(posds)
    else:
        # skip downsampling
        posds = posa
    locds = pos.locate_keys(posds)    

    # load genotype data
    genotype = allel.GenotypeChunkedArray(callset_pass[chrom]['calldata/GT'])
    geno_subset = genotype.subset(sel0=loc_asc)
    gn = geno_subset.to_n_alt()

    
    # prune    
    for i in range(ldp_n_iter):
        loc_unlinked = allel.locate_unlinked(gn, size=ldp_size, step=ldp_step, threshold=ldp_threshold)
        n = np.count_nonzero(loc_unlinked)
        n_remove = gn.shape[0] - n
        log('iteration', i+1, 'retaining', n, 'removing', n_remove, 'variants')
        gnu = gn.compress(loc_unlinked, axis=0)
        posu = pos.compress(loc_unlinked)
        locu = pos.locate_keys(posu)

    return locu

Define function for generating treemix file:

In [5]:
def to_treemix(acs, fn):
    pops = sorted(acs.keys())
    n_variants = acs[pops[0]].shape[0]
    n_alleles = acs[pops[0]].shape[1]
    assert n_alleles == 2, 'only biallelic variants supported'
    for pop in pops[1:]:
        assert n_variants == acs[pop].shape[0], 'bad number of variants for pop %s' % pop
        assert n_alleles == acs[pop].shape[1], 'bad number of alleles for pop %s' % pop
        
    with open(fn, 'wt', encoding='ascii') as f:
        print(' '.join(pops), file=f)
        for i in range(n_variants):
            print(' '.join([','.join(map(str, acs[pop][i])) for pop in pops]), file=f)


Define last function, the analysis function that includes all function above and applies these on my populations, chromosomes and regions of interest.

In [14]:
def run_analysis(rname, chrom, start, stop, segpops,
                 n=100000, ldp_size=500, ldp_step=250, ldp_threshold=.1, ldp_n_iter=1):

    # initial ascertainment
    loc_asc = ingroup_ascertainment(chrom, start, stop, segpops=segpops)
    
    # downsample and prune
    locu = downsample_and_prune(chrom, start, stop, loc_asc, 
                                n=n, ldp_size=ldp_size, ldp_step=ldp_step, 
                                ldp_threshold=ldp_threshold, ldp_n_iter=ldp_n_iter)
    
    # write allele counts
    acsu = dict()
    for pop in segpops:
        acsu[pop] = allele_counts[chrom][pop][:, :2][locu]

    outdir = 'treemix/ag_pops/seg_%s_ldp_%s' % ('_'.join(segpops), ldp_n_iter)
    !mkdir -pv {outdir}
    fn = os.path.join(outdir, '%s.allele_counts.txt' % rname)
    to_treemix(acsu, fn)
    !gzip -fv {fn}

Declaring values for generating my treemix file and ran on it for chromosome 3R, 3L, X, and the X region involved on speciation between <i>An.gambiae</i> and <i>An.coluzzii</i>

In [15]:
segpops = ['BFcol', 'CIcol', 'GHcol', 'GNcol','GHgam', 'BFgam', 'GNgam', 'GM', 'GW']
n = 100000
ldp_n_iter = 1
region_X_speciation = 'X-speciation', 'X', 15000000, 24000000 
region_X_free = 'X-free', 'X', 1, 14000000 
region_3L_free = '3L-free', '3L', 15000000, 41000000
region_3R_free = '3R-free', '3R', 1, 24000000 

In [16]:
rname, chrom, start, stop = region_3L_free
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops,n=n, ldp_n_iter=ldp_n_iter)

3L-free 3L 15000000 41000000
Populations ascertainment, initial 5989818
After require segregating in BFcol 1743824
After require segregating in CIcol 966122
After require segregating in GHcol 771417
After require segregating in GNcol 264372
After require segregating in GHgam 199473
After require segregating in BFgam 198010
After require segregating in GNgam 194238
After require segregating in GM 191296
After require segregating in GW 191011
iteration 1 retaining 124132 removing 66879 variants
treemix/ag_pops/seg_BFcol_CIcol_GHcol_GNcol_GHgam_BFgam_GNgam_GM_GW_ldp_1/3L-free.allele_counts.txt:	 91.8% -- replaced with treemix/ag_pops/seg_BFcol_CIcol_GHcol_GNcol_GHgam_BFgam_GNgam_GM_GW_ldp_1/3L-free.allele_counts.txt.gz


In [17]:
rname, chrom, start, stop = region_3R_free
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops,n=n, ldp_n_iter=ldp_n_iter) #outgroups

3R-free 3R 1 24000000
Populations ascertainment, initial 5760020
After require segregating in BFcol 1662192
After require segregating in CIcol 935048
After require segregating in GHcol 745633
After require segregating in GNcol 253224
After require segregating in GHgam 193813
After require segregating in BFgam 192709
After require segregating in GNgam 189321
After require segregating in GM 186583
After require segregating in GW 186277
iteration 1 retaining 124545 removing 61732 variants
treemix/ag_pops/seg_BFcol_CIcol_GHcol_GNcol_GHgam_BFgam_GNgam_GM_GW_ldp_1/3R-free.allele_counts.txt:	 92.3% -- replaced with treemix/ag_pops/seg_BFcol_CIcol_GHcol_GNcol_GHgam_BFgam_GNgam_GM_GW_ldp_1/3R-free.allele_counts.txt.gz


In [8]:
rname, chrom, start, stop = region_X_free
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops, n=n, ldp_n_iter=ldp_n_iter)

X-free X 1 14000000
Populations ascertainment, initial 3357129
After require segregating in AOcol 283965
After require segregating in BFcol 173888
After require segregating in CIcol 146215
After require segregating in GHcol 135601
After require segregating in GNcol 55134
After require segregating in GHgam 48802
After require segregating in CMgam 48790
After require segregating in BFgam 48695
After require segregating in GNgam 48384
After require segregating in GQgam 40476
After require segregating in UGgam 40466
After require segregating in GAgam 38465
After require segregating in FRgam 16632
After require segregating in KE 9986
After require segregating in GM 9980
After require segregating in GW 9980
iteration 1 retaining 8207 removing 1773 variants
mkdir: created directory 'd/data/treemix3'
mkdir: created directory 'd/data/treemix3/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_GAgam_FRgam_KE_GM_GW_ldp_1'
d/data/treemix3/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_

In [9]:
rname, chrom, start, stop = region_X_speciation
log(rname, chrom, start, stop)
run_analysis(rname, chrom, start, stop, segpops, n=n, ldp_n_iter=ldp_n_iter)

X-speciation X 15000000 24000000
Populations ascertainment, initial 883199
After require segregating in AOcol 54420
After require segregating in BFcol 22465
After require segregating in CIcol 18595
After require segregating in GHcol 17516
After require segregating in GNcol 7051
After require segregating in GHgam 2757
After require segregating in CMgam 2734
After require segregating in BFgam 2634
After require segregating in GNgam 2549
After require segregating in GQgam 1574
After require segregating in UGgam 1570
After require segregating in GAgam 1331
After require segregating in FRgam 553
After require segregating in KE 350
After require segregating in GM 347
After require segregating in GW 347
iteration 1 retaining 149 removing 198 variants
d/data/treemix3/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_GAgam_FRgam_KE_GM_GW_ldp_1/X-speciation.allele_counts.txt:	 92.8% -- replaced with d/data/treemix3/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQg

In [11]:
df = pd.read_csv('d/data/treemix/seg_AOcol_BFcol_CIcol_GHcol_GNcol_GHgam_CMgam_BFgam_GNgam_GQgam_UGgam_GAgam_FRgam_KE_GM_GW_ldp_1/X-speciation.allele_counts.txt.gz', sep = ' ')
df

Unnamed: 0,AOcol,BFcol,BFgam,CIcol,CMgam,FRgam,GAgam,GHcol,GHgam,GM,GNcol,GNgam,GQgam,GW,KE,UGgam
0,1560,1482,1840,1411,5940,480,1380,1082,240,1300,80,800,180,1820,960,2240
1,1560,1500,1840,1420,5940,480,1380,1100,240,1300,80,782,180,1820,960,2240
2,1560,1500,1831,1420,5940,480,1380,1100,240,1300,80,800,180,1820,960,2240
3,1560,1500,1822,1420,5904,480,1380,1100,240,1300,80,800,180,1820,960,2231
4,1560,13614,1795,1339,5893,480,1380,9218,240,1291,80,800,180,1793,6234,2240
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,1560,1500,1804,1420,5895,480,1380,1100,231,1291,80,791,180,1820,960,2231
145,1560,1500,1840,1420,5940,480,1380,1100,240,1300,80,800,126,1820,960,2240
146,1560,1500,1840,1420,5931,480,1380,1100,240,1300,80,800,180,1820,960,2240
147,1560,1500,1840,1420,5940,480,1380,1100,240,1300,80,800,180,1820,960,2231


In [13]:
df.to_csv('x_spec.txt',index=False, sep=" ")

## Treemix

Total SNPs per chromosomes:
- <b>3L-free</b>: 48664 SNPs
- <b>3R-free</b>: 66465 SNPs
- <b>X-free</b>: 8207 SNPs
- <b>X-speciation</b>: 149 SNPs

In [2]:
from IPython.display import Image

### 3L-free

### 3R-free

### X-free

### X-speciation