In [2]:
import numpy as np
import pandas as pd; pd.set_option('display.max_rows', 10000)
import allel 
import matplotlib.pyplot as plt
import zarr
import h5py
import seaborn as sns
from sklearn import metrics
from tqdm import tqdm

In [3]:
%run "~/ag1000g/selective_sweeps/scripts/sweeps_functions.py"
samples = pd.read_csv("../../data/samples.meta.txt", sep='\t')

### Ne estimates from Ag1000g populations (LDNe)

In preparation for the LLINEUP data, and with some spare time in the first few months of my PhD, I began analyses into effective population size of the Ag1000g populations. I now have a functional snakemake pipeline to apply LDNe and IDBNe to genomic data, which I have done, as well as also calculating Ne from theta=4NeMu.

Estimates of effective population size were implemented in NeEstimator v2.1 (Do *et al*., 2014), using the Linkage Disequilibrium method (LDNe) of Waples and Do (2008). This method is widely used, and generally performs robustly in comparisons between single-sample estimators (see references for further reading).

In [11]:
Ne = pd.read_csv("Ne_analyses.LDNe", sep="\t")
Ne[Ne['AF'] == 'minAF_0.05']

Unnamed: 0,AF,chrom,pop,sample_size,independent_comparisons,overall_r^2,expected_r^2,Ne_estimate,Parametric_CI_lower,Parametric_CI_upper,Jackknife_CI_lower,Jackknife_CI_upper
0,minAF_0.05,3L,GHcol,55.0,6465956,0.019633,0.019242,851.2,806.9,900.7,338.5,Infinite
4,minAF_0.05,3R,GHcol,55.0,6705702,0.019729,0.01924,680.4,652.1,711.2,224.3,Infinite
8,minAF_0.05,3L,GHgam,12.0,13844967,0.109337,0.10824,279.1,259.7,301.6,98.9,Infinite
12,minAF_0.05,3R,GHgam,12.0,13760829,0.110819,0.10824,117.7,114.0,121.7,34.1,Infinite
16,minAF_0.05,3L,BFgam,92.0,2357246,0.011382,0.011249,2503.3,2167.7,2960.1,1317.0,21833.9
20,minAF_0.05,3R,BFgam,92.0,2520667,0.011443,0.011247,1699.6,1541.9,1892.6,1040.1,4508.3
24,minAF_0.05,3L,BFcol,75.0,3364929,0.0139,0.013903,Infinite,17897.2,Infinite,10892.7,Infinite
28,minAF_0.05,3R,BFcol,75.0,3173472,0.013897,0.013902,Infinite,19659.2,Infinite,9299.5,Infinite
32,minAF_0.05,3L,UGgam,112.0,2733796,0.009316,0.009184,2515.7,2249.7,2851.9,1960.6,3500.0
36,minAF_0.05,3R,UGgam,112.0,2759529,0.009329,0.009184,2304.8,2080.2,2583.0,1798.6,3198.8


Be aware that both methods to produce confidence intervals, the parametric and pseudo-jackknife method, may be suboptimal according to Jones *et al*., (2016), unfortunately, the methods they suggest have not yet been implemented in any usable manner.

### IBDNe 

### estimating Ne from theta=4Nemu

In [9]:
pops = samples.population.unique()
chroms = ['3L', '3R']

mu=3.5e-9

Ne = dict()
Ne_Ag = dict()
        
for pop in pops:
    for chrom in chroms:
        
        Ag_array  = zarr.open_array(f"/home/sanj/ag1000g/data/ag1000g.phase2.ar1.pass/{chrom}/calldata/GT/")
        pos  = zarr.open_array(f"/home/sanj/ag1000g/data/ag1000g.phase2.ar1.pass/{chrom}/variants/POS")
        geno = allel.GenotypeChunkedArray(Ag_array)
        print(f"-------------------  Arrays loaded {pop} -----------------")
     
        pop_bool = samples.population == pop
        pop_geno = geno.compress(pop_bool, axis=1)
        
        print(f"Counting alleles {pop} {chrom}")
        ac = pop_geno.count_alleles()
        print("Computing theta")
        theta = allel.watterson_theta(pos, ac)
        print('done')
        Neff = theta/(4*mu)
        
        Ne[chrom] = Neff
    
    Ne_Ag[pop] = dict(Ne)

-------------------  Arrays loaded ------------------------
Counting alleles GHcol 3L
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles GHcol 3R
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles GHgam 3L
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles GHgam 3R
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles BFgam 3L
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles BFgam 3R
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles BFcol 3L
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles BFcol 3R
Computing theta
done
-------------------  Arrays loaded ------------------------
Counting alleles UGgam 3L
Computing theta
done
-------------------  Arrays loaded --

In [10]:
Ne_theta = pd.DataFrame.from_dict(Ne_Ag).T
Ne_theta.round()

Unnamed: 0,3L,3R
GHcol,870489.0,952726.0
GHgam,735458.0,816605.0
BFgam,1303007.0,1397081.0
BFcol,1126854.0,1242037.0
UGgam,1230586.0,1321118.0
GM,867399.0,945927.0
GW,1154793.0,1259574.0
KE,220627.0,237080.0
CMgam,1645481.0,1767792.0
FRgam,287884.0,321054.0


#### Questions for Martin, Dave, Alistair etc

What is the best way to present this data? ... lots of populations, multiple chromosomes etc

AND Should this work remain as a 2-pager sent to those who would be interested? 
##### OR is it worth producing a standalone paper, perhaps by.... 

- Adding analyses into runs of homozygosity (ROH), and comparing ROH profiles between populations

- Alternatively, could write review on studies of effective population size in Anopheles mosquitoes, and include these analyses as part of that.

### Methods
The methods need checking - as this was initially just exploratory, I quickly guessed the appropriate things to do.

- The analysis has been done on whole populations - however, some populations are from multiple sites in the same country, whereas some are just one site. How will this affect analyses? One might certainly expect it to affect IBDNe estimates.

#### LDNe
- Chromosomes 3L and 3R were chosen for analysis to avoid major inversions, in agreement with the analysis conducted by the Kern lab in Phase 1. 
- Pericentromeric regions of low recombination were removed - the exact values I used were slightly arbitrary and based on plots in phase 1 paper. 
- SNPs were restricted to non-coding regions - a better option might be to restrict to x-distance away from coding regions (though I doubt this will modify the results significantly)
- 10,000 random SNPs were used for each population, and each population has a different random selection, rather than the same 10,000 SNP positions.
- MAFs of 0, 0.01, 0.02, 0.05 were all tested with LDNe. I have presented 0.05 here, as using the lower MAFs gives a larger Ne estimate but with many more infinity values.

#### IDBNe
- Chromosomes 3L and 3R were chosen as above.
- Pericentromeric regions of low recombination were removed as above. 
- Thats it

#### Theta=4NeMu
- I have used wattersons theta across the whole chromosome (not restricted to non-coding regions as above)
- Ne = theta/(4*mu)
- mu = 3.5e-9 (from Drosophila as in Keightley et al. 2009, (Miles et al., 2017, supplementary))

### References

In [17]:
samples.groupby(['population', 'region']).agg('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,ox_code,src_code,country,contributor,contact,year,m_s,sex,n_sequences,mean_coverage
population,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AOcol,Luanda,78,78,78,78,0,78,78,78,78,78
BFcol,Bana,6,6,6,6,6,6,6,6,6,6
BFcol,Bana M,18,18,18,18,18,18,18,18,18,18
BFcol,Bana V,16,16,16,16,16,16,16,16,16,16
BFcol,Pala,10,10,10,10,10,10,10,10,10,10
BFcol,Sour,25,25,25,25,25,25,25,25,25,25
BFgam,Bana,17,17,17,17,17,17,17,17,17,17
BFgam,Bana M,2,2,2,2,2,2,2,2,2,2
BFgam,Bana V,1,1,1,1,1,1,1,1,1,1
BFgam,Pala,46,46,46,46,46,46,46,46,46,46
