In [1]:
import numpy as np
import pandas as pd; pd.set_option('display.max_rows', 10000)
import allel 
import zarr
from IPython.display import HTML, display

In [2]:
samples = pd.read_csv("../../data/samples.meta.txt",  sep='\t')

## Estimates of recent effective population size (*Ne*) on the Ag1000g data
##### Sanjay C Nagi

In preparation for the LLINEUP trial data, and with a bit of spare time, I began analyses into effective population size of the Ag1000g populations. I now have a snakemake pipeline to apply LDNe and IDBNe to *Ag* WGS data, which I have done, as well as also calculating Ne from theta=4NeMu. *I have not yet attempted to fully interpret the results*. 


### LDNe

Estimates of effective population size were estimated in NeEstimator v2.1 (Do *et al*., 2014), using the Linkage Disequilibrium method (LDNe) of Waples and Do (2008). This method is widely used (the most widely cited of all methods to estimate *Ne*), and generally performs robustly in comparisons between single-sample estimators (see references for further reading).

In [3]:
Ne = pd.read_csv("Ne_analyses.LDNe", sep="\t")

In [4]:
Ne['Ne_estimate'] = Ne['Ne_estimate'].replace("Infinite", np.inf).astype(float) #change string to np.inf to allow sorting
Ne[Ne['AF'] == 'minAF_0.05'].drop(columns=['AF', 'independent_comparisons', 
                                           'overall_r^2', 'expected_r^2', 
                                           'Parametric_CI_lower', 'Parametric_CI_upper']).sort_values(by='Ne_estimate')

Unnamed: 0,chrom,pop,sample_size,Ne_estimate,Jackknife_CI_lower,Jackknife_CI_upper
56,3L,KE,47.9,2.4,2.0,2.9
60,3R,KE,47.9,2.8,2.5,3.1
72,3L,FRgam,23.9,100.0,74.5,148.9
76,3R,FRgam,24.0,110.2,87.3,147.7
12,3R,GHgam,12.0,117.7,34.1,Infinite
116,3R,GNcol,4.0,128.5,18.6,Infinite
88,3L,AOcol,77.9,247.6,134.5,969.3
8,3L,GHgam,12.0,279.1,98.9,Infinite
100,3R,GAgam,69.0,298.1,246.7,374.4
92,3R,AOcol,78.0,302.8,165.8,1198.5


Qualitatively, we can see that the largest populations, at least in terms of effective population size, are *coluzzii* from Burkina (BFcol) and Cote D'ivoire (CIcol), as well as *gambiae* from Uganda (UGgam) and Burkina (BFgam).

The Guinea-Bissau population (GW) also displays a high *Ne*, which we might expect - overall LD will surely be lower in a hybrid population.

Ignoring Bioko and Guinean coluzzii for their small sample size and infinity estimates (9,4), the smallest effective population sizes are found in the Kenyan population (KE), *gambiae* from Mayotte (FRgam), as we would expect, but also the ghaniaian *gambiae* population.

In phase 1, it was noted that the Gabon (GAgam) and Angolan (AOgam) populations had reduced diversity, and more extensive LD, indicating smaller Ne than other West Africa populations, in agreement with the LDNe estimates above.

#### Notes 

- Values of infinity can result from an underestimate of sampling error - if this occurs, the bias correction can result in negative estimates of Ne. This is observed in Bioko (GQgam) and Guinean coluzzii (GNcol), and possibly is contributing to Kenyas miniscule Ne estimates.
- Infinity may also result from too large an effective population size to estimate with the sample size given. For example, we get an infinity value for the point estimate of Burkina Faso coluzzii (BFcol). In this case, the lower CI is informative. <br>
- I have excluded the parametric CIs, which seemed far too narrow. Both methods to produce confidence intervals, the parametric and jackknife method, may be suboptimal according to Jones *et al*., (2016), unfortunately, the methods they suggest have not yet been implemented in any usable manner.

### IBDNe 

IBDNe estimates effective population sizes in the recent past based on the size and abundance of shared segments of IBD in a population (Browning & Browning, 2015). A high abundance of IBD tracts would indicate a small *Ne*, and the length of those IBD tracts would indicate when that small *Ne* occured (as they are whittled down in size by recombination over time).

- The full estimates range from 0-300 generations ago, however, the authors suggest that the IBD tracts are informative from 4 to approximately 200 generations ago. 
- Some of the predicted histories of effective population size are particularly volatile, and the confidence intervals vary in size massively. <br>

I have attached the IBDNe plots in the attached pdfs. A couple of populations are missing, due to small sample sizes. The harmonic means are as follows, and roughly seem to agree with the LDNe data:

In [5]:
pd.read_csv("ibdne/IBDNe_Harmonic_means.txt", sep="\t").round()

Unnamed: 0,Population,IBDNe_harmonic_means
0,KE,2354.0
1,FRgam,18747.0
2,GAgam,20470.0
3,CIcol,25168.0
4,GM,27957.0
5,AOcol,30554.0
6,BFgam,47377.0
7,UGgam,48750.0
8,GHcol,98537.0
9,GNgam,176859.0


### estimating Ne from theta=4Nemu

As all populations share the same mutation rate, in this case, the Ne estimates are simply a function of diversity (wattersons theta in this case, though I have seen others use pi).

In [15]:
pops = samples.population.unique()
chroms = ['3L', '3R']

mu=3.5e-9

Ne = dict()
Ne_pi = dict()
Ne_Ag = dict()
Ne_Ag_pi = dict()
        
for pop in pops:
    for chrom in chroms:
        
        Ag_array  = zarr.open_array(f"/home/sanj/ag1000g/data/ag1000g.phase2.ar1.pass/{chrom}/calldata/GT/")
        pos  = zarr.open_array(f"/home/sanj/ag1000g/data/ag1000g.phase2.ar1.pass/{chrom}/variants/POS")
        geno = allel.GenotypeChunkedArray(Ag_array)
        print(f"-------------------  Arrays loaded {pop} -----------------")
     
        pop_bool = samples.population == pop
        pop_geno = geno.compress(pop_bool, axis=1)
        
        print(f"Counting alleles {pop} {chrom}")
        ac = pop_geno.count_alleles()
        print("Computing theta")
        theta = allel.watterson_theta(pos, ac)
        pi = allel.sequence_diversity(pos, ac)
        print('done')
        Neff = theta/(4*mu)
        Neff_pi = pi/(4*mu)
        
        Ne[chrom] = Neff
        Ne_pi[chrom] = Neff_pi
    
    Ne_Ag[pop] = dict(Ne)
    Ne_Ag_pi[pop] = dict(Ne_pi)

Ne_theta = pd.DataFrame.from_dict(Ne_Ag).T
Ne_pi = pd.DataFrame.from_dict(Ne_Ag_pi).T
Ne_theta.columns = ['3L_theta', '3R_theta']
Ne_pi.columns = ['3L_pi', '3R_pi']
Ne_Ag = pd.concat([Ne_theta, Ne_pi], axis=1)

Ne_Ag.round().to_csv("Ne_theta_pi_Ag.csv", index=True)

-------------------  Arrays loaded GHcol -----------------
Counting alleles GHcol 3L
Computing theta
done
-------------------  Arrays loaded GHcol -----------------
Counting alleles GHcol 3R
Computing theta
done
-------------------  Arrays loaded GHgam -----------------
Counting alleles GHgam 3L
Computing theta
done
-------------------  Arrays loaded GHgam -----------------
Counting alleles GHgam 3R
Computing theta
done
-------------------  Arrays loaded BFgam -----------------
Counting alleles BFgam 3L
Computing theta
done
-------------------  Arrays loaded BFgam -----------------
Counting alleles BFgam 3R
Computing theta
done
-------------------  Arrays loaded BFcol -----------------
Counting alleles BFcol 3L
Computing theta
done
-------------------  Arrays loaded BFcol -----------------
Counting alleles BFcol 3R
Computing theta
done
-------------------  Arrays loaded UGgam -----------------
Counting alleles UGgam 3L
Computing theta
done
-------------------  Arrays loaded UGgam -----

In [17]:
pd.read_csv("Ne_theta_pi_Ag.csv", index_col=0)

Unnamed: 0,3L_theta,3R_theta,3L_pi,3R_pi
GHcol,870489.0,952726.0,522127.0,569754.0
GHgam,735458.0,816605.0,538470.0,595453.0
BFgam,1303007.0,1397081.0,545999.0,600311.0
BFcol,1126854.0,1242037.0,529589.0,577963.0
UGgam,1230586.0,1321118.0,543128.0,596927.0
GM,867399.0,945927.0,550725.0,598330.0
GW,1154793.0,1259574.0,558378.0,608367.0
KE,220627.0,237080.0,353424.0,372746.0
CMgam,1645481.0,1767792.0,546319.0,601318.0
FRgam,287884.0,321054.0,360686.0,403933.0


## Questions for Martin/Dave/Eric/others

Should this work remain as a 2-pager, sent to those who would be interested? 

Or is it worth producing a standalone paper, perhaps by either.... 

- Adding analyses into runs of homozygosity (ROH), and comparing ROH profiles between populations. (*I have since realised this was pretty much done in original phase 1 paper, though only discussed for Kenya*).

- Writing review on studies of effective population size in Anopheles mosquitoes, and include these analyses as part of that.

---------
And what is the best way to present this data? ... lots of populations, multiple chromosomes etc. 

## Methods
The methods need checking - as this was initially just exploratory, I made rapid decisions regarding what might be appropriate.

- The analysis has been done on whole populations - however, many populations are from multiple sites in the same country, whereas some are just one site. How will this affect analyses? One might certainly expect it to affect IBDNe estimates.

#### LDNe
- Chromosomes 3L and 3R were chosen for analysis to avoid major inversions, in agreement with the IBDNe analysis conducted by the Kern lab in Phase 1. 
- Pericentromeric regions of low recombination were removed - the exact values I used were slightly arbitrary and based on plots in phase 1 paper. 
- SNPs were restricted to non-coding regions - a better option might be to restrict to x-distance away from coding regions (though I doubt this will modify the results significantly)
- 10,000 random SNPs were used for each population, and each population has a different random selection, rather than the same 10,000 SNP positions. Is it more appropriate to use the same 10,000? I dont know.
- MAFs of 0, 0.01, 0.02, 0.05 were all tested with LDNe. I have presented 0.05 here, as using the lower MAF thresholds gives a larger Ne estimate but with many more infinity values.

--------
#### IDBNe
- Chromosomes 3L and 3R were chosen as above.
- Pericentromeric regions of low recombination were removed as above. 
- Thats it

-------
#### Theta=4NeMu
- I have used wattersons theta across the whole chromosome accessible regions (not restricted to non-coding regions as above)
- mu=3.5e-9 (from Drosophila as in Keightley et al. 2009, (Miles et al., 2017, supplementary text))
- Ne=theta/(4*mu)

### References

TODO