# Creating genotype matrices of historic surveillance SNPs

## Goal

Create matrices of the 24 SNP barcode from Daniels et al. 2008 and the 96 SNP barcode for Lo et al. 2018. These updates matrices can be used in transmission models.

In [1]:
# load packages - have been installed in vcf virtual env
import os, sys
import zarr
import numpy as np
import pandas as pd
import allel  
import dask.array as da

## Load literature position files

These files were generated from tables in the papers since accompaying files with the chromosome and position were not available. Additional file 4 from Daniels 2018 was manually copied. Table 1 from Lo 2018 was downloaded as .jpg, converted to PDF to Excel. Resulting table was manually checked and corrected for chromosome and position errors that occured during the conversion. 

In [2]:
pos_dir = '/mnt/c/Users/jribado/Dropbox (IDM)/Data, Dynamics, and Analytics Folder/Projects/malaria_barcode/pfv6_barcodeSubset'
daniels_df = pd.read_csv(os.path.join(pos_dir, "2008Daniels_BarcodePositions.txt"), sep='\t', header=0)
lo_df      = pd.read_csv(os.path.join(pos_dir, "2018Lo_BarcodePositions.txt"), sep='\t', header=0)

In [3]:
# create a list with the chromosome and position for each study
daniels_pos = [x + "_" + str(y) for x, y in zip(["{:02d}".format(x) for x in daniels_df["chr"]], daniels_df["position"].tolist())]
print(len(daniels_pos))
daniels_pos[1:6]

24


['01_539044', '02_842803', '04_282592', '05_9311601', '06_145472']

In [4]:
lo_pos = [x + "_" + str(y) for x, y in zip(["{:02d}".format(x) for x in lo_df["Chromosome"]], lo_df["Position"].tolist())]
print(len(lo_pos))
lo_pos[1:6]

96


['01_107988', '01_388876', '01_497218', '01_506357', '01_565523']

## Load variant file 

In [33]:
# read in zarr files
zarr_path='/mnt/c/Users/jribado/Dropbox (IDM)/Data, Dynamics, and Analytics Folder/Projects/Pfv6 Malaria/Pf_6.zarr.zip'
callset = zarr.open(zarr_path, mode='r')

In [37]:
# get the chromosome and position for all possible variants
zarr_chrom = [x.split("_",2)[1] for x in callset['variants/CHROM'][:].tolist()]
zarr_pos = callset['variants/POS'][:].tolist()
zarr_index = [x + "_" + str(y) for x, y in zip(zarr_chrom, zarr_pos)]
print(len(zarr_index))
zarr_index[1:6]

6051696


['01_37', '01_58', '01_65', '01_72', '01_79']

## Find index overlaps

There are ~6 million variant positions in the Pfv6 release where any variant is present in any of the 7,113 samples. The SNPs in both studies should overlap - especially if the samples are a part of the consortium. 

### Daniels 2008

In [55]:
len(list(set(zarr_index)&set(daniels_pos)))

5

### Lo 2018

In [56]:
len(list(set(zarr_index)&set(lo_pos)))

25

That's not great. 

### Overlap discrepensies hypotheses

1. The studies are using different reference genomes, which would change the position of SNPs.
    - Likely - check papers for reference genome used to call SNPs.
    - Update: Daniels 2008 coordinates reflect the 3d7 genome from PlasmoDB v5 and Lo 2018 uses samples from 5 study sites in western Kenya with a 126 SNP panel from Neafsey et al 2008 - also likely the 3d7v5 genome available at the time. When checking PlasmoDB the only 3d7 genome available matches the one used for Pfv6 read alignment.
2. The samples used in the studies to find SNPs are not in MalariaGEN, and the other samples in Pfv6 from same population have different genomic structure at these sites. 
    - Likely - chosen SNPs are the result of avilable samples at the time and could reflect population specific SNPs. 
    - Update: Lo 2018 finds SNPs specific to western Kenya; Daniels 2008 SNPs in this cohort are relatively homogenous. It may be these SNPs are specific to this population and therefore not represented in Pfv6 samples. 
3. Genomic struture around these positions makes if difficult to unambigiously map reads to the region, and there mapping isn't good enough to call variants at these sites. 
    - Less likely - there are 7k samples, and there is no filtering at the moment for low quality SNPs.   
4. The indexes do not match between the file types - should both be 1-based indexing. 
    - Check by adding one to the VCF position before matching. If all, or nearly all, positions match then it's an indexing error. 
    - Update: Only 2 additional SNPs matched between files which is a function of genome diversity, not a systemic error calling positions. 

## Pf3k sanity check

Checking population structure differences on the ability to call a SNP (hypothesis 2) is difficult when the samples are not shared across studies. Missing variants could be further confounded by updates to the variant calling pipeline.

Pf3k samples are a subset of the Pfv6 samples, which is a good sanity check to at least address the latter point of the effect of the updated pipeline on number of variants called. Variant calls were run individually for each Pfv6 sample, meaning that the addition for more samples should not have reduced the confidence of calls at individual positions due to population differences. Variants should remain roughly equal between the two versions assuming minor updates to the variant algorithm. 

(Due to some memory storage issues, I cannot read each chromosome in from Pf3k. These files are stored on the P/: drive and cannot be accessed from the Linux subsystem intalled on Windows this notebook is currently running from. I am checking overlaps on the first chromosome.)  

In [30]:
# read in variant files for Pf3k
x=["{:02d}".format(item) for item in range(1, 15)]
pf3k_chrom, pf3k_pos, pf3k_variant_keep = [], [], []
for i in x: 
    pf3k_zarr=''.join(["/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_",str(i),"_v3.zarr"])
    print(pf3k_zarr)
    pf3k_callset = zarr.open(pf3k_zarr, mode='r')
    pf3k_chrom.extend([x.split("_",2)[1] for x in pf3k_callset['variants/CHROM'][:]])
    pf3k_pos.extend(pf3k_callset['variants/POS'][:])
    quality_set = pf3k_callset['variants/FILTER_PASS'][:]
    snp_set = pf3k_callset['variants/is_snp'][:]
    vsqlod_set = pf3k_callset['variants/VQSLOD'][:]  > 6
    biallelic_set = pf3k_callset['variants/numalt'][:] == 1
    pf3k_variant_keep.extend(quality_set & snp_set & vsqlod_set & biallelic_set)
    

/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_01_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_02_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_03_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_04_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_05_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_06_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_07_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_08_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_09_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_10_v3.zarr
/mnt/c/Users/jribado/OneDrive - IDMOD/Malaria/pf3k_zarr_format/SNP_INDEL_Pf3D7_11_v3.zarr
/mnt/c/Use

In [24]:
pf3k_index = [x + "_" + str(y) for x, y in zip(pf3k_chrom, pf3k_pos)]

In [27]:
len(list(set(pf3k_index)&set(daniels_pos)))

4

In [58]:
pf3k_callset.tree(expand=True)

In [31]:
np.count_nonzero(pf3k_variant_keep)

247496

In [41]:
# get high quality snps for pfv6 to find overlaps
quality_set = callset['variants/FILTER_PASS'][:]
snp_set = callset['variants/is_snp'][:]
vsqlod_set = callset['variants/VQSLOD'][:]  > 6
biallelic_set = callset['variants/numalt'][:] == 1
variant_keep = quality_set & snp_set & vsqlod_set & biallelic_set 
filt_n = np.count_nonzero(variant_keep)
print("Number of filtered genes:", filt_n)

Number of filtered genes: 89578


In [73]:
pf3k_match=pd.DataFrame(list(zip(pf3k_index, pf3k_variant_keep)), columns=['pos', 'hq'])
pf3k_match.insert(0, 'id', range(0, len(pf3k_match)))
pf3k_keep = pf3k_match[pf3k_match.hq == True].id.values
pf3k_keep_list = [pf3k_index[i] for i in pf3k_keep]

In [71]:
pfv6_match=pd.DataFrame(list(zip(zarr_index, variant_keep)), columns=['pos', 'hq'])
pfv6_match.insert(0, 'id', range(0, len(pfv6_match)))
pfv6_keep = pfv6_match[pfv6_match.hq == True].id.values
pfv6_keep_list = [zarr_index[i] for i in pfv6_keep]

In [74]:
len(list(set(pf3k_keep_list)&set(pfv6_keep_list)))

35696

variant_keep[1:4]