# Creating matrices of SNPs from VCF files

## Goal

Learn to subset variants from Zarr files that can be used in IDM models. 

The Zarr format is similar to HD5 that will allows to subset large files without loading into memory. For any whole genome sequencing file, variants in a population will be in a VCF format. 

In [27]:
# load packages - have been installed in vcf virtual env
import os, sys
import zarr
import numpy as np
import pandas as pd
import allel  
import dask.array as da
from itertools import compress

## Load literature position files

These positions files were obtained from Rachel Daniels on 05/07/2020, which updated the positions in the 2008 paper to the Pf3D7v3.1 reference Pf3k WGS samples are aligned.

In [8]:
pos_dir = '/home/jribado/Dropbox (IDM)/parasite_genetics/genomics/senegal'
daniels_df = pd.read_csv(os.path.join(pos_dir, "2008Daniels_BarcodePositions_Updated.txt"), sep='\t', header=0)
# create a list with the chromosome and position for each study
# daniels_pos = [x + "_" + str(y) for x, y in zip(["{:02d}".format(x) for x in daniels_df["chr"]], daniels_df["position"].tolist())]
print("Number of variants from user: ", daniels_df.shape[0])

Number of variants from user:  24


## Load variant file 

In [9]:
# read in zarr files
chrom = 7
zarr_path = '/home/jribado/Dropbox (IDM)/Data, Dynamics, and Analytics Folder/Projects/malaria_pfcommunity/malaria_pf3k/pf3k_zarr/'
zarr_file = zarr_path + 'SNP_INDEL_Pf3D7_' + "{:02d}".format(chrom) + '_v3.zarr'
callset = zarr.open(zarr_file, mode='r')

## Find index overlaps

This looks for overlaps without considering quality filtering parameters. Quality metric are applied in the following section. 

In [10]:
# get the chromosome and position for all possible variants
zarr_chrom = [x.split("_",2)[1] for x in callset['variants/CHROM'][:].tolist()]
zarr_index = callset['variants/POS'][:].tolist()
print("Possible variants on chromosome", chrom, ":", len(zarr_index))

Possible variants on chromosome 7 : 300181


In [11]:
# subset the variants on the chromosome of interest
#daniels_chrom_sub = daniels_df.loc[daniels_df.chr == chrom]
daniels_df['chr']
daniels_chrom_sub = daniels_df[daniels_df['chr'].str.contains("{:02d}".format(chrom))]
print("Possible variants on chromosome", chrom, "from user defined variant list :", daniels_chrom_sub.shape[0])
print(daniels_chrom_sub["position"].tolist())

Possible variants on chromosome 7 from user defined variant list : 8
[221722, 435497, 489666, 602559, 616459, 628392, 736978, 1359804]


In [12]:
user_overlap = [i in daniels_chrom_sub["position"].tolist() for i in zarr_index]
print("Number of overlapping variants from VCF and user defined variant list on chromosome", chrom, ":", np.count_nonzero(user_overlap))
print("Overlapping positions on chromsome", chrom, ":", np.where(user_overlap)[0])

Number of overlapping variants from VCF and user defined variant list on chromosome 7 : 8
Overlapping positions on chromsome 7 : [ 58951  87768  95123 158078 160895 162482 177615 260960]


Yay! The updated positions are in the VCF file, which means we can rely on standard recalibration calls to minimize errors :). 

We are less concerned about non-Senegal samples at the moment. We can filter those out. 

In [31]:
sen_bool = np.array(["Sen" in i for i in np.array(callset['samples']).tolist()])
print(np.count_nonzero(sen_bool))
sen_names = list(compress(np.array(callset['samples']).tolist(), sen_bool))

137


['SenP005.02',
 'SenP008.04',
 'SenP011.02',
 'SenP019.04',
 'SenP027.02',
 'SenP031.01',
 'SenP051.02',
 'SenP060.02',
 'SenT001.08',
 'SenT001.11',
 'SenT002.07',
 'SenT002.09',
 'SenT004.10',
 'SenT008.10',
 'SenT009.10',
 'SenT011.09',
 'SenT013.11',
 'SenT015.09',
 'SenT015.11',
 'SenT021.09',
 'SenT022.09',
 'SenT022.11',
 'SenT024.10',
 'SenT025.10',
 'SenT026.10',
 'SenT029.09',
 'SenT030.11',
 'SenT032.09',
 'SenT033.08',
 'SenT033.09',
 'SenT036.10',
 'SenT037.09',
 'SenT040.11',
 'SenT042.09',
 'SenT044.10',
 'SenT044.11',
 'SenT045.10',
 'SenT045.11',
 'SenT046.10',
 'SenT046.11',
 'SenT047.09',
 'SenT052.11',
 'SenT053.11',
 'SenT055.11',
 'SenT058.10',
 'SenT058.11',
 'SenT061.09',
 'SenT063.07',
 'SenT064.10',
 'SenT066.08',
 'SenT067.09',
 'SenT067.11',
 'SenT069.11',
 'SenT072.11',
 'SenT074.10',
 'SenT074.11',
 'SenT075.10',
 'SenT077.08',
 'SenT077.11',
 'SenT078.10',
 'SenT080.10',
 'SenT081.10',
 'SenT082.10',
 'SenT084.09',
 'SenT084.10',
 'SenT084.11',
 'SenT085.

## Pull variant information

Now that fun part, actually getting the relevant information! These two sections will highlight pulling genotypes or count information for different alleles. 

I recommend filtering the variants based on MalariaGen guidelines. The two variables that can be relaxed below ate the number of alternative alleles and the VSQLOD score. Currently it is set to only pull biallelic SNPs, but can be changed to pull muliallelic sites. The VSQLOD score is a ratio of SNP position testing Gaussian mixture models for the probability of true SNPs in a good and bad model. The higher the score, the more confident you can be in the call. Anything above 1 is considered technucally "true" if this needs to be relaxed to include more variants (Pfv6 uses a filter of 6). 

More info on VQSLOD scores: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-6-Variant_filtering.pdf

In [18]:
# get high quality snps for pfv6 to find overlaps
quality_set = callset['variants/FILTER_PASS'][:]
snp_set = callset['variants/is_snp'][:]
vsqlod_set = callset['variants/VQSLOD'][:]  > 2
biallelic_set = callset['variants/numalt'][:] < 2
variant_hq = quality_set & snp_set & vsqlod_set & biallelic_set
print("Number of filtered genes:", np.count_nonzero(variant_hq))

Number of filtered genes: 46029


In [19]:
variant_keep = quality_set & snp_set & vsqlod_set & biallelic_set & user_overlap
print(type(variant_keep))
np.count_nonzero(variant_keep)

<class 'numpy.ndarray'>


6

### Genotype matrix

In [34]:
gt_zarr = callset['calldata/GT']
gt_dask = allel.GenotypeDaskArray(gt_zarr)
gt_daskSub = gt_dask.subset(user_overlap, sen_bool).compute()

In [35]:
gt_daskSub

Unnamed: 0,0,1,2,3,4,...,132,133,134,135,136,Unnamed: 12
0,0/0,0/1,1/1,0/0,1/1,...,0/0,0/0,0/0,0/0,0/0,
1,1/1,0/0,0/0,0/0,1/1,...,0/0,1/1,0/0,0/0,0/0,
2,0/0,0/0,0/0,1/1,1/1,...,0/0,1/1,1/1,1/1,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
5,0/0,0/0,1/1,0/0,0/0,...,1/1,0/0,1/1,1/1,1/1,
6,0/0,0/1,0/0,0/0,1/1,...,1/1,1/1,1/1,1/1,1/1,
7,0/0,0/2,0/0,1/1,0/0,...,0/0,0/0,1/1,2/2,1/1,


This is not an elegant "Pythonic way" to pull genotypes, but it works. Would be thrilled to learn a better way to loop through 3D Numpy arrays to make this faster. 

In [69]:
def gt_merge(list1, list2, j):
    dic = {-1:"X", 
           0: np.array(callset['variants/REF'][:])[user_overlap][j], 
           1: np.array(callset['variants/ALT'][:])[user_overlap][j,0], 
           2: np.array(callset['variants/ALT'][:])[user_overlap][j,1],
           3: np.array(callset['variants/ALT'][:])[user_overlap][j,2]}
    merged_list = []
    for i in range(0, len(list1)):
        if list1[i] == list2[i]:
            if list1[i] == -1:
                merged_list.append("X")
            else:
                merged_list.append(list1[i])
        else:
            merged_list.append("N") 
    merged_replace=[dic.get(n, n) for n in merged_list]  
    return(merged_replace)

In [81]:
data = []
for j in range(0, len(gt_daskSub)):
    data.append(gt_merge(gt_daskSub[j,:,0], gt_daskSub[j,:,1], j))
df = pd.DataFrame(data,
                 index=["Pf3D7_" + str("{:02d}".format(chrom)) + "_v3:" + str(x) for x in np.array(callset['variants/POS'][:])[user_overlap]],
                 columns=sen_names)
print(df)

                    SenP005.02 SenP008.04 SenP011.02 SenP019.04 SenP027.02  \
Pf3D7_07_v3:221722           G          N          A          G          A   
Pf3D7_07_v3:435497           T          A          A          A          T   
Pf3D7_07_v3:489666           C          C          C          T          T   
Pf3D7_07_v3:602559           T          N          C          T          C   
Pf3D7_07_v3:616459           A          G          G          G          G   
Pf3D7_07_v3:628392           C          C          T          C          C   
Pf3D7_07_v3:736978           A          N          A          A          C   
Pf3D7_07_v3:1359804          C          N          C          A          C   

                    SenP031.01 SenP051.02 SenP060.02 SenT001.08 SenT001.11  \
Pf3D7_07_v3:221722           A          G          G          G          N   
Pf3D7_07_v3:435497           A          T          A          T          T   
Pf3D7_07_v3:489666           T          T          C          C

### Counts matrix

The counts matrix would give a good sense for the proportion of reads that align to each allele. This is great to consider if expanding any models to handle polyinfections at "biallelic" sites or multiallelic sites. Change the allele count array to pull the 1, 2, 3..,n allele in each sequence. 

In [75]:
# pull total number of high quality reads that align to each position per sample
dp_zarr = callset['calldata/DP']
dp_dask = allel.AlleleCountsDaskArray(dp_zarr)
dp_variant_selection = dp_dask.compress(user_overlap, axis=0).compute()

In [85]:
ad_variant_selection

Unnamed: 0,Array,Chunk
Bytes,168.96 kB,2.05 kB
Shape,"(8, 2640, 4)","(4, 64, 4)"
Count,379 Tasks,168 Chunks
Type,int16,numpy.ndarray
"Array Chunk Bytes 168.96 kB 2.05 kB Shape (8, 2640, 4) (4, 64, 4) Count 379 Tasks 168 Chunks Type int16 numpy.ndarray",4  2640  8,

Unnamed: 0,Array,Chunk
Bytes,168.96 kB,2.05 kB
Shape,"(8, 2640, 4)","(4, 64, 4)"
Count,379 Tasks,168 Chunks
Type,int16,numpy.ndarray


In [84]:
# get the number of reads that align to each of the alleles
ad_zarr=callset['calldata/AD']
ad_array=da.from_zarr(ad_zarr)
ad_variant_selection=ad_array[user_overlap]

In [86]:
# pull the first item in the allele count array to get the reference count
ad_array_ref=ad_variant_selection[:,:,0]
ad_array_ref=ad_array_ref.compute()

In [53]:
# check both data frames have the same dimension
np.shape(ad_array_ref) == np.shape(dp_variant_selection)

True

In [55]:
# ignore zero division for site where one population may have an allele but other samples may not have adequate coverage
np.seterr(divide='ignore', invalid='ignore')
ref_het_calc=ad_array_ref/dp_variant_selection

In [59]:
# assign the row names that correspond to the chromosome and position
pos_keep = np.where(user_overlap)[0]
chrom = ["chr"+x.split("_",2)[1] for x in callset['variants/CHROM'][:][pos_keep].tolist()]
pos = callset['variants/POS'][:][pos_keep].tolist()
pos_index = [x + "_" + str(y) for x, y in zip(chrom, pos)]

het_df=pd.DataFrame(data=ref_het_calc,
                     index=pos_index,
                     columns=callset['samples'])
het_df
het_df.to_csv('~/Desktop/alleleCounts_test.csv', header=True, index=True, sep="\t")