# Creating matrices of SNPs from VCF files

## Goal

Learn to subset variants from Zarr files that can be used in IDM models. 

The Zarr format is similar to HD5 that will allows to subset large files without loading into memory. For any whole genome sequencing file, variants in a population will be in a VCF format. 

In [1]:
# load packages - have been installed in vcf virtual env
import os, sys
import zarr
import numpy as np
import pandas as pd
import allel  
import dask.array as da
from itertools import compress

## Load literature position files

These positions files were obtained from Rachel Daniels on 05/07/2020, which updated the positions in the 2008 paper to the Pf3D7v3.1 reference Pf3k WGS samples are aligned.

In [2]:
pos_dir = '/home/jribado/Dropbox (IDM)/parasite_genetics/genomics/senegal'
daniels_df = pd.read_csv(os.path.join(pos_dir, "2008Daniels_BarcodePositions_Updated.txt"), sep='\t', header=0)
# create a list with the chromosome and position for each study
# daniels_pos = [x + "_" + str(y) for x, y in zip(["{:02d}".format(x) for x in daniels_df["chr"]], daniels_df["position"].tolist())]
print("Number of variants from user: ", daniels_df.shape[0])

Number of variants from user:  24


## Load variant file 

In [3]:
# read in zarr files
chrom = 1
zarr_path = '/home/jribado/Dropbox (IDM)/Data, Dynamics, and Analytics Folder/Projects/malaria_pfcommunity/malaria_pf3k/pf3k_zarr/'
zarr_file = zarr_path + 'SNP_INDEL_Pf3D7_' + "{:02d}".format(chrom) + '_v3.zarr'
callset = zarr.open(zarr_file, mode='r')

## Find index overlaps

This looks for overlaps without considering quality filtering parameters. Quality metric are applied in the following section. 

In [4]:
# get the chromosome and position for all possible variants
zarr_chrom = [x.split("_",2)[1] for x in callset['variants/CHROM'][:].tolist()]
zarr_index = callset['variants/POS'][:].tolist()
print("Possible variants on chromosome", chrom, ":", len(zarr_index))

Possible variants on chromosome 1 : 159946


In [5]:
# subset the variants on the chromosome of interest
#daniels_chrom_sub = daniels_df.loc[daniels_df.chr == chrom]
daniels_df['chr']
daniels_chrom_sub = daniels_df[daniels_df['chr'].str.contains("{:02d}".format(chrom))]
print("Possible variants on chromosome", chrom, "from user defined variant list :", daniels_chrom_sub.shape[0])
print(daniels_chrom_sub["position"].tolist())

Possible variants on chromosome 1 from user defined variant list : 2
[130339, 537322]


In [41]:
user_overlap = [i in daniels_chrom_sub["position"].tolist() for i in zarr_index]
print(len(user_overlap))
print("Number of overlapping variants from VCF and user defined variant list on chromosome", chrom, ":", np.count_nonzero(user_overlap))
print("Overlapping positions on chromsome", chrom, ":", np.where(user_overlap)[0])

159946
Number of overlapping variants from VCF and user defined variant list on chromosome 01 : 2
Overlapping positions on chromsome 01 : [ 57944 114554]


Yay! The updated positions are in the VCF file, which means we can rely on standard recalibration calls to minimize errors :). 

We are less concerned about non-Senegal samples at the moment. We can filter those out. 

In [7]:
sen_bool = np.array(["Sen" in i for i in np.array(callset['samples']).tolist()])
print(np.count_nonzero(sen_bool))
sen_names = list(compress(np.array(callset['samples']).tolist(), sen_bool))

137


## Pull variant information

Now that fun part, actually getting the relevant information! These two sections will highlight pulling genotypes or count information for different alleles. 

I recommend filtering the variants based on MalariaGen guidelines. The two variables that can be relaxed below ate the number of alternative alleles and the VSQLOD score. Currently it is set to only pull biallelic SNPs, but can be changed to pull muliallelic sites. The VSQLOD score is a ratio of SNP position testing Gaussian mixture models for the probability of true SNPs in a good and bad model. The higher the score, the more confident you can be in the call. Anything above 1 is considered technucally "true" if this needs to be relaxed to include more variants (Pfv6 uses a filter of 6). 

More info on VQSLOD scores: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-6-Variant_filtering.pdf

In [8]:
# get high quality snps for pfv6 to find overlaps
quality_set = callset['variants/FILTER_PASS'][:]
snp_set = callset['variants/is_snp'][:]
vsqlod_set = callset['variants/VQSLOD'][:]  > 2
biallelic_set = callset['variants/numalt'][:] < 3
variant_hq = quality_set & snp_set & vsqlod_set & biallelic_set
print("Number of filtered genes:", np.count_nonzero(variant_hq))

Number of filtered genes: 17887


In [9]:
variant_keep = quality_set & snp_set & vsqlod_set & biallelic_set & user_overlap
print(type(variant_keep))
np.count_nonzero(variant_keep)

<class 'numpy.ndarray'>


2

### Genotype matrix

In [33]:
gt_zarr = callset['calldata/GT']
gt_dask = allel.GenotypeDaskArray(gt_zarr)
gt_daskSub = gt_dask.subset(variant_keep, sen_bool).compute()

In [11]:
ac     = gt_daskSub.count_alleles(max_allele=2)
sub_af = ac/ac.sum(axis=1, keepdims=True)
pd.DataFrame(sub_af,
             index=["Pf3D7_" + str("{:02d}".format(chrom)) + "_v3:" + str(x) for x in np.array(callset['variants/POS'][:])[variant_keep]],
             columns=["REF", "ALT1", "ALT2"])

Unnamed: 0,REF,ALT1,ALT2
Pf3D7_01_v3:130339,0.30292,0.69708,0.0
Pf3D7_01_v3:537322,0.021898,0.978102,0.0


This is not an elegant "Pythonic way" to pull genotypes, but it works. Would be thrilled to learn a better way to loop through 3D Numpy arrays to make this faster. 

In [12]:
def gt_merge(list1, list2, j):
    dic = {-1:"X", 
           0: np.array(callset['variants/REF'][:])[variant_keep][j], 
           1: np.array(callset['variants/ALT'][:])[variant_keep][j,0], 
           2: np.array(callset['variants/ALT'][:])[variant_keep][j,1],
           3: np.array(callset['variants/ALT'][:])[variant_keep][j,2]}
    merged_list = []
    for i in range(0, len(list1)):
        if list1[i] == list2[i]:
            if list1[i] == -1:
                merged_list.append("X")
            else:
                merged_list.append(list1[i])
        else:
            merged_list.append("N") 
    merged_replace=[dic.get(n, n) for n in merged_list]  
    return(merged_replace)

In [13]:
data = []
for j in range(0, len(gt_daskSub)):
    data.append(gt_merge(gt_daskSub[j,:,0], gt_daskSub[j,:,1], j))
df = pd.DataFrame(data,
                 index=["Pf3D7_" + str("{:02d}".format(chrom)) + "_v3:" + str(x) for x in np.array(callset['variants/POS'][:])[variant_keep]],
                 columns=sen_names)
print(df)

                   SenP005.02 SenP008.04 SenP011.02 SenP019.04 SenP027.02  \
Pf3D7_01_v3:130339          C          T          T          C          C   
Pf3D7_01_v3:537322          A          A          A          A          A   

                   SenP031.01 SenP051.02 SenP060.02 SenT001.08 SenT001.11  \
Pf3D7_01_v3:130339          T          C          T          T          N   
Pf3D7_01_v3:537322          A          A          A          A          A   

                    ... SenT230.08 SenT231.08 SenT232.08 SenT233.08  \
Pf3D7_01_v3:130339  ...          T          T          T          T   
Pf3D7_01_v3:537322  ...          A          A          A          A   

                   SenT235.08 SenT236.08 SenV034.04 SenV035.04 SenV042.05  \
Pf3D7_01_v3:130339          T          C          C          T          T   
Pf3D7_01_v3:537322          A          A          A          A          A   

                   SenV092.05  
Pf3D7_01_v3:130339          T  
Pf3D7_01_v3:537322        

In [68]:
# define functions from working pieces above
def variant_filter(callset, vsqlod_min, num_alt):
    quality_set = callset['variants/FILTER_PASS'][:]
    snp_set = callset['variants/is_snp'][:]
    vsqlod_set = callset['variants/VQSLOD'][:]  > vsqlod_min
    alt_set = callset['variants/numalt'][:] < num_alt + 1 
    variant_hq = quality_set & snp_set & vsqlod_set & alt_set
    return(variant_hq)
    
def read_zarr(chrom, zarr_path):
    # zarr_file = zarr_path + 'SNP_INDEL_Pf3D7_' + "{:02d}".format(chrom) + '_v3.zarr'
    zarr_file = zarr_path + 'SNP_INDEL_Pf3D7_' + chrom + '_v3.zarr'
    callset = zarr.open(zarr_file, mode='r')
    return(callset)

def overlap_avail(user_pos_list, callset):
    zarr_index   = callset['variants/POS'][:].tolist()
    user_overlap =  [i in user_pos_list for i in zarr_index]
    return(user_overlap)
    
def overlap_filter(user_snps, filter_snps):
        return(user_snps & filter_snps)
        
def user_snps(user_df, chrom):
    user_df  = pd.read_csv(user_df, sep='\t', header=0)
    user_sub = user_df[user_df['chr'].str.contains(chrom)]
    return(user_sub["position"].tolist())

def gt_subset(callset, variant_bool, sample_bool):
    gt_zarr = callset['calldata/GT']
    gt_dask = allel.GenotypeDaskArray(gt_zarr)
    gt_daskSub = gt_dask.subset(variant_bool, sample_bool).compute()

def allele_freq(callset, variant_bool, sample_bool, num_alt):
    gt_zarr = callset['calldata/GT']
    gt_dask = allel.GenotypeDaskArray(gt_zarr)
    gt_daskSub = gt_dask.subset(variant_bool, sample_bool).compute()
    print(gt_daskSub)
    # count number of alleles
    ac = gt_daskSub.count_alleles(max_allele=num_alt)
    sub_af = ac/ac.sum(axis=1, keepdims=True)
    print(type(sub_af))
    sub_df = pd.DataFrame(sub_af,
                 index=["Pf3D7_" + chrom + "_v3:" + str(x) for x in np.array(callset['variants/POS'][:])[variant_bool]],
                 columns=["REF"] + ["ALT" + str(x) for x in range(1, num_alt + 1)])
    return(sub_df)

In [69]:
num_alt = 3
positions = []


for i in [1,2]:
    chrom = "{:02d}".format(i)
    zarr_path = '/home/jribado/Dropbox (IDM)/Data, Dynamics, and Analytics Folder/Projects/malaria_pfcommunity/malaria_pf3k/pf3k_zarr/'
    callset = read_zarr(chrom, zarr_path)
  
    daniels_snps=user_snps(os.path.join(pos_dir, "2008Daniels_BarcodePositions_Updated.txt"), chrom)
    variant_bool = overlap_filter(overlap_avail(daniels_snps, callset), variant_filter(callset, 2, num_alt))
    sample_bool  = np.array(["Sen" in i for i in np.array(callset['samples']).tolist()])
    freq =allele_freq(callset, variant_bool, sample_bool, 3)
    print(freq)
    #zarr_overlap = overlap_filter(overlap_avail(danials_snps, callset), hq_snps)
   # print(zarr_overlap)
    
    

0/0 1/1 1/1 0/0 0/0 ... 0/0 0/0 1/1 1/1 1/1
1/1 1/1 1/1 1/1 1/1 ... 1/1 1/1 1/1 1/1 1/1

<class 'numpy.ndarray'>
                         REF      ALT1  ALT2  ALT3
Pf3D7_01_v3:130339  0.302920  0.697080   0.0   0.0
Pf3D7_01_v3:537322  0.021898  0.978102   0.0   0.0
1/1 0/1 1/1 0/0 1/1 ... 0/0 1/1 1/1 0/0 0/0

<class 'numpy.ndarray'>
                         REF      ALT1  ALT2  ALT3
Pf3D7_02_v3:842805  0.722222  0.277778   0.0   0.0


### Counts matrix

The counts matrix would give a good sense for the proportion of reads that align to each allele. This is great to consider if expanding any models to handle polyinfections at "biallelic" sites or multiallelic sites. Change the allele count array to pull the 1, 2, 3..,n allele in each sequence. 

In [21]:
# pull total number of high quality reads that align to each position per sample
dp_zarr = callset['calldata/DP']
dp_dask = allel.AlleleCountsDaskArray(dp_zarr)
dp_variant_selection = dp_dask.compress(user_overlap, axis=0).compute()

In [27]:
dp_variant_selection

Unnamed: 0,0,1,2,3,4,...,2635,2636,2637,2638,2639,Unnamed: 12
0,70,33,24,24,82,...,38,40,39,41,62,
1,65,35,21,26,112,...,53,59,40,44,41,
2,122,23,26,22,120,...,55,59,58,114,52,
...,...,...,...,...,...,...,...,...,...,...,...,...
5,133,39,36,24,136,...,64,43,67,123,127,
6,47,36,29,29,60,...,80,28,38,85,84,
7,143,25,22,25,123,...,105,43,80,151,142,


In [129]:
# get the number of reads that align to each of the alleles
ad_zarr=callset['calldata/AD']
ad_array=da.from_zarr(ad_zarr)
ad_variant_selection=ad_array[user_overlap]

In [130]:
# pull the first item in the allele count array to get the reference count
ad_array_ref=ad_variant_selection[:,:,0]
ad_array_ref=ad_array_ref.compute()

In [131]:
# check both data frames have the same dimension
np.shape(ad_array_ref) == np.shape(dp_variant_selection)

True

In [132]:
# ignore zero division for site where one population may have an allele but other samples may not have adequate coverage
np.seterr(divide='ignore', invalid='ignore')
ref_het_calc=ad_array_ref/dp_variant_selection

In [135]:
# assign the row names that correspond to the chromosome and position
pos_keep = np.where(user_overlap)[0]
chrom = ["chr"+x.split("_",2)[1] for x in callset['variants/CHROM'][:][pos_keep].tolist()]
pos = callset['variants/POS'][:][pos_keep].tolist()
pos_index = ["Pf3D7_" + str("{:02d}".format(chrom)) + "_v3:" + str(x) for x in np.array(callset['variants/POS'][:])[variant_keep]]

het_df=pd.DataFrame(data=ref_het_calc,
                     index=pos_index,
                     columns=callset['samples'])
het_df
# het_df.to_csv('~/Desktop/alleleCounts_test.csv', header=True, index=True, sep="\t")

TypeError: unsupported format string passed to list.__format__