# Creating matrices of SNPs from VCF files

## Goal

Learn to subset variants from Zarr files that can be used in IDM models. 

The Zarr format is similar to HD5 that will allows to subset large files without loading into memory. For any whole genome sequencing file, variants in a population will be in a VCF format. 

In [2]:
# load packages - have been installed in vcf virtual env
import os, sys
import zarr
import numpy as np
import pandas as pd
import allel  
import dask.array as da

## Load literature position files

These positions files were generated from tables in the papers since accompaying files with the chromosome and position were not available. Additional file 4 from Daniels 2018 was manually copied. 

In [152]:
pos_dir = '/home/jribado/Dropbox (IDM)/Data, Dynamics, and Analytics Folder/Projects/malaria_pfcommunity/'
daniels_df = pd.read_csv(os.path.join(pos_dir, "2008Daniels_BarcodePositions.txt"), sep='\t', header=0)
# create a list with the chromosome and position for each study
# daniels_pos = [x + "_" + str(y) for x, y in zip(["{:02d}".format(x) for x in daniels_df["chr"]], daniels_df["position"].tolist())]
print("Number of variants from user: ", daniels_df.shape[0])

Number of variants from user:  24


## Load variant file 

In [162]:
# read in zarr files
chrom = 7
zarr_path = '/home/jribado/Dropbox (IDM)/Data, Dynamics, and Analytics Folder/Projects/malaria_pfcommunity/malaria_pf3k/pf3k_zarr/'
zarr_file = zarr_path + 'SNP_INDEL_Pf3D7_' + "{:02d}".format(chrom) + '_v3.zarr'
callset = zarr.open(zarr_file, mode='r')

## Find index overlaps

This looks for overlaps without considering quality filtering parameters. Quality metric are applied in the following section. 

In [163]:
# get the chromosome and position for all possible variants
zarr_chrom = [x.split("_",2)[1] for x in callset['variants/CHROM'][:].tolist()]
zarr_index = callset['variants/POS'][:].tolist()
print("Possible variants on chromosome", chrom, ":", len(zarr_index))

Possible variants on chromosome 7 : 300181


In [267]:
# subset the variants on the chromosome of interest
daniels_chrom_sub = daniels_df.loc[daniels_df.chr == chrom]
print("Possible variants on chromosome", chrom, "from user defined variant list :", daniels_chrom_sub.shape[0])
print(daniels_chrom_sub["position"].tolist())

Possible variants on chromosome 7 from user defined variant list : 8
[277104, 490877, 545046, 657939, 671839, 683772, 792356, 1415182]


In [165]:
user_overlap = [i in daniels_chrom_sub["position"].tolist() for i in zarr_index]
print("Number of overlapping variants from VCF and user defined variant list on chromosome", chrom, ":", np.count_nonzero(user_overlap))
print("Overlapping positions on chromsome", chrom, ":", np.where(user_overlap)[0])

Number of overlapping variants from VCF and user defined variant list on chromosome 7 : 3
Overlapping positions on chromsome 7 : [119800 168112 277966]


## Pull variant information

Now that fun part, actually getting the relevant information! These two sections will highlight pulling genotypes or count information for different alleles. 

I recommend filtering the variants based on MalariaGen guidelines. The two variables that can be relaxed below ate the number of alternative alleles and the VSQLOD score. Currently it is set to only pull biallelic SNPs, but can be changed to pull muliallelic sites. The VSQLOD score is a ratio of SNP position testing Gaussian mixture models for the probability of true SNPs in a good and bad model. The higher the score, the more confident you can be in the call. Anything above 1 is considered technucally "true" if this needs to be relaxed to include more variants (Pfv6 uses a filter of 6). 

More info on VQSLOD scores: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-6-Variant_filtering.pdf

In [145]:
# get high quality snps for pfv6 to find overlaps
quality_set = callset['variants/FILTER_PASS'][:]
snp_set = callset['variants/is_snp'][:]
vsqlod_set = callset['variants/VQSLOD'][:]  > 2
biallelic_set = callset['variants/numalt'][:] == 1
variant_hq = quality_set & snp_set & vsqlod_set & biallelic_set
print("Number of filtered genes:", np.count_nonzero(variant_hq))

Number of filtered genes: 46029


In [146]:
variant_keep = quality_set & snp_set & vsqlod_set & biallelic_set & user_overlap 
np.count_nonzero(variant_keep)

1

Only one of the variants meets the criteria with relaxed quality, so let's just proceed with the 3 original for demonstration sake. 

### Genotype matrix

In [45]:
gt_zarr = callset['calldata/GT']
gt_dask = allel.GenotypeDaskArray(gt_zarr)
gt_daskSub = gt_dask.subset(user_overlap).compute()

In [46]:
gt_daskSub

Unnamed: 0,0,1,2,3,4,...,2635,2636,2637,2638,2639
0,2/2,0/0,0/0,0/0,./.,...,./.,0/0,./.,0/0,0/0
1,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0
2,./.,./.,./.,0/0,0/0,...,./.,./.,./.,./.,./.


This is not an elegant "Pythonic way" to pull genotypes, but it works. Would be thrilled to learn a better way to loop through 3D Numpy arrays to make this faster. 

In [363]:
def merge(list1, list2, j):
    dic = {-1:"X", 
           0: np.array(callset['variants/REF'][:])[user_overlap][j], 
           1: np.array(callset['variants/ALT'][:])[user_overlap][j,0], 
           2: np.array(callset['variants/ALT'][:])[user_overlap][j,1],
           3: np.array(callset['variants/ALT'][:])[user_overlap][j,2]}
    merged_list = []
    for i in range(0, len(list1)):
        if list1[i] == list2[i]:
            if list1[i] == 0:
                merged_list.append(ref_allele)
            if list1[i] == -1:
                merged_list.append("X")
            else:
                merged_list.append(list1[i])
        else:
            merged_list.append("N") 
        
    merged_replace=[dic.get(n, n) for n in merged_list]  
    #merged_list = [list1[i] for i in range(0, len(list1)) if list1[i] == list2[i]] 
    return(merged_replace)

In [365]:
data = []
for j in range(0, len(gt_daskSub)):
    data.append(merge(gt_daskSub[j,:,1], gt_daskSub[j,:,1], 1))
df = pd.DataFrame(data,
                 index=np.array(callset['variants/POS'][:])[user_overlap],
                 columns=callset['samples'])
print(df)

/samples 7G8 ERS740936 ERS740937 ERS740940 GB4 PA0007-C PA0008-C PA0011-C  \
545046               G         G         G   X        G        G        X   
671839     G         G         G         G   G        G        G        G   
1415182    X         X         X         G   G        G        G        G   

/samples PA0012-C PA0015-C  ... SenT230.08 SenT231.08 SenT232.08 SenT233.08  \
545046          G        X  ...          X          G          X          X   
671839          G        G  ...          G          G          G          G   
1415182         X        X  ...          X          X          G          X   

/samples SenT235.08 SenT236.08 SenV034.04 SenV035.04 SenV042.05 SenV092.05  
545046            G          X          G          X          G          G  
671839            G          G          G          G          G          G  
1415182           G          X          X          X          X          X  

[3 rows x 2640 columns]


### Counts matrix

The counts matrix would give a good sense for the proportion of reads that align to each allele. This is great to consider if expanding any models to handle polyinfections at "biallelic" sites or multiallelic sites. Change the allele count array to pull the 1, 2, 3..,n allele in each sequence. 

In [48]:
# pull total number of high quality reads that align to each position per sample
dp_zarr = callset['calldata/DP']
dp_dask = allel.AlleleCountsDaskArray(dp_zarr)
dp_variant_selection = dp_dask.compress(user_overlap, axis=0).compute()

In [49]:
# get the number of reads that align to each of the alleles
ad_zarr=callset['calldata/AD']
ad_array=da.from_zarr(ad_zarr)
ad_variant_selection=ad_array[user_overlap]

In [52]:
# pull the first item in the allele count array to get the reference count
ad_array_ref=ad_variant_selection[:,:,0]
ad_array_ref=ad_array_ref.compute()

In [53]:
# check both data frames have the same dimension
np.shape(ad_array_ref) == np.shape(dp_variant_selection)

True

In [55]:
# ignore zero division for site where one population may have an allele but other samples may not have adequate coverage
np.seterr(divide='ignore', invalid='ignore')
ref_het_calc=ad_array_ref/dp_variant_selection

In [59]:
# assign the row names that correspond to the chromosome and position
pos_keep = np.where(user_overlap)[0]
chrom = ["chr"+x.split("_",2)[1] for x in callset['variants/CHROM'][:][pos_keep].tolist()]
pos = callset['variants/POS'][:][pos_keep].tolist()
pos_index = [x + "_" + str(y) for x, y in zip(chrom, pos)]

het_df=pd.DataFrame(data=ref_het_calc,
                     index=pos_index,
                     columns=callset['samples'])
het_df
het_df.to_csv('~/Desktop/alleleCounts_test.csv', header=True, index=True, sep="\t")