## work with 1KPG 

In [1]:
import numpy as np

In [6]:
x = [1,2,3,4]
y=x[1:2]
y

[2]

In [12]:
x = np.zeros((2,3))
y = np.array([1,2])
x[:,1]=y
x

array([[0., 1., 0.],
       [0., 2., 0.]])

In [14]:
np.arange(1,2)

array([1])

In [1]:
from collections import defaultdict

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

import vcf

In [2]:
vcf_fname = r'/ifs1/pub/database/ftp.ncbi.nih.gov/1000genomes/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz'

vcf_ind_fname = r'/ifs1/pub/database/ftp.ncbi.nih.gov/1000genomes/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz.tbi'


In [3]:
v = vcf.Reader(filename=vcf_fname)

print('Variant Level information')
infos = v.infos
for info in infos:
    print(info)

print('Sample Level information')
fmts = v.formats
for fmt in fmts:
    print(fmt)

Variant Level information
TSD
SVTYPE
MSTART
MLEN
MEND
MEINFO
SVLEN
IMPRECISE
CIEND
CIPOS
END
CS
MC
AF
NS
EAS_AF
EUR_AF
AFR_AF
AMR_AF
SAS_AF
SITEPOST
AC
AN
Sample Level information
GT


We start by inspecting annotations available for each record (remember that each record encodes a variant, such as SNP, CNV, Indel, and so on, and the state of that variant per sample). At the variant (record) level, we find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth among others. There are others, but they are mostly specific to the 1000 genomes project (here, we will try to be as general as possible). Your own dataset may have more annotations (or none of these).


At the sample level, there are only two annotations in this file: GT: genotype and DP: the per sample read depth. You have the per variant (total) read depth and per sample read depth; be sure not to confuse both.

In [20]:
# now that we know which information is available, let's inspect a single VCF record

v = vcf.Reader(filename=vcf_fname)
rec = next(v)
print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER)
print(rec.INFO)
print(rec.FORMAT)
samples = rec.samples
print(len(samples))
sample = samples[1]
print(sample)
print(sample.sample)
print(sample.called,sample.gt_type, sample.gt_alleles, sample.is_het, sample.is_variant, sample.phased)
print(sample['GT'])

1 710330 ALU_umary_ALU_2 A [<INS:ME:ALU>] None None
{'AC': [35], 'AF': [0.00698882], 'AFR_AF': [0.0], 'AMR_AF': [0.0072], 'AN': 5008, 'CS': 'ALU_umary', 'EAS_AF': [0.0069], 'EUR_AF': [0.0189], 'MEINFO': ['AluYa4_5', '1', '223', '-'], 'NS': 2504, 'SAS_AF': [0.0041], 'SITEPOST': 0.9998, 'SVLEN': [222], 'SVTYPE': 'ALU', 'TSD': 'null'}
GT
2504
Call(sample=HG00097, CallData(GT=0|0))
HG00097
True 0 ['0', '0'] False False True
0|0


We start by retrieving the standard information: the chromosome, position, identifier, reference base, (typically just one), alternative bases (can have more than one, but it's not uncommon as a first filtering approach to only accept a single ALT, for example, only accept bi-allelic SNPs), quality (as you might expect, Phred-scaled), and filter status. Regarding the filter status, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).


We then print the additional variant-level information (AC, AS, AF, AN, DP, and so on) followed by the sample format (in this case, DP and GT). Finally, we count the number of samples and inspect a single sample to check whether it was called for this variant. If available, the reported alleles, heterozygosity and phasing status (this dataset happens to be phased, which is not that common).

In [None]:
# Let's check the type of variant and the number of non-biallelic SNPs in a single pass:
f = vcf.Reader(filename=vcf_fname)

my_type = defaultdict(int)
num_alts = defaultdict(int)

for rec in f:
    my_type[rec.var_type, rec.var_subtype] += 1
    if rec.is_snp:
        num_alts[len(rec.ALT)] += 1
print(my_type)
print(num_alts)

We use the now common Python default dictionary. We find that this dataset has INDELs (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-thirds being transitions with one-third transversions). There
is a residual number (79) of tri-allelic SNPs.

In [11]:
samples

[Call(sample=HG00096, CallData(GT=0|0)),
 Call(sample=HG00097, CallData(GT=0|0)),
 Call(sample=HG00099, CallData(GT=0|0)),
 Call(sample=HG00100, CallData(GT=0|0)),
 Call(sample=HG00101, CallData(GT=0|0)),
 Call(sample=HG00102, CallData(GT=0|0)),
 Call(sample=HG00103, CallData(GT=0|0)),
 Call(sample=HG00105, CallData(GT=0|0)),
 Call(sample=HG00106, CallData(GT=0|0)),
 Call(sample=HG00107, CallData(GT=0|0)),
 Call(sample=HG00108, CallData(GT=0|0)),
 Call(sample=HG00109, CallData(GT=0|0)),
 Call(sample=HG00110, CallData(GT=0|0)),
 Call(sample=HG00111, CallData(GT=0|0)),
 Call(sample=HG00112, CallData(GT=0|0)),
 Call(sample=HG00113, CallData(GT=0|0)),
 Call(sample=HG00114, CallData(GT=0|0)),
 Call(sample=HG00115, CallData(GT=0|0)),
 Call(sample=HG00116, CallData(GT=0|0)),
 Call(sample=HG00117, CallData(GT=0|0)),
 Call(sample=HG00118, CallData(GT=0|0)),
 Call(sample=HG00119, CallData(GT=0|0)),
 Call(sample=HG00120, CallData(GT=0|0)),
 Call(sample=HG00121, CallData(GT=0|0)),
 Call(sample=HG0