# [technical note] pgenlib

- Yosuke Tanigawa (ytanigaw@stanford.edu)
- 2017.02.07

## objective 
- Learn how to call pgenlib with python API 
- would like to demonstrate if we can infer haplotype based on informative snps on long reads

### update
- 2017/02/07:
  - changed population reference to 1KG
- 2017/02/06: 
  - updated pgenlib python binding from v0.5 to v0.6 
  - learned how to get genotype info from pgen file (imputed file)
- 2017/02/03: 
  - Initial draft: working with bgen file (array data without imputation)

## data set in mind
- `/share/PI/mrivas/data/nanopore-wgs-consortium-old/nanopore-wgs.25000.sorted.10k.mapq50.ext.sorted.informative.q14.snps`

## dependencies

In [1]:
import numpy as np
import pgenlib as pg
import pandas as pd
import subprocess as sp
import sys

In [2]:
import matplotlib
matplotlib.use('agg')
from matplotlib import pyplot as plt

## how to read pgen file (or bed file)
- In the following sections, I will try basic APIs based on specification

In [3]:
chr20 = pg.PgenReader('/scratch/users/ytanigaw/20170207/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes-pgen.pgen')

In [4]:
type(chr20)

Python.pgenlib.PgenReader

- We now have genotypic information on memory with pgenlib

In [5]:
bim20 = pd.read_csv('/scratch/users/ytanigaw/20170207/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes-pgen.bim',
                    sep = '\t', 
                    names = ['chr', 'id', 'morgan', 'bp', 'pri', 'sec'])

In [6]:
bim20.head()

Unnamed: 0,chr,id,morgan,bp,pri,sec
0,20,rs527639301,0,60343,A,G
1,20,rs538242240,0,60419,G,A
2,20,rs149529999,0,60479,T,C
3,20,rs150241001,0,60522,TC,T
4,20,rs533509214,0,60568,C,A


- we also have bim file on memory

#### # of SNP sites

In [7]:
!wc -l /scratch/users/ytanigaw/20170207/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes-pgen.bim

1812841 /scratch/users/ytanigaw/20170207/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes-pgen.bim


In [8]:
bim20.shape

(1812841, 6)

## region of interest
- Take SNP sites on the data file as follows:

In [9]:
regions_raw = !cat /share/PI/mrivas/data/nanopore-wgs-consortium-old/nanopore-wgs.25000.sorted.10k.mapq50.ext.sorted.informative.q14.snps|awk '{if(NR > 1){print $7}}'|sed -e 's/;/\n/g'
regions = [r.split(',') for r in regions_raw]
len(regions)

36979

- our benchmark chromosome is chr20

In [10]:
regions_chr20 = [x for x in regions if x[0][:5] == 'chr20']
print len(regions_chr20)

569


In [11]:
regions_chr20[:10]

[['chr20:712806', 'T', 'A', '*', 'None', '16'],
 ['chr20:712968', 'C', 'G', '*', 'None', '19'],
 ['chr20:712969', 'T', 'G', '*', 'None', '18'],
 ['chr20:713575', 'a', 'T', '*', 'None', '14'],
 ['chr20:714008', 'a', 'G', 'rs2317021', 'True', '18'],
 ['chr20:714873', 'T', 'C', '*', 'None', '14'],
 ['chr20:717649', 'c', 'T', 'rs6117373', 'True', '16'],
 ['chr20:719542', 'c', 'T', '*', 'None', '15'],
 ['chr20:720916', 'G', 'A', 'rs2144930', 'True', '14'],
 ['chr20:725258', 'c', 'T', '*', 'None', '14']]

In [13]:
regions_chr20_snps = [x for x in regions_chr20 if x[3][:2] == 'rs']
print len(regions_chr20_snps)

194


In [14]:
regions_chr20_snps[:10]

[['chr20:714008', 'a', 'G', 'rs2317021', 'True', '18'],
 ['chr20:717649', 'c', 'T', 'rs6117373', 'True', '16'],
 ['chr20:720916', 'G', 'A', 'rs2144930', 'True', '14'],
 ['chr20:727476', 'c', 'T', 'rs60461885', 'True', '16'],
 ['chr20:728499', 'a', 'C', 'rs6054508', 'True', '16'],
 ['chr20:730893', 'c', 'T', 'rs57596504', 'True', '16'],
 ['chr20:731005', 'a', 'G', 'rs13039346', 'True', '19'],
 ['chr20:732295', 'a', 'G', 'rs6140092', 'True', '19'],
 ['chr20:734617', 'a', 'G', 'rs56210940', 'True', '16'],
 ['chr20:736644', 'g', 'A', 'rs6140099', 'True', '16']]

## find variant_idxs of interest
- we need to take correspondance between genomic coordinate and variant indeces

In [15]:
def browse_bim_file_by_genome_index(bim_df, index, head_num = 1, verbose = False):
    if(verbose):
        print 'query: {}'.format(index)
    entry = bim_df.loc[index <= bim_df['bp']].head(head_num)
    return(entry)

In [16]:
idxs_chr20 = [int(r[0].split(':')[1]) for r in regions_chr20_snps]
print len(idxs_chr20)
print idxs_chr20[:10]

194
[714008, 717649, 720916, 727476, 728499, 730893, 731005, 732295, 734617, 736644]


In [17]:
browse_bim_file_by_genome_index(bim20, idxs_chr20[0], verbose=True)

query: 714008


Unnamed: 0,chr,id,morgan,bp,pri,sec
20751,20,rs2317021,0,714008,G,A


In [18]:
genome_index_to_variant_index = dict(zip(bim20['bp'], range(len(bim20))))

In [19]:
variant_idxs = np.array([genome_index_to_variant_index[pos] for pos in idxs_chr20 
                         if pos in genome_index_to_variant_index],
                        dtype = np.uint32)

In [20]:
print(len(idxs_chr20))
print(len(variant_idxs))

194
181


### # of samples, # of variants, phase info

In [21]:
print chr20.get_raw_sample_ct()
print chr20.get_variant_ct()
print chr20.hardcall_phase_present()

2504
1812841
False


### access genotype info of a specific locus

- read the first entry

In [22]:
res_ary = np.zeros(chr20.get_raw_sample_ct(), dtype=np.int8)
chr20.read(variant_idxs[0], res_ary)
print len(res_ary)
print sum(res_ary == 0)
print sum(res_ary == 1)
print sum(res_ary == 2)
print sum(res_ary == -9)

2504
340
962
1202
0


- want to read variants of interest at once

In [23]:
res_mat = np.zeros((len(variant_idxs), chr20.get_raw_sample_ct()), dtype=np.int8)

In [24]:
res_ary.shape

(2504,)

In [25]:
chr20.read_list(variant_idxs, res_mat)

- resulting matrix
  - each row corresponds to one loci on genome
  - each column corresponds to an individual in the reference panel
  - 0/1/2 represents the alternate allele count.

In [26]:
res_mat[:10, :15]

array([[2, 2, 1, 0, 2, 1, 1, 0, 2, 1, 1, 2, 1, 0, 2],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 2, 1, 2, 2, 1, 1, 0, 0, 1, 0, 1],
       [1, 1, 0, 0, 2, 1, 2, 1, 1, 1, 0, 0, 1, 0, 1],
       [2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 0, 2, 0, 1],
       [1, 1, 0, 1, 2, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1],
       [1, 1, 0, 1, 2, 1, 2, 1, 1, 1, 0, 0, 1, 0, 1],
       [1, 1, 0, 1, 2, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1],
       [1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 0, 1]], dtype=int8)