# [technical note] [WIP] pgenlib

- Yosuke Tanigawa (ytanigaw@stanford.edu)
- 2017.02.06

## objective 
- Learn how to call pgenlib with python API 
- would like to demonstrate if we can infer haplotype based on informative snps on long reads

### update
- 2017/02/06: 
  - updated pgenlib python binding from v0.5 to v0.6 
  - learned how to get genotype info from pgen file (imputed file)
- 2017/02/03: 
  - Initial draft: working with bgen file (array data without imputation)

## data set in mind
- `/share/PI/mrivas/data/nanopore-wgs-consortium-old/nanopore-wgs.25000.sorted.10k.mapq50.ext.sorted.informative.q14.snps`

## dependencies

In [1]:
import numpy as np
import pgenlib as pg
import pandas as pd
import subprocess as sp
import sys

In [2]:
import matplotlib
matplotlib.use('agg')
from matplotlib import pyplot as plt

## how to read pgen file (or bed file)
- In the following sections, I will try basic APIs based on specification

In [3]:
chr11 = pg.PgenReader('/share/PI/mrivas/ukbb/download/chr11impv1-pgen.pgen')
type(chr11)

Python.pgenlib.PgenReader

- We now have genotypic information on memory with pgenlib

In [4]:
bim11 = pd.read_csv('/share/PI/mrivas/ukbb/download/chr11impv1-pgen.bim', sep = '\t', 
                   names = ['chr', 'id', 'morgan', 'bp', 'pri', 'sec'])

In [5]:
bim11.head()

Unnamed: 0,chr,id,morgan,bp,pri,sec
0,11,rs371609562,0,61395,CTT,C
1,11,rs566764841,0,77250,G,GT
2,11,rs200634578,0,87150,A,T
3,11,rs560955407,0,87203,G,T
4,11,rs537964411,0,87209,TA,T


- we also have bim file on memory

#### imputed file has more SNP sites

In [6]:
!wc -l /share/PI/mrivas/ukbb/download/chr11impv1-pgen.bim

3603469 /share/PI/mrivas/ukbb/download/chr11impv1-pgen.bim


In [7]:
!wc -l /share/PI/mrivas/data/ukbb/cal/chrom11.bim

43237 /share/PI/mrivas/data/ukbb/cal/chrom11.bim


## region of interest
- Take SNP sites on the data file as follows:

In [8]:
regions_raw = !cat /share/PI/mrivas/data/nanopore-wgs-consortium-old/nanopore-wgs.25000.sorted.10k.mapq50.ext.sorted.informative.q14.snps|awk '{if(NR > 1){print $7}}'|sed -e 's/;/\n/g'
regions = [r.split(',') for r in regions_raw]
len(regions)

36979

- our benchmark chromosome is chr11 (the only chromosome we have for pgen file)

In [9]:
regions_chr11 = [x for x in regions if x[0][:5] == 'chr11']
print len(regions_chr11)

1430


In [10]:
regions_chr11[:10]

[['chr11:2937020', 'C', 'G', '*', 'None', '15'],
 ['chr11:2937222', 'G', 'A', '*', 'None', '15'],
 ['chr11:2939266', 'T', 'G', '*', 'None', '14'],
 ['chr11:2939392', 'G', 'A', 'rs445679', 'True', '14'],
 ['chr11:2940346', 'c', 'G', '!', 'None', '19'],
 ['chr11:2940347', 'c', 'A', '!', 'None', '16'],
 ['chr11:2940939', 'c', 'T', '*', 'None', '15'],
 ['chr11:2942214', 'a', 'G', 'rs2519158', 'True', '15'],
 ['chr11:2943996', 'C', 'G', '*', 'None', '15'],
 ['chr11:2944330', 'G', 'A', '*', 'None', '18']]

In [11]:
regions_chr11_snps = [x for x in regions_chr11 if x[3][:2] == 'rs']
print len(regions_chr11_snps)

529


In [12]:
regions_chr11_snps[:10]

[['chr11:2939392', 'G', 'A', 'rs445679', 'True', '14'],
 ['chr11:2942214', 'a', 'G', 'rs2519158', 'True', '15'],
 ['chr11:2947246', 'G', 'A', 'rs61871192', 'True', '15'],
 ['chr11:2950558', 'A', 'G', 'rs13390', 'True', '19'],
 ['chr11:2957645', 'a', 'G', 'rs12360591', 'True', '14'],
 ['chr11:2958774', 'g', 'T', 'rs61871197', 'True', '14'],
 ['chr11:2962132', 'T', 'C', 'rs2583422', 'True', '17'],
 ['chr11:2962177', 'G', 'A', 'rs4758492', 'True', '14'],
 ['chr11:2962842', 'G', 'A', 'rs775456943', 'False', '15'],
 ['chr11:2963780', 'G', 'A', 'rs563544435', 'False', '15']]

## find variant_idxs of interest
- we need to take correspondance between genomic coordinate and variant indeces

In [13]:
def browse_bim_file_by_genome_index(bim_df, index, head_num = 1, verbose = False):
    if(verbose):
        print 'query: {}'.format(index)
    entry = bim_df.loc[index <= bim_df['bp']].head(head_num)
    return(entry)

In [14]:
idxs_chr11 = [int(r[0].split(':')[1]) for r in regions_chr11_snps]
print len(idxs_chr11)
print idxs_chr11[:10]

529
[2939392, 2942214, 2947246, 2950558, 2957645, 2958774, 2962132, 2962177, 2962842, 2963780]


In [15]:
browse_bim_file_by_genome_index(bim11, idxs_chr11[0], verbose=True)

query: 2939392


Unnamed: 0,chr,id,morgan,bp,pri,sec
89248,11,rs445679,0,2939392,G,A


In [16]:
genome_index_to_variant_index = dict(zip(bim11['bp'], range(len(bim11))))

In [17]:
variant_idxs = np.array([genome_index_to_variant_index[pos] for pos in idxs_chr11 
                         if pos in genome_index_to_variant_index],
                        dtype = np.uint32)

In [18]:
print(len(idxs_chr11))
print(len(variant_idxs))

529
483


### # of samples, # of variants, phase info

In [19]:
print chr11.get_raw_sample_ct()
print chr11.get_variant_ct()
print chr11.hardcall_phase_present()

152249
3603469
False


### access genotype info of a specific locus

- read the first entry

In [22]:
res_ary = np.zeros(chr11.get_raw_sample_ct(), dtype=np.int8)
chr11.read(variant_idxs[0], res_ary)
print len(res_ary)
print sum(res_ary == 0)
print sum(res_ary == 1)
print sum(res_ary == 2)
print sum(res_ary == -9)

152249
151838
306
15
90


- want to read variants of interest at once

In [23]:
res_mat = np.zeros((len(variant_idxs), chr11.get_raw_sample_ct()), dtype=np.int8)

In [24]:
res_ary.shape

(152249,)

In [25]:
chr11.read_list(variant_idxs, res_mat)

- resulting matrix
  - each row corresponds to one loci on genome
  - each column corresponds to an individual in the reference panel
  - 0/1/2 represents the alternate allele count.

In [26]:
res_mat[:10, :15]

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  0,  0,  2,  1,  1,  2,  1, -9,  1,  0,  0,  0,  0],
       [ 2,  2,  2,  2,  1,  2,  2,  1,  1,  2,  1,  2,  2,  2,  2],
       [ 2,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  1,  2,  2,  1,  1,  2,  1,  2,  2,  1,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 2,  2,  2,  2,  1,  2,  2,  1,  1,  2,  1,  2,  2,  1,  2],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 0,  0, -9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  2,  0,  0,  0]], dtype=int8)