# validation 
## Yosuke Tanigawa (ytanigaw@stanford.edu) 2017/8/20

Evaluation on chr20 doesn’t look great



## 1] data set
- reference panel : UKBB imputed genotype (v1) with 112k keep file
- read : nanopore consortium data (chr 20) subset to > 10k mapped fragments (roughly 14x coverage)

## 2] results
## genotype (alternative allele count) accuracy measure
- `~49%` accuracy in genotype (definition follows)
  - accuracy := (1 / (number of SNP positions)) * \sum_i indicator{alt_c(posterior, i) == alt_c(validation, i)}
  - where,
    - indicator is an indicator function
    - alt_c(dataset = d, position = i) is alternative count at position i for dataset d
    - posterior is our prediction
    - validation is from platium genome dataset
    - we are looking at 198,417 positions on chr20
    - 97384 / 198417 = 49% accuracy
    - breakdown
      - 94172/148750 for reference allele homozygous position (in Platinum genome dataset)
      - 51/43343 for heterogyzous position (in Platinum genome dataset)
      - 3161/6324 for alternative allele homozygous position (in Platinum genome dataset)

## 3] what to check

### bin size can be too large for phased dataset
- number of SNPs in bin ranges from 3 to 1600+
- number of unique haplotypes in bin ranges from 5 to 82208
- the definition of bin might not be optimal
- `/oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/nanopore/src/inference/prior_count.out`

### heterozygous call is super rare in our posterior prob
- read specific error rate estimate is not correct (?)
- `/oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/nanopore/src/inference/pgenout.out`

### need to evaluate with estimate with different coverage
- how much performance can we get from prior prob (?) => useful to think about validity of read specific error rate


## 4] method summary

### data prep
  - UKBB imputed data (v1) subseted to 112k individuals and biallelic SNPs
  - phased with Eagle 2

###  partition genome into bins based on LD structure (`plink1.9 --blocks`)
  - https://github.com/rivas-lab/nanopore/blob/master/notebook/data_prep/population_ref/define_bins/define_bins.ipynb
  
### prior: just a frequency count in population
  - https://github.com/rivas-lab/nanopore/blob/master/src/inference/prior_count.py
  
### log-likelihood: binomial model with mismatch rate estimate from non-SNP sites
  - https://github.com/rivas-lab/nanopore/blob/master/src/inference/log_likelihood.py

### log-posterior: just compute from prior and likelihood
  - https://github.com/rivas-lab/nanopore/blob/master/src/inference/log_posterior.py

### pgen output: call homozygous (for a block) iff posterior(haplotype) >= 0.9
  - https://github.com/rivas-lab/nanopore/blob/master/src/inference/log_posterior.py

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import dok_matrix

import pgenlib as pg

In [2]:
population_ref_f='../../public_data/intermediate/population_ref/chr20-alleles'
out_pgen_f='../../private_data/output/chr20.sorted-chr20-alleles.pgen'
validation_f='../../public_data/intermediate/validation/NA12878-snps'

In [3]:
def read_alleles_range_wrapper(pgen_f):
    with pg.PgenReader(pgen_f) as pgr:
        buf = np.zeros((pgr.get_variant_ct(), pgr.get_raw_sample_ct() * 2), dtype=np.int32)
        pgr.read_alleles_range(0, pgr.get_variant_ct(), buf)
    return buf

In [4]:
def read_bim(bim_f):
    return pd.read_csv(
        bim_f, sep='\t', 
        names=['chr', 'rsid', 'genetic_dist', 'pos', 'a1', 'a2']
    )    

In [5]:
out_mat = read_alleles_range_wrapper(out_pgen_f)

In [6]:
pop_bim_df = read_bim('{}.bim'.format(population_ref_f))
val_bim_df = read_bim('{}.bim'.format(validation_f))

In [7]:
pop_bim_a1 = dict(zip(pop_bim_df.pos, pop_bim_df.a1))
pop_bim_a2 = dict(zip(pop_bim_df.pos, pop_bim_df.a2))
pop_bim_idx = dict(zip(pop_bim_df.pos, pop_bim_df.index))

In [8]:
val_bim_a1 = dict(zip(val_bim_df.pos, val_bim_df.a1))
val_bim_a2 = dict(zip(val_bim_df.pos, val_bim_df.a2))

In [9]:
idx = set(pop_bim_df.pos) & set(val_bim_df.pos)

In [10]:
no_flip = set([x for x in idx if val_bim_a1[x] == pop_bim_a1[x] and val_bim_a2[x] == pop_bim_a2[x]])
flip    = set([x for x in idx if val_bim_a2[x] == pop_bim_a1[x] and val_bim_a1[x] == pop_bim_a2[x]])

In [11]:
sorted(idx - flip - no_flip)

[1941171, 7601825, 12739300, 61948796]

In [12]:
val_pgen_mat = read_alleles_range_wrapper('{}.pgen'.format(validation_f))

In [13]:
val_sparse = dok_matrix(out_mat.shape, dtype=np.int32)
for val_idx in range(val_pgen_mat.shape[0]):
    if(val_bim_df.pos[val_idx] in no_flip):
        val_sparse[pop_bim_idx[val_bim_df.pos[val_idx]], :] = val_pgen_mat[val_idx,:]
    elif(val_bim_df.pos[val_idx] in flip):
        val_sparse[pop_bim_idx[val_bim_df.pos[val_idx]], :] = 1 - val_pgen_mat[val_idx,:]

In [14]:
val_mat = np.array(val_sparse.todense(), dtype = np.int32)

In [15]:
val_mat.shape, out_mat.shape

((198417, 2), (198417, 2))

In [16]:
val_mat.sum(axis=1).shape, out_mat.sum(axis=1).shape

((198417,), (198417,))

In [17]:
val_mat.sum(axis=1)[:10]

array([1, 0, 1, 0, 0, 1, 0, 1, 0, 0])

In [18]:
out_mat.sum(axis=1)[:10]

array([2, 2, 2, 0, 0, 0, 2, 2, 2, 0])

In [26]:
np.sum(val_mat.sum(axis=1) == out_mat.sum(axis=1)), len(val_mat)

(97384, 198417)

In [23]:
100.0 * np.sum(val_mat.sum(axis=1) == out_mat.sum(axis=1)) / len(val_mat)

49.08047193536844

In [21]:
np.sum(out_mat.sum(axis=1) == 1)

227

In [25]:
val_ref = (val_mat.sum(axis=1) == 0)
val_het = (val_mat.sum(axis=1) == 1)
val_alt = (val_mat.sum(axis=1) == 2)

In [27]:
np.sum(val_mat.sum(axis=1)[val_ref] == out_mat.sum(axis=1)[val_ref]), len(val_mat[val_ref])

(94172, 148750)

In [28]:
np.sum(val_mat.sum(axis=1)[val_het] == out_mat.sum(axis=1)[val_het]), len(val_mat[val_het])

(51, 43343)

In [31]:
np.sum(val_mat.sum(axis=1)[val_alt] == out_mat.sum(axis=1)[val_alt]), len(val_mat[val_alt])

(3161, 6324)