# Minor patch on `ukb_snp_qc` file for the array dataset

## Yosuke Tanigawa, 2020/10/14


- The snp QC file provided by UK Biobank (http://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=1955) has marker-level QC information.
- It seems like the file has some errors in header line: `PC19_loading`, `PC29_loading`, and `PC39_loading` are all mislabed as `PC9_loading`
- This notebook applies patches to the issues above and merge it with the pvar file.



In [1]:
suppressWarnings(suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
}))


In [2]:
snp_qc_f <- '/oak/stanford/groups/mrivas/ukbb24983/snp/snp_download/ukb_snp_qc.txt'
pvar_f <- '/oak/stanford/groups/mrivas/ukbb24983/cal/pgen/ukb24983_cal_cALL_v2_hg19.pvar.zst'
snp_qc_patched_f <- '/oak/stanford/groups/mrivas/ukbb24983/snp/ukb_snp_qc.pvar'


In [3]:
pvar_df <- fread(cmd=paste('zstdcat', pvar_f), colClasses = c('#CHROM'='character')) %>%
rename('CHROM'='#CHROM')


In [4]:
snp_qc_f %>%
fread(colClasses = c('chromosome'='character')) %>%
rename('PC19_loading'=137, 'PC29_loading'=147, 'PC39_loading'=157) -> snp_qc_df


In [5]:
pvar_df %>%
left_join(
    snp_qc_df,
    by=c('ID'='rs_id', 'POS'='position', 'REF'='allele1_ref', 'ALT'='allele2_alt')
) -> full_df


In [6]:
full_df %>% count(array)

array,n
<int>,<int>
0,17536
1,34197
2,753693


In [9]:
full_df %>% filter(CHROM != chromosome) %>% count(CHROM, chromosome)

CHROM,chromosome,n
<chr>,<chr>,<int>
MT,26,265
X,23,18857
XY,25,1357
Y,24,691


In [10]:
full_df %>% 
select(-chromosome) %>%
rename('#CHROM' = 'CHROM') %>%
fwrite(snp_qc_patched_f, sep='\t', na = "NA", quote=F)
