In [2]:
suppressWarnings(suppressPackageStartupMessages({
    library(tidyverse)
    library(data.table)
}))


As you may see in the PLINK log file (`/scratch/groups/mrivas/ukbb24983/array-exome-combined/pgen/ukb24983_cal_hla_cnv_exomeOQFE.bed.log`), 
there are some variants removed based on the allele frequency threshold.

> 2962 variants removed due to minor allele threshold(s)

We add back the "geno_data_source" column.


```
$ cd /scratch/groups/mrivas/ukbb24983/array-exome-combined/pgen

$ cat ukb24983_cal_hla_cnv_exomeOQFE.pvar | md5sum
a76257a1dc2bca3b84c04cf7788673cd  -

$ mv ukb24983_cal_hla_cnv_exomeOQFE.pvar ukb24983_cal_hla_cnv_exomeOQFE.plink.pvar

$ # we execute the codes in this notebook and generate the merged file

$ cat ukb24983_cal_hla_cnv_exomeOQFE.pvar| cut -f1-5 | md5sum
a76257a1dc2bca3b84c04cf7788673cd  -
```


In [1]:
ukb_d <- '/scratch/groups/mrivas/ukbb24983'
data_d <- '/scratch/groups/mrivas/ukbb24983/array-exome-combined/pgen'

# input
in_f <- file.path(data_d, 'ukb24983_cal_hla_cnv_exomeOQFE.plink.pvar')
array_f <- file.path(ukb_d, 'array-combined/annotation/annotation_20201012/ukb24983_cal_hla_cnv.annot_compact_20201023.tsv.gz')
exome_f <- file.path(ukb_d, 'exome/annotation/20201025_exome_oqfe_2020/ukb24983_exomeOQFE.annotation.20210108.compact.tsv.gz')

# output
out_f <- file.path(data_d, 'ukb24983_cal_hla_cnv_exomeOQFE.pvar')


In [4]:
array_f %>% fread(select=c('ID', 'FILTER', 'geno_data_source')) -> array_df

exome_f %>% fread(select=c('ID', 'FILTER')) %>%
mutate(geno_data_source = 'exome200k')-> exome_df


In [5]:
in_f %>% fread(colClasses = c('#CHROM'='character')) %>%
rename('CHROM'='#CHROM') -> df


In [6]:
df %>% left_join(bind_rows(array_df, exome_df), by='ID') -> merged_df


In [7]:
merged_df %>% count(geno_data_source)


geno_data_source,n
<chr>,<int>
cal,802498
cnv,275180
exome200k,17660693
hla,328


In [8]:
merged_df %>% count(FILTER)

FILTER,n
<chr>,<int>
.,18134874
gnomad_af,164
hwe,34668
hwe;gnomad_af,3
hwe;mgi,3
mcpi,10
mgi,41
missingness,351835
missingness;gnomad_af,40
missingness;hwe,24008


In [10]:
merged_df %>%
rename('#CHROM' = 'CHROM') %>%
fwrite(out_f, sep='\t', na = "NA", quote=F)
