# Quality control exomes UKBB Q4/2020

The data consists of 200,000 exomes 

50K exomes were made available in March 2019

150K exomes were made available in October 2020

The 50K set includes the following family relationships:

* 194 parent-offspring pairs
* 613 full-sibling pairs
* 26 trios
* 1 monozygotic twin pair
* 195 second degree genetically determined relationships

**Quality control published for the 50K set**

FASTQ files aligned to GRCh38 with BWA-mem and BAM files generated. 

In the BAM files identify and mark duplicates using PICARD

gVCF files with called variants produced using WeCall

Samples excluded if:
* Differences between genetic and reported sex
* High rates of heterozygosity/contamination (Dstat>0.4)
* Low sequence coverage (<85% of bases with 20X coverage)
* Sample duplicates 
* WES variants discordant with genotyping chip

Then creation of project-level VCF or pVCF

Goldilocks:
* SNV with DP<7 changed to no-call
* SNV heterozygotes retained if allele balance ratio was AB>=0.15
* Multiallelic left-normalized and represented as bi-allelic

# Quality control of pVCF/PLINK files

This pipeline is intended to use as an extra step of quality control after obtaining the joint-call file provided by the UKBB (PLINK, pVCF formats)

To download the PLINK files generate and use the following script

```
tpl_file=../farnam.yml
jobid=23155
cwd=/home/dc2325/scratch60/exomes_UKBB
job_size=1
numThreads=22
exome_UKBB=/home/dc2325/project/UKBB_GWAS_dev/workflow/exome_UKBB.ipynb
exome_sbatch=../output/$(date +"%Y-%m-%d")_exome_download.sbatch

cmd_args="""default
    --cwd $cwd
    --jobid $jobid
    --job_size $job_size
    --numThreads $numThreads
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $exome_UKBB \
    --to-script $exome_sbatch \
    --args "$cmd_args"
```

## Basic summary statistics using PLINK v1.9

### Calculate MAF for chr1 using PLINK

In [107]:
module load PLINK/1.90-beta5.3




In [2]:
plink --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1 --freq gz --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

PLINK v1.90b4.6 64-bit (15 Aug 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.log.
Options in effect:
  --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1
  --freq gz
  --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

191893 MB RAM detected; reserving 95946 MB for main workspace.
1783906 variants loaded from .bim file.
200643 people (90020 males, 110437 females, 186 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.nosex .
Using up to 35 threads (change this with --threads).
Before main variant filters, 200643 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
To

In [26]:
frq <- read.table(gzfile('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz'), header=T)
head(frq, 10)
dim(frq)

### Calculate number of variants above/below threshold with R

In [28]:
options(scipen = 999)
rare_var <- frq[frq[,'MAF']<=0.005 ,]
#Some NA values are generated that mess up the nrow
common_var <- frq[frq$MAF > 0.005 & !is.na(frq$MAF),]
# Total number of variants 1783906
dim(rare_var)
dim(common_var)

### Calculate number of variants using awk

In [105]:
zcat /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz | awk '($5 + 0) < 0.005' | wc -l
zcat /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz | awk '($5 + 0) > 0.005' | wc -l
zcat /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.frq.gz | awk '($5 + 0) == 0.005' | wc -l

1757726
26180
1



## Evaluate the missingness per individual/SNP

In [108]:
plink --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1 --missing --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

PLINK v1.90b4.6 64-bit (15 Aug 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.log.
Options in effect:
  --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1
  --missing
  --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

128238 MB RAM detected; reserving 64119 MB for main workspace.
1783906 variants loaded from .bim file.
200643 people (90020 males, 110437 females, 186 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 200643 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.

In [112]:
imiss <- read.table('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.imiss', header=T)
lmiss <- read.table('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.lmiss', header=T)

In [113]:
# F_MISS: frequency of missing genotypes
head(imiss)

Unnamed: 0_level_0,FID,IID,MISS_PHENO,N_MISS,N_GENO,F_MISS
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<int>,<dbl>
1,1434748,1434748,Y,21320,1783906,0.01195
2,5523981,5523981,Y,19100,1783906,0.01071
3,5023838,5023838,Y,22973,1783906,0.01288
4,4023729,4023729,Y,19854,1783906,0.01113
5,4442146,4442146,Y,21056,1783906,0.0118
6,5654789,5654789,Y,20164,1783906,0.0113


In [120]:
min(imiss$F_MISS)
max(imiss$F_MISS)
miss_ten <- imiss[imiss[,'F_MISS']>0.1 ,] # No individuals missing more than 10% of genotypes
dim(miss_ten)

In [114]:
head(lmiss)

Unnamed: 0_level_0,CHR,SNP,N_MISS,N_GENO,F_MISS
Unnamed: 0_level_1,<int>,<fct>,<int>,<int>,<dbl>
1,1,1:69081:G:C,2018,200643,0.01006
2,1,1:69134:A:G,887,200643,0.004421
3,1,1:69149:T:A,1018,200643,0.005074
4,1,1:69217:G:A,455,200643,0.002268
5,1,1:69224:A:T,276,200643,0.001376
6,1,1:69231:C:T,508,200643,0.002532


In [125]:
#F_MISS: proportion of samples missing this SNP
lmiss_ten <- lmiss[lmiss[,'F_MISS']>=0.01 ,] # 26989 
dim(lmiss_ten)
head(lmiss_ten)

Unnamed: 0_level_0,CHR,SNP,N_MISS,N_GENO,F_MISS
Unnamed: 0_level_1,<int>,<fct>,<int>,<int>,<dbl>
39,1,1:69511:A:T,127496,200643,0.6354
99,1,1:69897:T:C,30228,200643,0.1507
124,1,1:925849:D:16,200643,200643,1.0
365,1,1:930388:D:8,200643,200643,1.0
450,1,1:931128:D:13,200643,200643,1.0
452,1,1:931131:D:4,188524,200643,0.9396


In [None]:
# Check missingness between cases and controls

In [109]:
plink --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1 --hardy --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

PLINK v1.90b4.6 64-bit (15 Aug 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.log.
Options in effect:
  --bfile /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1
  --hardy
  --out /home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1

128238 MB RAM detected; reserving 64119 MB for main workspace.
1783906 variants loaded from .bim file.
200643 people (90020 males, 110437 females, 186 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 200643 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
T

In [None]:
hardy <- read.table('/home/dc2325/scratch60/exomes_UKBB/ukb23155_c1_b0_v1.hwe', header=T)