/
ImputeDCas20_5629.Rmd
153 lines (123 loc) · 6.22 KB
/
ImputeDCas20_5629.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Impute TARI DCas20_5629"
site: workflowr::wflow_site
date: "2020-December-16"
output: workflowr::wflow_html
editor_options:
chunk_output_type: inline
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
DArTseqLD (DCas20-5629).
# Copy data
Copy the imputation reference panel from 2019 to the `data/` folder.
```{bash,eval = FALSE}
mkdir /workdir/mw489/
cp -r /home/jj332_cas/marnin/TARI_2020GS /workdir/mw489/
cp -r /home/jj332_cas/CassavaGenotypeData/CassavaGeneticMap /workdir/mw489/TARI_2020GS/data/
cp /home/jj332_cas/CassavaGenotypeData/nextgenImputation2019/ImputationEastAfrica_StageII_90919/chr*_ImputationReferencePanel_StageVI_91119.vcf.gz /workdir/mw489/TARI_2020GS/data/
```
# Impute
Impute with [Beagle V5.0](https://faculty.washington.edu/browning/beagle/b5_0.html).
Use the "imputation reference panel" dataset from 2019, e.g. `chr1_ImputationReferencePanel_StageVI_91119.vcf.gz` as reference.
Used 1 large memory Cornell CBSU machine (e.g. [cbsulm16; 112 cores, 512 GB RAM](https://biohpc.cornell.edu/lab/hardware.aspx)), running 1 chromosome at a time.
R functions are stored in the `code/` sub-directory. Functions sourced from e.g. **imputationFunctions.R** are wrappers around e.g. Beagle, and other command line programs.
```{bash, eval=F}
cd /workdir/mw489/TARI_2020GS/
```
```{r,eval = FALSE}
targetVCFpath<-here::here("data/Report-DCas20-5629") # location of the targetVCF
refVCFpath<-here::here("data/")
mapPath<-here::here("data/CassavaGeneticMap/")
outPath<-here::here("output/")
outSuffix<-"DCas20_5629"
```
```{r,eval = FALSE}
source(here::here("code","imputationFunctions.R"))
# cbsurobbins using 64 cores
purrr::map(1:18,~runBeagle5(targetVCF=paste0(targetVCFpath,"chr",.,"_DCas20_5629.vcf.gz"),
refVCF=paste0(refVCFpath,"chr",.,"_ImputationReferencePanel_StageVI_91119.vcf.gz"),
mapFile=paste0(mapPath,"chr",.,"_cassava_cM_pred.v6_91019.map"),
outName=paste0(outPath,"chr",.,"_DCas20_5629_EA_REFimputed"),
nthreads=64))
```
Clean up Beagle log files after run. Move to sub-directory `output/BeagleLogs/`.
```{bash,eval = FALSE}
cd /workdir/mw489/TARI_2020GS/output/;
mkdir BeagleLogs;
cp *_DCas20_5629_EA_REFimputed.log BeagleLogs/
cp -r BeagleLogs /home/jj332_cas/marnin/TARI_2020GS/output/
cp *_DCas20_5629_EA_REFimputed* /home/jj332_cas/marnin/TARI_2020GS/output/
```
# Post-impute filter
For now, the function will just do a fixed filter: AR2>0.75 (DR2>0.75 as of Beagle5.0), P_HWE>1e-20, MAF>0.005 [0.5%].
It can easily be modified in the future to include parameters to vary the filter specifications.
Input parameters
```{r,eval = FALSE}
#' @inPath path to input VCF-to-be-filtered, can be left null if path included in @inName . Must end in "/"
#' @inName name of input VCF file EXCLUDING file extension. Assumes .vcf.gz
#' @outPath path where filtered VCF and related are to be stored.Can be left null if path included in @outName . Must end in "/".
#' @outName name desired for output EXCLUDING extension. Output will be .vcf.gz
```
Loop to filter all 18 VCF files in parallel
```{r,eval = FALSE}
inPath<-here::here("output/")
outPath<-here::here("output/")
source(here::here("code","imputationFunctions.R"))
require(furrr); options(mc.cores=18); plan(multiprocess)
future_map(1:18,~postImputeFilter(inPath=inPath,
inName=paste0("chr",.,"_DCas20_5629_EA_REFimputed"),
outPath=outPath,
outName=paste0("chr",.,"_DCas20_5629_EA_REFimputedAndFiltered")))
```
Check what's left
```{r,eval = FALSE}
purrr::map(1:18,~system(paste0("zcat ",here::here("output/"),"chr",.,"_DCas20_5629_EA_REFimputedAndFiltered.vcf.gz | wc -l")))
# 5785
# 2106
# 1951
# 2306
# 2271
# 2093
# 911
# 1795
# 2048
# 1504
# 1486
# 1881
# 1500
# 2757
# 2205
# 1562
# 1636
# 1519
```
```{bash, eval=F}
cd /workdir/mw489/TARI_2020GS/output/;
cp -r *_DCas20_5629_EA_REFimputed* /home/jj332_cas/marnin/TARI_2020GS/output/
```
# Formats for downstream analysis
The function below will (1) convert the input VCF to plink1.9 binary format and (2) convert the plink binary to a dosage (0,1,2) matrix with special attention to which allele gets counted in the file.
**NOTICE:** I was worried about `plink1.9` changing allele codes between files. There is some risk the counted allele could switch between e.g. the reference panel and the progeny files because of allele freq. (see plink documentation). To avoid this, went to extra trouble: write a file suffixed `*.alleleToCount` listing SNP ID (column 1) and the ALT allele from the VCF (column 2). Pass the file to `plink1.9` using the `--recode-allele` flag to ensure all output dosages count the ALT allele consistent with the VCFs. The reason to use `plink1.9` is that `Beagle5` imputed files don't have a **DS** (dosage) field that can be directly extracted. Instead, phased genotypes e.g. `0|1` need to be converted to dosages (e.g. `0|1 --> 1`, `1|1 --> 2`). An alternative might be to extract the haplotypes using `vcftools` and manually (in R) computed the dosages; that would give most control but is slow.
```{bash, eval=F}
cd /home/jj332_cas/marnin/TARI_2020GS/;
```
```{r, eval=F}
library(tidyverse); library(magrittr);
source(here::here("code","imputationFunctions.R"))
require(furrr); options(mc.cores=18); plan(multiprocess)
pathOut<-here::here("output/")
# Imputation reference panel
future_map(1:18,~convertVCFtoDosage(pathIn="/home/jj332_cas/CassavaGenotypeData/nextgenImputation2019/ImputationEastAfrica_StageII_90919/",
pathOut=pathOut,
vcfName = paste0("chr",.,"_ImputationReferencePanel_StageVI_91119")))
# DCas20_5629
future_map(1:18,~convertVCFtoDosage(pathIn=here::here("output/"),pathOut=pathOut,
vcfName = paste0("chr",.,"_DCas20_5629_EA_REFimputedAndFiltered")))
# Genome-wide dosage (for use in R) for each dataset
# Imputation reference panels
createGenomewideDosage(pathIn = here::here("output/"), chroms=1:18, "_ImputationReferencePanel_StageVI_91119")
# DCas20_5629
createGenomewideDosage(pathIn = here::here("output/"), chroms=1:18, "_DCas20_5629_EA_REFimputedAndFiltered")
```