/
index.Rmd
90 lines (60 loc) · 4.25 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
title: "TARI 2020 GS and related procedures"
site: workflowr::wflow_site
author: "Marnin Wolfe"
output:
workflowr::wflow_html:
toc: false
editor_options:
chunk_output_type: console
---
This repository and website documents all analyses, summary, tables and figures associated with TARI genomic prediction and related procedures (e.g. imputation).
# December Imputations
### DCas20_5629
GS C1.
Impute with E. Africa Imputation Reference Panel dataset, which can be found on the [Cassavabase FTP server here](ftp://ftp.cassavabase.org/marnin_datasets/nextgenImputation2019/ImputationEastAfrica_StageII_90919/) with names e.g. `chr*_ImputationReferencePanel_StageVI_91119.vcf.gz` with [code/documentation here](https://wolfemd.github.io/NaCRRI_2020GS/).
**Steps**:
1. [Convert DCas20_5629 report to VCF for imputation](convertDCas20_5629_ToVCF.html):
2. [Impute DCas20_5629](ImputeDCas20_5629.html): with East Africa reference panel
**Files**:
- **RefPanel VCF filename:** `chr*_ImputationReferencePanel_StageVI_91119.vcf.gz`
- **Imputed filename:** `chr*_DCas20_5629_EA_REFimputed.vcf.gz`
- **Post-impute filtered filename:** `chr*_DCas20_5629_EA_REFimputedAndFiltered.vcf.gz`
- **Genome-wide dosage matrix format for use in R:**
- Imputation Reference Panel: `DosageMatrix_ImputationReferencePanel_StageVI_91119.rds`
- DCas20\_5629 with standard post-impute filter: `DosageMatrix_DCas20_5629_EA_REFimputedAndFiltered.rds`
**HOW TO COMBINE DOSAGE MATRICES:** Users will want to combine the genotypes in the imputation reference panel files, with the genotypes in the imputed DArT file. They can have slightly different sets of markers along the columns. Here is a basic example how to combine:
```{r, eval=F}
snps_refpanel<-readRDS("DosageMatrix_ImputationReferencePanel_StageVI_91119.rds")
snps_dcas20_5629<-readRDS("DosageMatrix_DCas20_5629_EA_REFimputedAndFiltered.rds")
snps2keep<-colnames(snps_refpanel)[,colnames(snps_refpanel) %in% colnames(snps_dcas20_5629)]
snps<-bind_rows(snps_refpanel[,snps2keep],
snps_dcas20_5629[,snps2keep])
```
# December Genomic Prediction
Get TARI TP data from Cassavabase. Use it with imputed data to predict GEBV/GETGV for all samples in the new reports (**DCas20-5629**).
1. [Prepare training dataset](01-cleanTPdata.html): Download data from DB, "Clean" and format DB data.
2. [Get BLUPs combining all trial data](02-GetBLUPs.html): Combine data from all trait-trials to get BLUPs for downstream genomic prediction.
* Fit mixed-model to multi-trial dataset and extract BLUPs, de-regressed BLUPs and weights. Include two rounds of outlier removal.
3. [Check prediction accuracy](03-CrossValidation.html): Evaluate prediction accuracy with cross-validation.
4. [Genomic prediction](04-GetGBLUPs.html): Predict _genomic_ BLUPs (GEBV and GETGV) for all selection candidates using all available data.
5. [Results](05-Results.html): Home for plots and other results.
**Files**: everything is in the `output/` sub-directory.
- **GEBVs for parent selection:** `GEBV_TARI_ModelA_2020Dec21.csv`
- **GETGVs for variety advancement:** `GETGV_TARI_ModelADE_2020Dec21.csv`
- **Tidy, long-form CSV of predictions, including PEVs:** `genomicPredictions_TARI_2020Dec21.csv`
[**DOWNLOAD FROM CASSAVABASE FTP SERVER**](ftp://ftp.cassavabase.org/marnin_datasets/TARI_2020GS/output/)
or
[**DOWNLOAD FROM GitHub**](https://github.com/wolfemd/TARI_2020GS/tree/master/output)
# Data availability and reproducibility
The R package **workflowr** was used to document this study reproducibly.
Much of the supporting data *and* output from the analyses documented here are too large for GitHub.
The repository will be mirrored, here: <ftp://ftp.cassavabase.org/marnin_datasets/TARI_2020GS/> with all data.
# Directory structure of this repository
**NOTICE:** `data/` and `output/` are empty on GitHub. Please see <ftp://ftp.cassavabase.org/marnin_datasets/TARI_2020GS/> for access.
1. `data/`: raw data (e.g. unimputed SNP data)
2. `output/`: outputs (e.g. imputed SNP data)
3. `analysis/`: most code and workflow documented in **.Rmd** files
4. `docs/`: compiled **.html**, "knitted" from **.Rmd**
Supporting functions `code/`
The analyses in the **html** / **Rmd** files referenced above often source R scripts in the `code/` sub-folder.