-
Notifications
You must be signed in to change notification settings - Fork 2
/
05-getBLUPs.Rmd
671 lines (472 loc) · 32.1 KB
/
05-getBLUPs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
# Preliminary field trial analysis
```{r setup, include=F, echo=F}
library(tidyverse); library(genomicMateSelectR); library(sommer); library(gt)
```
- **Context and Purpose:**
- **Upstream:** Section \@ref(prepare-genotypic-data) - quality control steps for genotypic data
- **Downstream:** Genomic prediction and related analyses.
- **Inputs:**
- **Expected outputs:**
## Process Map
![](images/get_blups_process_map.png){width=100%}
## One-stage or multi-stage?
We often have very large, multi-year, multi-location, multi-trial-type (**MET**) datasets we use to train genomic prediction models. The number of plots can be in the range of 10- or even 100,000 plots with many thousands of unique genotypes observed in unbalanced fashion across heterogenous experimental designs. All that to say, the computational burden and level of expertise required to execute genomic prediction analyses directly on these data is very great.
For the sake of semi-automation and computational efficiency, the standard genomic prediction pipeline I have implemented for NextGen Cassava, which I demonstrate below is a two-stage weighted genomic prediction:
**Stage 1. Get BLUPs**
- Conduct preliminary analysis of the trial data *without* genomic relatedeness / marker-information.
Identify the best-fitting model for the data and potentially curate the raw data, removing outliers.
- Fit mixed-model to **MET** data. Extract **`BLUPs`**, **`PEVs`** and variance components (**`VarComps`**). Compute de-regressed BLUPs (**`drgBLUPs`**) and weights (**`WTS`**) for "Stage 2".
**Stage 2. Cross-validation and genomic prediction**
Conduct weighted cross-validation and genomic prediction analyses using **`de-regressed BLUPs`** (alternative **`BLUEs`**) as the response variable and the weights from Stage 1. This effectively reduces the number of "observations" in the analysis to the number of unique genotypes (i.e. clones, inbred lines, etc.), leading to lower computational burden.
**Note of advice for single-stage analyses:** I strongly recommend conducting preliminary analysis of the trial data *without* genomic relatedeness / marker-information. Ensure the model will converge, residuals look acceptable and otherwise assess the best fitting model *before* committing to lengthy computationally intensive analyses. :)
## Set-up training datasets
```{r}
phenos<-readRDS(here::here("output","phenotypes_cleaned.rds"))
```
The pipeline version of this analysis will use the `TRUE/FALSE` values of **`CompleteBlocks`** and **`IncompleteBlocks`** ([Preliminary analysis of trial data](#detect_designs) ).
```{r}
phenos %>%
count(CompleteBlocks,IncompleteBlocks,locationName) %>%
spread(locationName,n)
```
Convert the data to "long format" . Remove missing values. "Nest" the data by Trait.
```{r}
traits<-c("DM","MCMDS","logFYLD","logDYLD")
phenos<-phenos %>%
# Convert the data to "long format"
pivot_longer(cols = all_of(traits),
names_to = "Trait", values_to = "Value") %>%
# Remove missing values
filter(!is.na(Value)) %>%
# Nest the MultiEnvironmentTrial data by trait
nest(METdata=c(-Trait))
phenos %>%
mutate(N_plots=map_dbl(METdata,nrow))
```
Where previously there was one row per plot and a large number of columns, now things are simple and tidy.
One row per trait. The actual plot-basis data are now contained within **METdata**, a list-type column, each element containing a `tibble`.
To demonstrate, check the contents of row 1 of the **METdata** column:
```{r}
phenos$METdata[[1]] %>% head %>% rmarkdown::paged_table()
```
## Considerations before modeling
There is a "standard" model I have deployed reliably on Cassava MET data for the GS pipeline.
For the sake of demonstration, I want to pick out one trait, and do some model comparisons to illustrate the basics of selecting the best-fitting model.
Before doing that, a few considerations.
### Mixed-model software
In the past, I have used both `lmer()` (`lme4` R package) and `asreml()` (the R package interface for the `asreml` software) for this sort of analysis. `lmer()` is limited to homogeneous error variance structures and only certain design models / variance structures. `asreml()` is fast, flexible and industry standard *BUT* proprietary.
The `sommer` package is fully open-source. The `sommer::mmer()` function is designed to fit almost any model `asreml` can, using a very similar syntax... but it is much more memory intensive and relatively slow. `sommer` uses direct-inversion of matrices, which have to be held in-memory in R.
Below, I'll use `sommer::mmer()`.
*Later, we will address ways to speed up such analyses, e.g. by linking R to a multi-threaded matrix algebra library like `OpenBLAS`.*
#### Learning about mixed-models
**PLACEHOLDER FOR LINKS, REFERENCES, ETC.**
### Sources of variability
**Fixed Factors:**
- **`yearInLoc`**: fixed-effect for year-location combination
**Random Factors:**
- **`germplasmName`**: random genetic effect, primary aim is to compute BLUPs for this term
- **`repInTrial`**: complete block replicate effects with levels uniquely nested within trials (**`studyName`**`'s`)
- **`blockInRep`**: incomplete blocks, with levels unique nested within replicates (within **`repInTrial`**`'s`)
**Error Variance:**
- **`studyName`**: heterogeneous error variances for each trial?
- Another option might be to use the **`plotArea`**, or the **`plantsPerPlot`**, or the **`trialType`**
**"Bonus" points:** If I had extra time, **`plantingDate`** and **`harvestDate`** information may provide important covariates: for example to correct yield and dry matter for trial-to-trial differences in developmental stage (age) at harvest...
### Parallel process with `future`
Below, I wanted to fit 4 different models to the data. After initial tests, I felt inpatient.
I decided to demonstrate a version of parallel processing that most laptops should be able to do. I used the package `future` to run each model at the same time.
See [this quick overview of `library(future)`](https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html) to learn more.
Below, I choose `plan(multisession)` (activates parallel future processing), as opposed to the default, `plan(sequential)`.
To get the models to all run in the background, at the same time, I substitute `%<-%{ }` for `<-` in the code below.
## Comparison of models
What follows is a demonstration, not a complete evaluation of the options, but it shows a strategy to compare models.
Pick out the first chunk of **`METdata`**.
```{r}
# grab just one chunk of the data to "experiment" on
METdata<-phenos$METdata[[1]] %>%
# sommer doesn't recognize logical T/F variables
mutate(CompleteBlocks=as.character(CompleteBlocks),
IncompleteBlocks=as.character(IncompleteBlocks))
```
Not all trials (**`studyName`**s) have complete and/or incomplete blocking structures. For this reason, the "standard" model I've implemented for NextGen Cassava predictions over the last several years is:
- **`yearInLoc`**: fixed factor
- **`repInTrial`** and **`blockInRep`** both random using the `at()` variance structure to model only at trials where **`CompleteBlocks=="TRUE"`**.
- If the phenotype is a plot-basis yield component, e.g. **`logFYLD`**, include **`PropNOHAV`** as an additional fixed-effect covariate.
### Fit models
The following code block implements 4 different models, each will execute in the background and at the same time.
```{r, eval=F}
library(future);
plan(multisession)
# the model I've applied to NextGen predictions the last several years
conv_model %<-%{ mmer(Value~yearInLoc,
random=~germplasmName +
vs(at(CompleteBlocks,"TRUE"),repInTrial) +
vs(at(IncompleteBlocks,"TRUE"),blockInRep),
data=METdata,
verbose = FALSE) }
# add heterogeneous-error variance for each trial (studyName)
conv_het_error_model %<-% { mmer(Value~yearInLoc,
random=~germplasmName +
vs(at(CompleteBlocks,"TRUE"),repInTrial) +
vs(at(IncompleteBlocks,"TRUE"),blockInRep),
rcov=~vs(ds(studyName),units),
data=METdata,
verbose = FALSE) }
# simplify: no het. error, no at() variance structure
simple_model %<-% { mmer(Value~yearInLoc,
random=~germplasmName + repInTrial + blockInRep,
data=METdata,
verbose = FALSE) }
# no at() variance structure, include het. error variance by trial
simple_het_error_model %<-% { mmer(Value~yearInLoc,
random=~germplasmName + repInTrial + blockInRep,
rcov=~vs(ds(studyName),units),
data=METdata,
verbose = FALSE) }
```
```{r, eval=F}
# Save output of above so the above models don't need to be re-run
## if and when I reformat or otherwise re-knit this doc to HTML
save(conv_model,conv_het_error_model,simple_model,simple_het_error_model,
file=here::here("output","models_compared.Rdata"))
```
### Model comparisons
```{r}
# load the output from the models from where I stored it
load(here::here("output","models_compared.Rdata"))
```
Which model fits best? How to compare?
`sommer` provides `anova()` for pairwise comparisons of models using a likelihood ratio test *and* by AIC, BIC.
```{r}
anova(conv_model,conv_het_error_model)
```
Here's a very not-pretty not-high-throughput way to set-up a table comparing the AIC for all 4 models.
```{r}
tibble(Model=c("conv_model","conv_het_error_model","simple_model","simple_het_error_model"),
AIC=c(conv_model$AIC,conv_het_error_model$AIC,simple_model$AIC,simple_het_error_model$AIC)) %>%
arrange(AIC) %>% gt::gt()
```
The "simple" heterogeneous error model fits the data best, I'll go ahead and use it for this demonstration.
In addition, for plot-basis yield traits, the proportion of stands harvested (**PropNOHAV**) as fixed-effect covariate.
## Learn functions and iteration
At this point, taking a moment to familiarize yourself with some R packages and programming concepts might be adviseable.
1. Learn to read and write your own functions in R
2. Learn to iterate procedures in R
### Write functions in R
<https://swcarpentry.github.io/r-novice-inflammation/02-func-R/>
<https://r4ds.had.co.nz/functions.html>
### Iteration in the tidyverse
Packages `purrr` and (for parallel processing) `furrr`.
<https://r4ds.had.co.nz/iteration.html>
<https://www.r-bloggers.com/2021/09/tidy-parallel-processing-in-r-with-furrr/>
<https://purrr.tidyverse.org/>
<https://dcl-prog.stanford.edu/purrr-mutate.html>
<https://dcl-prog.stanford.edu/purrr-parallel.html>
<https://furrr.futureverse.org/>
## Analyze all traits
Next, we will fit the "best" model from the previous step, but to all traits.
Below, I'll take the time to illustrate a tidy way to set-up and then execute the model, processing traits in parallel.
In short, we have four data chunks set-up in the tibble `phenos`
```{r}
phenos
```
Let's go step-by-step to build a good understanding of a programming workflow. We'll use similar procedures in many of the analyses and data processes we do downstream.
(1) We will write a function that will take each data chunk (tibbles stored in `phenos$METdata`) as an input argument.
(2) We'll execute the function (and thus the analysis) simultaneously (in parallel) across the data chunks using the `furrr` package.
### Test code for function
Here's how we want our call to `mmer()` to fit the mixed-model to look like:
```{r, eval=F}
# DON"T RUN
simple_het_error_model<-mmer(fixed = Value~yearInLoc,
random=~germplasmName + repInTrial + blockInRep,
rcov=~vs(ds(studyName),units),
data=METdata, getPEV = T)
```
To build a function around this, we want to be able to input different values ofthe arguments to `mmer()`, namely: `data=`, `fixed=`, `random=` and `rcov=`.
Note that in this case, the only actual difference among traits in our example is going to be an added fixed effect for the yield traits, the **`PropNOHAV`**.
```{r set input args}
# Input arguments for the function:
Trait<-phenos$Trait[[1]]
# input for "data="
METdata<-phenos$METdata[[1]]
# input for "fixed="
fixedFormula<-ifelse(Trait %in% c("logDYLD","logFYLD","logRTNO","logTOPYLD"),
"Value ~ yearInLoc + PropNOHAV",
"Value ~ yearInLoc")
# input for "random="
randFormula<-"~germplasmName + repInTrial + blockInRep";
# input for "rcov="
rcovFormula="~vs(ds(studyName),units)"
```
Here's how we can call `mmer()` inside another function, using the variables in the above code chunk as input arguments:
```{r fit mmer model,cache=T, eval=F}
starttime<-proc.time()[3]
mmer_output<-mmer(fixed = as.formula(fixedFormula),
random = as.formula(randFormula),
rcov = as.formula(rcovFormula),
data=METdata,
getPEV = T)
stoptime<-proc.time()[3]; elapsed<-stoptime-starttime; elapsed/60
# iteration LogLik wall cpu(sec) restrained
# 1 -367.006 10:18:35 31 0
# 2 -338.545 10:19:5 61 0
# 3 -325.9 10:19:38 94 0
# 4 -321.988 10:20:8 124 0
# 5 -321.2 10:20:37 153 0
# 6 -321.017 10:21:4 180 0
# 7 -320.971 10:21:29 205 0
# 8 -320.959 10:21:55 231 0
# 9 -320.956 10:22:20 256 0
# 10 -320.955 10:22:46 282 0
# elapsed
# 4.775417
```
Took 5 whole minutes to fit that, yikes.
**Outlier Detection and Removal:** My standard pipeline includes outlier detection and removal. (1) Fit the model. (2) Define outliers as observations where the absolute-value of the standardized residuals \>3.3. (3) Remove outliers and re-fit the model. (4) Check for and remove residuals again. (5) If more residuals were detected and removed, re-fit a final time.
For the sake of a simple, quick example, the demonstration below, will just do one round removing outliers.
```{r outlier detection, cache=T, eval=F}
# index observations that are defined as outliers
outliers<-which(abs(scale(mmer_output$residuals))>3.3)
length(outliers) # How many?
# [1] 6
```
```{r remove outliers and refit, cache=T, eval=F}
# remove outliers, if any
if(length(outliers)>0){
x<-METdata[-outliers,]
# Refit the model
starttime<-proc.time()[3]
mmer_output<-mmer(fixed = as.formula(fixedFormula),
random = as.formula(randFormula),
rcov= as.formula(rcovFormula),
data=x,
getPEV = T)
stoptime<-proc.time()[3]; elapsed<-stoptime-starttime; elapsed/60
}
if(length(outliers)==0){ outliers<-NULL }
save(mmer_output,outliers,
file=here::here("output","mmer_output_example_during_stage1.Rdata"))
```
Next we extract and format the outputs we want from the mixed-model; these will be combined into a list and passed to the `return()` statement of the function we are building.
In the chunk below, I extract and compute a number of things:
1) Model fit stats like AIC, log likelihood and the number of grouping factors, mostly stored for posterity
2) Variance component estimates
3) A broad-sense heritability value `H2`
4) Extract **`BLUP`**s and corresponding **`PEV`**s (prediction error variance) for the genotype-factor ("**germplasmName**")
5) Compute
- reliabilities (**`REL`**) for each **`BLUP`**
- de-regressed BLUPs (**`drgBLUP`**)
- Weighting factor (**`WT`**) for each BLUP, to be used in downstream analyses where the **`drgBLUP`** is supplied as the response variable.
In a few sections ([see here](#de-regression-explained)), I'll break down what these are and how they are used.
```{r tidy outputs}
load(here::here("output","mmer_output_example_during_stage1.Rdata"))
# summary(mmer_output)
# log likelihood of the model, AIC, convergence T/F
modelfit<-summary(mmer_output)$logo
# number of groups for factors, could be used to compute DF
groups<-summary(mmer_output)$groups
# variance components
varcomp<-summary(mmer_output)$varcomp
# genotypic variance
Vg<-mmer_output$sigma$germplasmName %>% as.vector()
# Mean error-variance across trials
## across heterogeneous error variance estimates
meanHetVarE<-mmer_output$sigma %>%
.[names(.) %>% grep(":units",.,value=T)] %>%
unlist() %>%
mean()
# Mean number of reps per accession
meanNreps<-METdata %>%
count(germplasmName) %$% mean(n)
# Broad-sense heritability
H2<-Vg/(Vg+(meanHetVarE/meanNreps))
# Extract the BLUPs and PEVs, compute Reliability (REL),
# de-regressed BLUPs and weights for downstream analysis
blups<-mmer_output$U$germplasmName$Value %>%
tibble(germplasmName=names(.),BLUP=.) %>%
mutate(germplasmName=gsub("germplasmName","",germplasmName),
# prediction error variance
PEV=diag(mmer_output$PevU$germplasmName$Value),
REL=1-PEV/Vg, # Reliability
drgBLUP=BLUP/REL, # De-regressed BLUP
WT=(1-H2)/((0.1 + (1-REL)/REL)*H2) # weight for downstream
)
# Combine all outputs into one object the function can return()
out<-tibble(H2=H2,meanHetVarE=meanHetVarE,meanNreps=meanNreps,
modelfit=list(modelfit),
groups=list(groups),
blups=list(blups),
varcomp=list(varcomp),
outliers=list(outliers))
```
### Set-up the function
I combined the code chunks above into a function **`get_blups()`**. To spare the duplication of code here, the function is in its own R script: **`code/get_blups.R`**.
**A NOTE ON DEBUGGING:** Building and executing analyses in this way is often an iterative process. After my first run of the function `get_blups()`, I noticed all the results were the same. In this case, I leftout an argument from the `function()` call. Details not important. *SUGGESTION*: When you put the code into an actual `function()`, test it on a single code chunk, especially if the analysis will run a long time or use large resources.
### Run the full analysis
Finally, we are ready to source the function we built and execute the full analysis on all traits with it.
```{r}
source(here::here("code","get_blups.R"))
phenos %<>%
mutate(fixedFormula=ifelse(Trait %in% c("logDYLD","logFYLD","logRTNO","logTOPYLD"),
"Value ~ yearInLoc + PropNOHAV",
"Value ~ yearInLoc"),
randFormula="~germplasmName + repInTrial + blockInRep",
rcovFormula="~vs(ds(studyName),units)")
```
The next chunk uses the function `future_pmap()` from the `furrr` package to exectute the `get_blups()` function across all four traits in parallel.
```{r, cache=T, eval=F}
starttime<-proc.time()[3]
require(furrr); plan(multisession, workers = 4)
blups<-phenos %>%
mutate(modelOut=future_pmap(.,get_blups)) %>%
unnest(modelOut)
stoptime<-proc.time()[3]; elapsed<-stoptime-starttime; elapsed/60
# elapsed
# 17.25957
# Save
saveRDS(blups,file = here::here("output","blups.rds"))
```
Took 17 minutes to run on my laptop :(
### Why is MCMDS heritability 0?
```{r}
blups<-readRDS(here::here("output","blups.rds"))
# blups %>% rmarkdown::paged_table()
blups %>%
dplyr::select(Trait,H2,meanHetVarE,meanNreps,outliers) %>%
mutate(H2=round(H2,3),
meanNreps=round(meanNreps,3),
outliers=map_dbl(outliers,length)) %>%
rmarkdown::paged_table()
```
Why is the H2 for **MCMDS** 0? Aside from this being an unfortunate result, knowing that it is unusual / atypical requires knowledge of cassava genetic architecture.
The heritablity of **DM**, **logFYLD** and **logDYLD** are both pretty much in line with "typical" results, *but* MCMDS in most cassava populations is a pretty heritable trait.
I notice there were 33 outliers for MCMDS.
Let's look at the distribution of MCMDS in the raw data:
```{r}
blups$METdata[[2]] %>% ggplot(.,aes(x=Value)) +
geom_density()
```
This is actually often how MCMDS scores look like.
Which are the observations that were removed as outliers?
```{r}
blups$METdata[[2]][blups$outliers[[2]],] %$% summary(Value)
```
They are exclusively high scores.
I wonder if removing them was a mistake:
To find out, let's re-run the analysis just for MCMDS, this time turning outlier removal off. I actually added an `removeOutliers=T/F` argument and a few lines of code to the `get_blups()` function to make this happen:
```{r, cache=T, eval=F}
source(here::here("code","get_blups.R"))
blups_mcmds_noRemoveOut<-phenos %>%
filter(Trait=="MCMDS") %>%
mutate(modelOut=pmap(.,get_blups,removeOutliers = FALSE)) %>%
unnest(modelOut)
save(blups_mcmds_noRemoveOut,
file=here::here("output","blups_mcmds_noRemoveOut.Rdata"))
```
```{r}
load(here::here("output","blups_mcmds_noRemoveOut.Rdata"))
blups_mcmds_noRemoveOut %>%
dplyr::select(Trait,H2,meanHetVarE,meanNreps,outliers) %>%
mutate(H2=round(H2,3),
meanNreps=round(meanNreps,3),
outliers=map_dbl(outliers,length)) %>%
rmarkdown::paged_table()
```
```{r}
blups_mcmds_noRemoveOut$blups[[1]] %>%
summary(across(~summary(.)))
```
**DECLARING MCMDS A LOSS** You can see from the summary, MCMDS *in this dataset* have essentially no genetic variance / reliability. For the sake of moving on with the example, I will exclude MCMDS from further consideration.
For my breeding program or critical experiment, I *would not* stop there. If there are not alternative trials to draw data from, the non-normally distributed, pseudo-count distributed MCMDS data can be modeled as such.
## De-regressed BLUPs and weighted two-stage prediction {#de-regression-explained}
The **`BLUP`** represents a single performance prediction for each unique **germplasmName** *without* accounting for genomic relatedness using marker data but accounting for the potentially unbalanced original trial data's experimental design factors. The **BLUP**s are in the units of the original trait but centered to the population mean.
**BLUP**s are shrunken, *i.e.* regressed towards the mean.Downstream genomic predictions fit the **germplasmName** again as a random effect in a mixed-model. For this reasion, using **BLUP**s directly as the response variable would doubly shrink the data. Instead, it is advised to do what is called "de-regression" of the BLUPs @Garrick2009.
The de-regressed BLUP (**drgBLUP**) is simply the BLUP divided by a quantity called the **reliability** or $r^{2}$.
The reliability is defined as: $r^{2} = 1 - \frac{PEV}{\sigma^{2}_{g}}$
- **PEV:** = prediction error variance, a measure exactly what it sounds like, the error variance associated with each individual prediction, which may be different among predictions, depending on the data. When we ran `sommer::mmer()` we set `getPEV=TRUE` and then later extracted the diagonal-values of the resulting PEV matrix (`PEV=diag(mmer_output$PevU$germplasmName$Value)`).
- $\sigma^{2}_{g}$ = the genetic variance estimated in Step 1. **PEV** will always be some fraction of the total genetic variance.
Thus reliability ranges from 0 to 1 and measures the *certainty* surrounding each **`BLUP`** value; or rather, the probability the **`BLUP`** would change if anothe experiment (data point) were added. So if $r^{2}=1$, then we have essentially zero expected error for a BLUP. For example, check varieties often have so many observations in a dataset that they will have high reliability.
So if a de-regressed BLUP is defined as $\frac{BLUP}{r^{2}}$ you can see that for clones with high reliability, the BLUP will stay essentially the same, but if the reliability is low, the **`BLUP`** will be *un-shrink* or *de-regress* strongly. This is going to happen because genotypes with few observations and thus low reliability will be more strongly shrunken to zero during the first step.
Below is a density plot showing the distribution of the original data compared to the **`BLUP`** and the **`drgBLUP`**. This is meant to show how shrinkage and unshrinkage effect the data. Since BLUPs and drgBLUPs are centered on the mean, I added back the grand mean for comparison to the original data.
```{r class.source = 'fold-hide'}
popMean<-mean(blups$METdata[[1]]$Value)
blups$METdata[[1]] %>%
dplyr::select(observationUnitDbId,Value) %>%
mutate(Data="Original Data") %>%
bind_rows(blups$blups[[1]] %>%
dplyr::select(germplasmName,BLUP,drgBLUP) %>%
pivot_longer(cols=c("BLUP","drgBLUP"),
names_to = "Data", values_to = "Value") %>%
mutate(Value=Value+popMean)) %>%
ggplot(.,aes(x=Value,fill=Data)) + geom_density(alpha=0.75) + theme_bw() +
geom_vline(xintercept = popMean)
```
Next, I grab the blups from only the **germplasmName** with the two highest and the two lowest **REL**. In all cases, the de-regressed BLUP should shift away from the mean, but proportional to the reliability. Below, I devised a plot to show that:
```{r class.source = 'fold-hide'}
library(ggrepel)
blups$blups[[1]] %>%
slice_max(order_by = REL,n = 2) %>%
bind_rows(blups$blups[[1]] %>%
slice_min(order_by = REL,n = 2)) %>%
dplyr::select(germplasmName,REL,BLUP,drgBLUP) %>%
pivot_longer(cols = c("BLUP","drgBLUP"),
names_to = "Data",values_to = "Value") %>%
ggplot(.,aes(x=Value,y=germplasmName,
color=REL,group=germplasmName)) +
geom_line() +
geom_point(aes(shape=Data)) +
geom_text_repel(aes(label=Data)) +
geom_vline(xintercept = 0)
```
The de-regressed BLUP will be used as a response variable in downstream genomic prediction.
We can account for the differential reliability among genotypes downstream using a weighted analysis. We will compute our weights according to @Garrick2009. Using `sommer::mmer()`, weights can be supplied through the `weights=` argument (see the [sommer::mmer documentation](https://cran.r-project.org/web/packages/sommer/sommer.pdf) for details). The values supplied will be weight the error term proportional to $\frac{1}{\sqrt{WT}}$, meaning drgBLUP with higher WT have smaller error; more effect on the prediction.
@Garrick2009's recommended weight is:
$$WT = \frac{1-H^{2}}{0.1+\frac{1-r^{2}}{r^{r}} \times H^{2}}$$
- $H^{2}$ is the broad-sense heritability (H2).
- $r^{2}$ is reliability (REL).
```{r class.source = 'fold-hide'}
REL<-H2<-seq(from=0.1,to=0.9,by=0.05)
expand.grid(REL=REL,H2=H2) %>%
mutate(WT=(1-H2)/((0.1 + (1-REL)/REL)*H2)) %>%
ggplot(.,aes(x=REL,y=WT,color=H2, group=H2)) + geom_line()
```
I made a plot showing how **`WT`** varies across a range of **`H2`** and **`REL`**. This is to show how the weighting behaves. As we see, **`WT`** scales with **`REL`** and **`H2`**. **`H2`** conditions the overall variability among individuals in their weight in the analysis. This works such that if heritability is low, there is a greater relative premium (weight) placed on individuals with high quality data (high **`REL`**), usually those with high numbers of observations, e.g. check-genotypes. If heritability is high, **`WT`**s are low and there isn't much difference between low and high **`REL`**.
Here's a plot that shows from the actual `blups`, how the **`WT`** scales with the number of observations-per-genotype.
```{r}
blups$blups[[1]] %>%
# Count the number of plots per clone in the raw data and merge to the BlUPs
left_join(blups$METdata[[1]] %>% count(germplasmName,name = "Nplots")) %>%
ggplot(.,aes(x=Nplots,y=WT)) + geom_point()
```
Next, I plot the **`BLUP`**s again the de-regressed BLUPs and scale point-size (**`WT`**) and color (**`REL`**). The red line has a slope of 1. You can see clearly how high reliability predictions don't change much (close to the red line). You can also see that the farther the **`BLUP`** from the mean, the more the de-regression (the more the original shrinkage).
```{r}
blups$blups[[1]] %>%
# Count the number of plots per clone in the raw data and merge to the BlUPs
left_join(blups$METdata[[1]] %>% count(germplasmName,name = "Nplots")) %>%
ggplot(.,aes(x=BLUP,y=drgBLUP,color=REL,size=WT)) +
geom_point() + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) +
geom_abline(slope=1, color='darkred')
```
### Computing H2 for weights
Quick note on computing $H^2$ for the **`WT`**s above. Heterogeneneous error variance model does not lend itself to standard compute of $H^2$ across trials.
We need an $H^2$ that reflects the overall dataset. The analysis above used heterogenoeous error variances. Therefore, I decided to use the following $H^2$:
$$H^2 = \frac{\sigma^2_g}{\sigma^2_g+\bar{\sigma^2_{e}}/\bar{n}_{reps}}$$
The $\sigma^2_g$ is the mixed-model estimate.
The $\bar{\sigma^2_{e}}$ is the average error-variance estimate across trials.
$\bar{n}_{reps}$ is the mean number of reps per genotype in the dataset.
So this $H^2$ definition should reflect the average trial and the average number of replicates per line.
Alternatives are definitely possible and users are encouraged to question and experiment!
## Mixed models in brief
There are often factors that effect our experiments that we don't specifically care about. As an example, we might replicate our experiment within a site by planting two different fields with the same design. We can easily imagine that small variations in soil, slope and aspect, elevation and planting date might influence the outcome at least to some extent. But most of the time, we aren't after the effect of the field or planting date. We want to learn some general about the location as a whole.
We can make some assumption that there are a universe of possible outcomes from any given part of a field station, for example, that might be observed. Every replicate / experiment we do just shows us one manifestation of that.
This makes the replication, field or planting date a **random variable**.
In addition, data are often complex and messy. Sample sizes in individual grouping variables often leave something to be desired , especially if we are trying to fit complicated models with many parameters. On top of that our data points might not be truly independent - for instance, there could be spatial autocorrelation between plots or blocks that are more similar because they are in a similar location. Or, as is the case in genomic selection, there we might have information about the genetic relatedness of the germplasm we are growing, that we can measure and want to account for.
This is why mixed models were developed, to deal with such messy data and to allow us to use all our data, even when we have low sample sizes, structured data, and many covariates to fit. Oh, and on top of all that mixed models allow us to save degrees of freedom compared to running standard linear models! Sounds good, doesn't it?
In broad terms, **fixed effects** are variables that we expect will have an effect on the dependent/response variable. Most of the time, **random effects** are usually grouping factors for which we are trying to control. A lot of the time we are not specifically interested in their impact on the response variable. Additionally, the data for our random effect is just a sample of all the possibilities.
The above three paragraphs were paraphrased from (<https://ourcodingclub.github.io/2017/03/15/mixed-models.html>).
There is a **major** exception to the statement above about using random-effects to control e.g. replication variance: genotypes.
Our primary objective is to obtain correct rankings of the genetic- or heritable-economic value of each accession, in order to determine which to cross or advance.
### Links for learning
<https://si.biostat.washington.edu/sites/default/files/modules/Seattle-SISG-18-IntroQG-Lecture08.pdf>
<https://towardsdatascience.com/how-linear-mixed-model-works-350950a82911>
<https://stats.oarc.ucla.edu/other/mult-pkg/introduction-to-linear-mixed-models/>
[Excellence in Breeding - Basics of GS - 2021 YouTube Video Series with Eduardo Covarrubias](https://www.youtube.com/playlist?list=PLZ0lafzH_UmfHsoePj0t7xQglKSOpzbDp)