analysis/baselineSim.Rmd

---
title: "Baseline simulation of cassava GS"
author: "Marnin Wolfe"
date: "2021-Aug-27"
output: 
  workflowr::wflow_html:
    code_folding: hide
    toc: true
editor_options:
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, 
                      tidy='styler', 
                      tidy.opts=list(strict=FALSE,width.cutoff=100), 
                      highlight=TRUE)
```

# Previous steps

1.  Empirical estimate of TrialType-specific error variances in terms of the IITA selection index (SELIND). See that analysis [here](https://wolfemd.github.io/IITA_2021GS/inputsForSimulationV2.html).

2.  [Burn-in simulations](burnInSims.html)

# Set-up a singularity shell with R+OpenBLAS

***This is optional and can be skipped.***

If you want the advantage of multi-threaded BLAS to speed up predictions within the simulations, you need an R instance that is linked to OpenBLAS (another example is Microsoft R Open). For CBSU, the recommended approach is currently to use singularity shells provided by the "rocker" project. They even come pre-installed with tidyverse :). Linked to OpenBLAS, using a simple function `RhpcBLASctl::blas_set_num_threads()` I can add arguments to functions to control this feature. For optimal performance, it is import to balance the number of threads each R session uses for BLAS against any other form of parallel processing being used and considering total available system resources.

```{bash, eval=F}
# 0) Pull a singularity image with OpenBLAS enabled R + tidyverse from rocker/
# singularity pull ~/rocker2.sif docker://rocker/tidyverse:latest;

# 1) start a screen shell 
screen; 
# 2) reserve interactive slurm
salloc -n 25 --mem 60G;
# 3) start the singularity Linux shell inside that
singularity shell ~/rocker2.sif; 
# Project directory, so R will use as working dir.
cd /home/mw489/BreedingSchemeOpt/;
# 3) Start R
export OMP_NUM_THREADS=1;
R
```

```{r, eval=F}
# Install genomicMateSelectR to user-accessible libPath
### In a singularity shell, sintall as follows:
libPath<-"/home/mw489/R/x86_64-pc-linux-gnu-library/4.1" # should be YOUR libPath
withr::with_libpaths(new=libPath, devtools::install_github("wolfemd/genomicMateSelectR", ref = 'master'))
### Else, simply
devtools::install_github("wolfemd/genomicMateSelectR", ref = 'master')
# Install my own forked repo of AlphaSimHlpR
withr::with_libpaths(new=libPath, install.packages("Rcpp"))
withr::with_libpaths(new=libPath, install.packages("AlphaSimR"))
withr::with_libpaths(new=libPath, install.packages("optiSel"))
withr::with_libpaths(new=libPath, install.packages("rgl"))
withr::with_libpaths(new=libPath, devtools::install_github("wolfemd/AlphaSimHlpR", ref = 'master', force=T))
# devtools::install_github("wolfemd/AlphaSimHlpR", ref = 'master', force=T)
```

# A small example

**Objective:** develop a hopefully faster GS simulator which uses a fixed number of clones for predictions... hoping to constrain compute requirements better than using the `trainingPopCycles` parameter, without sacrificing much prediction accuracy.

For specifying a `newBSP`, there are some considerations. The `bsp` from previous run contains an entry `bsp$checks` specifying a pop-object with randomly chosen checks from the burn-in phase. *For now*, I will ensure the burn-in checks are used post burn-in. Keep it simple. For that reason, `newBSP` should not alter the `nChks` value. If it does, might cause code to break...

-   Set-up my own population improvement function

-   Added some options to `bsp` objects to control which phenotype records are used, which clones are includedin the TP, and which clones are considered selection candidates

-   JL's original `popImprov1Cyc` drew `candidates` from the full `records$F1@id` (excluding potentially indivs only scored during the current year, if `useCurrentPhenoTrain=FALSE`).

-   My changes:

    -   `nTrainPopCycles`: draw training pop clones only from this number of recent cycles.

    -   `nYrsAsCandidates`: candidates for selection only from this number of recent years

    -   `maxTrainingPopSize`: From the lines in the most recent cycles (indicated by `nTrainPopCycles`), subsample this number of lines for training data.

        -   This is *in addition to* the "check" (`bsp$checks@id`) and the lines indicates as selection `candidates` according to the setting of `nYrsAsCandidates`.
        -   Phenotypic records of `candidates` will *also* be included in any predictions.
        -   All "historical" data will always be used, but the number of maximum training lines will be held constant.
        -   Replaces the stage-specific `bsp$trainingPopCycles`, which will be unused in this pipeline, but not deleted from the package.
        
```{r inputs,eval=T}
suppressMessages(library(AlphaSimHlpR))
suppressMessages(library(tidyverse))
suppressMessages(library(genomicMateSelectR))
select <- dplyr::select

schemeDF<-read.csv(here::here("data","baselineScheme - Test.csv"), 
                   header = T, stringsAsFactors = F)

source(here::here("code","runSchemesPostBurnIn.R"))

simulations<-readRDS(here::here("output","burnIn_test.rds"))
```

## Example
```{r}
newBSP<-specifyBSP(schemeDF = schemeDF,
                   nChr = 3,effPopSize = 100,quickHaplo = F,
                   segSites = 400, nQTL = 40, nSNP = 100, genVar = 40,
                   gxeVar = NULL, gxyVar = 15, gxlVar = 10,gxyxlVar = 5,
                   meanDD = 0.5,varDD = 0.01,relAA = 0.5,
                   stageToGenotype = "PYT",
                   nParents = 10, nCrosses = 4, nProgeny = 50,nClonesToNCRP = 3,
                   phenoF1toStage1 = T,errVarPreStage1 = 500,
                   useCurrentPhenoTrain = F, 
                   nCyclesToKeepRecords = 30,
                   selCritPipeAdv = selCritIID, 
                   selCritPopImprov =  selCritIID,
                   nTrainPopCycles=6,
                   nYrsAsCandidates=2,
                   maxTrainingPopSize=500)
```

```{r continue after burn-in with GS, eval=F}
start<-proc.time()[3]
postBurnInGS_test<-runSchemesPostBurnIn(simulations = simulations,
                                        newBSP=newBSP,
                                        nPostBurnInCycles=10,
                                        selCritPop="parentSelCritGEBV",
                                        selCritPipe="selCritIID",
                                        productFunc="productPipeline",
                                        popImprovFunc="popImprovByParentSel",
                                        ncores=4,
                                        nBLASthreads=4)
saveRDS(postBurnInGS_test,file = here::here("output","postBurnInGS_test.rds"))
end<-proc.time()[3]; print(paste0((end-start)/60," mins elapsed"))
```

## Benchmark "full-scale" GS cycles

## Debug
```{r debugging params}
# burnInSim<-simulations$burnInSim[[1]]
# selCritPop="parentSelCritGEBV";
# selCritPipe="selCritIID";
# productFunc="productPipeline";
# popImprovFunc="popImprov1Cyc";
# newBSP=burnInSim$bsp
# newBSP[["maxTrainingPopSize"]]<-500
# newBSP[["nYrsAsCandidates"]]<-2
# newBSP[["nTrainPopCycles"]]<-10
# newBSP$checks<-NULL
```

```{r runSchemesPostBurnIn}
# runSchemesPostBurnIn<-function(simulations,
#                                newBSP=NULL, # so you can change the scheme entirely after burn-in
#                                nPostBurnInCycles,
#                                selCritPop="selCritIID",
#                                selCritPipe="selCritIID",
#                                productFunc="productPipeline",
#                                popImprovFunc="popImprovByParentSel",
#                                ncores=1,
#                                nBLASthreads=NULL,nThreadsMacs2=NULL){
# 
#   require(furrr); plan(multisession, workers = ncores)
#   options(future.globals.maxSize=+Inf); options(future.rng.onMisuse="ignore")
# 
#   simulations<-simulations %>%
#     mutate(SimOutput=future_map2(SimRep,burnInSim,function(SimRep,burnInSim,...){
#       # debug 
#       # burnInSim<-simulations$burnInSim[[1]]
#       if(!is.null(nBLASthreads)) { RhpcBLASctl::blas_set_num_threads(nBLASthreads) }
#       cat("******", SimRep, "\n")
# 
#       # This CONTINUES where previous sims left off
#       ## no initialize step
#       ## Keep burn-in stage sim params "SP"
#       SP<-burnInSim$SP
#       ## specify a potentially new bsp object 
#       ## (keep checks stored in burn-in stage's bsp)
#       if(!is.null(newBSP)){ 
#         bsp<-newBSP; bsp$checks<-burnInSim$bsp$checks 
#       } else { bsp<-burnInSim$bsp }
#       ## 'historical' records from burn-in
#       records<-burnInSim$records
#       ## override burn-in specified product and population improvement funcs 
#       bsp[["productPipeline"]] <- get(productFunc)
#       bsp[["populationImprovement"]] <- get(popImprovFunc)
#       bsp[["selCritPipeAdv"]] <- get(selCritPipe)
#       bsp[["selCritPopImprov"]] <- get(selCritPop)
# 
#       # Post burn-in cycles
#       cat("\n"); cat("Post burn-in cycles"); cat("\n")
#       for (cycle in 1:nPostBurnInCycles){
#         cat(cycle, " ")
#         records <- bsp$productPipeline(records, bsp, SP)
#         records <- bsp$populationImprovement(records, bsp, SP)
#       }
# 
#       # Finalize the stageOutputs
#       records <- AlphaSimHlpR:::lastCycStgOut(records, bsp, SP)
# 
#       return(list(records=records,
#                   bsp=bsp,
#                   SP=SP))
#     },
#     nPostBurnInCycles=nPostBurnInCycles,
#     selCritPop=selCritPop,
#     selCritPipe=selCritPipe,
#     productFunc=productFunc,
#     popImprovFunc=popImprovFunc,
#     nBLASthreads=nBLASthreads,
#     nThreadsMacs2=nThreadsMacs2))
#   plan(sequential)
#   return(simulations)
# }

```