analysis/var-prep.Rmd

---
title: "Variable preparation"
author: "Robert Schlegel"
date: "2019-05-23"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(fig.width = 8, fig.align = 'center',
                      echo = TRUE, warning = FALSE, message = FALSE, 
                      eval = TRUE, tidy = FALSE)
```


## Introduction

This vignette will walk through the steps needed to create mean 'whatever' states during all of the MHWs detected in the previous [SST preparation](https://robwschlegel.github.io/MHWNWA/sst-prep.html) vignette. These 'whatever' states are any of the abiotic variables present in the NAPA model that have been deemed relevant w.r.t. forcing of extreme ocean surface temperatures.

```{r startup}
# Packages used in this vignette
library(tidyverse) # Base suite of functions
library(ncdf4) # For opening and working with NetCDF files
library(lubridate) # For convenient date manipulation

# Set number of cores
doMC::registerDoMC(cores = 50)

# Disable scientific notation for numeric values
  # I just find it annoying
options(scipen = 999)

# Corners of the study area
NWA_corners <- readRDS("data/NWA_corners.Rda")

# The NAPA data location
NAPA_files <- dir("../../data/NAPA025/1d_grid_T_2D", full.names = T)

# The NAPA model lon/lat values
NAPA_coords <- readRDS("data/NAPA_coords.Rda")

# Load NAPA bathymetry/lon/lat
NAPA_bathy <- readRDS("data/NAPA_bathy.Rda")

# Load MHW results
NAPA_MHW_sub <- readRDS("data/NAPA_MHW_sub.Rda")
```

For the upcoming variable prep we are also going to want the NAPA coordinates that are within our chosen study area as seen with `NWA_corners`. We will also create a subsetted bathy coord file, too, so as to have an effective land mask.. This will help us to reduce a lot of computational cost as we go along because many of the pixels over land are given 0 values, rather than missing values, which is a strange choice...

```{r NAPA-coords-sub}
# The NAPPA coordinates for the study area only
NAPA_coords_sub <- NAPA_coords %>% 
  filter(lon >= NWA_corners[1], lon <= NWA_corners[2],
           lat >= NWA_corners[3], lat <= NWA_corners[4])
# saveRDS(NAPA_coords_sub, "data/NAPA_coords_sub.Rda")

# Tha NAPA bathymetry for the study area only
NAPA_bathy_sub <- NAPA_bathy %>% 
  right_join(NAPA_coords_sub, by = c("lon_index", "lat_index", "lon", "lat"))
saveRDS(NAPA_bathy_sub, "data/NAPA_bathy_sub.Rda")
```


## Chosen variables

There are many variables present in the NAPA model, more than we would really need to use for this project. We have therefore chosen to narrow our investigation. Listed below are all of the variables found within a given NAPA surface layer NetCDF file.

```{r ncdf-var-dump}
# Spreadsheet of variables present in the NetCDF files
NAPA_vars <- ncdump::NetCDF(NAPA_files[1])$variable[1:6]
NAPA_vars
```

These variable are not an unruly amount of information and so we will extract and compile all of them when creating the data packets that will be fed into our SOMs later. Many of these variables are likely to have very high auto-correlative relationships in which case we will need to select only the most representative of them. I foresee the multiple heat terms coming to a head with one another, though I can't yet say from this early vantage point which will be best.

It should go without saying, but we will not be accounting for lon/lat or the three time variables at the end of the above spreadsheet as these are not proper abiotic variables. Therefore our initial variable list is as follows:

```{r ncdf-var-initial}
# Remove unwanted variables
NAPA_vars <- ncdump::NetCDF(NAPA_files[1])$variable[c(1:5, 8:17), 1:6]

# Save
saveRDS(NAPA_vars, "data/NAPA_vars.Rda")

NAPA_vars
```


## Synoptic states

With the variables chosen, the next step is to create mean synoptic states for each variable during each of the MHWs detected in all sub-regions. In order to make that process go more smoothly we will first create a date index of all of the NAPA files present on Eric Oliver's `tikoraluk` server. Unfortunately that means that from here out the code in this vignette will only run on said server. The output of this vignette will however be publicly available [here](https://github.com/robwschlegel/MHWNWA/tree/master/data).


### Date index for NAPA files

To create the index of dates to be found within each of the thousands of NAPA surface NetCDF files we will use a simple for loop to crawl through the files and write down for us in one long spreadsheet which dates are to be found in which files. While this could be done on the fly in the following steps, it will just be easier to have a stable index prepared.

```{r date-index, eval=FALSE}
# Pull out the dates
NAPA_files_dates <- data.frame()
for(i in 1:length(NAPA_files)){
  file_name <- NAPA_files[i]
  date_start <- ymd(str_sub(basename(as.character(file_name)), start = 29, end = 36))
  date_end <- ymd(str_sub(basename(as.character(file_name)), start = 38, end = 45))
  date_seq <- seq(date_start, date_end, by = "day")
  date_info <- data.frame(file = file_name, date = date_seq)
  NAPA_files_dates <- rbind(date_info, NAPA_files_dates)
}

# Order by date, just for tidiness
NAPA_files_dates <- dplyr::arrange(NAPA_files_dates, date)

# Save
# saveRDS(NAPA_files_dates, "data/NAPA_files_dates.Rda")
```


### Variable climatologies

Part of the data packet we need to create for the SOMs is the anomaly values. In order to create anomalies however we need to first create climatologies for all of the variables. This may prove to be a somewhat daunting task, but it's what we are here to do! In order to create a climatology of values we will need to load all of the files and then pixel-wise go about getting the seasonal (daily) climatologies. This will be done with the same function (`ts2clm()`) that is used for the MHW climatologies. We will first create a function that extracts the desired variables from any NetCDF files fed to it. With that done it should be a routine matter to get the climatologies. Hold onto your hats, this is going to be RAM heavy...

```{r clim-var-all}

```


### Variable extractor

We needed a list of the dates present in each file so that we can easily load only the NetCDF files we need to extract our desired variables. The dates we want are the range of dates during each of the MHWs detected in the [SST preparation](https://robwschlegel.github.io/MHWNWA/sst-prep.html) vignette. In the chunk below we will create a function that decides which files should have their variables loaded and a function that binds everything up into tidy data packets that our SOM can ingest.

```{r ncdf-var-extractor}
# Function for extracting the desired variables from a given NetCDF file
# testers...
# file_name <- NAPA_files[1]
extract_all_var <- function(file_name, coords = NWA_corners){
  
  # Extract and join variables with a for loop
    # This should be optimised...
  NAPA_vars_extracted <- data.frame()
  for(i in 1:length(NAPA_vars$name)){
    extract_one <- extract_one_var(NAPA_vars$name[i], file_name = file_name)
    if(nrow(NAPA_vars_extracted) == 0){
      NAPA_vars_extracted <- rbind(extract_one, NAPA_vars_extracted)
    } else {
      NAPA_vars_extracted <- left_join(NAPA_vars_extracted, extract_one, by = c("lon_index", "lat_index", "t"))
    }
  }

  # Subset to the study area
  # NAPA_vars_sub <- left_join(NAPA_vars_extracted, NAPA_coords, by = c("lon_index", "lat_index")) %>% 
    # filter(lon >= NWA_corners[1], lon <= NWA_corners[2],
           # lat >= NWA_corners[3], lat <= NWA_corners[4]) %>% 
    # dplyr::select(lon_index, lat_index, lon, lat, everything())
  
  # Free up disk space, useful when being multi-cored
  # rm(NAPA_vars_extracted); gc()
  
  # Exit
  return(NAPA_vars_extracted)
}

# Function for extracting variables from as many files as a MHW event requires
# testers...
# event_sub <- NAPA_MHW_event[23,]
data_packet <- function(event_sub){
  date_idx <- seq(event_sub$date_start, event_sub$date_end, by = "day")
  file_idx <- filter(NAPA_files_dates, date %in% date_idx) %>% 
    mutate(file = as.character(file)) %>% 
    select(file) %>% 
    unique()
  system.time(
  packet_base <- plyr::ldply(file_idx$file, extract_all_var) %>% 
    filter(t %in% date_idx)
  ) # 264 seconds for seven files
}
```


```{r synoptic-states}
# Load NAPA file date index
NAPA_files_dates <- readRDS("data/NAPA_files_dates.Rda")
# MHW Events
NAPA_MHW_event <- NAPA_MHW_sub %>%
  select(-clims, -cats) %>%
  unnest(events) %>%
  filter(row_number() %% 2 == 0) %>%
  unnest(events)
```

With all of our synoptic snapshots for our chosen variables created it now time to feed them to the [Self-organising map (SOM) analysis](https://robwschlegel.github.io/MHWNWA/som.html).