analysis/som.Rmd

---
title: "Self-organising map (SOM) analysis"
author: "Robert Schlegel"
date: "2020-08-25"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
csl: FMars.csl
bibliography: MHWflux.bib
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(fig.width = 8, fig.align = 'center',
                      echo = TRUE, warning = FALSE, message = FALSE, 
                      eval = TRUE, tidy = FALSE)
```

## Introduction

This vignette contains the code used to perform the self-organising map (SOM) analysis on the mean synoptic states created in the [data preparation](https://robwschlegel.github.io/MHWflux/data-prep.html) vignette. We'll start by creating custom packets that meet certain experimental criteria before feeding them into a SOM.

```{r libraries}
# Load functions and objects to be used below
source("code/functions.R")
```

## Data packet

In this step we will create a data packet that can be fed directly into the SOM algorithm. This means that it must be converted into a super-wide matrix format. In the first run of this analysis on the NAPA model data it was found that the inclusion of the Labrador Sea complicated the results quite a bit. It was also unclear whether or not the Gulf of St Lawrence (GSL) region should be included in the analysis. So in the second run of this analysis multiple different SOM variations were employed and it was decided that the GSL region should be included.

### Prep synoptic state packets

Up first we must create the synoptic state packets.

```{r unnest-packets, eval=FALSE}
# Set number of cores
  # NB: 50 cores uses too much RAM
registerDoParallel(cores = 20)

# Load needed data
ALL_anom <- readRDS("data/ALL_anom.Rda")
ALL_other <- readRDS("data/ALL_other.Rda")

# Create one big anomaly packet from OISST data
system.time(synoptic_states <- plyr::ddply(OISST_MHW_event, c("region", "event_no"),
                                           data_packet_func, .parallel = T, df = ALL_anom)) # 129 seconds
# Save
saveRDS(synoptic_states, "data/synoptic_states.Rda")

# Create other synoptic states per MHW per variable
doParallel::registerDoParallel(cores = 10) # NB: Be careful here...
system.time(synoptic_states_other <- plyr::ddply(OISST_MHW_event, c("region", "event_no"),
                                                 data_packet_func, .parallel = T, df = ALL_other)) # 212 seconds
# Save
saveRDS(synoptic_states_other, "data/synoptic_states_other.Rda")
```

### Create SOM packet

With all of our data ready we may now prepare and save them for the SOM.

```{r create-SOM-packet, eval=FALSE}
## Create wide data packet that is fed to SOM
system.time(packet <- synoptic_states %>%
              select(region, event_no, synoptic) %>%
              unnest(cols = "synoptic") %>%
              wide_packet_func()) # 79 seconds

# Save
saveRDS(packet, "data/packet.Rda")
```

## Run SOM models

Now we feed the SOM with a function that ingests the data packet and produces results for us. The function below has been greatly expanded on from the previous version of this project and now performs all of the SOM related work in one go. This allowed me to remove a couple hundreds lines of code and text from this vignette.

```{r som-run, eval=FALSE}
# # OISST SOM analysis
packet <- readRDS("data/packet.Rda")
synoptic_states_other <- readRDS("data/synoptic_states_other.Rda")
system.time(som <- som_model_PCI(packet, synoptic_states_other)) # 176 seconds
saveRDS(som, file = "data/som.Rda")
# saveRDS(som, file = "shiny/som.Rda")
```

## Investigate clustering of MHWs

A reviewer of the manuscript noted that the MHWs appear to be clustered closely together both within and across regions. This is by design in the methodology, but just how exactly these events cluster together in the SOM nodes warrants further investigation. The answer of how closely events cluster together within each region was answered in the [MHWs vs. heat flux](https://robwschlegel.github.io/MHWflux/mhw-flux.html#Small_events_and_the_shoaling_MLD) vignette. In this section we will first create an index of MHWs that can be said to be occurring across multiple regions at once. We then look to see how often these events are clustered into the same, or different nodes.

```{r event-cluster}
# Double for loop to crawl through events and find overlapping events
# So hideous...
event_overlap_res <- data.frame()
for(i in 1:nrow(OISST_MHW_event)){
  df1 <- OISST_MHW_event[i,]
  df1_dates <- seq(df1$date_start, df1$date_end, by = "day")
  OISST_MHW_event_sub <- filter(OISST_MHW_event, region != df1$region)
  for(j in 1:nrow(OISST_MHW_event_sub)){
    df2 <- OISST_MHW_event_sub[j,]
    df2_dates <- seq(df2$date_start, df2$date_end, by = "day")
    if(any(df2_dates %in% df1_dates)){
      event_overlap <- df2 %>% 
        dplyr::rename(region_match = region, 
                      event_no_match = event_no,
                      duration_match = duration) %>% 
        mutate(region = df1$region,
               event_no = df1$event_no,
               duration = df1$duration,
               date_peak_orig = df1$date_peak) %>% 
        dplyr::select(region, event_no, duration, region_match, 
                      event_no_match, duration_match:date_end, date_peak_orig) %>% 
        mutate(strng1 = paste(region, event_no, region_match, event_no_match),
               strng2 = paste(region_match, event_no_match, region, event_no))
      
      # Check that the pairing hasn't already been counted
      if(!event_overlap$strng2 %in% event_overlap_res$strng1){
        event_overlap_res <- rbind(event_overlap_res, event_overlap)
      }
    }
  }
}; rm(i, j)

# The events per node
SOM <- readRDS("data/som.Rda")
SOM_info <- SOM$info %>% 
  mutate(node = LETTERS[node])

# How well do these overlaps match up?
event_overlap_node <- event_overlap_res %>% 
  left_join(SOM_info[,c("region", "event_no", "node")], 
            by = c("region", "event_no")) %>% 
  left_join(SOM_info[,c("region", "event_no", "node")], 
            by = c("region_match" = "region", "event_no_match" = "event_no")) %>% 
  mutate(match = node.x == node.y)

# Get the total count of overlapping events clustered into the same node
paste0(sum(event_overlap_node$match), "/",
       nrow(event_overlap_node))

# The number of matched events not clustered together when one event is 10 days or shorter
nrow(filter(event_overlap_node, match == FALSE, duration <= 10 | duration_match <= 10))
```

This result shows us that of the 321 overlapping event pairs, 190 of them are clustered into the same nodes as one another. Looking through the results manually one may see that many of the events that are not clustered together, one of the events is short and the other is longer. Indeed, of the 131 overlapping events that are not clustered together, for 74 of these pairs at least one of the events is 10 days or shorter in duration. We can conclude from this that when synoptic scale patterns are causing multiple MHWs simultaneously across regions that the SOM clusters these correctly, with the exception of when one of the events is very short. This is likely because the cause of the much shorter event is a more transient driver that isn't resolved as clearly in the mean synoptic state for the longer event.

Another minor point raised by a reviewer was that the atmospheric pattern in Node I appears to precede that of Node L, and they wanted to know if the MHWs clustered into Node I actually were coming before those events in Node L, and that these two nodes were really just a continuation of a single atmospheric pattern. To determine this we find the gaps in occurrence of the start, peak, and end dates of the MHWs between these nodes. First, we can simply look at the results from the above chunk and confirm that there are indeed no events in Node I that overlap in time with Node L.

```{r node-I-L}
# Prep data
OISST_MHW_event_node <- left_join(OISST_MHW_event, 
                                  SOM_info[,c("region", "event_no", "season_peak", "node")], 
                                  by = c("region", "event_no", "season" = "season_peak")) %>% 
  dplyr::select(region, event_no, node, duration, date_start:date_end)
events_I <- filter(OISST_MHW_event_node, node == "I") %>% 
  left_join(event_overlap_node)
events_L <- filter(OISST_MHW_event_node, node == "L")

# Find the nearest occurring events between nodes
dist_days <- function(df){
  
  # Convert dates to integers for FNN
  df_int <- df %>% 
    mutate(date_start = as.integer(date_start),
           date_peak = as.integer(date_peak),
           date_end = as.integer(date_end))
  events_L_int <- events_L %>% 
    mutate(date_start = as.integer(date_start),
           date_peak = as.integer(date_peak),
           date_end = as.integer(date_end))
  
  # Find nearest event
  res <- events_L[as.numeric(knnx.index(as.matrix(events_L_int[,c("date_start", "date_peak", "date_end")]),
                                        as.matrix(df_int[,c("date_start", "date_peak", "date_end")]), k = 1)),] %>% 
    mutate(region_I = df$region,
           event_no_I = df$event_no,
           duration_I = df$duration,
           date_start_I = df$date_start,
           date_peak_I = df$date_peak,
           date_end_I = df$date_end,
           start_diff = date_start - date_end_I)
  return(res)
}
event_day_dist <- plyr::ddply(events_I, c("region", "event_no"), dist_days)

# The gaps between the nearest events between the nodes
event_day_dist$start_diff[order(event_day_dist$start_diff)]
```

From these results we see that the events occurring in Node I and L usually occurred very far apart from one another. Nearly two years or more in most cases. Remember that the average time distance between events in the MHWs for this study is ~ two months, so this is a very large difference relative to the average. We do however see that there are four values falling within one month of each other. Looking at these values more closely one sees that these are four events from Node I relating to two events in Node L as the Node I events are in two sets of two. Surprisingly, one of these near matches shows that the Node L event occurred before the two nearby Node I events. This implies that this massive low pressure system is moving the opposite direction than expected. Indeed, one of these backward matching events is in the same region, the Gulf of Maine, for both nodes. That warrants a closer look. GOing through the real day by day steps around these events we see that... 

## References