analysis/mhw-flux.Rmd

---
title: "MHWs vs. heat flux"
author: "Robert Schlegel"
date: "2020-02-25"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
csl: FMars.csl
bibliography: MHWflux.bib
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(fig.width = 8, fig.align = 'center',
                      echo = TRUE, warning = FALSE, message = FALSE, 
                      eval = TRUE, tidy = FALSE)
```

## Introduction

This vignette will walk through the thinking and the process for how to link physical variables to their potential effect on driving or dissipating MHWs. The primary source that inspired this work was @Chen2016. In this paper the authors were able to illustrate which parts of the heat budget were most likely driving the anomalous heat content in the surface of the ocean. What this analysis seeks to do is to build on this methodology by applying the fundamental concept to ALL of the MHWs detected in the NW Atlantic. Fundamentally we are running thousands of RMSE and correlation calculations between SST anomalies and the co-occurring anomalies for a range of physical variables. The lower the RMSE and the stronger the correlation (both positive and negative) the more of an indication this is to us that these phenomena are related. We will also be comparing the magnitude of change for Qx terms and SSTa during the onset and decline of MHWs to see how much of these events can be attributed to heat flux.

```{r startup}
# All of the libraries and objects used in the project
# Note that this also loads the data we will be using in this vignette
source("code/functions.R")
```

## Relating SSTa to physical variables

We know when MHWs occurred, and our physical data are prepped, so what we need to do is run magnitude comparisons, RMSE, and correlations between SSTa from the start to peak and peak to end of each event for the full suite of variables. This will show us for each event which values related the best for the onset AND decline of the events. We will run RMSE and correlations on the full time series, but these were decided not to be included in the final manuscript as they don't add much other than to show how much better the calculations for onset or decline are.

The magnitude of change in SSTa and SST_Qx for the onset and decline of MHWs will be calculated by finding how much SSTa and SST_Qx increase from the onset to the peak of a MHW, and how much SSTa and SST_Qx decrease on the decline of the same event. The thinking here is that if SSTa exceeds the SST_Qx increase for the onset, this will indicate that some advective process is aiding the development of the SSTa, but if SST_Qx exceeds SSTa, this indicates that the heat flux term is responsible for the creation of the event. Likewise for the decline of events, if the SST_Qx value decreases more than SSTa this would mean that negative heat flux into the ocean was terminating the event, whereas a greater decrease in SSTa would imply that advection was primarily terminating the event.

Running correlations is useful to see how well certain variables change in relation to SSTa, but these do not show how similar those proportions of change are. In order to do this we need to calculate the RMSE between the SST anomalies and the SST_Qx terms. This may only be done with the SST_Qx terms because they are in the same units as SST. One additional tweak we will need here is to add the cumulative Qx term for the following day to the SSTa from the first day of the MHW. This is done so that we may see more closely how the developing Qx term matches to the development of the SSTa. Effectively this tells us how much the Qx term may be driving SSTa.

```{r MHW-var-cor, eval=FALSE}
# Extract just the event info
OISST_MHW_event_index <- OISST_MHW_event %>% 
  select(event_no, region, season) %>% 
  ungroup() %>% 
  mutate(row_index = 1:n())

# Run all the stats
ALL_cor <- plyr::ddply(OISST_MHW_event_index, .parallel = T,
                       .variables = c("row_index"), .fun = stats_all,
                       df_event = OISST_MHW_event) %>% 
  left_join(OISST_MHW_event_index, by = "row_index") %>% 
  select(region, season, event_no, ts, everything()) %>%
  arrange(region, event_no) %>% 
  mutate(Parameter2 = factor(Parameter2))

# Save
saveRDS(ALL_cor, "data/ALL_cor.Rda")
saveRDS(ALL_cor, "shiny/ALL_cor.Rda")
```

What we have now is a long dataframe containing the magnitude of change, RMSE, and correlations of different variables with SSTa. It must be pointed out that the correlations for the non-Qx terms are for the same day, there is no time lag introduced, which may be important. Below we are going to visualise the range of correlations for each variable to see how much each distribution is skewed. This skewness could probably be quantified in a meaningful way... but let's look at the data first.

```{r shiny-histo}
# source("shiny/app.R")
# Or it is live here:
# https://robert-schlegel.shinyapps.io/MHWflux/
```

There are some really clear patterns coming through in the data. In particular SSS seems to be strongly related to the onset of MHWs. There are a lot of nuances in these data and so I think this is actually an example of where a Shiny app is useful to interrogate the data.

In the shiny app it also comes out that the longer events tend not to correlate strongly with a single variable. This is to be expected and supports the argument that very persistent MHWs are supported by a confluence of variables. How to parse that out is an interesting challenge.

## Results

### 12 hour offset

While going through the data it was discovered that the OISST temperature values represent an average from midnight to midnight. This therefore puts them off set to the Qx terms, which were also being calculated over the same time period. But to compare the Qx terms to SST there should be a 12 hour offset. So 12 hours were added to the terms before re-running the pipeline and recalculating all of the correlations and RMSE. In the code chunk below we compare the stats pre-12 hour shift and afterwards.

```{r shift-compare}
# Load old and new data
ALL_cor_old <- readRDS("data/ALL_cor_old.Rda") %>% 
  mutate(region = toupper(region)) %>% 
  mutate(run = "old")
ALL_cor <- readRDS("data/ALL_cor.Rda") %>% 
  dplyr::select(region:n_Obs, rmse) %>% 
  mutate(run = "new")
ALL_cor_stack <- rbind(ALL_cor, ALL_cor_old) %>% 
  filter(rmse > 0) %>% 
  mutate(Parameter2 = as.character(Parameter2),
         Parameter2 = case_when(Parameter2 == "qnet_budget" ~ "qnet",
                                Parameter2 == "lhf_budget" ~ "lhf",
                                Parameter2 == "shf_budget" ~ "shf",
                                Parameter2 == "lwr_budget" ~ "lwr",
                                Parameter2 == "swr_budget" ~ "swr",
                                Parameter2 == "qnet_mld_cum" ~ "qnet",
                                Parameter2 == "lhf_mld_cum" ~ "lhf",
                                Parameter2 == "shf_mld_cum" ~ "shf",
                                Parameter2 == "lwr_mld_cum" ~ "lwr",
                                Parameter2 == "swr_mld_cum" ~ "swr",
                                TRUE ~ Parameter2))

# Compare RMSE quantiles
ALL_cor_stack %>% 
  filter(rmse > 0) %>% 
  group_by(run, region, ts, Parameter2) %>% 
  summarise(q10 = quantile(rmse, 0.10), 
            q50 = quantile(rmse, 0.50),
            q90 = quantile(rmse, 0.90)) %>% 
  pivot_wider(names_from = run, values_from = q10:q90)

# Plot difference
ALL_cor_stack %>% 
  filter(rmse > 0) %>% 
  ggplot(aes(x = ts, y = rmse)) +
  geom_boxplot(aes(fill = run)) +
  facet_grid(Parameter2~region) #+
  # scale_y_continuous(limits = c(0, 2))
```

The new an improved method that correctly integrates Qx into SSTa creates much lower RMSE values for almost every variable and region. A smashing success.

### Magnitude of change summary

In this sub-section we will look at how the different magnitudes of change for onset and decline of the SST_Qx terms compare to the same changes in SSTa.

```{r mag-summary}
# Load magnitude data
ALL_mag <- readRDS("data/ALL_cor.Rda") %>% 
  dplyr::select(region:ts, Parameter2, n_Obs, mag) %>% 
  na.omit()

# Calculate the proportions of change in magnitudes
ALL_mag_prop <- ALL_mag %>% 
  filter(Parameter2 != "sst") %>% 
  left_join(filter(ALL_mag, Parameter2 == "sst"), 
            by = c("region", "season", "event_no", "ts", "n_Obs")) %>% 
  filter(ts != "full") %>% 
  dplyr::select(-Parameter2.y) %>% 
  dplyr::rename(var = Parameter2.x, mag_Qx = mag.x, mag_SSTa = mag.y) %>% 
  mutate(prop = mag_Qx/mag_SSTa) %>% 
  mutate(var = as.character(var),
         var = case_when(var == "lwr_budget" ~ "Qlw",
                         var == "swr_budget" ~ "Qsw",
                         var == "lhf_budget" ~ "Qlh",
                         var == "shf_budget" ~ "Qsh",
                         var == "qnet_budget" ~ "Qnet",
                         TRUE ~ var),
         var = factor(var, levels = c("Qnet", "Qlh", "Qsh", "Qlw", "Qsw")),
         region = toupper(region))

# Find an event with good onset and decline match
ALL_mag_prop %>% filter(var == "Qnet", prop >= 1, prop <= 2)

# Create schematic figure showing how magnitudes are calculated
p1 <- fig_mag_func(2, "GSL") +
  ggtitle("Good onset, bad decline")
p2 <- fig_mag_func(44, "GM") +
  ggtitle("Bad onset, good decline")
p3 <- fig_mag_func(32, "GSL") +
  ggtitle("Good onset, good decline")
mag_schematic <- ggpubr::ggarrange(p1, p2, p3, ncol = 3, nrow = 1, common.legend = TRUE)
# mag_schematic

# Scatterplot of SSTa vs SST_Qx (facets per variable)
mag_scatter <- ALL_mag_prop %>% 
  ggplot(aes(x = mag_SSTa, y = mag_Qx)) +
  geom_point(aes(colour = n_Obs, shape = ts)) +
  geom_smooth(aes(linetype = ts), method = "lm", colour = "black") +
  scale_colour_gradientn(colours = rainbow(10)) +
  facet_wrap(~var, nrow = 1) +
  labs(x = "ΔSSTa", y = "ΔSST_Qx", shape = "Time series", linetype = "Time series") +
  theme(legend.position = "top")
# mag_scatter

# Boxplots of proportions of change for SST_Qx for onset/decline
mag_box <- ALL_mag_prop %>% 
  ggplot(aes(x = ts, y = prop)) +
  geom_hline(aes(yintercept = 0), colour = "red") +
    geom_boxplot(aes(fill = season), outlier.shape = NA) +
  # geom_boxplot(aes(fill = region), outlier.shape = NA) +
  scale_y_continuous(limits = c(-2, 2)) +
  facet_wrap(~var, nrow = 1) +
  labs(x = NULL, y = "ΔSST_Qx / ΔSSTa")
# mag_box

# Combine for one massive display
mag_full <- ggpubr::ggarrange(mag_schematic, mag_scatter, mag_box, ncol = 1, nrow = 3, labels = c("A)", "B)", "C)"))
ggsave("output/magnitude_schematic.png", height = 8, width = 10)
ggsave("output/magnitude_schematic.pdf", height = 8, width = 10)
mag_full
```

The first thing we want to look at with these magnitude scores is how many MHWs appear to be driven or decayed by Qnet. We determine this if the proportion of change in Qnet is more than half that of the change in SSTa. But also, because the Qx variables can be greater than the sum (Qnet), we will see when individual variables are contributing more than half of the change in SSTa, too.

```{r mag-prop}
# Create TRUE/FALSE table when prop is over 0.5 for any variable
ALL_mag_TF <- ALL_mag_prop %>% 
  mutate(prop_TF = ifelse(prop > 0.5, TRUE, FALSE)) %>% 
  dplyr::select(region:n_Obs, prop_TF) %>% 
  pivot_wider(names_from = var, values_from = prop_TF)

# Total proportion of MHWs forced by Qnet
sum(ALL_mag_TF$Qnet)/nrow(ALL_mag_TF)

# Total number of events driven by Qnet
nrow(filter(ALL_mag_TF, ts == "onset", Qnet == TRUE))

# Total number of events decayed by Qnet
nrow(filter(ALL_mag_TF, ts == "decline", Qnet == TRUE))

# Proportion of MHWs with Qnet inset and decline
ALL_mag_TF %>% 
  dplyr::select(region:ts, Qnet) %>% 
  pivot_wider(names_from = ts, values_from = Qnet) %>% 
  filter(onset == TRUE, decline == TRUE)

# Counts by time series phase
ALL_mag_count_ts <- ALL_mag_prop %>% 
  group_by(ts, var) %>% 
  mutate(ts_count = n(),
         group = "Total") %>% 
  filter(prop > 0.5) %>% 
  mutate(filter_count = n(),
         filter_prop = round(filter_count/ts_count, 2)) %>% 
  dplyr::select(group, ts, var, filter_prop) %>% 
  distinct() %>% 
  mutate(var = factor(var, levels = c("Qnet", "Qlh", "Qsh", "Qlw", "Qsw"))) %>% 
  pivot_wider(names_from = var, values_from = filter_prop) %>% 
  dplyr::select(group, ts, Qnet, Qlh, Qsh, Qlw, Qsw)
# knitr::kable(ALL_mag_count_ts)

# Counts by time region phase
ALL_mag_count_region <- ALL_mag_prop %>% 
  group_by(region, ts, var) %>% 
  mutate(ts_count = n()) %>% 
  filter(prop > 0.5) %>% 
  mutate(filter_count = n(),
         filter_prop = round(filter_count/ts_count, 2)) %>% 
  dplyr::select(region, ts, var, filter_prop) %>% 
  distinct() %>% 
  pivot_wider(names_from = var, values_from = filter_prop) %>% 
  dplyr::select(region, ts, Qnet, Qlh, Qsh, Qlw, Qsw) %>% 
  arrange(region, ts) %>% 
  dplyr::rename(group = region)
# knitr::kable(ALL_mag_count_region)

# Counts by time series phase
ALL_mag_count_season <- ALL_mag_prop %>% 
  group_by(season, ts, var) %>% 
  mutate(ts_count = n()) %>% 
  filter(prop > 0.5) %>% 
  mutate(filter_count = n(),
         filter_prop = round(filter_count/ts_count, 2)) %>% 
  dplyr::select(season, ts, var, filter_prop) %>% 
  distinct() %>% 
  pivot_wider(names_from = var, values_from = filter_prop) %>% 
  dplyr::select(season, ts, Qnet, Qlh, Qsh, Qlw, Qsw) %>% 
  arrange(season, ts) %>% 
  dplyr::rename(group = season)
# knitr::kable(ALL_mag_count_season)
ALL_mag_count_ts_region_season <- rbind(ALL_mag_count_ts, ALL_mag_count_region, ALL_mag_count_season) %>% 
  dplyr::select(group:Qnet) %>% 
  pivot_wider(names_from = ts, values_from = Qnet) %>% 
  dplyr::select(group, onset, decline) %>% 
  mutate(onset = paste0(onset*100,"%"),
         decline = paste0(decline*100,"%"))
knitr::kable(ALL_mag_count_ts_region_season)
```

We also want a table showing the count of best magnitude match per variable per region/season.

```{r mag-table}
# Get the top results other than Qnet
ALL_mag_top <- ALL_mag_prop %>% 
  filter(var != "Qnet") %>% # Remove Qnet
  group_by(region, season, event_no, ts) %>% 
  filter(prop == max(prop))

# The top count by region
ALL_mag_region <- ALL_mag_top %>% 
  group_by(region, ts, var) %>% 
  summarise(count = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count) %>% 
  mutate(Qlw = replace_na(Qlw, 0)) %>% 
  left_join(mag_region_count, by = c("region", "ts")) %>% 
  dplyr::select(region, ts, count, Qlh:Qsw) %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/count)*100),"%)")) %>% 
  dplyr::rename(group = region)
# knitr::kable(ALL_mag_region)

# The top count by season
ALL_mag_season <- ALL_mag_top %>% 
  group_by(season, ts, var) %>% 
  summarise(count = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count) %>% 
  mutate(Qsh = replace_na(Qsh, 0)) %>%
  left_join(mag_season_count, by = c("season", "ts")) %>% 
  dplyr::select(season, ts, count, Qlh:Qsw) %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/count)*100),"%)")) %>% 
  dplyr::rename(group = season)
# knitr::kable(ALL_mag_season)

# Print table
ALL_mag_region_season <- rbind(ALL_mag_region, ALL_mag_season)
knitr::kable(ALL_mag_region_season)
```

These tables help to reiterate how much more important Qsw is for the onset and decline of MHWs over the other variables, and that Qlh often comes in second. For the seasons though we see a much more complex relationship that I won't go into here.

### RMSE summary

Here we will go about summarising the RMSE results. We will also create some boxplots to help with the process.

```{r RMSE-boxplot}
# The base RMSE results
ALL_RMSE <- ALL_cor %>% 
  filter(rmse > 0 ) %>% 
  dplyr::rename(var = Parameter2) %>% 
  dplyr::select(region:ts, var, n_Obs, rmse) %>% 
  mutate(var = as.character(var),
         var = case_when(var == "lwr_budget" ~ "Qlw",
                         var == "swr_budget" ~ "Qsw",
                         var == "lhf_budget" ~ "Qlh",
                         var == "shf_budget" ~ "Qsh",
                         var == "qnet_budget" ~ "Qnet",
                         TRUE ~ var),
         var = factor(var, levels = c("Qnet", "Qlh", "Qsh", "Qlw", "Qsw")),
         region = toupper(region)) %>% 
  # filter out events where Qnet is not a primary drivers
  left_join(ALL_mag_TF, by = c("region", "season", "event_no", "ts", "n_Obs")) %>% 
  filter(Qnet == TRUE)

# Summaries by region
ALL_RMSE_region <- ALL_RMSE %>% 
  dplyr::select(-season, -event_no, -n_Obs) %>% 
  group_by(region, ts, var) %>% 
  summarise(rmse_min = min(rmse),
           rmse_mean = mean(rmse),
           rmse_max = max(rmse), .groups = "drop")

# Boxplot of region summaries
ALL_RMSE %>% 
  dplyr::select(-season, -event_no, -n_Obs) %>% 
  ggplot(aes(x = var, y = rmse)) +
  geom_boxplot(aes(fill = ts)) +
  facet_wrap(~region) #+
  # theme(axis.text.x = element_text(angle = 45))

# Summaries by season
ALL_RMSE_season <- ALL_RMSE %>% 
  dplyr::select(-region, -event_no, -n_Obs) %>% 
  group_by(season, ts, var) %>% 
  summarise(rmse_min = min(rmse),
           rmse_mean = mean(rmse),
           rmse_max = max(rmse), .groups = "drop")

# Boxplot of season results
ALL_RMSE %>% 
  dplyr::select(-region, -event_no, -n_Obs) %>% 
  ggplot(aes(x = var, y = rmse)) +
  geom_boxplot(aes(fill = ts)) +
  facet_wrap(~season)

# Boxplot by variable
ALL_RMSE %>% 
  filter(var != "Qnet") %>% 
  ggplot(aes(x = region, y = rmse)) +
  geom_boxplot(aes(fill = ts)) +
  facet_wrap(~var)
ALL_RMSE %>% 
    filter(var != "Qnet") %>% 
  ggplot(aes(x = season, y = rmse)) +
  geom_boxplot(aes(fill = ts)) +
  facet_wrap(~var)
```

These boxplots help to tell a high level story about the importance of the Qx terms with SSTa, but I want a slightly more fine-grain level of results. To do this I am going to count which of the Qx terms has the lowest RMSE per MHW. These will then be grouped by region and season to provide an understanding for how this may vary.

```{r RMSE-table}
# Get the top results other than Qnet
ALL_RMSE_top <- ALL_RMSE %>% 
  filter(var != "Qnet") %>% # Remove Qnet
  group_by(region, season, event_no, ts) %>% 
  filter(rmse == min(rmse))

# The top count overall
ALL_RMSE_ts <- ALL_RMSE_top %>%
  group_by(ts) %>%
  mutate(total_count = n(),
         group = "Total") %>% 
  group_by(group, ts, total_count, var) %>% 
  summarise(count = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count) %>% 
  replace_na(list(Qlh = 0, Qsh = 0, Qlw = 0, Qsw = 0)) %>% 
  dplyr::select(group, ts, total_count, Qlh:Qsw) %>% 
  filter(ts != "full") %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/total_count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/total_count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/total_count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/total_count)*100),"%)"))

# The top count by region
ALL_RMSE_region <- ALL_RMSE_top %>%
  group_by(region, ts) %>%
  mutate(total_count = n()) %>% 
  group_by(region, ts, total_count, var) %>% 
  summarise(count = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count) %>% 
  replace_na(list(Qlh = 0, Qsh = 0, Qlw = 0, Qsw = 0)) %>% 
  dplyr::select(region, ts, total_count, Qlh:Qlw) %>% 
  filter(ts != "full") %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/total_count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/total_count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/total_count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/total_count)*100),"%)")) %>% 
  dplyr::rename(group = region)
# knitr::kable(ALL_RMSE_region)

# The top count by season
ALL_RMSE_season <- ALL_RMSE_top %>% 
  group_by(season, ts) %>%
  mutate(total_count = n()) %>% 
  group_by(season, ts, total_count, var) %>% 
  summarise(count = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count) %>% 
  replace_na(list(Qlh = 0, Qsh = 0, Qlw = 0, Qsw = 0)) %>% 
  dplyr::select(season, ts, total_count, Qlh:Qlw) %>% 
  filter(ts != "full") %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/total_count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/total_count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/total_count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/total_count)*100),"%)")) %>% 
  dplyr::rename(group = season)
# knitr::kable(ALL_RMSE_season)

# Print table
ALL_RMSE_ts_region_season <- rbind(ALL_RMSE_ts, ALL_RMSE_region, ALL_RMSE_season) %>% 
  dplyr::rename(count = total_count)
knitr::kable(ALL_RMSE_ts_region_season)
```

I'm pretty happy with how these results are displayed. But I think it would be better to show the regions and seasons in separate tables just to keep them more manageable. In the last chunk in this sub-section we will grab some specific summary stats that are used in the publication.

```{r RMSE-stats}
# Short event best fits
ALL_RMSE_short <- ALL_RMSE %>% 
  filter(n_Obs <= 14, var != "Qnet") %>% 
  group_by(region, season, event_no, ts) %>% 
  filter(rmse == min(rmse))

# Short event index
ALL_RMSE_long <- ALL_RMSE %>% 
  filter(n_Obs > 14, var != "Qnet") %>% 
  group_by(region, season, event_no, ts) %>% 
  filter(rmse == min(rmse))

# The top Qx for events of 14 days or less by region
ALL_RMSE_short_region <- ALL_RMSE_short %>%
  group_by(region, ts) %>% 
  mutate(count = n()) %>% 
  group_by(region, ts, count, var) %>% 
  summarise(count_Q = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count_Q) %>% 
  replace_na(list(Qlh = 0, Qsh = 0, Qlw = 0, Qsw = 0)) %>% 
  dplyr::select(region, ts, count, Qlh:Qlw) %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/count)*100),"%)"))
knitr::kable(ALL_RMSE_short_region)

# The top Qx for events of more than 14 days by region
ALL_RMSE_long_region <- ALL_RMSE_long %>%
  group_by(region, ts) %>% 
  mutate(count = n()) %>% 
  group_by(region, ts, count, var) %>% 
  summarise(count_Q = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count_Q) %>% 
  replace_na(list(Qlh = 0, Qsh = 0, Qlw = 0, Qsw = 0)) %>% 
  dplyr::select(region, ts, count, Qlh:Qlw) %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/count)*100),"%)"))
knitr::kable(ALL_RMSE_long_region)

# The top Qx for events of 14 days or less by season
ALL_RMSE_short_season <- ALL_RMSE_short %>%
  group_by(season, ts) %>% 
  mutate(count = n()) %>% 
  group_by(season, ts, count, var) %>% 
  summarise(count_Q = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count_Q) %>% 
  replace_na(list(Qlh = 0, Qsh = 0, Qlw = 0, Qsw = 0)) %>% 
  dplyr::select(season, ts, count, Qlh:Qlw) %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/count)*100),"%)"))
knitr::kable(ALL_RMSE_short_season)

# The top Qx for events of more than 14 days by season
ALL_RMSE_long_season <- ALL_RMSE_long %>%
  group_by(season, ts) %>% 
  mutate(count = n()) %>% 
  group_by(season, ts, count, var) %>% 
  summarise(count_Q = n(), .groups = "drop") %>% 
  pivot_wider(names_from = var, values_from = count_Q) %>% 
  replace_na(list(Qlh = 0, Qsh = 0, Qlw = 0, Qsw = 0)) %>% 
  dplyr::select(season, ts, count, Qlh:Qlw) %>% 
  mutate(Qlh = paste0(Qlh, " (",round((Qlh/count)*100),"%)"),
         Qsh = paste0(Qsh, " (",round((Qsh/count)*100),"%)"),
         Qlw = paste0(Qlw, " (",round((Qlw/count)*100),"%)"),
         Qsw = paste0(Qsw, " (",round((Qsw/count)*100),"%)"))
knitr::kable(ALL_RMSE_long_season)
```

We also want to see how RMSE scores change the shorter the time series portion of the MHW is.

```{r RMSE-short-change}
# Faceted scatter plot showing how RMSE changes based on ts length
ggplot(ALL_RMSE, aes(x = n_Obs, y = rmse)) +
  geom_point(aes(colour = ts)) +
  geom_smooth(aes(colour = ts), method = "lm") +
  facet_wrap(~region)

# Faceted scatter plot showing how RMSE changes based on ts length
ggplot(ALL_RMSE, aes(x = n_Obs, y = rmse)) +
  geom_point(aes(colour = ts)) +
  geom_smooth(aes(colour = ts), method = "lm") +
  facet_wrap(~season)
```


### Correlations: summary, regions, seasons

With the correlations calculated for the onset, decline, and full extent of each MHW, we want to know if any signals emerge from the regions and/or seasons of occurrence of these events. Is the relationship between SSS and MHW onset stronger in the winter? Stronger in certain region? Having manually looked through the Shiny app it does look like there are some patterns. Below is the code used to determine the stats referred to in the paper.

```{r}
# Overall correlation results
cor_summary_all <- ALL_cor %>% 
  filter(Parameter1 == "sst",
         Parameter2 != "sst") %>% 
  group_by(ts, Parameter2) %>% 
  summarise(q10_r = quantile(r, 0.1),
            mean_r = mean(r, na.rm = T),
            q90_r = quantile(r, 0.9),
            mean_rmse = mean(rmse, na.rm = T))

# Overall positive only results
cor_positive_all <- ALL_cor %>% 
  filter(Parameter1 == "sst",
         Parameter2 != "sst",
         r > 0) %>% 
  group_by(ts, Parameter2) %>% 
  summarise(q10_r = quantile(r, 0.1),
            mean_r = mean(r, na.rm = T),
            q90_r = quantile(r, 0.9),
            mean_rmse = mean(rmse, na.rm = T))

# Overall negative only results
cor_negative_all <- ALL_cor %>% 
  filter(Parameter1 == "sst",
         Parameter2 != "sst",
         r < 0) %>% 
  group_by(ts, Parameter2) %>% 
  summarise(q10_r = quantile(r, 0.1),
            mean_r = mean(r, na.rm = T),
            q90_r = quantile(r, 0.9),
            mean_rmse = mean(rmse, na.rm = T))

# region correlation results
cor_summary_region <- ALL_cor %>% 
  filter(Parameter1 == "sst",
         Parameter2 != "sst") %>% 
  group_by(ts, region, Parameter2) %>% 
  summarise(q10_r = quantile(r, 0.1),
            mean_r = mean(r, na.rm = T),
            q90_r = quantile(r, 0.9),
            mean_rmse = mean(rmse, na.rm = T))

# Look at specific results
cor_summary_region %>% 
  filter(Parameter2 == "mld_cum")

# region correlation results
cor_summary_season <- ALL_cor %>% 
  filter(Parameter1 == "sst",
         Parameter2 != "sst") %>% 
  group_by(ts, season, Parameter2) %>% 
  summarise(q10_r = quantile(r, 0.1),
            mean_r = mean(r, na.rm = T),
            q90_r = quantile(r, 0.9),
            mean_rmse = mean(rmse, na.rm = T))

# Look at specific results
cor_summary_season %>% 
  filter(Parameter2 == "mld")

# Test for normality of r distributions

```

### Lead or lag of heat flux

Also could look into lead or lag of Qnet etc.

### Variability in Qx during regins and seasons

See if there is more variance in Qx during certain seasons or regions.

### Relationships

With patterns pulled out by region and season, we want to see if there are any relationships between MHWs that show strong correlations at onset/decline with a particular QX term and strong correlations at onset/decline with an air/sea variable. We will look for this within regions and seasons as well. For example, do MHWs that correlate well with an increase in SSS also correlate well with a decrease in long-wave radiation during the decline of the event? I'm not sure how best to go about this in a clean manner.

Another thing to consider would be if fast onset slow decline (and vice versa) events have different characteristics to slower evolving events. The same question could be posed to long vs short events and those with high intensities vs low. In order to begin this investigation we must join the MHW results to the correlation results. We will visualise these patterns with heatmaps.

```{r metrics_cor}
# Strong positive correlation at onset and strong positive and decline and vice versa
ALL_cor_wide <- readRDS("data/ALL_cor.Rda") %>% 
  dplyr::rename(var = Parameter2) %>% 
  ungroup() %>% 
  filter(Parameter1 == "sst") %>% 
  dplyr::select(region:ts, var, r, n_Obs) %>% 
  mutate(var = as.character(var)) %>% 
  mutate(var = case_when(var == "sst" ~ "SST",
                         var == "bottomT" ~ "SBT",
                         var == "sss" ~ "SSS",
                         var == "mld_cum" ~ "MLD_c",
                         var == "mld_1_cum" ~ "MLD_1_c",
                         var == "t2m" ~ "Air_temp",
                         var == "tcc_cum" ~ "Cloud_cover_c",
                         var == "p_e_cum" ~ "Precip_Evap_c",
                         var == "mslp_cum" ~ "MSLP_c",
                         var == "lwr_budget" ~ "Qlw",
                         var == "swr_budget" ~ "Qsw",
                         var == "lhf_budget" ~ "Qlh",
                         var == "shf_budget" ~ "Qsh",
                         var == "qnet_budget" ~ "Qnet",
                         TRUE ~ var)) %>% 
  filter(var %in% c("SST", "SSS", "SBT", "MLD_c", "MLD_1_c", "Air_temp", "Cloud_cover_c",
                           "Precip_Evap_c", "MSLP_c", "Qlw", "Qsw", "Qlh", "Qsh", "Qnet")) %>% 
  pivot_wider(values_from = r, names_from = var)

# Combine MHW metrics and correlation results
events_cor_prep <- OISST_MHW_event %>% 
  dplyr::select(region, season, event_no, duration, intensity_mean, intensity_max, 
                intensity_cumulative, rate_onset, rate_decline) %>% 
  left_join(ALL_cor_wide, by = c("region", "season", "event_no"))

# Heatmap showing average correlations by MHW duration
events_cor_prep %>% 
  mutate(duration = plyr::round_any(duration, 10)) %>% 
  group_by(ts, duration) %>% 
  mutate(count = n()) %>% 
  summarise_if(is.numeric, mean) %>% 
  pivot_longer(cols = Air_temp:Qsw) %>% 
  filter(name != "sst",
         ts != "full",
         duration <= 50) %>% # Only 1 event is longer than this 
  ggplot(aes(x = duration, y = name)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = count)) +
  facet_wrap(~ts) +
  scale_fill_gradient2(low = "blue", high = "red") +
  coord_cartesian(expand = F) +
  labs(y = NULL, x = "Duration (10 day steps)", fill = "r (mean)") +
  theme(legend.position = "bottom")

# Heatmap showing average correlations by MHW max intensity
events_cor_prep %>% 
  mutate(intensity_max = plyr::round_any(intensity_max, 0.5)) %>% 
  group_by(ts, intensity_max) %>%
  mutate(count = n()) %>% 
  summarise_if(is.numeric, mean) %>% 
  pivot_longer(cols = Air_temp:Qsw) %>% 
  filter(name != "sst",
         ts != "full") %>% 
  ggplot(aes(x = intensity_max, y = name)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = count)) +
  facet_wrap(~ts) +
  scale_fill_gradient2(low = "blue", high = "red") +
  coord_cartesian(expand = F) +
  labs(y = NULL, x = "Max Intensity (°C; 0.5° steps)", fill = "r (mean)") +
  theme(legend.position = "bottom")

# Heatmap showing average correlations by MHW rate onset
events_cor_prep %>% 
  mutate(rate_onset = round(rate_onset, 1)) %>% 
  group_by(ts, rate_onset) %>% 
  mutate(count = n()) %>% 
  summarise_if(is.numeric, mean) %>% 
  pivot_longer(cols = Air_temp:Qsw) %>% 
  filter(name != "sst",
         ts != "full",
         rate_onset <= 0.75) %>% # Only two is faster than this 
  ggplot(aes(x = rate_onset, y = name)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = count)) +
  facet_wrap(~ts) +
  scale_fill_gradient2(low = "blue", high = "red") +
  coord_cartesian(expand = F) +
  labs(y = NULL, x = "Rate of onset (°C; 0.1° steps)", fill = "r (mean)") +
  theme(legend.position = "bottom")

# Heatmap showing average correlations by MHW rate decline
events_cor_prep %>% 
  mutate(rate_decline = round(rate_decline, 1)) %>% 
  filter(rate_decline <= 0.6) %>% # nip off a couple of outliers
  group_by(ts, rate_decline) %>% 
  mutate(count = n()) %>% 
  summarise_if(is.numeric, mean) %>% 
  pivot_longer(cols = Air_temp:Qsw) %>% 
  filter(name != "sst",
         ts != "full") %>% 
  ggplot(aes(x = rate_decline, y = name)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = count)) +
  facet_wrap(~ts) +
  scale_fill_gradient2(low = "blue", high = "red") +
  coord_cartesian(expand = F) +
  labs(y = NULL, x = "Rate of decline (°C; 0.1° steps)", fill = "r (mean)") +
  theme(legend.position = "bottom")
```

### Linearity of events

Another observation I made while going through the results in the app was how the onset and decline of events tend to have a better RMSE when the SSTa trend is more linear. My hypothesis is that this is because there is less advective pressure on the dispersion of the heat anomaly, therefore more of the heatflux is responsible for the temperature anomaly. THis is a pretty straight forward statement, but I think it is useful in that one may be able to take linearity of SSTa as a proxy for the proportion of heat flux that is responsible for it. So in the chunk below I use simple linear models to create an R2 value for each onset and decline portion of an event. Then I find how close to 1.0 that R2 value is (meaning more linear SSTa) and I see how RMSE changes as R2 improves. I hypothesised that more linear SSTa onset will provide better (lower) RMSE values. The same is likely true for decline but I'm not certain.

```{r linearity}
# R2 for RMSE vs R2
ALL_cor_R2 <- readRDS("data/ALL_cor.Rda") %>% 
  ungroup() %>% 
  filter(Parameter1 == "sst", rmse > 0) %>% 
  nest_by(region, season, ts) %>% 
  summarise(broom::glance(lm(rmse ~ sst_R2, data = data)))

# Scatterplots
readRDS("data/ALL_cor.Rda") %>% 
  ungroup() %>% 
  filter(Parameter1 == "sst", rmse > 0) %>% 
  ggplot(aes(x = sst_R2, y = rmse)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_grid(region ~ Parameter2)

# Boxplots
readRDS("data/ALL_cor.Rda") %>% 
  ungroup() %>% 
  filter(Parameter1 == "sst", rmse > 0) %>% 
  mutate(sst_R2_cut = cut(sst_R2, breaks = c(0, 0.25, 0.5, 0.75,1.0))) %>% 
  na.omit() %>% 
  ggplot(aes(x = sst_R2_cut, y = rmse)) +
  geom_boxplot() +
  facet_grid(region ~ Parameter2)
```

Nope. If there is any relationship it is incredibly noisey. I'll not be pursuing this further for this project, but I'm not entirely dissuaded from pursuing this avenue of research in the future with a deeper learning approach.

### Choice events

There are a lot of results to wade through and though it is clear there are important signals in the results, it is proving difficult to distill them. One thought is that we don't need to look at all of the events, just the longest/most intense events with strong r values. This is first done by cutting out all Cat. I events. We then find strong correlations with long events. There should just be a few.

Once this has been done we group events by their strongest Qx relationship. Then find their strongest relationship with the next level of variables (e.g. MLD, MSLP, and so on). Ideally one may find the top four flavours.

```{r choice-events}
# Filter out smol events
events_cor_cat <- events_cor_prep %>% 
  left_join(OISST_MHW_cats[,c("region", "event_no", "category")], by = c("region", "event_no")) %>% 
  filter(category != "I Moderate", duration >= 21)

# Events with high Qlw correlations at onset
events_cor_cat %>% 
  filter(ts == "onset", Qlw >= 0.7)

# Melt the data frame and find the q term with the highest correlation
# Those are then used to separate events into groups
events_q_onset <- events_cor_cat %>% 
  filter(ts == "onset") %>% 
  dplyr::select(region:event_no, ts, Qnet:Qsh) %>% 
  pivot_longer(cols = Qnet:Qsh, names_to = "var", values_to = "val") %>% 
  group_by(region, season, event_no) %>% 
  filter(abs(val) == max(abs(val)))

# With this guide one may then parse out the q groups
events_lhf_p_onset <- events_q_onset %>% 
  filter(val >= 0, var == "Qlh") %>% 
  left_join(events_cor_cat) %>% 
  ungroup() %>% 
  # unite(region, season, event_no, sep = "_")
  mutate(event_no = as.factor(event_no)) %>% 
  dplyr::select(region:ts, Air_temp:Qsw) %>% 
  pivot_longer(Air_temp:Qsw) %>% 
  ggplot(aes(x = event_no, y = name)) +
  geom_tile(aes(fill = value)) +
  # facet_wrap(~name, scales = "free") +
  scale_fill_gradient2(low = "blue", high = "red") +
  coord_cartesian(expand = F) +
  labs(y = NULL, fill = "r (mean)") +
  theme(legend.position = "bottom")
events_lhf_p_onset
```

### Table

In the following table a more concise summary of the results is presented.

```{r, echo=FALSE, message=FALSE}
# NB: This table was created manually by going through the Shiny app one variable at a time.
res_table <- read_csv("data/res_table.csv")
knitr::kable(res_table, caption = "Most of the variables that have been correlated against the temperature anomalies during the onset, decline, and full duration of MHWs. The cumulative heat flux terms were corrected for by the daily MLD (Q/(rho x Cp x hmld)) before the correlations were calculated. Correlations were also run on the cumulative flux terms without correcting for MLD, but there was little difference so the results are not itemised here. This table shows the full names of the variables, as well as the abbreviations used in the code. The 'onset' column describes (in shorthand) what the tendency of correlations for the MHWs is during the onset of events. This is repeated for the 'full' and 'decline' columns respectively. The 'season' column briefly states the most clear/noteworthy pattern(s) when looking at how the correlations are divided up by season. The same is done in the 'region' column. The last column, 'story', gives a TRUE/FALSE if I think the variable has a story to tell. Something worth pursuing further. Particularly to see if the variables relate strongly to other variables, not just temperature. This then could provide a framework for determining 'types' of MHWs (e.g. strong SSS change with strong latent heat flux).")
```

With a table organised by each variable, it makes sense to also create a table organised by season, and another by region.

```{r}

```

### NWA 2012

From Chen et al. 2016 (JGR)
Such an extreme event in the MAB was attributed to the anomalous atmospheric forcing, which was linked to the northward shift in the jet stream position [Chen et al., 2014a, 2015]. The anomalously warm atmospheric conditions in the winter of 2011–2012 increased the ocean heat content (increased the ocean heat content anomaly) and facilitated the extreme warm ocean temperature in spring 2012 [Chen et al., 2014a, 2015]. On the other hand, the ocean advection played a secondary role, which partially damped the heat content anomaly created by the air-sea heat flux [Chen et al., 2015].
In both cases, initial temperature and ocean advection are not sufficient to describe the seasonal mean temperature. Additional cooling (warming) in addition to ocean advection is needed to further describe the winter (spring) temperature. In comparison, using the sum of the initial temperature and air-sea flux yields a much better description of seasonal mean temperatures (Figures 5c and 5f)
While the overall role of ocean advection is smaller than that of air-sea flux in determining the winter and spring temperatures, the year-to-year changes in the relative importance is worth investigating.
Normally, given anomalous initial temperature, air will act to damp the temperature anomaly, as in winter 2007 or 2011, or even 2005 to some extent. However, in winter 2012, the air continued to increase the temperature anomaly.
Out of the 12 years 2003–2014, the air-sea flux normally dominated the temperature anomaly in the MAB during winter. In only 3 years was the winter time temperature anomaly primarily controlled by ocean advection.
For spring, ocean advection has more control on the temperature anomalies than air-sea flux does, although the difference is smaller (Table 2). In both seasons, the relative importance of air-sea flux and ocean advection does not seem to be related to either the initial or seasonal mean thermal condition of the shelf water (fourth and fifth columns of Tables 1 and 2).
The correlation coefficients increase from 0.66 in the first half of February to 0.91 in the second half of March. This suggests that estimation of spring temperature anomaly in the MAB  based  on  the  thermal  condition  2 months before spring is statistically possible.
This suggests that more northerly jet stream positions result in larger heatflux from the atmosphere into the ocean in the MAB. This is likely due to warmer and more humid air overlying the continental shelf, which reduces the heat loss from the ocean during the cooling seasons [Chenet al., 2014a].
In spring and summer, the air-sea flux may be less correlated with the air temperature due to the shallowness of the surface mixed layer, and thus may be disconnected from large-scale atmospheric circulation, i.e., jet stream variability.

## References