<a href="https://colab.research.google.com/github/samsoe/mpg_notebooks/blob/master/yvp_species_richness_WRANGLE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Documentation

[Readme fixed plot vegetation data](https://docs.google.com/document/d/16-Aq8u9Rudd78fSzfjvpCXyQgE-BstC-d2PjYfmLtcw/edit?usp=sharing)

# Security

* The user must load a `json` file containing the BigQuery API key into the local directory `/content/...`
* The user must have a Google Maps API key to enable mapping. 
   * CAUTION make sure the key is deleted from the current instance of the notebook before sharing

# Tools

In [None]:
library(tidyverse)

* Remember that the file containing authorization keys for Big Query must be loaded into the virutual envrionment manually.

In [None]:
install.packages("bigrquery")
library(bigrquery)

# Source

## Database Connection

In [3]:
# BigQuery API Key
bq_auth(path = "/content/mpg-data-warehouse-api_key-master.json")

In [4]:
Sys.setenv(BIGQUERY_TEST_PROJECT = "mpg-data-warehouse")

In [5]:
billing <- bq_test_project()

### vegetation_point_intercept_gridVeg

In this view of the yvp data, species from the cover-based and additional species summaries will be vertically combined for each grid point. Since the additional species summary records species presence only, this view will be limited entirely to species presence. The result will be the plant species richness for each grid point, and this is useful for comparing plant communities, finding the locations of rarer species, or identifying grid points where non-native species are just getting established. After these data are processed, we will want to retain knowledge of whether a species was detected during point-intercept of additional species surveys so that we can evaluate the potential rarity of a given species. The new variable detection_type will allow us to do this.


In [16]:
sql_vegetation_cover <- 
"
SELECT
  CONCAT(plot_code, \" \", date) AS survey_code,
  plot_code,
  SUBSTR(SAFE_CAST(date AS STRING), 0, 4) AS year,
  plot_loc,
  plot_rep,
  plot_num,
  (\"cover_est\") AS detection_type,
  species_key
FROM
  `mpg-data-warehouse.vegetation_fixed_plot_yvp.yvp_vegetation_cover`

In [17]:
bq_yvp <- bq_project_query(billing, sql_yvp)

In [18]:
tb_yvp <- bq_table_download(bq_yvp)

In [19]:
df_yvp <- as.data.frame(tb_yvp)

In [64]:
df_yvp %>% glimpse() 

Rows: 22,962
Columns: 8
$ survey_code    [3m[90m<chr>[39m[23m "YVP 12 2018-07-10", "YVP 12 2018-07-10", "YVP 12 2018…
$ plot_code      [3m[90m<chr>[39m[23m "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP…
$ year           [3m[90m<chr>[39m[23m "2018", "2018", "2018", "2018", "2018", "2018", "2018"…
$ plot_loc       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_rep       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_num       [3m[90m<int>[39m[23m 12, 12, 12, 12, 12, 246, 246, 246, 246, 571, 571, 571,…
$ detection_type [3m[90m<chr>[39m[23m "supplemental_obs", "supplemental_obs", "supplemental_…
$ species_key    [3m[90m<int>[39m[23m 72, 179, 426, 63, 484, 402, 53, 274, 334, 496, 240, 56…


# Wrangle

With the data from yvp_vegetation_cover, species lists must first be summarized as distinct key_plant_species values within survey_code values. This is because the raw data are estimated in 10 subplots per transect, and species names will often be redundant among subplots. Then the yvp_vegetation_cover data can be vertically bound to the yvp_additional_species data after some light coercion of field names.

One caution with these data. According to protocol, a plant species is only included in the additional species table if it was not found during cover-based surveys. In practice, I assume that this is routinely violated because it isn’t easy to remember all the species surveyed, nor is it efficient to check time after time. It’s important that we eliminate duplicate species for a given grid point. When duplication exists, default to detection_type = “cover_est”. This will prevent upward bias of richness estimates and will make downstream analyses less complicated. Some operation that again summarizes distinct key_plant_species values within survey_code values will be necessary. For additional information on this point, please see instructions for a similar operation with the point-intercept data in the gridVeg [Readme](https://docs.google.com/document/d/1JWnhxNjeSQZkSnGhtHP68i_l1mDj4vPFMBdUvGqN0TA/edit#heading=h.hnb7ex8jlp42).


In [25]:
df_yvp %>% glimpse()

Rows: 22,962
Columns: 8
$ survey_code    [3m[90m<chr>[39m[23m "YVP 12 2018-07-10", "YVP 12 2018-07-10", "YVP 12 2018…
$ plot_code      [3m[90m<chr>[39m[23m "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP…
$ year           [3m[90m<chr>[39m[23m "2018", "2018", "2018", "2018", "2018", "2018", "2018"…
$ plot_loc       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_rep       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_num       [3m[90m<int>[39m[23m 12, 12, 12, 12, 12, 246, 246, 246, 246, 571, 571, 571,…
$ detection_type [3m[90m<chr>[39m[23m "supplemental_obs", "supplemental_obs", "supplemental_…
$ species_key    [3m[90m<int>[39m[23m 72, 179, 426, 63, 484, 402, 53, 274, 334, 496, 240, 56…


In [63]:
df_yvp %>%
  filter(survey_code == "YVP 10 2017-06-09", species_key == "232")

survey_code,plot_code,year,plot_loc,plot_rep,plot_num,detection_type,species_key
<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232
YVP 10 2017-06-09,YVP 10,2017,,,10,cover_est,232


In [62]:
df_yvp %>%
  group_by(survey_code, species_key) %>%
  # distinct(detection_type) # %>%
  count(species_key) #%>%
  # filter(n > 1) %>%
  # arrange(desc(n), survey_code, species_key) %>%
  # select(survey_code, species_key) %>%
  # head()

survey_code,species_key,n
<chr>,<int>,<int>
YVP 10 2017-06-09,5,1
YVP 10 2017-06-09,37,2
YVP 10 2017-06-09,39,2
YVP 10 2017-06-09,51,1
YVP 10 2017-06-09,72,1
YVP 10 2017-06-09,82,1
YVP 10 2017-06-09,84,1
YVP 10 2017-06-09,90,2
YVP 10 2017-06-09,153,2
YVP 10 2017-06-09,163,2


In [57]:
df_yvp %>%
  group_by(survey_code, species_key) %>%
  distinct(detection_type) %>%
  count(species_key) %>%
  filter(n > 1) %>%
  arrange(desc(n), survey_code, species_key) %>%
  select(survey_code, species_key) %>%
  head()

survey_code,species_key
<chr>,<int>
YVP 10 2018-07-12,16
YVP 10 2018-07-12,163
YVP 10 2018-07-12,169
YVP 10 2018-07-12,433
YVP 10 2019-07-02,16
YVP 10 2019-07-02,163


In [36]:
df_yvp %>%
  filter(survey_code == "YVP 10 2018-07-12" &
         species_key == 16)

survey_code,plot_code,year,plot_loc,plot_rep,plot_num,detection_type,species_key
<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>
YVP 10 2018-07-12,YVP 10,2018,,,10,supplemental_obs,16
YVP 10 2018-07-12,YVP 10,2018,,,10,cover_est,16


In [51]:
duplicates <- df_yvp %>%
  group_by(survey_code, species_key) %>%
  distinct(detection_type) %>%
  count(species_key) %>%
  filter(n > 1) %>%
  arrange(desc(n), survey_code, species_key) %>%
  select(survey_code, species_key) %>%
  head()

In [53]:
vars <- c("survey_code", "species_key")

df_yvp %>%
  filter(.data)

ERROR: ignored