<a href="https://colab.research.google.com/github/samsoe/mpg_notebooks/blob/master/yvp_species_richness_WRANGLE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Documentation

[Readme fixed plot vegetation data](https://docs.google.com/document/d/16-Aq8u9Rudd78fSzfjvpCXyQgE-BstC-d2PjYfmLtcw/edit?usp=sharing)

# Security

* The user must load a `json` file containing the BigQuery API key into the local directory `/content/...`
* The user must have a Google Maps API key to enable mapping. 
   * CAUTION make sure the key is deleted from the current instance of the notebook before sharing

# Tools

In [65]:
library(tidyverse)

* Remember that the file containing authorization keys for Big Query must be loaded into the virutual envrionment manually.

In [66]:
install.packages("bigrquery")
library(bigrquery)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Source

In this view of the yvp data, species from the cover-based and additional species summaries will be vertically combined for each grid point. Since the additional species summary records species presence only, this view will be limited entirely to species presence. The result will be the plant species richness for each grid point, and this is useful for comparing plant communities, finding the locations of rarer species, or identifying grid points where non-native species are just getting established. After these data are processed, we will want to retain knowledge of whether a species was detected during point-intercept of additional species surveys so that we can evaluate the potential rarity of a given species. The new variable detection_type will allow us to do this.

## Database Connection

In [99]:
# BigQuery API Key
bq_auth(path = "/content/mpg-data-warehouse-api_key-master.json")

In [100]:
Sys.setenv(BIGQUERY_TEST_PROJECT = "mpg-data-warehouse")

In [101]:
billing <- bq_test_project()

### yvp_vegetation_cover

In [74]:
sql_vegetation_cover <- 
"
SELECT
  CONCAT(plot_code, \" \", date) AS survey_code,
  plot_code,
  SUBSTR(SAFE_CAST(date AS STRING), 0, 4) AS year,
  plot_loc,
  plot_rep,
  plot_num,
  (\"cover_est\") AS detection_type,
  species_key
FROM
  `mpg-data-warehouse.vegetation_fixed_plot_yvp.yvp_vegetation_cover`
"

In [76]:
bq_vegetation_cover <- bq_project_query(billing, sql_vegetation_cover)

Auto-refreshing stale OAuth token.



In [77]:
tb_vegetation_cover <- bq_table_download(bq_vegetation_cover)

In [78]:
df_vegetation_cover <- as.data.frame(tb_vegetation_cover)

In [79]:
df_vegetation_cover %>% glimpse() 

Rows: 21,682
Columns: 8
$ survey_code    [3m[90m<chr>[39m[23m "YVP N7 2017-06-08", "YVP N7 2017-06-08", "YVP N7 2017…
$ plot_code      [3m[90m<chr>[39m[23m "YVP N7", "YVP N7", "YVP N7", "YVP N7", "YVP N7", "YVP…
$ year           [3m[90m<chr>[39m[23m "2017", "2017", "2017", "2017", "2017", "2017", "2017"…
$ plot_loc       [3m[90m<chr>[39m[23m "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ plot_rep       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_num       [3m[90m<int>[39m[23m 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ detection_type [3m[90m<chr>[39m[23m "cover_est", "cover_est", "cover_est", "cover_est", "c…
$ species_key    [3m[90m<int>[39m[23m 82, 113, 153, 187, 233, 266, 286, 320, 389, 411, 437, …


### yvp_additional_species

In [105]:
sql_additional_species <- "
SELECT 
  CONCAT(plot_code, \" \", date) AS survey_code,
  plot_code,
  SUBSTR(SAFE_CAST(date AS STRING), 0, 4) AS year,
  plot_loc,
  plot_rep,
  plot_num,
  (\"supplemental_obs\") AS detection_type,
  species_key
FROM
  `mpg-data-warehouse.vegetation_fixed_plot_yvp.yvp_additional_species`
"

In [107]:
bq_additional_species <- bq_project_query(billing, sql_additional_species)

In [108]:
tb_additional_species <- bq_table_download(bq_additional_species)

In [109]:
df_additional_species <- as.data.frame(tb_additional_species)

In [110]:
df_additional_species %>% glimpse()

Rows: 1,280
Columns: 8
$ survey_code    [3m[90m<chr>[39m[23m "YVP 12 2018-07-10", "YVP 12 2018-07-10", "YVP 12 2018…
$ plot_code      [3m[90m<chr>[39m[23m "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP…
$ year           [3m[90m<chr>[39m[23m "2018", "2018", "2018", "2018", "2018", "2018", "2018"…
$ plot_loc       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_rep       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_num       [3m[90m<int>[39m[23m 12, 12, 12, 12, 12, 246, 246, 246, 246, 571, 571, 571,…
$ detection_type [3m[90m<chr>[39m[23m "supplemental_obs", "supplemental_obs", "supplemental_…
$ species_key    [3m[90m<int>[39m[23m 72, 179, 426, 63, 484, 402, 53, 274, 334, 496, 240, 56…


# Wrangle

With the data from yvp_vegetation_cover, species lists must first be summarized as distinct key_plant_species values within survey_code values. This is because the raw data are estimated in 10 subplots per transect, and species names will often be redundant among subplots. Then the yvp_vegetation_cover data can be vertically bound to the yvp_additional_species data after some light coercion of field names.

One caution with these data. According to protocol, a plant species is only included in the additional species table if it was not found during cover-based surveys. In practice, I assume that this is routinely violated because it isn’t easy to remember all the species surveyed, nor is it efficient to check time after time. It’s important that we eliminate duplicate species for a given grid point. When duplication exists, default to detection_type = “cover_est”. This will prevent upward bias of richness estimates and will make downstream analyses less complicated. Some operation that again summarizes distinct key_plant_species values within survey_code values will be necessary. For additional information on this point, please see instructions for a similar operation with the point-intercept data in the gridVeg [Readme](https://docs.google.com/document/d/1JWnhxNjeSQZkSnGhtHP68i_l1mDj4vPFMBdUvGqN0TA/edit#heading=h.hnb7ex8jlp42).


## Remove duplicates

### yvp_vegetation_cover

In [81]:
df_vegetation_cover %>% glimpse()

Rows: 21,682
Columns: 8
$ survey_code    [3m[90m<chr>[39m[23m "YVP N7 2017-06-08", "YVP N7 2017-06-08", "YVP N7 2017…
$ plot_code      [3m[90m<chr>[39m[23m "YVP N7", "YVP N7", "YVP N7", "YVP N7", "YVP N7", "YVP…
$ year           [3m[90m<chr>[39m[23m "2017", "2017", "2017", "2017", "2017", "2017", "2017"…
$ plot_loc       [3m[90m<chr>[39m[23m "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ plot_rep       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_num       [3m[90m<int>[39m[23m 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ detection_type [3m[90m<chr>[39m[23m "cover_est", "cover_est", "cover_est", "cover_est", "c…
$ species_key    [3m[90m<int>[39m[23m 82, 113, 153, 187, 233, 266, 286, 320, 389, 411, 437, …


In [131]:
# remove duplicate species records per survey
df_vegetation_cover <- df_vegetation_cover %>%
  group_by(survey_code) %>%
  distinct() %>%
  arrange(desc(survey_code), species_key) %>% glimpse()

Rows: 5,292
Columns: 8
Groups: survey_code [175]
$ survey_code    [3m[90m<chr>[39m[23m "YVP NC294 2019-05-09", "YVP NC294 2019-05-09", "YVP N…
$ plot_code      [3m[90m<chr>[39m[23m "YVP NC294", "YVP NC294", "YVP NC294", "YVP NC294", "Y…
$ year           [3m[90m<chr>[39m[23m "2019", "2019", "2019", "2019", "2019", "2019", "2019"…
$ plot_loc       [3m[90m<chr>[39m[23m "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ plot_rep       [3m[90m<chr>[39m[23m "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",…
$ plot_num       [3m[90m<int>[39m[23m 294, 294, 294, 294, 294, 294, 294, 294, 294, 294, 294,…
$ detection_type [3m[90m<chr>[39m[23m "cover_est", "cover_est", "cover_est", "cover_est", "c…
$ species_key    [3m[90m<int>[39m[23m 5, 20, 39, 52, 57, 67, 72, 74, 82, 90, 153, 154, 174, …


### yvp_additional_species

In [111]:
df_additional_species %>% glimpse()

Rows: 1,280
Columns: 8
$ survey_code    [3m[90m<chr>[39m[23m "YVP 12 2018-07-10", "YVP 12 2018-07-10", "YVP 12 2018…
$ plot_code      [3m[90m<chr>[39m[23m "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP 12", "YVP…
$ year           [3m[90m<chr>[39m[23m "2018", "2018", "2018", "2018", "2018", "2018", "2018"…
$ plot_loc       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_rep       [3m[90m<chr>[39m[23m "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", …
$ plot_num       [3m[90m<int>[39m[23m 12, 12, 12, 12, 12, 246, 246, 246, 246, 571, 571, 571,…
$ detection_type [3m[90m<chr>[39m[23m "supplemental_obs", "supplemental_obs", "supplemental_…
$ species_key    [3m[90m<int>[39m[23m 72, 179, 426, 63, 484, 402, 53, 274, 334, 496, 240, 56…


In [130]:
# remove duplicates
df_additional_species <- df_additional_species %>%
  group_by(survey_code) %>%
  distinct() %>%
  arrange(desc(survey_code), species_key) %>% glimpse()

Rows: 1,275
Columns: 8
Groups: survey_code [178]
$ survey_code    [3m[90m<chr>[39m[23m "YVP NC294 2019-05-09", "YVP NC294 2019-05-09", "YVP N…
$ plot_code      [3m[90m<chr>[39m[23m "YVP NC294", "YVP NC294", "YVP NC294", "YVP NC294", "Y…
$ year           [3m[90m<chr>[39m[23m "2019", "2019", "2019", "2019", "2019", "2019", "2019"…
$ plot_loc       [3m[90m<chr>[39m[23m "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ plot_rep       [3m[90m<chr>[39m[23m "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",…
$ plot_num       [3m[90m<int>[39m[23m 294, 294, 294, 294, 294, 294, 294, 294, 294, 294, 294,…
$ detection_type [3m[90m<chr>[39m[23m "supplemental_obs", "supplemental_obs", "supplemental_…
$ species_key    [3m[90m<int>[39m[23m 31, 36, 84, 178, 183, 316, 362, 36, 216, 316, 342, 362…


## Combine dataframes

In [133]:
species_richness <- union_all(df_vegetation_cover, df_additional_species) %>% glimpse()

Rows: 6,567
Columns: 8
Groups: survey_code [187]
$ survey_code    [3m[90m<chr>[39m[23m "YVP NC294 2019-05-09", "YVP NC294 2019-05-09", "YVP N…
$ plot_code      [3m[90m<chr>[39m[23m "YVP NC294", "YVP NC294", "YVP NC294", "YVP NC294", "Y…
$ year           [3m[90m<chr>[39m[23m "2019", "2019", "2019", "2019", "2019", "2019", "2019"…
$ plot_loc       [3m[90m<chr>[39m[23m "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ plot_rep       [3m[90m<chr>[39m[23m "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",…
$ plot_num       [3m[90m<int>[39m[23m 294, 294, 294, 294, 294, 294, 294, 294, 294, 294, 294,…
$ detection_type [3m[90m<chr>[39m[23m "cover_est", "cover_est", "cover_est", "cover_est", "c…
$ species_key    [3m[90m<int>[39m[23m 5, 20, 39, 52, 57, 67, 72, 74, 82, 90, 153, 154, 174, …


In [152]:
# look for duplicates
species_richness %>%
  group_by(survey_code) %>%
  filter(survey_code == "YVP NC294 2019-05-09") %>% 
  count(species_key) %>%
  arrange(desc(n))

survey_code,species_key,n
<chr>,<int>,<int>
YVP NC294 2019-05-09,183,2
YVP NC294 2019-05-09,5,1
YVP NC294 2019-05-09,20,1
YVP NC294 2019-05-09,31,1
YVP NC294 2019-05-09,36,1
YVP NC294 2019-05-09,39,1
YVP NC294 2019-05-09,52,1
YVP NC294 2019-05-09,57,1
YVP NC294 2019-05-09,67,1
YVP NC294 2019-05-09,72,1
