<a href="https://colab.research.google.com/github/samsoe/mpg_notebooks/blob/master/YVP_Additional_Species_Wrangle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*R Notebook*

# README

* Readme fixed plot vegetation data - [Additional Species Data](https://docs.google.com/document/d/16-Aq8u9Rudd78fSzfjvpCXyQgE-BstC-d2PjYfmLtcw/edit#heading=h.t9gebon1aetd)

# Load Tools

In [0]:
# Package and library installation
packages_needed = c("tidyverse", "gsheet") # comma delimited vector of package names
packages_installed = packages_needed %in% rownames(installed.packages())

if (any(! packages_installed))
  install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
  library(packages_needed[i], character.only = T)
}

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Source

In [0]:
# 2020-04-28_yvp_additional_species.csv
src = 'https://drive.google.com/uc?id=1GWDvhXIHsrOUaRveq5SoozgZ7oUW9XJy'

In [0]:
df <- read.csv(file = src)

In [0]:
head(df, n=2)

Unnamed: 0_level_0,plot_code,date,species_code,cover_pct
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<int>
1,YVP 10,2017-06-09,BALSAG,1
2,YVP 10,2017-06-09,ERICOR,1


# Wrangle

## Structure Columns

### plot_code

In [0]:
# convert to string
df$plot_code <- as.character(df$plot_code)

### plot_loc

In [0]:
# detect "N" in 'plot_code' and write to new column 'plot_loc'
df <- df %>%
  mutate(plot_loc = ifelse(str_detect(plot_code, "N"), "N", NA))

In [0]:
# strip "N" from 'plot_code' if present
df$plot_code <- str_remove(df$plot_code, "N")

In [0]:
# reorder columns
df <- df[,c(1,5,2,3,4)]

### plot_rep

In [0]:
# detect "A", "B", "C" characters in plot_code and if present write to 'plot_rep'
df <- df %>%
  mutate(plot_rep = case_when(str_detect(plot_code, "A")~"A",
                              str_detect(plot_code, "B")~"B",
                              str_detect(plot_code, "C")~"C"))

In [0]:
# strip "A", "B", "C" from plot_code
df$plot_code <- str_remove(df$plot_code, "[ABC]")

In [0]:
# reorder columns
df <- df[,c(1,2,6,3,4,5)]

### plot_num

In [0]:
# use digital values from 'plot_code' and to populate 'plot_num'
df <- df %>%
  mutate(plot_num = str_extract(plot_code, "[:digit:].*"))

In [0]:
df <- df[,c(1,2,3,7,4,5,6)]

### date

In [0]:
# convert to date
df$date <- as.Date(df$date)

### subplot

In [0]:
# not present in source dataset

### species_key

This will be imported from the plant species metadata table, and we can use it to join and correct species codes in the future. But because joining the key to the species codes will require that the codes be corrected first, we will skip this step for now.

### species_code

In [0]:
# convert to string
df$species_code <- as.character(df$species_code)

### cover_pct

In [0]:
typeof(df$cover_pct)

In [0]:
head(df)

Unnamed: 0_level_0,plot_code,plot_loc,plot_rep,plot_num,date,species_code,cover_pct
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<date>,<chr>,<int>
1,YVP 10,,,10,2017-06-09,BALSAG,1
2,YVP 10,,,10,2017-06-09,ERICOR,1
3,YVP 10,,,10,2017-06-09,ERINAU,2
4,YVP 10,,,10,2017-06-09,ERIPUM,1
5,YVP 10,,,10,2017-06-09,LEWRED,1
6,YVP 10,,,10,2017-06-09,PURVIR,10


# Correct errors in species codes
The species codes used in the source data contain numerous errors, and they also in some cases represent old taxonomy where species names have been revised. This can cause all sorts of problems, like artificially creating new species or making it impossible to join with available species metadata. Several steps must be accomplished here:

1. Trim leading or trailing spaces from the code (this was done in excel before source CSV files were created)
2. Read in master list of species metadata and query YVP species codes to identify which ones don't align
3. Align the species codes, identify the ones that are wrong and correct them
4. Import the numeric key from the species metadata so that future aligments are easier and errors are less common

### Read in master list of species metadata and codes

In [0]:
# 2020-04-27_MPGR_plant_species_list
spp = gsheet2tbl("https://docs.google.com/spreadsheets/d/1wPen7yeimXtY4qK5Nj4JPvlgHYamoogR0YJekaF7i9Y") %>% 
as_tibble() %>% glimpse()

Rows: 754
Columns: 9
$ key_PlantSpecies [3m[90m<dbl>[39m[23m 784, 783, 782, 781, 780, 779, 778, 777, 776, 775, 77…
$ key_PlantCode    [3m[90m<chr>[39m[23m "UNKN_SP", "CRYP_SP", "RUME_SP", "HIER_SP", "BOEC_SP…
$ NameScientific   [3m[90m<chr>[39m[23m "Unknown", "Cryptantha spp.", "Rumex spp.", "Hieraci…
$ NameSynonym      [3m[90m<chr>[39m[23m NA, NA, NA, NA, "Arabis spp.", NA, NA, NA, NA, NA, N…
$ NameCommon       [3m[90m<chr>[39m[23m "unknown", "cryptantha", "dock", "hawkweed", "rockcr…
$ NameFamily       [3m[90m<chr>[39m[23m "unknown", "Boraginaceae", "Polygonaceae", "Asterace…
$ NativeStatus     [3m[90m<chr>[39m[23m "unknown", "native", "nonnative", "unknown", "native…
$ LifeCycle        [3m[90m<chr>[39m[23m "unknown", "unknown", "Perennial", "Perennial", "Bie…
$ LifeForm         [3m[90m<chr>[39m[23m "unknown", "forb", "forb", "forb", "forb", "forb", "…


### Align species codes and identify mistakes


In [0]:
# Align the species codes 
# Produce df of codes that don't match the master list
collisions_species_codes = 
df %>% 
anti_join(spp, by = c("species_code" = "key_PlantCode")) %>% 
group_by(species_code) %>% 
distinct(species_code) %>% 
arrange(species_code) %>% 
print(n = Inf)

[90m# A tibble: 29 x 1[39m
[90m# Groups:   species_code [29][39m
   species_code      
   [3m[90m<chr>[39m[23m             
[90m 1[39m ANTE SP           
[90m 2[39m ANTSPP            
[90m 3[39m ANTSPP2           
[90m 4[39m ARTSPP            
[90m 5[39m ASTMIN            
[90m 6[39m BOEC SP           
[90m 7[39m BOESPP            
[90m 8[39m CARE SP           
[90m 9[39m CAREX SP          
[90m10[39m CREACU            
[90m11[39m DESC SP           
[90m12[39m ERIG SP           
[90m13[39m ERISPP            
[90m14[39m ERITRA            
[90m15[39m GAIARI?           
[90m16[39m HEISCO            
[90m17[39m HEISPP            
[90m18[39m PENSPP            
[90m19[39m PURVIR            
[90m20[39m ROSSPP            
[90m21[39m SALI SP           
[90m22[39m SALIX SP          
[90m23[39m SELDEN            
[90m24[39m SENIINT           
[90m25[39m SOLCAN            
[90m26[39m UNKN SP           
[90m27[39m UNKNOWN ASTERACEAE
[90

### Create file that associates errors with corrections

In [0]:
# Produce file `collisions_species_codes` for work in spreadsheet outside of this environment
# The file will save to the `content` folder in the drive tree
# BL downloaded the file to his desktop to produce a new naming key file
filename = "collisions_species_codes.csv"
if (filename %in% list.files(getwd())) {
  cat("file already exists in working directory: ", filename, "\n", "working directory: ", getwd(), "\n")
} else {
  write.csv(collisions_species_codes, filename)
  cat(filename, " written to working directory \n", "working directory: ", getwd(), "\n")
}

collisions_species_codes.csv  written to working directory 
 working directory:  /content 


In [0]:
# Import csv file with the updated codes 
# This file was produced by visually aligning the codes with a file that Rebecca Durham provided
code_corrections <- read.csv(file = "https://drive.google.com/uc?id=11Eo8DKXp0GR5qLXiRoAwBg9MPCpX1AXq",
  colClasses = c("character", "character")) %>% 
glimpse()

Rows: 29
Columns: 2
$ plantcode_incorrect [3m[90m<chr>[39m[23m "ANTE SP", "ANTSPP", "ANTSPP2", "ARTSPP", "ASTMIN…
$ plantcode_corrected [3m[90m<chr>[39m[23m "ANTE_SP", "ANTE_SP", "ANTE_SP", "ARTE_SP", "ASTM…


### Cascade changes through dataset


In [0]:
# Create new df to hold corrected information
# Change species_code to character variable to avoid problems with levels later
yvp_addtl_spp_correct = df %>% mutate(species_code = as.character(species_code)) %>% glimpse()

Rows: 1,280
Columns: 7
$ plot_code    [3m[90m<chr>[39m[23m "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 1…
$ plot_loc     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_rep     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_num     [3m[90m<chr>[39m[23m "10", "10", "10", "10", "10", "10", "10", "10", "10", "1…
$ date         [3m[90m<date>[39m[23m 2017-06-09, 2017-06-09, 2017-06-09, 2017-06-09, 2017-06…
$ species_code [3m[90m<chr>[39m[23m "BALSAG", "ERICOR", "ERINAU", "ERIPUM", "LEWRED", "PURVI…
$ cover_pct    [3m[90m<int>[39m[23m 1, 1, 2, 1, 1, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5…


In [0]:
# Loop operation used to update each instance of an incorrect code
# Embed logic control to prevent errors if this loop is run on a df with corrected codes
# Variable to track loop cycles
cycles = 0

for (i in 1:length(code_corrections[, 1])) {
  index = which(yvp_addtl_spp_correct$species_code == code_corrections$plantcode_incorrect[i])

  if (length(index != 0)) {
    cat("number of incorrect code entries: ", length(index), "\n")
    cat("incorrect code: ", code_corrections$plantcode_incorrect[i], "\n")
    yvp_addtl_spp_correct[index, ]$species_code = code_corrections$plantcode_corrected[i]
    print(yvp_addtl_spp_correct[index, c(1,5,6,7)])
    cycles = cycles + length(index)
    cat("\n")
  } else {
    cat("no incorrect code entries were found \n")
  }

  cat("number of corrections made (cumulative): ", cycles, "\n\n\n")

}

In [0]:
# Rescan for incorrect species codes
yvp_addtl_spp_correct %>% 
anti_join(spp, by = c("species_code" = "key_PlantCode")) %>% 
group_by(species_code) %>% distinct(species_code) %>% arrange(species_code)

species_code
<chr>


In [0]:
# Incorporate serial key for species codes
yvp_addtl_spp_FINAL = 
yvp_addtl_spp_correct %>% 
left_join(spp %>% select(key_PlantSpecies, key_PlantCode), by = c("species_code" = "key_PlantCode")) %>% 
rename(species_key = key_PlantSpecies) %>% 
select(c(1,2,3,4,5,8,6,7)) %>% 
glimpse()

Rows: 1,280
Columns: 8
$ plot_code    [3m[90m<chr>[39m[23m "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 1…
$ plot_loc     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_rep     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_num     [3m[90m<chr>[39m[23m "10", "10", "10", "10", "10", "10", "10", "10", "10", "1…
$ date         [3m[90m<date>[39m[23m 2017-06-09, 2017-06-09, 2017-06-09, 2017-06-09, 2017-06…
$ species_key  [3m[90m<dbl>[39m[23m 72, 212, 218, 220, 298, 433, 16, 72, 163, 169, 212, 218,…
$ species_code [3m[90m<chr>[39m[23m "BALSAG", "ERICOR", "ERINAU", "ERIPUM", "LEWRED", "PRUVI…
$ cover_pct    [3m[90m<int>[39m[23m 1, 1, 2, 1, 1, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5…


In [0]:
summary(yvp_addtl_spp_FINAL)

  plot_code           plot_loc           plot_rep           plot_num        
 Length:1280        Length:1280        Length:1280        Length:1280       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
      date             species_key    species_code         cover_pct     
 Min.   :2017-05-08   Min.   :  3.0   Length:1280        Min.   : 0.000  
 1st Qu.:2017-06-09   1st Qu.: 91.0   Class :character   1st Qu.: 1.000  
 Median :2018-07-02   Median :240.0   Mode  :character   Median : 1.000  
 Mean   :2018-07-17   Mean   :258.2                      Mean   : 1.635  
 3rd Qu.:2019-05-29   3rd Qu.:405.0                      3rd Qu.: 1.000  
 Max.   :2019-07-

# Output
## Export wrangled dataframe to csv
Export the full dataset so we can push it to the BQ database

In [0]:
filename_final = "yvp_additional_species_FINAL.csv"

if (filename_final %in% list.files(getwd())) {
  cat("file already exists in working directory:", filename_final, "\n", "working directory:", getwd(), "\n")
} else {
  write.csv(yvp_addtl_spp_FINAL, filename_final)
  cat(filename_final, "written to working directory \n", "working directory:", getwd(), "\n")
}

yvp_additional_species_FINAL.csv written to working directory 
 working directory: /content 


## Push to BigQuery
Link to BQ table and or text description...

## Field datasheet version
TBD based on conversation with Rebecca Durham (BL, 2020-05-05)

I don't know what is best here: include the entire list? Completely blank list?

Unknown: does RD want the cumulative species list for each subplot, or just from 2019?

Make the filename easy to differentiate from the full datasets, like append "field data sheets" or "data collection sheets" or something.