<a href="https://colab.research.google.com/github/samsoe/mpg_notebooks/blob/master/YVP_Vegetation_Cover_Data_Wrangle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*R Notebook*

# README

* [Readme fixed grid plot vegetation data](https://docs.google.com/document/d/16-Aq8u9Rudd78fSzfjvpCXyQgE-BstC-d2PjYfmLtcw/edit?usp=sharing)

# Load Tools

In [None]:
# Package and library installation
packages_needed = c("tidyverse", "gsheet") # comma delimited vector of package names
packages_installed = packages_needed %in% rownames(installed.packages())

if (any(! packages_installed))
  install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
  library(packages_needed[i], character.only = T)
}

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# Source

In [None]:
# 2020-04-28_yvp_vegetation_cover
src = 'https://drive.google.com/uc?id=1pemnlKIlfAQw2JSMN7yDlYMG5QhUW-NP'

In [None]:
df <- read.csv(file = src)

In [None]:
head(df, n=2)

Unnamed: 0_level_0,plot_code,date,subplot,species_code,cover_pct
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<int>
1,YVP 10,2017-06-09,1,BOESPP,1
2,YVP 10,2017-06-09,1,CREINT,1


# Wrangle

## Structure columns

### Plot Code Transformation
The plot code used in the source data is a complex string. It is needed to provide a unique key to each survey location, but because it is a string it is difficult to sort or filter plots. Further, the plot codes used here will be difficult to associate with the extensive grid point metadata stored elsewhere in the MPG Data Warehouse. 

Solution: paste the separate identifers from the plot code into separate fields, but retain the original character string for internal use.

#### plot_code

In [None]:
# convert to string
df$plot_code <- as.character(df$plot_code)

#### plot_ loc

In [None]:
# detect "N" in 'plot_code' and write to new column 'plot_loc'
df <- df %>%
  mutate(plot_loc = ifelse(str_detect(plot_code, "N"), "N", NA))

In [None]:
# reorder columns
df <- df[,c(1,6,2,3,4,5)]

#### plot_rep

In [None]:
# detect "A", "B", "C" characters in plot_code and if present write to 'plot_rep'
df <- df %>%
  mutate(plot_rep = case_when(str_detect(plot_code, "A")~"A",
                              str_detect(plot_code, "B")~"B",
                              str_detect(plot_code, "C")~"C"))

In [None]:
# reorder columns
df <- df[,c(1,2,7,3,4,5,6)]

#### plot_num

In [None]:
# use digital values from 'plot_code' and to populate 'plot_num'
df <- df %>%
  mutate(plot_num = str_extract(plot_code, "[:digit:].*"),
         plot_num = as.integer(plot_num))

In [None]:
# reorder columns
df <- df[,c(1,2,3,8,4,5,6,7)]

### date

In [None]:
# convert to date
df$date <- as.Date(df$date)

### subplot

In [None]:
# convert to integer
df$subplot <- as.integer(df$subplot)

### species_key

This will be imported from the plant species metadata table, and we can use it to join and correct species codes in the future. But because joining the key to the species codes will require that the codes be corrected first, we will skip this step for now.


### species_code

In [None]:
# convert to string
df$species_code <- as.character(df$species_code)

In [None]:
head(df, n=2)

Unnamed: 0_level_0,plot_code,plot_loc,plot_rep,plot_num,date,subplot,species_code,cover_pct
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<date>,<int>,<chr>,<int>
1,YVP 10,,,10,2017-06-09,1,BOESPP,1
2,YVP 10,,,10,2017-06-09,1,CREINT,1


## Identify Double Counted Species
In a few instances, a plant species is counted twice in the same survey subplot. This could inflate the cover reported for that species. In these cases, the desired end product is to have just one row for each. When the reported percent cover differs between repeated entries, we cannot tell which one is correct. We used the following algorithm to process these repeated or double counts:

* If the cover_pct values are equal, simply delete one of the rows
* If the cover_pct values are not equal, delete one of the rows and change cover_pct for the remaining one to NA

In [None]:
str(df)

'data.frame':	21728 obs. of  8 variables:
 $ plot_code   : chr  "YVP 10" "YVP 10" "YVP 10" "YVP 10" ...
 $ plot_loc    : chr  NA NA NA NA ...
 $ plot_rep    : chr  NA NA NA NA ...
 $ plot_num    : int  10 10 10 10 10 10 10 10 10 10 ...
 $ date        : Date, format: "2017-06-09" "2017-06-09" ...
 $ subplot     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ species_code: chr  "BOESPP" "CREINT" "EUPESU" "FESCAM" ...
 $ cover_pct   : int  1 1 5 25 25 10 1 1 5 1 ...


In [None]:
head(df)

Unnamed: 0_level_0,plot_code,plot_loc,plot_rep,plot_num,date,subplot,species_code,cover_pct
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<date>,<int>,<chr>,<int>
1,YVP 10,,,10,2017-06-09,1,BOESPP,1
2,YVP 10,,,10,2017-06-09,1,CREINT,1
3,YVP 10,,,10,2017-06-09,1,EUPESU,5
4,YVP 10,,,10,2017-06-09,1,FESCAM,25
5,YVP 10,,,10,2017-06-09,1,FESIDA,25
6,YVP 10,,,10,2017-06-09,1,GEUTRI,10


In [None]:
# Find instances where a plant species is counted twice in the same year-plot-subplot combination
dbl_counts <- df %>%
  group_by(year = as.numeric(substring(date,0,4)), plot_code, subplot, species_code) %>%
  summarize(counted = n()) %>% 
  ungroup() %>%
  arrange(year, plot_code, subplot, desc(counted)) %>%
  filter(counted > 1) %>%
  print(n=Inf)

[90m# A tibble: 46 x 5[39m
    year plot_code subplot species_code counted
   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m       [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m          [3m[90m<int>[39m[23m
[90m 1[39m  [4m2[24m017 YVP 144         2 VERVER             2
[90m 2[39m  [4m2[24m017 YVP 180         7 FRIPUD             2
[90m 3[39m  [4m2[24m017 YVP 203         4 COLLIN             2
[90m 4[39m  [4m2[24m017 YVP 355        10 PSESPI             2
[90m 5[39m  [4m2[24m017 YVP 44          9 ORTTEN             2
[90m 6[39m  [4m2[24m017 YVP N111        2 DRAVER             2
[90m 7[39m  [4m2[24m017 YVP NB294       8 MICGRA             2
[90m 8[39m  [4m2[24m018 YVP 112         9 ALYALY             2
[90m 9[39m  [4m2[24m018 YVP 12          4 HOLUMB             2
[90m10[39m  [4m2[24m018 YVP 144        10 ACHMIL             2
[90m11[39m  [4m2[24m018 YVP 184         4 HOLUMB             2
[90m12[39m  [4m2[24m018 YVP 185        

In [None]:
view_doubles  <- dbl_counts %>%
  left_join(df %>% mutate(year = as.numeric(substring(date,0,4))))

Joining, by = c("year", "plot_code", "subplot", "species_code")



In [None]:
str(view_doubles)

tibble [92 × 10] (S3: tbl_df/tbl/data.frame)
 $ year        : num [1:92] 2017 2017 2017 2017 2017 ...
 $ plot_code   : chr [1:92] "YVP 144" "YVP 144" "YVP 180" "YVP 180" ...
 $ subplot     : int [1:92] 2 2 7 7 4 4 10 10 9 9 ...
 $ species_code: chr [1:92] "VERVER" "VERVER" "FRIPUD" "FRIPUD" ...
 $ counted     : int [1:92] 2 2 2 2 2 2 2 2 2 2 ...
 $ plot_loc    : chr [1:92] NA NA NA NA ...
 $ plot_rep    : chr [1:92] NA NA NA NA ...
 $ plot_num    : int [1:92] 144 144 180 180 203 203 355 355 44 44 ...
 $ date        : Date[1:92], format: "2017-05-30" "2017-05-30" ...
 $ cover_pct   : int [1:92] 3 4 1 1 10 1 20 2 4 1 ...


In [None]:
view_doubles %>%
  distinct(date, plot_code, species_code, subplot) %>%
  arrange(date, plot_code, species_code)

date,plot_code,species_code,subplot
<date>,<chr>,<chr>,<int>
2017-05-08,YVP NB294,MICGRA,8
2017-05-18,YVP 203,COLLIN,4
2017-05-25,YVP N111,DRAVER,2
2017-05-30,YVP 144,VERVER,2
2017-05-31,YVP 180,FRIPUD,7
2017-06-02,YVP 355,PSESPI,10
2017-06-06,YVP 44,ORTTEN,9
2018-05-28,YVP 144,ACHMIL,10
2018-05-28,YVP N278,ARESER,2
2018-05-28,YVP N522,LITRUD,1


In [None]:
view_doubles %>%
  distinct(date, plot_code, subplot, plot_loc) %>%
  arrange(date, plot_code)

date,plot_code,subplot,plot_loc
<date>,<chr>,<int>,<chr>
2017-05-08,YVP NB294,8,N
2017-05-18,YVP 203,4,
2017-05-25,YVP N111,2,N
2017-05-30,YVP 144,2,
2017-05-31,YVP 180,7,
2017-06-02,YVP 355,10,
2017-06-06,YVP 44,9,
2018-05-28,YVP 144,10,
2018-05-28,YVP N278,2,N
2018-05-28,YVP N522,1,N


### Resolve double counts

* If the 'cover_pct' values are equal, delete one of the rows
* If the 'cover_pct' values are not equal, delete one of the rows and change cover_pct for the remaining one to NA

In [None]:
str(view_doubles)

tibble [92 × 10] (S3: tbl_df/tbl/data.frame)
 $ year        : num [1:92] 2017 2017 2017 2017 2017 ...
 $ plot_code   : chr [1:92] "YVP 144" "YVP 144" "YVP 180" "YVP 180" ...
 $ subplot     : int [1:92] 2 2 7 7 4 4 10 10 9 9 ...
 $ species_code: chr [1:92] "VERVER" "VERVER" "FRIPUD" "FRIPUD" ...
 $ counted     : int [1:92] 2 2 2 2 2 2 2 2 2 2 ...
 $ plot_loc    : chr [1:92] NA NA NA NA ...
 $ plot_rep    : chr [1:92] NA NA NA NA ...
 $ plot_num    : int [1:92] 144 144 180 180 203 203 355 355 44 44 ...
 $ date        : Date[1:92], format: "2017-05-30" "2017-05-30" ...
 $ cover_pct   : int [1:92] 3 4 1 1 10 1 20 2 4 1 ...


In [None]:
distinct_doubles <- view_doubles %>%
  distinct(date, plot_code, subplot, species_code) %>%
  arrange(date, plot_code)

In [None]:
str(distinct_doubles)

tibble [46 × 4] (S3: tbl_df/tbl/data.frame)
 $ date        : Date[1:46], format: "2017-05-08" "2017-05-18" ...
 $ plot_code   : chr [1:46] "YVP NB294" "YVP 203" "YVP N111" "YVP 144" ...
 $ subplot     : int [1:46] 8 4 2 2 7 10 9 10 2 1 ...
 $ species_code: chr [1:46] "MICGRA" "COLLIN" "DRAVER" "VERVER" ...


In [None]:
nrow(distinct_doubles)

In [None]:
for (row in 1:nrow(distinct_doubles)) {
  dbl_ref <- distinct_doubles[row, ]
  
  # date, plot_code, species_code, subplot
  selected_rows <- filter(df, date == dbl_ref$date &
                        plot_code == dbl_ref$plot_code &
                        species_code == dbl_ref$species_code &
                        subplot == dbl_ref$subplot)
                        
  # identify indicies of duplicate observationos in original dataframe
  selected_indices <- which(df$date == dbl_ref$date &
                        df$plot_code == dbl_ref$plot_code &
                        df$species_code == dbl_ref$species_code &
                        df$subplot == dbl_ref$subplot)

  # Display for Review
  print(selected_rows)

  # compare "cover_pct" observations for equality
  if(var(selected_rows$cover_pct) == 0) {
    print("EQUAL")
    # drop duplicate observation
    df <- df[-c(last(selected_indices)), ]    
  } else if (var(selected_rows$cover_pct) != 0) {
    print("NOT EQUAL")
    # set first row "cover_pct" to NA
    df[c(first(selected_indices)), ]$cover_pct = NA

    # drop duplicate observation
    df <- df[-c(last(selected_indices)), ]
  }
}

In [None]:
str(df)

'data.frame':	21682 obs. of  8 variables:
 $ plot_code   : chr  "YVP 10" "YVP 10" "YVP 10" "YVP 10" ...
 $ plot_loc    : chr  NA NA NA NA ...
 $ plot_rep    : chr  NA NA NA NA ...
 $ plot_num    : chr  "10" "10" "10" "10" ...
 $ date        : Date, format: "2017-06-09" "2017-06-09" ...
 $ subplot     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ species_code: chr  "BOESPP" "CREINT" "EUPESU" "FESCAM" ...
 $ cover_pct   : int  1 1 5 25 25 10 1 1 5 1 ...


In [None]:
# rescan for double observations
# Find instances where a plant species is counted twice in the same year-plot-subplot combination
dbl_recount <- df %>%
  group_by(year = as.numeric(substring(date,0,4)), plot_code, subplot, species_code) %>%
  summarize(counted = n()) %>% 
  ungroup() %>%
  arrange(year, plot_code, subplot, desc(counted)) %>%
  filter(counted > 1) %>%
  print(n=Inf)

[90m# A tibble: 0 x 5[39m
[90m# … with 5 variables: year [3m[90m<dbl>[90m[23m, plot_code [3m[90m<chr>[90m[23m, subplot [3m[90m<int>[90m[23m,[39m
[90m#   species_code [3m[90m<chr>[90m[23m, counted [3m[90m<int>[90m[23m[39m


In [None]:
# display previously duplicated plots for review
for (row in 1:nrow(distinct_doubles)) {
  dbl_ref <- distinct_doubles[row, ]

  # date, plot_code, species_code, subplot
  selected_rows <- filter(df, date == dbl_ref$date &
                        plot_code == dbl_ref$plot_code &
                        species_code == dbl_ref$species_code &
                        subplot == dbl_ref$subplot)
  print(selected_rows[,c(1,5,6,7,8)])
}

  plot_code       date subplot species_code cover_pct
1 YVP NB294 2017-05-08       8       MICGRA         1
  plot_code       date subplot species_code cover_pct
1   YVP 203 2017-05-18       4       COLLIN        NA
  plot_code       date subplot species_code cover_pct
1  YVP N111 2017-05-25       2       DRAVER         1
  plot_code       date subplot species_code cover_pct
1   YVP 144 2017-05-30       2       VERVER        NA
  plot_code       date subplot species_code cover_pct
1   YVP 180 2017-05-31       7       FRIPUD         1
  plot_code       date subplot species_code cover_pct
1   YVP 355 2017-06-02      10       PSESPI        NA
  plot_code       date subplot species_code cover_pct
1    YVP 44 2017-06-06       9       ORTTEN        NA
  plot_code       date subplot species_code cover_pct
1   YVP 144 2018-05-28      10       ACHMIL         5
  plot_code       date subplot species_code cover_pct
1  YVP N278 2018-05-28       2       ARESER        NA
  plot_code       date subpl

## Correct errors in species codes
The species codes used in the source data contain numerous errors, and they also in some cases represent old taxonomy where species names have been revised. This can cause all sorts of problems, like artificially creating new species or making it impossible to join with available species metadata. Several steps must be accomplished here:

1. Trim leading or trailing spaces from the code (this was done in excel before source CSV files were created)
2. Read in master list of species metadata and query YVP species codes to identify which ones don't align
3. Align the species codes, identify the ones that are wrong and correct them
4. Import the numeric key from the species metadata so that future aligments are easier and errors are less common

### Read in master list of species metadata and codes


In [None]:
# 2020-04-27_MPGR_plant_species_list
spp = gsheet2tbl("https://docs.google.com/spreadsheets/d/1wPen7yeimXtY4qK5Nj4JPvlgHYamoogR0YJekaF7i9Y") %>% 
as_tibble() %>% glimpse()

Rows: 754
Columns: 9
$ key_PlantSpecies [3m[90m<dbl>[39m[23m 784, 783, 782, 781, 780, 779, 778, 777, 776, 775, 77…
$ key_PlantCode    [3m[90m<chr>[39m[23m "UNKN_SP", "CRYP_SP", "RUME_SP", "HIER_SP", "BOEC_SP…
$ NameScientific   [3m[90m<chr>[39m[23m "Unknown", "Cryptantha spp.", "Rumex spp.", "Hieraci…
$ NameSynonym      [3m[90m<chr>[39m[23m NA, NA, NA, NA, "Arabis spp.", NA, NA, NA, NA, NA, N…
$ NameCommon       [3m[90m<chr>[39m[23m "unknown", "cryptantha", "dock", "hawkweed", "rockcr…
$ NameFamily       [3m[90m<chr>[39m[23m "unknown", "Boraginaceae", "Polygonaceae", "Asterace…
$ NativeStatus     [3m[90m<chr>[39m[23m "unknown", "native", "nonnative", "unknown", "native…
$ LifeCycle        [3m[90m<chr>[39m[23m "unknown", "unknown", "Perennial", "Perennial", "Bie…
$ LifeForm         [3m[90m<chr>[39m[23m "unknown", "forb", "forb", "forb", "forb", "forb", "…


### Align species codes and identify mistakes


In [None]:
# Align the species codes 
# Produce df of codes that don't match the master list
collisions_species_codes = 
df %>% 
anti_join(spp, by = c("species_code" = "key_PlantCode")) %>% 
group_by(species_code) %>% 
distinct(species_code) %>% 
arrange(species_code) %>% 
print(n = Inf)

[90m# A tibble: 60 x 1[39m
[90m# Groups:   species_code [60][39m
   species_code       
   [3m[90m<chr>[39m[23m              
[90m 1[39m AGOS SP            
[90m 2[39m AGROSP             
[90m 3[39m ALOP SP            
[90m 4[39m ANDOCCUAL          
[90m 5[39m ANTE SP            
[90m 6[39m ANTSPP             
[90m 7[39m ARNCOR?            
[90m 8[39m ARTE SP            
[90m 9[39m ARTSPP             
[90m10[39m BOEC SP            
[90m11[39m BOESPP             
[90m12[39m CARE SP            
[90m13[39m CARE SP2           
[90m14[39m CARE SP4           
[90m15[39m CARSPP             
[90m16[39m CARSPP 1           
[90m17[39m CARSPP 2           
[90m18[39m CARSPP 4           
[90m19[39m CARSPP2            
[90m20[39m CASSPP             
[90m21[39m CAST SP            
[90m22[39m CERA SP            
[90m23[39m CERINT             
[90m24[39m CHEN SP            
[90m25[39m CREIINT            
[90m26[39m CREP SP            
[90m27

### Create file that associates errors with corrections

In [None]:
# Produce file `collisions_species_codes` for work in spreadsheet outside of this environment
# The file will save to the `content` folder in the drive tree
# BL downloaded the file to his desktop to produce a new naming key file
filename = "collisions_species_codes.csv"
if (filename %in% list.files(getwd())) {
  cat("file already exists in working directory: ", filename, "\n", "working directory: ", getwd(), "\n")
} else {
  write.csv(collisions_species_codes, filename)
  cat(filename, " written to working directory \n", "working directory: ", getwd(), "\n")
}


file already exists in working directory:  collisions_species_codes.csv 
 working directory:  /content 


In [None]:
# Import csv file with the updated codes 
# This file was produced by visually aligning the codes with a file that Rebecca Durham provided
code_corrections <- read.csv(file = "https://drive.google.com/uc?id=1D0j3U4Or2PviFS02F3rxTRXr1SpGOB0a",
  colClasses = c("character", "character")) %>% 
glimpse()

Rows: 60
Columns: 2
$ plantcode_incorrect [3m[90m<chr>[39m[23m "AGOS SP", "ALOP SP", "ANDOCCUAL", "ARNCOR?", "AR…
$ plantcode_corrected [3m[90m<chr>[39m[23m "AGOS_SP", "ALOP_SP", "ANDOCC", "ARNCOR", "ARTDRA…


### Cascade changes through dataset


In [None]:
# Create new df to hold corrected information
# Change species_code to character variable to avoid problems with levels later
yvp_veg_cover_correct = df %>% mutate(species_code = as.character(species_code)) %>% glimpse()

Rows: 21,682
Columns: 8
$ plot_code    [3m[90m<chr>[39m[23m "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 1…
$ plot_loc     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_rep     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_num     [3m[90m<int>[39m[23m 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
$ date         [3m[90m<date>[39m[23m 2017-06-09, 2017-06-09, 2017-06-09, 2017-06-09, 2017-06…
$ subplot      [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ species_code [3m[90m<chr>[39m[23m "BOESPP", "CREINT", "EUPESU", "FESCAM", "FESIDA", "GEUTR…
$ cover_pct    [3m[90m<int>[39m[23m 1, 1, 5, 25, 25, 10, 1, 1, 5, 1, 30, 2, 2, 1, 3, 0, 1, 0…


In [None]:
# Loop operation used to update each instance of an incorrect code
# Embed logic control to prevent errors if this loop is run on a df with corrected codes
# Variable to track loop cycles
cycles = 0

for (i in 1:length(code_corrections[, 1])) {
  index = which(yvp_veg_cover_correct$species_code == code_corrections$plantcode_incorrect[i])

  if (length(index != 0)) {
    cat("number of incorrect code entries: ", length(index), "\n")
    cat("incorrect code: ", code_corrections$plantcode_incorrect[i], "\n")
    yvp_veg_cover_correct[index, ]$species_code = code_corrections$plantcode_corrected[i]
    print(yvp_veg_cover_correct[index, c(1,5,6,7,8)])
    cycles = cycles + length(index)
    cat("\n")
  } else {
    cat("no incorrect code entries were found \n")
  }

  cat("number of corrections made (cumulative): ", cycles, "\n\n\n")

}

number of incorrect code entries:  1 
incorrect code:  AGOS SP 
      plot_code       date subplot species_code cover_pct
19181   YVP 205 2019-06-26       4      AGOS_SP         1

number of corrections made (cumulative):  1 


number of incorrect code entries:  9 
incorrect code:  ALOP SP 
      plot_code       date subplot species_code cover_pct
21124  YVP N348 2019-07-05       2      ALOP_SP        40
21135  YVP N348 2019-07-05       3      ALOP_SP        10
21144  YVP N348 2019-07-05       4      ALOP_SP        50
21155  YVP N348 2019-07-05       5      ALOP_SP        80
21164  YVP N348 2019-07-05       6      ALOP_SP        10
21179  YVP N348 2019-07-05       7      ALOP_SP        65
21188  YVP N348 2019-07-05       8      ALOP_SP        25
21198  YVP N348 2019-07-05       9      ALOP_SP        60
21205  YVP N348 2019-07-05      10      ALOP_SP         2

number of corrections made (cumulative):  10 


number of incorrect code entries:  2 
incorrect code:  ANDOCCUAL 
     plot_cod

In [None]:
# Rescan for incorrect species codes
yvp_veg_cover_correct %>% 
anti_join(spp, by = c("species_code" = "key_PlantCode")) %>% 
group_by(species_code) %>% distinct(species_code) %>% arrange(species_code)

species_code
<chr>


In [None]:
# Incorporate serial key for species codes
yvp_vegetation_cover_FINAL = 
yvp_veg_cover_correct %>% 
left_join(spp %>% select(key_PlantSpecies, key_PlantCode), by = c("species_code" = "key_PlantCode")) %>% 
rename(species_key = key_PlantSpecies) %>% 
select(c(1,2,3,4,5,6,9,7,8)) %>% 
glimpse()

Rows: 21,682
Columns: 9
$ plot_code    [3m[90m<chr>[39m[23m "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 10", "YVP 1…
$ plot_loc     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_rep     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plot_num     [3m[90m<int>[39m[23m 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
$ date         [3m[90m<date>[39m[23m 2017-06-09, 2017-06-09, 2017-06-09, 2017-06-09, 2017-06…
$ subplot      [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ species_key  [3m[90m<dbl>[39m[23m 780, 163, 230, 232, 233, 250, 84, 316, 320, 343, 483, 57…
$ species_code [3m[90m<chr>[39m[23m "BOEC_SP", "CREINT", "EUPESU", "FESCAM", "FESIDA", "GEUT…
$ cover_pct    [3m[90m<int>[39m[23m 1, 1, 5, 25, 25, 10, 1, 1, 5, 1, 30, 2, 2, 1, 3, 0, 1, 0…


In [None]:
summary(yvp_vegetation_cover_FINAL)

  plot_code           plot_loc           plot_rep            plot_num    
 Length:21682       Length:21682       Length:21682       Min.   :  7.0  
 Class :character   Class :character   Class :character   1st Qu.: 62.0  
 Mode  :character   Mode  :character   Mode  :character   Median :209.0  
                                                          Mean   :244.9  
                                                          3rd Qu.:386.0  
                                                          Max.   :571.0  
                                                                         
      date               subplot        species_key    species_code      
 Min.   :2017-05-08   Min.   : 1.000   Min.   :  3.0   Length:21682      
 1st Qu.:2017-06-09   1st Qu.: 3.000   1st Qu.:153.0   Class :character  
 Median :2018-07-02   Median : 5.000   Median :274.0   Mode  :character  
 Mean   :2018-07-22   Mean   : 5.499   Mean   :280.1                     
 3rd Qu.:2019-05-28   3rd Qu.: 8.000  

# Output

## Export Wrangled DataFrame to CSV 
Export the full data set so that we can push it to the BQ database




In [None]:
filename_final = "yvp_vegetation_cover_FINAL.csv"

if (filename_final %in% list.files(getwd())) {
  cat("file already exists in working directory:", filename_final, "\n", "working directory:", getwd(), "\n")
} else {
  write.csv(yvp_vegetation_cover_FINAL, filename_final)
  cat(filename_final, "written to working directory \n", "working directory:", getwd(), "\n")
}

yvp_vegetation_cover_FINAL.csv written to working directory 
 working directory: /content 


## Push to BigQuery

"yvp_vegetation_cover_FINAL.csv" uploaded manually to BigQuery

## Export field datasheet version
Field datasheets need to have a complete, cumulative species list for each plot recorded in a table, with the cover_pct column set to 0. This allows field techs to change the 0 to some number if the species is found. The date column is blank so that field techs can fill in the appropriate date. Do not include columns that are needed for data analysis, like plot_loc, plot_rep, plot_num, and species_key. 

**Schema for field data sheet**

* plot_num (helps for sorting and finding plots)
* plot_code
* date
* species_code
* cover_pct

In [None]:
field_datasheet = 
yvp_vegetation_cover_FINAL %>% 
select(plot_num, plot_code, subplot, species_code) %>% 
group_by(plot_num, plot_code, subplot) %>% 
distinct(species_code) %>% 
select(-species_code, species_code) %>% 
add_column(date = NA, .after = "plot_code") %>% 
add_column(cover_pct = 0) %>% 
arrange(plot_num, plot_code, subplot, species_code) %>% 
glimpse()

Rows: 8,861
Columns: 6
Groups: plot_num, plot_code, subplot [580]
$ plot_num     [3m[90m<int>[39m[23m 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
$ plot_code    [3m[90m<chr>[39m[23m "YVP N7", "YVP N7", "YVP N7", "YVP N7", "YVP N7", "YVP N…
$ date         [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ subplot      [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,…
$ species_code [3m[90m<chr>[39m[23m "ACHMIL", "ALYALY", "BROTEC", "CAMMIC", "CARE_SP", "COLL…
$ cover_pct    [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…


In [None]:
filename_field_datasheet = "yvp_vegetation_cover_field_datasheet_FINAL.csv"

if (filename_field_datasheet %in% list.files(getwd())) {
  cat("file already exists in working directory:", filename_final, "\n", "working directory:", getwd(), "\n")
} else {
  write.csv(field_datasheet, filename_field_datasheet)
  cat(filename_final, "written to working directory \n", "working directory:", getwd(), "\n")
}

yvp_vegetation_cover_FINAL.csv written to working directory 
 working directory: /content 
