<a href="https://colab.research.google.com/github/wildlifeai/spyfish_analysis/blob/main/accuracy_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What analysed videos do we have?

**NOTE THAT CHANGES THINGS**: the citsci datasets are the aggregated
classifications for *each 10 seconds of video, not for each deployment*.
We want to compare between deployments, so I’ll need to do that myself
with a `group_by` kind of method before I start adding absent species to
the dataset.

## Citizen scientist videos

100% complete on Zooniverse:

-   Te Whanganui o Hei (Cathedral Cove) Marine Reserve
-   LONG SPECIES LIST: Te Tapuwae o Rongokako Marine Reserve (movies) -
    **has biologist for comparison**
    -   The siteName column has different sites to the filename column
        ASK
-   LONG SPECIES LIST: Kāpiti Marine Reserve (movies)
-   LONG SPECIES LIST: Tūhua Marine Reserve (movies) - **has biologist
    for comparison**
-   LONG SPECIES LIST: Tapuae Marine Reserve (movies) - **has biologist
    for comparison**
-   LONG SPECIES LIST: Goat Island Marine Reserve (movies)

Still ongoing:

-   Tapuae Marine Reserve (movies) - 94% complete - **has biologist for
    comparison**
-   Tonga Island Marine Reserve (movies) - 64% complete - **has
    biologist for comparison**

## Biologist-analysed videos



Datasets that are in the correct format are:

-   Copy of BUV data entry sheet for **Horoirangi 2022** example sites
    jane.xlsx - *Sheet2*
-   MPAMAR data **Akaroa Pohatu BUV 2021** - Video analysis data sheet -
    DOC-7166069.xlsm
-   MPAMAR Data BUV **Akaroa Pohatu 2017** Video analysis data
    sheet.xlsm
-   MPAMAR Data BUV **Akaroa Pohatu 2019** Video analysis data
    sheet.xlsm
-   MPAMAR Data BUV **Horoirangi 2021** - Video analysis sheet.xlsm
-   MPAMAR Data BUV **Tapuae 2021** Video analysis sheet.xlsm
-   MPAMAR Data BUV **Te Angiangi 2021** Video analysis data sheet.xlsm
-   MPAMAR Data BUV **Te Tapuwae o Rongokako 2021** - Video analysis
    sheet - DOC-6731514.xlsm
-   MPAMAR Data BUV **Tonga Island 2021** Video analysis data sheet.xlsm
-   MPAMAR Data BUV **Tuhua 2021** Video analysis sheet -
    DOC-6891090.xlsm

Almost correct format but not finished:

-   MPAMAR Data BUV **Tuhua 2020** Video analysis sheet.xlsm - no
    *MaxCount compiled* sheet, but an *All counts compiled* sheet is
    present

Datasets not in the correct format are:

-   BUV data entry sheet for Horoirangi 2022 example sites.xlsx - but
    this one has a copy with the correct format as one sheet
-   MPAMAR Data BUV **Horoirangi Tonga Island MR 2004** -
    DOC-6831278.xlsx
-   MPAMAR Data BUV **Tapuae MR 2011** - DOC-1243658.xlsx
-   MPAMAR Data BUV **Tuhua MR 2004** - DOC-1159857.xlsx
-   MRMDATA - BUV - **Long Bay Okura Marine Reserve 2021** -
    DOC-7164369.xlsx
-   **Poor Knights Islands BUV analysis April 2015** - DOC-2654059.xlsx

# Import data

Here we import the data for videos that have been analysed by both
biologists and citizen scientists. Currently, that is these three:

-   LONG SPECIES LIST: Tapuae Marine Reserve (movies)
-   LONG SPECIES LIST: Te Tapuwae o Rongokako Marine Reserve (movies)
-   LONG SPECIES LIST: Tūhua Marine Reserve (movies)

**ASK** or check:

Add a common name column

The biologist data needs a common name column (`COMMONNAME`) written in
the same format/style as Zooniverse’s exported classifications. So the
common names need to be all caps with no spaces or hyphens (e.g. BLUECOD
and SHORTTAILEDSTINGRAY), and NULL SAMPLE needs to be written as
NOTHINGHERE instead. Write this in and save the file before importing it
into R.

## Connect to the Google Drive with the excel files

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Load the R extension for python

In [2]:
%load_ext rpy2.ipython

## Specify the path to the folder with the excel data

In [3]:
%%R
path_data <- '/content/drive/MyDrive/Projects/Spyfish Aotearoa/hiromi_zooniverse/Accuracy analysis/data'

# Save the paths of each excel file

In [6]:
%%R
bio_tapuae_path <- paste(path_data, 'MPAMAR Data BUV Tapuae 2021 Video analysis sheet .xlsm', sep="/")
bio_te_tapuwae_path <- paste(path_data, 'MPAMAR Data BUV Te Tapuwae o Rongokako 2021 - Video analysis sheet - DOC-6731514.xlsm', sep="/")
bio_tuhua_path <-  paste(path_data, 'MPAMAR Data BUV Tuhua 2021 Video analysis sheet - DOC-6891090.xlsm', sep="/")

citsci_tapuae_path <- paste(path_data, 'Zooniverse - Long Species List Tapuae 2023-07-21.csv', sep="/")
citsci_te_tapuwae_path <- paste(path_data, 'Zooniverse - Long Species List Te Tapuwae o Rongokako 2023-07-13.csv', sep="/")
citsci_tuhua_path <- paste(path_data, 'Zooniverse - Long Species List Tuhua 2023-07-21.csv', sep="/")

## Read in biologist data

In [None]:
%%R
#Tapuae
library(readxl)
bio_tapuae <- read_excel(bio_tapuae_path, sheet = "MaxCount compiled")
head(bio_tapuae)
str(bio_tapuae)

#Te Tapuwae o Rongokako
bio_te_tapuwae <- read_excel(bio_te_tapuwae_path, sheet = "MaxCount compiled")
head(bio_te_tapuwae)
str(bio_te_tapuwae)

#Tūhua
bio_tuhua <- read_excel(bio_tuhua_path, sheet = "MaxCount compiled")
head(bio_tuhua)
str(bio_tuhua)

## Read in citizen scientist data


In [None]:
%%R
#Tapuae
citsci_tapuae <- read.csv(citsci_tapuae_path)
head(citsci_tapuae)
str(citsci_tapuae)

#Te Tapuwae o Rongokako
citsci_te_tapuwae <- read.csv(citsci_te_tapuwae_path)
head(citsci_te_tapuwae)
str(citsci_te_tapuwae)

#Tūhua
citsci_tuhua <- read.csv(citsci_tuhua_path)
head(citsci_tuhua)
str(citsci_tuhua)

Potentially useful variables for the accuracy analysis are:

**Biologists:**

-   SurveyName
-   LinkToMarineReserve
-   IsLongTermMonitoring
-   **SiteID**
-   Depth
-   UnderwaterVisibility
-   IsControlSite
-   IsNullSample
-   IsBadDeployment
-   **ScientificName**
-   **MaxN**
-   **TimeOfMaxN** - not so important for data, important for still
    frames workflow
-   **COMMONNAME** - same as Zooniverse label, I wrote this in, includes
    NOTHINGHERE

**Citizen scientists:**

-   **label** - species
-   **first_seen** = TimeOfMaxN
-   **how_many** = MaxN
-   **filename** - has site name in it
-   **siteName** - Tuhua is fine, but Te
-   classifications_count
-   retirement_reason
-   n_users
-   SurveyID
-   subject_ids

# Preparing the datasets

The site ID info is in the column SiteID in biologist classifications
and the column siteName in citizen scientist classifications.

-   **ASK**: **the siteName column and the site name in the movie file
    name is different in Te Tapuwae’s citsci dataset. Leave Te Tapuwae
    for now.** This isn’t a problem with the other two.

### Rename columns to match and round biologist MaxN

-   **ASK**: why do the biologist MaxN’s have so many decimal points?


In [None]:
%%R
library(dplyr)

#Function to process citsci dataset
citsci_process <- function(df) {
  df %>%
    rename(citsci_species = label,
         citsci_first_seen = first_seen,
         citsci_how_many = how_many,
         siteID = siteName)
}

#Processing the datasets
citsci_tapuae <- citsci_process(citsci_tapuae)
str(citsci_tapuae)
citsci_tuhua <- citsci_process(citsci_tuhua)

#Function to process biologist dataset
bio_process <- function(df) {

  #Round MaxN values
  df %>%
    mutate(TimeOfMaxN = round(as.numeric(TimeOfMaxN)))

  #Rename columns
  df %>%
    rename(bio_species = COMMONNAME,
         bio_first_seen = TimeOfMaxN,
         bio_how_many = MaxN,
         siteID = SiteID)
}

#Processing the datasets
bio_tapuae <- bio_process(bio_tapuae)
bio_tuhua <- bio_process(bio_tuhua)

# Adding species absences

It will be helpful to include species whose MAXcount is zero in the
dataset so we can compare those as well, particularly if the biologists
notice something that the citizen scientists don’t. Currently Zooniverse
and biologist data shows species presences, and when nothing appears at
all it says “NULL DEPLOYMENT” (biologist data) or “NOTHINGHERE”
(Zooniverse data).

## Testing adding species absences with a smaller dataset

Below, I specified the Zooniverse species list for Tūhua, and made it so
that the species absences are also recorded in the dataset, and that
nothing here results in species absences for all the species. Bad
deployments are deleted.



In [None]:
%%R
#Create the original "test_bio_tuhua" dataset (semi mock, semi real)
test_bio_tuhua <- data.frame(
  bio_species = c("SPOTTY", "SCARLETWRASSE", "MORAYEEL", "GOATFISH", "SCARLETWRASSE", "REDPIGFISH", "SNAPPER", "NOTHINGHERE", "MORAYEEL", "SNAPPER", "SCORPIONFISH", "SCHOOLSHARK"),
  bio_first_seen = c(0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 2, 5),
  bio_how_many = c(1, 1, 1, 1, 1, 1, 3, NA, 2, 4, 1, 8),
  siteID = c("TUH_005", "TUH_005", "TUH_005", "TUH_005", "TUH_005", "TUH_005", "TUH_005", "TUH_006", "TUH_006", "TUH_009", "TUH_009", "TUH_009"),
  stringsAsFactors = FALSE
)

#Create the species list
species_list <- c(
  "BANDEDWRASSE", "BLUEMAOMAO", "BLUEMOKI",
  "BUTTERFLYPERCH", "EAGLERAY", "GOATFISH", "GREENBONEBUTTERFISH",
  "HIWIHIWI", "JACKMACKEREL","KAHAWAI", "KINGFISH", "KOHERU",
  "MARBLEFISH", "MORAYEEL", "PINKMAOMAO", "PORAE", "REDANDHALFBANDEDPERCHES",
  "REDMOKI", "REDPIGFISH", "SANDAGERSWRASSE", "SCARLETWRASSE",
  "SHORTTAILEDSTINGRAY", "SILVERSWEEP", "SINGLESPOTDEMOISELLE",
  "SMOOTHLEATHERJACKET", "SNAPPER", "SPOTTY", "TARAKIHI",
  "TREVALLY", "TWOSPOTDEMOISELLE", "MARINEMAMMAL", "OTHER"
)

#Function to add missing species rows for a given siteID
add_missing_bio_species <- function(df, site_id) {
  existing_species <- df$bio_species[df$siteID == site_id]
  missing_species <- setdiff(species_list, existing_species)

  new_rows <- data.frame(
    bio_species = missing_species,
    bio_first_seen = rep(NA, length(missing_species)),
    bio_how_many = rep(0, length(missing_species)),
    siteID = rep(site_id, length(missing_species)),
    stringsAsFactors = FALSE
  )

  df <- rbind(df, new_rows)
  return(df)
}

#Removing rows with "NOTHINGHERE" and adding missing species rows
test_bio_tuhua <- test_bio_tuhua[test_bio_tuhua$bio_species != "NOTHINGHERE", ]
unique_sites <- unique(test_bio_tuhua$siteID)

for (site in unique_sites) {
  test_bio_tuhua <- add_missing_bio_species(test_bio_tuhua, site)
}

#Remove rows with "BADDEPLOYMENT"
test_bio_tuhua <- test_bio_tuhua[test_bio_tuhua$bio_species != "BADDEPLOYMENT", ]

#Replace species not in the species list with "OTHER"
test_bio_tuhua$bio_species[!(test_bio_tuhua$bio_species %in% species_list)] <- "OTHER"

#Compile rows with "OTHER" species into one row per siteID
other_rows <- test_bio_tuhua[test_bio_tuhua$bio_species == "OTHER", ]
unique_sites <- unique(other_rows$siteID)

consolidated_rows <- data.frame(
  bio_species = "OTHER",
  bio_first_seen = NA,
  bio_how_many = sapply(unique_sites, function(site) sum(other_rows$bio_how_many[other_rows$siteID == site])),
  siteID = unique_sites,
  stringsAsFactors = FALSE
)

test_bio_tuhua <- test_bio_tuhua[test_bio_tuhua$bio_species != "OTHER", ]
test_bio_tuhua <- rbind(test_bio_tuhua, consolidated_rows)

#Sort the dataset by siteID
test_bio_tuhua <- test_bio_tuhua[order(test_bio_tuhua$siteID), ]

#Reset row names
rownames(test_bio_tuhua) <- NULL

#Print the final modified dataset
print(test_bio_tuhua)


### Testing with a cit sci small dataset

In [None]:
%%R
test_citsci_tuhua <- head(
  select(citsci_tuhua,
         citsci_species, citsci_first_seen, citsci_how_many, siteID), #can add more columns later if wanted
  n = 30
)

#Create the species list
species_list <- c(
  "BANDEDWRASSE", "BLUEMAOMAO", "BLUEMOKI",
  "BUTTERFLYPERCH", "EAGLERAY", "GOATFISH", "GREENBONEBUTTERFISH",
  "HIWIHIWI", "JACKMACKEREL","KAHAWAI", "KINGFISH", "KOHERU",
  "MARBLEFISH", "MORAYEEL", "PINKMAOMAO", "PORAE", "REDANDHALFBANDEDPERCHES",
  "REDMOKI", "REDPIGFISH", "SANDAGERSWRASSE", "SCARLETWRASSE",
  "SHORTTAILEDSTINGRAY", "SILVERSWEEP", "SINGLESPOTDEMOISELLE",
  "SMOOTHLEATHERJACKET", "SNAPPER", "SPOTTY", "TARAKIHI",
  "TREVALLY", "TWOSPOTDEMOISELLE", "MARINEMAMMAL", "OTHER"
)

#Function to add missing species rows for a given siteID
add_missing_citsci_species <- function(df, site_id) {
  existing_species <- df$citsci_species[df$siteID == site_id]
  missing_species <- setdiff(species_list, existing_species)

  new_rows <- data.frame(
    citsci_species = missing_species,
    citsci_first_seen = rep(NA, length(missing_species)),
    citsci_how_many = rep(0, length(missing_species)),
    siteID = rep(site_id, length(missing_species)),
    stringsAsFactors = FALSE
  )

  df <- rbind(df, new_rows)
  return(df)
}

#Removing rows with "NOTHINGHERE" and adding missing species rows
test_citsci_tuhua <- test_citsci_tuhua[test_citsci_tuhua$citsci_species != "NOTHINGHERE", ]
unique_sites <- unique(test_citsci_tuhua$siteID)

for (site in unique_sites) {
  test_citsci_tuhua <- add_missing_citsci_species(test_citsci_tuhua, site)
}

#Replace species not in the species list with "OTHER"
test_citsci_tuhua$citsci_species[!(test_citsci_tuhua$citsci_species %in% species_list)] <- "OTHER"

#Compile rows with "OTHER" species into one row per siteID
other_rows <- test_citsci_tuhua[test_citsci_tuhua$citsci_species == "OTHER", ]
unique_sites <- unique(other_rows$siteID)

consolidated_rows <- data.frame(
  citsci_species = "OTHER",
  citsci_first_seen = NA,
  citsci_how_many = sapply(unique_sites, function(site) sum(other_rows$citsci_how_many[other_rows$siteID == site])),
  siteID = unique_sites,
  stringsAsFactors = FALSE
)

test_citsci_tuhua <- test_citsci_tuhua[test_citsci_tuhua$citsci_species != "OTHER", ]
test_citsci_tuhua <- rbind(test_citsci_tuhua, consolidated_rows)

# Sort the dataset by siteID
test_citsci_tuhua <- test_citsci_tuhua[order(test_citsci_tuhua$siteID), ]

# Reset row names
rownames(test_citsci_tuhua) <- NULL

# Print the final modified dataset
print(test_citsci_tuhua)

# TE TAPUWAE - cit sci dataset site code problem, leave for now

All the below code was written and run before I realised about the site
code problem, where the site codes are different in the biologist data
compared to the Zooniverse data.

# Preparing the datasets

The site ID info is in the column SiteID in biologist classifications
and the column filename in citizen scientist classifications.

In [13]:
%%R
#Create a new column called SiteID in the citizen science dataset
citsci_te_tapuwae <- transform(citsci_te_tapuwae,
                                SiteID = substr(filename, 1, 7)) #first seven characters of the filename column values


#Round MaxN values in biologist dataset
bio_te_tapuwae <- bio_te_tapuwae %>%
  mutate(TimeOfMaxN = round(as.numeric(TimeOfMaxN)))

#Rename columns

In [14]:
%%R
#Citizen science datasets
library(dplyr)
citsci_te_tapuwae <- citsci_te_tapuwae %>%
  rename(
    citsci_species = label,
    citsci_first_seen = first_seen,
    citsci_how_many = how_many
  )

#Biologist datasets
bio_te_tapuwae <- bio_te_tapuwae %>%
  rename(
    bio_species = COMMONNAME,
    bio_first_seen = TimeOfMaxN,
    bio_how_many = MaxN
  )

# Adding species absences

## Testing with a smaller dataset



In [15]:
%%R
#Rename columns
library(dplyr)

test_bio_te_tapuwae <- head(
  select(bio_te_tapuwae,
         bio_species, bio_first_seen, bio_how_many, SiteID),
  n = 30
)
test_bio_te_tapuwae <- test_bio_te_tapuwae[15:25,]

Below, I specified the Zooniverse species list for Te Tapuwae, and made it so that the species absences are also recorded in the dataset, and that nothing here results in species absences for all the species. Bad deployments are deleted.


In [None]:
%%R
#Te Tapuwae o Rongokako species list
species_list <- c(
  "BANDEDWRASSE", "BLUECOD", "BLUEMAOMAO", "BLUE MOKI", "BROADNOSESEVENGILLSHARK",
  "BUTTERFLYPERCH", "TWOSPOTDEMOISELLE", "CONGEREEL", "EAGLERAY", "HIWIHIWI",
  "KAHAWAI", "MARBLEFISH", "MORAYEEL", "PORAE", "REDANDHALFBANDEDPERCHES",
  "REDMOKI", "ROCKLOBSTER", "SCARLETWRASSE", "SCHOOLSHARK", "SCORPIONFISH",
  "SHORTTAILEDSTINGRAY", "SILVERSWEEP", "SMOOTHLEATHERJACKET", "SNAPPER",
  "SPOTTY", "TARAKIHI", "TREVALLY", "MARINEMAMMAL"
)

#Function to add missing species rows for a given SiteID (biologist version)
add_missing_bio_species <- function(df, site_id) {
  existing_species <- df$bio_species[df$SiteID == site_id]
  missing_species <- setdiff(species_list, existing_species)

  new_rows <- tibble(
    bio_species = missing_species,
    bio_first_seen = rep(NA, length(missing_species)),
    bio_how_many = rep("0", length(missing_species)),
    SiteID = rep(site_id, length(missing_species))
  )

  df <- bind_rows(df, new_rows)
  return(df)
}

#Removing rows with bad deployment
test_bio_te_tapuwae <- test_bio_te_tapuwae %>%
  filter(bio_species != "BADDEPLOYMENT")

#Removing rows with "NOTHINGHERE" and adding missing species rows
test_bio_te_tapuwae <- test_bio_te_tapuwae %>%
  filter(bio_species != "NOTHINGHERE") %>%
  group_by(SiteID) %>%
  do(add_missing_bio_species(., unique(.$SiteID))) %>%
  ungroup()

#Output the updated dataset
test_bio_te_tapuwae


### Testing with a cit sci small dataset



In [None]:
%%R
#Obtain test dataset (has a NOTHINGHERE in it at the start)
test_citsci_te_tapuwae <- head(
  select(citsci_te_tapuwae,
         citsci_species, citsci_first_seen, citsci_how_many, SiteID),
  n = 30
)

#Function to add missing species rows for a given SiteID (biologist version)
add_missing_citsci_species <- function(df, site_id) {
  existing_species <- df$citsci_species[df$SiteID == site_id]
  missing_species <- setdiff(species_list, existing_species)

  new_rows <- tibble(
    citsci_species = missing_species,
    citsci_first_seen = rep(NA, length(missing_species)),
    citsci_how_many = rep("0", length(missing_species)),
    SiteID = rep(site_id, length(missing_species))
  )

  df <- bind_rows(df, new_rows)
  return(df)
}

#Removing rows with bad deployment
test_citsci_te_tapuwae <- test_citsci_te_tapuwae %>%
  filter(citsci_species != "BADDEPLOYMENT")

#Removing rows with "NOTHINGHERE" and adding missing species rows
test_bio_te_tapuwae <- test_bio_te_tapuwae %>%
  filter(bio_species != "NOTHINGHERE") %>%
  group_by(SiteID) %>%
  do(add_missing_bio_species(., unique(.$SiteID))) %>%
  ungroup()

#Output the updated dataset
test_bio_te_tapuwae