# Check BUV Deployment sheet help

This notebook is part of the Spyfish Aotearoa data standardisation efforts, and is used to perform cleaning of the existing BUV Deployments csv file (obtained from the sharepoint list with the same name). 

The output of this notebook is:
- lists of rows that have a suspicious behaviour
- a csv file with cleaned SurveyIDs, SiteIDs, DropIDs, expected fileName, LinkToVideoFile and info weather these last two match to the existing value and what is the discrepancy. 


Some of this code will be repurposed for ongoing checks of the BUV Deployment data as part of the pipeline.



The following (sharepoint) lists are currently available

- BUV Deployment
- BUV Survey Metadata
- BUV Survey Sites
- Marine reserves
- BUV Metadata Definitions

In [118]:
# Last changed 2025.04.15

In [None]:
## Run below code first: If you get the ModuleNotFoundError: No module named 'sftk' or similar error, 
## check the README.md Usage section for instructions or run the below code:

# import sys
# sys.path.append('path/to/Spyfish-Aotearoa-toolkit')


In [35]:
import os
import logging  # find logs in the following folders .sftk > logs - defined in the log_config
import pandas as pd
from pathlib import Path

from sftk.utils import read_file_to_df, is_format_match

## Load files

In [36]:
data_folder_path = "path/to/folder/with/data"

In [None]:
# Used the csv retrieved from the sharepoint list
buv_df = read_file_to_df(os.path.join(data_folder_path, "BUV Deployment.csv"))
buv_df.shape

In [None]:
buv_df

In [None]:
buv_df.columns

In [None]:
reserves_df = read_file_to_df(os.path.join(data_folder_path, "Marine Reserves.csv"))
reserves_df.sample(3)

In [None]:
reserves_df.columns

In [None]:
survey_df = read_file_to_df(os.path.join(data_folder_path, "BUV Survey Metadata.csv"))
survey_df.sample(3)

In [None]:
survey_df.columns

In [None]:
sites_df = read_file_to_df(os.path.join(data_folder_path, "BUV Survey Sites.csv"))
sites_df.sample(3)

In [None]:
sites_df.columns

## Extract column sets

In [None]:
survey_ids = survey_df["SurveyID"]
print(len(set(survey_ids)), len(survey_ids))
survey_ids.update(["RTT_20250226_BUV"])
survey_ids = set(survey_ids)
# survey_ids

In [None]:
# check the surveys that have double location acronyms
survey_df[survey_df["SurveyLocationAcronym"] ==  'AKA; POU']
survey_df[survey_df["SurveyLocationAcronym"] ==  'CRP; TAW']

In [None]:
survey_acronyms = survey_df["SurveyLocationAcronym"] 
print(len(set(survey_acronyms)), len(survey_acronyms)) # ok to differ, as there are multiple years for each acronym
survey_acronyms = set(survey_acronyms)
# TODO check if this is ok, these survey acronyms are added because there are acronym pairs
# Adding acronyms to account for acronym pairs, e.g., 'CRP; TAW'
survey_acronyms.update(["CRP", "AKA", "POU", "BNP"])
print(len(survey_acronyms))
# survey_acronyms

In [None]:
reserve_acronyms = reserves_df["SurveyLocationAcronym"]
print(len(set(reserve_acronyms)), len(reserve_acronyms))
reserve_acronyms = set(reserve_acronyms)
# reserve_acronyms

In [None]:
site_ids = sites_df["SiteID"]
print(len(set(site_ids)), len(site_ids))
site_ids = set(site_ids)
# site_ids

### Check SiteID duplicates:

In [51]:
# TODO Check duplicate SiteIDs

duplicate_site_ids_df = sites_df[sites_df.duplicated(subset=["SiteID"], keep=False)].sort_values(by="SiteID")
# duplicate_site_ids_df

In [None]:
duplicate_site_ids_df["SiteID"].unique()

## Check all entries have respective "parent" in definition list


In [None]:
# Reserve acronyms that do not have survey equivalent
print(len(reserve_acronyms - survey_acronyms))
reserve_acronyms - survey_acronyms

In [None]:
# SurveyID acronyms that do not have equivalent in reserve acronyms
print(len(survey_acronyms - reserve_acronyms))
survey_acronyms - reserve_acronyms

In [None]:
# TODO check if survey acronyms the same as surveyIDs
survey_df[survey_df["SurveyID"].str[:3] != survey_df["SurveyLocationAcronym"]]

In [None]:
# SiteIDs in BUV deployment have equivalent in sites_df
buv_sites = set(buv_df["SiteID"].unique())
buv_sites - site_ids

In [None]:
len(site_ids  - buv_sites)

## Review Various Columns

### Fix survey IDs

In [58]:
# Combinations of all acronyms that can be at the beginning of a survey
acronym_pattern = "|".join(survey_acronyms)
# print(acronym_pattern)
survey_id_pattern = fr"^({acronym_pattern})_(\d{{8}})_BUV$"

site_id_pattern = fr"^({acronym_pattern})_(\d{{3}})$"

In [None]:
buv_df[buv_df["SurveyID"].isna()]

Check that all string compliant: 

In [60]:
def confirm_fix_survey_ids(row):
    survey_id = row["SurveyID"]
    if survey_id == "RTT_BUV_20250226":
        return "RTT_20250226_BUV"
    # TODO check if needed this if the pd.isna in fomrat match, and also check if this can solve it
    if isinstance(survey_id, float): # when surveyID is None
        try: 
            if is_format_match(survey_id_pattern, row["DropID"][:16]):
                return row["DropID"][:16]
        except Exception as e:
            logging.error(f"Error processing survey with DropID {row["DropID"]} {e}")
        return f"FIX_{survey_id}"

    if not is_format_match(survey_id_pattern, survey_id):
        # logging.warning(f"{survey_id} doesn't follow the SurveyID format")
        print(f"{survey_id} doesn't follow the SurveyID format")
    return survey_id

In [61]:
buv_df["new_SurveyID"] = buv_df.apply(confirm_fix_survey_ids, axis=1)
survey_df["new_SurveyID"] = survey_df.apply(confirm_fix_survey_ids, axis=1)

### Fix SiteIDs 

- get from siteid
- get from filename
- TODO: get from lat lon (Some of the SiteIDs with missing values might have some issues with Lat Lon)

In [62]:
def fix_SiteID(row):
    # TODO watch out, if filename fixed the site is later in the string
    # if row["fileName"] == "CRP_20220407_BUV_CRP_018_01.mp4":
        # print(row)q
    site_id = row["SiteID"]
    survey_acronym = row["new_SurveyID"][:3]
    site_pattern = r"^_\d{3}$"
    if not is_format_match(site_id_pattern, site_id):
        if row["fileName"] == "CRP_20220407_BUV_CRP_018_01.mp4":
            print(row["fileName"], row["fileName"][:7], row["fileName"][17:24] )

        try: # filename route
            site_acronym = row["fileName"][17:20]
            site_num = row["fileName"][20:24] 
        except Exception as e:
            logging.error(f"Error processing survey {row["new_SurveyID"]} filename {row["fileName"]}: {e}")
            return f"FIX_{site_id}"
    else:
        # print(site_id)
        site_acronym = site_id[:3]
        site_num = site_id[3:]
    if site_acronym ==  survey_acronym or \
        site_acronym == "TAW" and survey_acronym == "CRP": # added options for
         if is_format_match(site_pattern, site_num):
              return site_acronym + site_num
         
    return f"FIX_{site_id}"

In [None]:
print(buv_df[buv_df["SiteID"].isna()].shape)
buv_df["new_SiteID"] = buv_df.apply(fix_SiteID, axis=1)
print(buv_df[buv_df["new_SiteID"].astype(str).str.startswith("FIX")].shape)

In [None]:

buv_df[buv_df["new_SiteID"].astype(str).str.startswith("FIX")]
# WGI_20220518_BUV	AHE_060 - are they also related?
# RON_20250128_BUV has plus LAT


In [None]:
buv_df[buv_df["fileName"] == "CRP_20220407_BUV_CRP_018_01.mp4"]

### Get repeated DeploymentIDs

Happens when the first tries are null or bad deployments, highest duplicate_count should be at the good deployment

In [None]:
sum(buv_df.duplicated(subset=["SurveyID", "new_SiteID"], keep=False))

In [None]:
buv_df["duplicate_count"] = buv_df.groupby(["new_SurveyID", "new_SiteID"]).cumcount() + 1
len(buv_df[buv_df["duplicate_count"].isna()]) # should be 0

In [None]:
#buv_df[buv_df["SurveyID"].str.startswith("SLI")][["new_SurveyID", "new_SiteID","duplicate_count", "IsBadDeployment"]]

In [None]:
# TODO Potential issue: ANG lots of bad deployment not many redone deployments
buv_df[buv_df["new_SurveyID"].str.startswith("ANG")][["new_SurveyID", "new_SiteID","duplicate_count", "IsBadDeployment"]]


In [37]:
def make_new_DropID(row):
     return f'{row["new_SurveyID"]}_{row["new_SiteID"]}_{int(row["duplicate_count"]):02d}'

buv_df["new_DropID"] = buv_df.apply(make_new_DropID, axis=1)

In [None]:

len(buv_df[buv_df.duplicated(subset=["new_DropID"], keep=False)]) # should be 0

## Create new fileName and LinkToVideoFile entries with new_DropID info

In [None]:
buv_df["new_fileName"] = buv_df["new_DropID"] + ".mp4"
buv_df["new_fileName"]

In [217]:
# Example LinkToVideoFile: SurveyID/DropID/fileName
# buv_df["LinkToVideoFile"].iloc[0]

In [None]:
# Combine path parts to create LinkToVideoFile

buv_df["new_LinkToVideoFile"] = Path() / buv_df["new_SurveyID"]/ buv_df["new_DropID"] /  buv_df["new_fileName"]
buv_df["new_LinkToVideoFile"]


In [41]:
def is_match_fileName(row):
    """Check if new and old file names are the same.
    
    The function flags two situations (on top of matches): 
     - when the only error is the number of 0s in deployment number
     - when the deployment duplicate number is different.
     """
    if row["fileName"] ==  row["new_fileName"]:
        return "True"
    try: 
        # discrepancy with the duplicate number
        if row["fileName"][:-5] == row["new_fileName"][:-5] and row["fileName"][-5] != row["new_fileName"][-5]:
            return "deployment_duplicate"
    except:
        # print(row["fileName"])
        pass
    try: 
        # discrepancy with the number of zeros in duplicate number 
        if row["fileName"][:-7] + row["fileName"][-5:] == row["new_fileName"]:
            return "digit_num"
    except:
        # print(row["fileName"])
        pass
  
    return "False"


In [42]:
# create columns with the info on how the old and new columns mis-match
buv_df["match_fileName"] = buv_df.apply(is_match_fileName, axis=1)
buv_df["match_LinkToVideoFile"] = buv_df["LinkToVideoFile"] == buv_df["new_LinkToVideoFile"]

In [None]:
# TODO check more closely the situations where the duplicate num does not match.
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    display( buv_df[(buv_df["match_fileName"] == "deployment_duplicate")][["fileName", "new_fileName", "match_fileName"]])

In [None]:
# Show all fileNames that do not match (and are not NA)
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
   display( buv_df[(buv_df["match_fileName"] != "True") & (~buv_df["fileName"].isna())][["fileName", "new_fileName", "match_fileName"]])


In [None]:
# TODO: Another example of duplicate_count issue, 
# All the SLI_20240124_BUV / SLI_105 have False isBadDeployment 
# Where is 03 ?
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
   display(buv_df[buv_df["new_SiteID"] == "SLI_105"])


In [None]:
# TODO another example issue DropID == SLI_20240124_BUV_SLI_005_02 but there is no 01 for that year/site
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
   display(buv_df[buv_df["new_SiteID"] == "SLI_005"])


## Export current state

In [None]:
buv_df.columns

In [None]:
buv_df.to_csv("BUV Deployments Comparison 2025-04-08.csv", index=False)

If you want to export the "new" version of the data, assuming it's all correct:

In [None]:
to_export = buv_df.copy()
to_export = to_export[['new_DropID', 'new_SurveyID', 'new_SiteID', 'Latitude', 'Longitude', 'EventDate',
       'Created By', 'TideLevel', 'Weather', 'UnderwaterVisibility',
       'ReplicateWithinSite', 'EventTimeStart', 'EventTimeEnd',
       'DepthDeployment', 'DepthStrata', 'NZMHCS_Abiotic', 'NZMHCS_Biotic',
       'NotesDeployment', 'RecordedBy', 'IsBadDeployment', 'fps', 'duration',
       'new_fileName', 'new_LinkToVideoFile', 'SamplingStart', 'SamplingEnd', 'ID']]
to_export.rename(columns={
    "new_DropID": "DropID",
    'new_SurveyID': 'SurveyID', 
    'new_SiteID': 'SiteID',
    'new_fileName': 'fileName', 
    'new_LinkToVideoFile': 'LinkToVideoFile'
}, inplace=True)
to_export.columns

In [165]:

to_export.to_csv("BUV Deployments Clean.csv", index=False)

# SiteID in BUV Deployment problems

In SurveyIDs RONs that have a positive number - and also they seem to be 0.2 off the exisiting Latitudes

In [None]:
buv_df[buv_df["new_SurveyID"].str.startswith("RON")]["Latitude"].unique()

In [None]:
sites_df[sites_df["SiteID"].str.startswith("RON")]["Latitude"].unique()

In [None]:
sites_df[sites_df["SiteID"].astype(str).str.startswith("RON")]["Latitude"].min() # 25

# General BUV Deployment review

In [230]:
# all referring to the created values
# gives only the first error, not all of them
def define_buv_row_issue(row):
    if row["new_SurveyID"] not in survey_ids:
        return "SurveyID does not exist in Survey Metadata"
    if row["new_SiteID"] not in site_ids:
        return "SiteID does not exist"
    
    survey_acronym = row["new_SurveyID"][:3]
    site_acronym = row["new_SiteID"][:3] 
    if survey_acronym != site_acronym:
        # TODO account for TAW and CRP, leaving as is now for them to be checked.
        return "Site and Survey do not reference the same marine reserve"
    
    if not str(row["new_DropID"]).startswith(str(row["new_SurveyID"])):
        return "Drop does not contain correct SurveyID info"
    
    if row["new_DropID"][17:24] != row["new_SiteID"]:
        return "Drop does not contain correct SiteID info"

    return True

In [None]:
buv_df["valid_entry"] = buv_df.apply(define_buv_row_issue, axis=1)
# Review the issues 
len(buv_df[buv_df["valid_entry"] != False])

In [232]:
buv_df.to_csv("BUV Deployments Comparison.csv", index=False)

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None): 

    display(buv_df[buv_df["valid_entry"] == "SiteID does not exist"])

In [None]:
buv_df[buv_df["valid_entry"] == "SurveyID does not exist"]

In [None]:
buv_df[buv_df["valid_entry"] == "Site and Survey do not reference the same marine reserve"]




In [None]:
print(len(buv_df[buv_df["valid_entry"] == "Drop does not contain correct SurveyID info"]))
print(len(buv_df[buv_df["valid_entry"] == "Drop does not contain correct SiteID info"]))