<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Extract-BUV-survey-site-info-from-Deployment/Video-analysis-sheet" data-toc-modified-id="Extract-BUV-survey-site-info-from-Deployment/Video-analysis-sheet-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Extract BUV survey site info from Deployment/Video analysis sheet</a></span><ul class="toc-item"><li><span><a href="#Import-file/sheet-from-to-extract-sites" data-toc-modified-id="Import-file/sheet-from-to-extract-sites-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import file/sheet from to extract sites</a></span></li><li><span><a href="#Read-file-into-dataframe,-keep-relevant-columns" data-toc-modified-id="Read-file-into-dataframe,-keep-relevant-columns-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Read file into dataframe, keep relevant columns</a></span></li><li><span><a href="#extract-SiteID" data-toc-modified-id="extract-SiteID-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>extract SiteID</a></span></li><li><span><a href="#Extract-lat-&amp;-long" data-toc-modified-id="Extract-lat-&amp;-long-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Extract lat &amp; long</a></span></li><li><span><a href="#Fix-SiteNames,-observed-typos,-and-relevant-capitalisation" data-toc-modified-id="Fix-SiteNames,-observed-typos,-and-relevant-capitalisation-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Fix SiteNames, observed typos, and relevant capitalisation</a></span></li><li><span><a href="#Keep-only-the-relevant-columns" data-toc-modified-id="Keep-only-the-relevant-columns-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Keep only the relevant columns</a></span></li><li><span><a href="#Export-site-info-into-csv" data-toc-modified-id="Export-site-info-into-csv-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Export site info into csv</a></span></li><li><span><a href="#Check-which-SiteNames-are-missing" data-toc-modified-id="Check-which-SiteNames-are-missing-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Check which SiteNames are missing</a></span></li></ul></li></ul></div>

In [None]:
# Last changed 2024.12.11

# Extract BUV survey site info from Deployment/Video analysis sheet

This notebooks is part of the 2024 Spyfish data cleaning and is used to migrate information about the sites of the BUV drops from Excel spreadsheets to a consolidated "BUV Survey Sites" sharepoint list. The site-related information from the sheets is cleaned up before we upload the information to the sharepoint list.

This notebook currently works for extracting Site info from files with this kind of format [BUV MOKES 04](https://docnz.sharepoint.com/:x:/r/teams/SpyfishAotearoa/_layouts/15/Doc.aspx?sourcedoc=%7B3488D733-A46B-4DF4-8394-C111F1766355%7D&file=BUV%20Mokes%2004%20-%20DOCDM-54594.xls)

In [None]:
import os
import pandas as pd

from ipyfilechooser import FileChooser
from IPython.display import display

## Import file/sheet from to extract sites

In [None]:
file_chooser = FileChooser(title='<b>Select the file from which to extract the sites</b>')
display(file_chooser)

In [None]:
site_file_name = file_chooser.selected
assert site_file_name != None, "Select site_file_name in the cell above."
print(f"The site_file_name is {site_file_name}")

In [None]:
tabs = pd.ExcelFile(site_file_name).sheet_names
for i, e in enumerate(tabs):
    print(i, e)

FILE_NUM = int(input("select sheet you want to process: "))
sheet_name_select = tabs[FILE_NUM]
print("\nselected sheet name: ", sheet_name_select)

## Read file into dataframe, keep relevant columns

In [None]:
site_df= pd.read_excel(site_file_name, sheet_name=sheet_name_select )
print(site_df.columns)
site_df

In [None]:
# STATION is the column containing the Site Names
print("The count and list of the existing site IDs")
len(site_df["STATION"].unique()), site_df["STATION"].unique()

In [None]:
# Insert column names that contain relevant Site information
site_df = site_df[['STATION', 'SITE', 'DATE', 'TIME', 'DEPTH', 'LAT', 'LONG', 'AREA', 'HABITAT', 'Stand']]

# As the count columns get dropped, we can delete the empty rows, meaning that each station now features only once
# check that dropping all empty rows worked, they are now
print(len(site_df))
site_df = site_df.dropna()
print(len(site_df))
print(f"The number of sites is: {len(site_df)}")
site_df

## extract SiteID

In [None]:
CURRENT_RESERVE_CODE = "ABC"

def get_SiteID(site_name):
    return f"{CURRENT_RESERVE_CODE}_{site_name.split('-')[-1].zfill(3)}"

In [None]:
site_df["SiteID"] = site_df["STATION"].apply(get_SiteID)
site_df.sample(3)

## Extract lat & long

In [None]:
def convert_lat_long(val):
    deg, minutes, seconds = val.split(" ")
    new_val = int(deg) + int(minutes) / 60 + int(seconds) / 3600
    # for NZ locations
    if new_val < 100:
        new_val *= -1
    return new_val

# test cases
# print(convert_lat_long("35 56 66"))
# print(convert_lat_long("175 08 78"))

In [None]:
site_df["Latitude"] = site_df["LAT"].apply(convert_lat_long).round(6)
site_df["Longitude"] = site_df["LONG"].apply(convert_lat_long).round(6)
site_df.sample(3)

## Fix SiteNames, observed typos, and relevant capitalisation


In [None]:
def fix_names(name):
    name_list = name.split(" ")
    
    # TODO: make it more robust by either adding more words or better instead keeping words 
    # such as "of", "and", etc lowercase, and capitalizing everything else
    for i,w in enumerate(name_list):
        if w in ["bay", "slot", "twins", "point", "site", "rock", "cave", "greenstone"]:
            name_list[i] = w.capitalize()
        
        # fix specific typos
        if w == "Roxk":
            name_list[i] = "Rock"

    return " ".join(name_list)

site_df["SiteName"] = site_df["SITE"].apply(fix_names)
site_df.sample(3)

## Keep only the relevant columns
Review the dataframe before exporting into a file in the next step

In [None]:
# If needed can add more columns
site_df_to_extract = site_df[['SiteID', 'SiteName', 'Latitude', 'Longitude']]
site_df_to_extract.reset_index(drop=True)
site_df_to_extract

## Export site info into csv





In [None]:

def export_to_annotations(df_with_vals, file_name, export_csv_file_name=None):
    if not export_csv_file_name:
        export_file_name = os.path.basename(file_name)
        export_file_name = export_file_name[:export_file_name.find('.')]
        export_csv_file_name = f"sites_{export_file_name}.csv"
 
   # make export folder in current folder 
    path_to_export = os.path.join(os.path.dirname(file_name), "export")
    print(path_to_export)
    os.makedirs(path_to_export, exist_ok=True)
    export_location = os.path.join(path_to_export, export_csv_file_name)
    
    print(f"Exporting data to file: '{export_location}'")
    df_with_vals.to_csv(export_location) 
    

print(f"Showing sample of export with shape: {site_df_to_extract.shape}")
display(site_df_to_extract.sample(10))
export_to_annotations(site_df_to_extract, site_file_name)     


## Check which SiteNames are missing

Copy paste the names from the BUV SiteNames on Sharepoint to compare

In [None]:
# insert the sites that are currently on the Sharepoint or any other list, checking for completeness
existing_sites = """List of Exisitng Sites
Site 2
Site 3
Site 4""".split("\n")

In [None]:
site_names = list(site_df_to_extract["SiteName"])
site_names

In [None]:
set_existing_sites = set(existing_sites)
set_site_names = set(site_names)
in_existing = sorted(list(set_existing_sites - set_site_names))
in_mok_file = sorted(list(set_site_names - set_existing_sites))

print("Present in BUV Sites sharepoint\n", in_existing)
print("Present in processed file\n", in_mok_file)
print(len(in_existing))
print(len(in_mok_file))