<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scientific-Names-Validity-Review" data-toc-modified-id="Scientific-Names-Validity-Review-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scientific Names Validity Review</a></span><ul class="toc-item"><li><span><a href="#Chose-excel-file-containing-the-scientific-names-to-check" data-toc-modified-id="Chose-excel-file-containing-the-scientific-names-to-check-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Chose excel file containing the scientific names to check</a></span></li><li><span><a href="#Small-test-df" data-toc-modified-id="Small-test-df-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Small test df</a></span></li><li><span><a href="#Check-each-of-the-scientific-names" data-toc-modified-id="Check-each-of-the-scientific-names-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Check each of the scientific names</a></span></li><li><span><a href="#run-the-name-checker-on-each-row" data-toc-modified-id="run-the-name-checker-on-each-row-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>run the name checker on each row</a></span></li></ul></li><li><span><a href="#run-the-name-checker-on-the-whole-dataframe" data-toc-modified-id="run-the-name-checker-on-the-whole-dataframe-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>run the name checker on the whole dataframe</a></span></li><li><span><a href="#TODOs-and-Extras:" data-toc-modified-id="TODOs-and-Extras:-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TODOs and Extras:</a></span><ul class="toc-item"><li><span><a href="#potential-todos" data-toc-modified-id="potential-todos-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>potential todos</a></span></li><li><span><a href="#print-WoRMS-suggestions" data-toc-modified-id="print-WoRMS-suggestions-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>print WoRMS suggestions</a></span></li><li><span><a href="#Example-API-return" data-toc-modified-id="Example-API-return-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Example API return</a></span></li></ul></li></ul></div>

In [1]:
# Last changed 2025.03.25

# Scientific Names Validity Review

This notebooks is part of the Spyfish data cleaning process and it reviews the validity of species scientific names in a given column of an Excel sheet. 

The checks are performed with calls to the [WoRMS API](https://www.marinespecies.org/rest/AphiaRecordsByName).



The output of this notebook creates a csv with the following column:
- **aphiaID**: from WoRMS API
- **scientificName**: the scientific name, validated by the WoRMS API
- **commonName**: the common name
- **taxonRank**: the corresponding rank of the scientific name



If you already have a cleaned csv, and would like to check & add a few names (from a list or from or another file) scroll down to [Extra scientific names to add](#extra-scientific-names-to-add)

In [8]:
## Run below code first: If you get the ModuleNotFoundError: No module named 'sftk' or similar error, 
## check the README.md Usage section for instructions or run the below code:
import sys
sys.path.append('path/to/Spyfish-Aotearoa-toolkit')


In [11]:
import pandas as pd

from sftk.utils import read_file_to_df
from sftk.clean_data import ScientificNameProcessing

## Chose csv file(s) containing the scientific names to check

In [None]:
# Example usage
scientific_names_file = {"path": "/path/to/scientific_name/csv_file", # change to your path
                         "columns":["scientificName", "commonName"]} # columns that have the scientifi and common names

scientific_names_file_example =  {"path": "../tests/mock_data/sample_clean_scientific_names.csv", 
                                 "columns": ["scientificName", "commonName"]} 


# Add multiple files if necessary: 
scientific_names_files = [scientific_names_file_example]

print("The scientific_names_files are") 
scientific_names_files


In [None]:
raw_scientific_names_dfs = []

for scientific_names_file in scientific_names_files:
    
    current_df = read_file_to_df(scientific_names_file["path"])
    
    current_df.rename(columns={
        # add the next line if there is a column called scientific_name not intended to be used as scientific name
        "scientific_name": "hold", 
        scientific_names_file["columns"][0]: "scientific_name", 
        # delete the next line if there is no column for common name 
        scientific_names_file["columns"][1]: "common_name"
        }, 
        inplace=True)
    raw_scientific_names_dfs.append(current_df)
   

# Concatenating all the files
raw_scientific_names_df = pd.concat(raw_scientific_names_dfs, ignore_index=True)
print(f"Length of scientific names dartaframe: {len(raw_scientific_names_df)}")
raw_scientific_names_df.sample(3)

In [109]:
# create checkpoint csv data if necessary
# raw_scientific_names_df.to_csv("checkpoint_concatenated_scientific_names.csv", index=False)

In [None]:
raw_scientific_names_df = raw_scientific_names_df[["scientific_name", "common_name"]]
raw_scientific_names_df.sample(3)

In [None]:
# sort values by scientific name
raw_scientific_names_df =  raw_scientific_names_df.sort_values(by=["scientific_name", "common_name"], ascending=[True, False])
# review duplicates for "scientific_name"
raw_scientific_names_df[raw_scientific_names_df.duplicated(subset='scientific_name', keep=False)]

# uncomment row if you want to drop duplicates and keep the first of the duplicates
#raw_scientific_names_df = raw_scientific_names_df.drop_duplicates(subset='scientific_name', keep='first')


## Small test df
To use for testing etc in order to avoid multiple API calls

In [15]:
# raw_scientific_names_df = pd.DataFrame(['Kathetostoma giganteum', # correct
#                                     'Kathetostoma giganteu', # typo
#                                     'Cephaloscyllium isabellum', # replaced by new nomenclature
#                                     'Triglidae sp', # only genus correct
#                                     'Blennioidei sp' # new nomenclature genus to fix
#                                    ], columns=["scientificName"])
# raw_scientific_names_df["commonName"] = "Test"
# raw_scientific_names_df

## Run the name checker on each row and check the scientific names 

In [114]:
def process_dataframe_row(row):
    # get relevant values from original dataframe
    scientific_name = row.get("scientific_name")
    common_name = row.get("common_name")
    
    return ScientificNameProcessing(scientific_name, common_name).query_api()

In [None]:
clean_scientific_names_df = raw_scientific_names_df.apply(process_dataframe_row, axis=1)
clean_scientific_names_df

In [None]:
# Convert list of dataclass instances to DataFrame
# TODO refactor this as now it goes from Dataframe > list > dataframe
clean_scientific_names_df = pd.DataFrame(clean_scientific_names_df.tolist())

clean_scientific_names_df.sample(3)

In [None]:
# order by aphia_id
clean_scientific_names_df = clean_scientific_names_df.sort_values(by=["aphia_id","scientific_name"], ascending=[True, False])
clean_scientific_names_df.sample()

# Review dataframe created with API response

In [None]:
# review API irregularities: 
clean_scientific_names_df[clean_scientific_names_df["status"] != "accepted"]

In [None]:
# review mismatches between the scientific_names to match and those accepted by WoRMS
# if there is a discrepancy, run those lines again, because the aphia_id refers to the old value:
clean_scientific_names_df[clean_scientific_names_df["scientific_names_match"] != True]


## check for duplicates

In [None]:
# check for duplicates aphia_id
clean_scientific_names_df[clean_scientific_names_df.duplicated(subset='aphia_id', keep=False)]

In [None]:
# check for duplicates scintific_name
clean_scientific_names_df[clean_scientific_names_df.duplicated(subset='scientific_name', keep=False)]

In [None]:
# check for duplicates common name
clean_scientific_names_df[clean_scientific_names_df.duplicated(subset='common_name', keep=False)]

## delete specific rows by id

In [60]:
# clean_scientific_names_df = clean_scientific_names_df[clean_scientific_names_df["aphia_id"] != 278154]

In [123]:
# export checkpoint if necessary
# clean_scientific_names_df.to_csv("checkpoint_api_scientific_names.csv", index=False)

# Extra scientific names to add

In [None]:
## upload from checkpoint or cleaned csv file

# scientific_names_file = "/path/to/clean/scientific/name/file"
# clean_scientific_names_df = read_file_to_df(scientific_names_file)
# print(clean_scientific_names_df.columns)
### rename columns if necessary
## clean_scientific_names_df.columns = ["aphia_id","scientific_name","common_name","taxon_rank"]
# clean_scientific_names_df.sample(3)

In [36]:
# Dictionary of additional scientific names and column names to check:

scientific_names_todo = {'Chondrichthyes' : None,
 'Conger wilsoni' : None,
 'Oligoplites saurus' : None,
 'Pseudocaranx georgianus' : None,
 'Acanthoclininae sp' : None,
 "test": None}


In [None]:
# Get the names from csv: 

scientific_names_file_to_check = "path/to/file/with/more/scientific/names"
scientific_names_file_to_check_df = read_file_to_df(scientific_names_file_to_check)
scientific_names_todo = scientific_names_file_to_check_df.set_index('scientificName')['commonName'].to_dict()
print(scientific_names_todo)
len(scientific_names_todo)


In [None]:
remaining_names_todo = set(scientific_names_todo.keys()) - set(clean_scientific_names_df["scientific_name"])
print(len(remaining_names_todo))
remaining_names_todo

In [None]:
# fis this based on above
to_add_to_df = []
for sn in remaining_names_todo:
    to_add_to_df.append(ScientificNameProcessing(sn, scientific_names_todo[sn]).query_api())
new_entries_df = pd.DataFrame(to_add_to_df)
new_entries_df

In [None]:
# TODO check if this works?
new_clean_scientific_names_df = pd.concat([clean_scientific_names_df, new_entries_df], ignore_index=True)
print(len(new_clean_scientific_names_df))
new_clean_scientific_names_df.sample(3)


In [None]:
clean_scientific_names_df = new_clean_scientific_names_df
# export checkpoint if necessary
# clean_scientific_names_df.to_csv("checkpoint_api_scientific_names.csv", index=False)

## Review daaframe for duplicates 

In [None]:
clean_scientific_names_df[clean_scientific_names_df.duplicated(subset='scientific_name', keep='last')]

In [None]:
clean_scientific_names_df[clean_scientific_names_df.duplicated(subset='common_name', keep=False) & ~clean_scientific_names_df["common_name"].isna()]

In [None]:
clean_scientific_names_df[clean_scientific_names_df.duplicated(subset='aphia_id', keep=False)]

### other reviews: 

In [None]:
# check scintific names that end with sp
clean_scientific_names_df[clean_scientific_names_df["scientific_name"].str.endswith(" sp")]

In [None]:
clean_scientific_names_df["scientific_name"].str.split()

In [None]:
#TODO???
clean_scientific_names_df[len(clean_scientific_names_df["scientific_name"].str.split()) != 2]

Replace common names that are the same as the scintific names with None

In [53]:
def delete_common_name(row):
    if isinstance(row["common_name"], float):
        return None
    if row["common_name"].lower() == row["scientific_name"].lower():
        return None
    return row["common_name"]


In [None]:
clean_scientific_names_df["common_name"] = clean_scientific_names_df.apply(delete_common_name, axis=1)

# Review dataframe 

In [63]:
ids_to_check = set()

In [None]:
non_matching_names = clean_scientific_names_df[clean_scientific_names_df["scientific_names_match"] != True]
print("Non matching names n: ",len(non_matching_names))
print(list(non_matching_names["aphia_id"]))
to_add = list(non_matching_names["aphia_id"])
ids_to_check.update(to_add)
non_matching_names

In [None]:
non_accepted_names = clean_scientific_names_df[clean_scientific_names_df["status"] != "accepted"]
print("Non accepted names n: ",len(non_accepted_names))
print(list(non_accepted_names["aphia_id"]))
to_add = list(non_accepted_names["aphia_id"])
ids_to_check.update(to_add)
non_accepted_names

In [None]:
missing_common_names = clean_scientific_names_df[clean_scientific_names_df["common_name"].isna()]
print("Missing common names n: ",len(missing_common_names))
print(list(missing_common_names["aphia_id"]))
to_add = list(missing_common_names["aphia_id"])
ids_to_check.update(to_add)
missing_common_names

In [None]:
same_names = clean_scientific_names_df[clean_scientific_names_df["common_name"] == clean_scientific_names_df["scientific_name"]]
print("Same common and scientific names n: ",len(same_names))
print(list(same_names["aphia_id"]))
to_add = list(same_names["aphia_id"])
ids_to_check.update(to_add)
same_names

In [None]:
len(ids_to_check)
ids_to_check

# Aphia ID to find the rows in the exported csv that need to be checked

Non species taxon: 

In [None]:
non_species_taxon = clean_scientific_names_df[clean_scientific_names_df["taxon_rank"] != "Species"]
print("Non species rank: ",len(non_species_taxon))
print(list(non_species_taxon["aphia_id"]))
non_species_taxon

# Export to csv

In [None]:
clean_scientific_names_df.sample()

In [None]:
to_export_df = clean_scientific_names_df[["aphia_id","scientific_name", "common_name", "taxon_rank"]]
to_export_df.columns = ["aphiaID","scientificName","commonName", "taxonRank"]
to_export_df.sample()

In [64]:
clean_scientific_names_df.to_csv("clean_scientific_names.csv", index=False)


## Example API return

In [None]:
[
  {
    "AphiaID": 277101,
    "url": "https://www.marinespecies.org/aphia.php?p=taxdetails&id=277101",
    "scientificname": "Cephaloscyllium isabellum",
    "authority": "(Bonnaterre, 1788)",
    "status": "unaccepted",
    "unacceptreason": null,
    "taxonRankID": 220,
    "rank": "Species",
    "valid_AphiaID": 298238,
    "valid_name": "Cephaloscyllium isabella",
    "valid_authority": "(Bonnaterre, 1788)",
    "parentNameUsageID": 204168,
    "kingdom": "Animalia",
    "phylum": "Chordata",
    "class": "Elasmobranchii",
    "order": "Carcharhiniformes",
    "family": "Scyliorhinidae",
    "genus": "Cephaloscyllium",
    "citation": "Froese, R. and D. Pauly. Editors. (2024). FishBase. Cephaloscyllium isabellum (Bonnaterre, 1788). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=277101 on 2024-11-29",
    "lsid": "urn:lsid:marinespecies.org:taxname:277101",
    "isMarine": 1,
    "isBrackish": 0,
    "isFreshwater": 0,
    "isTerrestrial": 0,
    "isExtinct": null,
    "match_type": "exact",
    "modified": "2023-01-11T08:59:53.383Z"
  }
]