<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scientific-Names-Validity-Review" data-toc-modified-id="Scientific-Names-Validity-Review-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scientific Names Validity Review</a></span><ul class="toc-item"><li><span><a href="#Chose-excel-file-containing-the-scientific-names-to-check" data-toc-modified-id="Chose-excel-file-containing-the-scientific-names-to-check-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Chose excel file containing the scientific names to check</a></span></li><li><span><a href="#Small-test-df" data-toc-modified-id="Small-test-df-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Small test df</a></span></li><li><span><a href="#Check-each-of-the-scientific-names" data-toc-modified-id="Check-each-of-the-scientific-names-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Check each of the scientific names</a></span></li><li><span><a href="#run-the-name-checker-on-each-row" data-toc-modified-id="run-the-name-checker-on-each-row-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>run the name checker on each row</a></span></li></ul></li><li><span><a href="#run-the-name-checker-on-the-whole-dataframe" data-toc-modified-id="run-the-name-checker-on-the-whole-dataframe-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>run the name checker on the whole dataframe</a></span></li><li><span><a href="#TODOs-and-Extras:" data-toc-modified-id="TODOs-and-Extras:-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TODOs and Extras:</a></span><ul class="toc-item"><li><span><a href="#potential-todos" data-toc-modified-id="potential-todos-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>potential todos</a></span></li><li><span><a href="#print-WoRMS-suggestions" data-toc-modified-id="print-WoRMS-suggestions-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>print WoRMS suggestions</a></span></li><li><span><a href="#Example-API-return" data-toc-modified-id="Example-API-return-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Example API return</a></span></li></ul></li></ul></div>

In [None]:
# Last changed 2025.03.21

# Scientific Names Validity Review

This notebooks is part of the Spyfish data cleaning process and it reviews the validity of species scientific names in a given column of an Excel sheet. 

The checks are performed with calls to the [WoRMS API](https://www.marinespecies.org/rest/AphiaRecordsByName).



The output of this notebook creates a csv with the following column:
- **aphiaID**: from WoRMS API
- **scientificName**: the scientific name, validated by the WoRMS API
- **commonName**: the common name
- **taxonRank**: the corresponding rank of the scientific name



In [1]:
import requests
import pandas as pd

from dataclasses import dataclass
from typing import Optional

## Chose excel file containing the scientific names to check

In [2]:
# Run this code if you want to use a widget to select your file: 

# from ipyfilechooser import FileChooser
# from IPython.display import display

# file_chooser = FileChooser(title='<b>Select the Excel file containing scientfic names</b>')
# display(file_chooser)
# scientific_names_file = file_chooser.selected
# assert scientific_names_file != None, "Select the file containing the scientific names in the cell above."

In [3]:
scientific_names_file = "/path/to/scientific_name/csv_or_excel"


In [None]:


print(f"The scientific_names_file is {scientific_names_file}")
# The next call assumes there is one sheet in the file or the relevant sheet is first. 
# If it's not, add it as a parameter.
if scientific_names_file.endswith(".csv"):
    # TODO not sure
    raw_scientific_names_df = pd.read_csv(scientific_names_file)
else: 
    raw_scientific_names_df = pd.read_excel(scientific_names_file)

# If the column name containing scientfic names is not called ScientificName replace with column name
# sheet_df = sheet_df[["ScientificName"]]

In [None]:
raw_scientific_names_df.sample(10)

In [None]:
raw_scientific_names_df.columns

## Small test df
To use for testing etc in order to avoid multiple API calls

In [5]:
# raw_scientific_names_df = pd.DataFrame(['Kathetostoma giganteum', # correct
#                                     'Kathetostoma giganteu', # typo
#                                     'Cephaloscyllium isabellum', # replaced by new nomenclature
#                                     'Triglidae sp', # only genus correct
#                                     'Blennioidei sp' # new nomenclature genus to fix
#                                    ], columns=["ScientificName"])
# raw_scientific_names_df["CommonName_literature"] = "Test"
# raw_scientific_names_df

Unnamed: 0,ScientificName,CommonName_literature
0,Kathetostoma giganteum,Test
1,Kathetostoma giganteu,Test
2,Cephaloscyllium isabellum,Test
3,Triglidae sp,Test
4,Blennioidei sp,Test


## Check each of the scientific names 



In [6]:

@dataclass
class ScientificNameEntry:
    aphiaID: Optional[int] = -1
    commonName: Optional[str] = None
    scientificName: Optional[str] = None
    oldScientificName: Optional[str] = None
    scientificNamesMatch: bool = False
    taxonRank: Optional[str] = None


In [7]:
def get_scientific_name_info(scientific_name, common_name=None):
    """Queries the WoRMS API and returns a ScientificNameEntry populated with the response"""

    # print(f"Checking {scientific_name} - {common_name}")
    # genus values are defined as Sp in the DOC data, so dealing with this here.
    genus = ""
    if scientific_name.endswith("sp"):
        scientific_name = scientific_name.split()[0]
        genus = " sp"
            
    url = f"https://www.marinespecies.org/rest/AphiaRecordsByName/{scientific_name}?like=false&marine_only=true&offset=1"   
  
    response = requests.get(url)
    if response.status_code == 200:
        response_json = response.json()[0]

        accepted_scientific_name = response_json.get('valid_name')
        scientific_name_instance = ScientificNameEntry(
            aphiaID=response_json.get("AphiaID"),
            commonName=common_name,
            scientificName=accepted_scientific_name + genus,
            oldScientificName=scientific_name,
            scientificNamesMatch=accepted_scientific_name == scientific_name,
            taxonRank=response_json.get("rank"),
        )
        return scientific_name_instance
    
    print(f"{scientific_name} was not processed, more info:\n{response.content}")
    return ScientificNameEntry(commonName=common_name,
                               oldScientificName=scientific_name)



In [8]:
def process_dataframe_row(row):
    # get relevant values from original dataframe
    scientific_name = row.get("ScientificName")
    common_name = row.get("CommonName_literature")
    
    return get_scientific_name_info(scientific_name, common_name)


## run the name checker on each row 


In [9]:
clean_scientific_names_df = raw_scientific_names_df.apply(process_dataframe_row, axis=1)
# clean_scientific_names_df

Kathetostoma giganteu was not processed, more info:
b''


In [10]:
# Convert list of dataclass instances to DataFrame
clean_scientific_names_df = pd.DataFrame(clean_scientific_names_df.tolist())

clean_scientific_names_df

Unnamed: 0,aphiaID,commonName,scientificName,oldScientificName,scientificNamesMatch,taxonRank
0,275992,Test,Kathetostoma giganteum,Kathetostoma giganteum,True,Species
1,-1,Test,,Kathetostoma giganteu,False,
2,277101,Test,Cephaloscyllium isabella,Cephaloscyllium isabellum,False,Species
3,125598,Test,Triglidae sp,Triglidae,True,Family
4,151738,Test,Blenniiformes sp,Blennioidei,False,Suborder


# Extra scientific names to add onto the list 

In [11]:
# Create DataFrame from dictionary values

scientific_names_todo = set(['Chondrichthyes', # FIX I think too wide
 'Conger wilsoni',
 'Oligoplites saurus',
 'Pseudocaranx georgianus']
)



for sn in clean_scientific_names_df["scientificName"]:
   scientific_names_todo.discard(sn)


scientific_names_todo

{'Chondrichthyes',
 'Conger wilsoni',
 'Oligoplites saurus',
 'Pseudocaranx georgianus'}

In [12]:
to_add_to_df = []
for sn in scientific_names_todo:
    to_add_to_df.append(get_scientific_name_info(sn))
new_entries_df = pd.DataFrame(to_add_to_df)
new_entries_df


Unnamed: 0,aphiaID,commonName,scientificName,oldScientificName,scientificNamesMatch,taxonRank
0,1039991,,Pseudocaranx georgianus,Pseudocaranx georgianus,True,Species
1,217546,,Conger wilsoni,Conger wilsoni,True,Species
2,159645,,Oligoplites saurus,Oligoplites saurus,True,Species
3,1517375,,Chondrichthyes,Chondrichthyes,True,Parvphylum


In [13]:

clean_scientific_names_df = pd.concat([clean_scientific_names_df, new_entries_df], ignore_index=True)

clean_scientific_names_df

Unnamed: 0,aphiaID,commonName,scientificName,oldScientificName,scientificNamesMatch,taxonRank
0,275992,Test,Kathetostoma giganteum,Kathetostoma giganteum,True,Species
1,-1,Test,,Kathetostoma giganteu,False,
2,277101,Test,Cephaloscyllium isabella,Cephaloscyllium isabellum,False,Species
3,125598,Test,Triglidae sp,Triglidae,True,Family
4,151738,Test,Blenniiformes sp,Blennioidei,False,Suborder
5,1039991,,Pseudocaranx georgianus,Pseudocaranx georgianus,True,Species
6,217546,,Conger wilsoni,Conger wilsoni,True,Species
7,159645,,Oligoplites saurus,Oligoplites saurus,True,Species
8,1517375,,Chondrichthyes,Chondrichthyes,True,Parvphylum


# Review dataframe 

In [14]:
# check if all scientific names accounted for? will flag errors
len(clean_scientific_names_df) == len(raw_scientific_names_df) + len(to_add_to_df)

True

In [15]:
ids_to_check = set()

In [16]:
non_matching_names = clean_scientific_names_df[clean_scientific_names_df["scientificNamesMatch"] != True]
print("Non matching names n: ",len(non_matching_names))
print(list(non_matching_names["aphiaID"]))
to_add = list(non_matching_names["aphiaID"])
ids_to_check.update(to_add)
non_matching_names

Non matching names n:  3
[-1, 277101, 151738]


Unnamed: 0,aphiaID,commonName,scientificName,oldScientificName,scientificNamesMatch,taxonRank
1,-1,Test,,Kathetostoma giganteu,False,
2,277101,Test,Cephaloscyllium isabella,Cephaloscyllium isabellum,False,Species
4,151738,Test,Blenniiformes sp,Blennioidei,False,Suborder


In [17]:
missing_common_names = clean_scientific_names_df[clean_scientific_names_df["commonName"].isna()]
print("Missing common names n: ",len(missing_common_names))
print(list(missing_common_names["aphiaID"]))
to_add = list(missing_common_names["aphiaID"])
ids_to_check.update(to_add)
missing_common_names

Missing common names n:  4
[1039991, 217546, 159645, 1517375]


Unnamed: 0,aphiaID,commonName,scientificName,oldScientificName,scientificNamesMatch,taxonRank
5,1039991,,Pseudocaranx georgianus,Pseudocaranx georgianus,True,Species
6,217546,,Conger wilsoni,Conger wilsoni,True,Species
7,159645,,Oligoplites saurus,Oligoplites saurus,True,Species
8,1517375,,Chondrichthyes,Chondrichthyes,True,Parvphylum


# Aphia ID to find the rows in the exported csv that need to be checked

In [18]:
len(ids_to_check)
ids_to_check

{-1, 151738, 159645, 217546, 277101, 1039991, 1517375}

# A few more, for curiosity

In [19]:
non_species_taxon = clean_scientific_names_df[clean_scientific_names_df["taxonRank"] != "Species"]
print("Non species rank: ",len(non_species_taxon))
print(list(non_species_taxon["aphiaID"]))
non_species_taxon



Non species rank:  4
[-1, 125598, 151738, 1517375]


Unnamed: 0,aphiaID,commonName,scientificName,oldScientificName,scientificNamesMatch,taxonRank
1,-1,Test,,Kathetostoma giganteu,False,
3,125598,Test,Triglidae sp,Triglidae,True,Family
4,151738,Test,Blenniiformes sp,Blennioidei,False,Suborder
8,1517375,,Chondrichthyes,Chondrichthyes,True,Parvphylum


# Export to csv

In [20]:
columns_to_export = ["aphiaID","commonName","scientificName", "taxonRank"]

clean_scientific_names_df.to_csv("clean_scientific_names.csv", columns=columns_to_export, index=False)


# Extras: 


- TODO(low): deal with the various scenarios: non perfect match/204/404


## Example API return

In [None]:
[
  {
    "AphiaID": 277101,
    "url": "https://www.marinespecies.org/aphia.php?p=taxdetails&id=277101",
    "scientificname": "Cephaloscyllium isabellum",
    "authority": "(Bonnaterre, 1788)",
    "status": "unaccepted",
    "unacceptreason": null,
    "taxonRankID": 220,
    "rank": "Species",
    "valid_AphiaID": 298238,
    "valid_name": "Cephaloscyllium isabella",
    "valid_authority": "(Bonnaterre, 1788)",
    "parentNameUsageID": 204168,
    "kingdom": "Animalia",
    "phylum": "Chordata",
    "class": "Elasmobranchii",
    "order": "Carcharhiniformes",
    "family": "Scyliorhinidae",
    "genus": "Cephaloscyllium",
    "citation": "Froese, R. and D. Pauly. Editors. (2024). FishBase. Cephaloscyllium isabellum (Bonnaterre, 1788). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=277101 on 2024-11-29",
    "lsid": "urn:lsid:marinespecies.org:taxname:277101",
    "isMarine": 1,
    "isBrackish": 0,
    "isFreshwater": 0,
    "isTerrestrial": 0,
    "isExtinct": null,
    "match_type": "exact",
    "modified": "2023-01-11T08:59:53.383Z"
  }