<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scientific-Names-Validity-Review" data-toc-modified-id="Scientific-Names-Validity-Review-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scientific Names Validity Review</a></span><ul class="toc-item"><li><span><a href="#Chose-excel-file-containing-the-scientific-names-to-check" data-toc-modified-id="Chose-excel-file-containing-the-scientific-names-to-check-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Chose excel file containing the scientific names to check</a></span></li><li><span><a href="#Small-test-df" data-toc-modified-id="Small-test-df-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Small test df</a></span></li><li><span><a href="#Check-each-of-the-scientific-names" data-toc-modified-id="Check-each-of-the-scientific-names-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Check each of the scientific names</a></span></li><li><span><a href="#run-the-name-checker-on-each-row" data-toc-modified-id="run-the-name-checker-on-each-row-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>run the name checker on each row</a></span></li></ul></li><li><span><a href="#run-the-name-checker-on-the-whole-dataframe" data-toc-modified-id="run-the-name-checker-on-the-whole-dataframe-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>run the name checker on the whole dataframe</a></span></li><li><span><a href="#TODOs-and-Extras:" data-toc-modified-id="TODOs-and-Extras:-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TODOs and Extras:</a></span><ul class="toc-item"><li><span><a href="#potential-todos" data-toc-modified-id="potential-todos-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>potential todos</a></span></li><li><span><a href="#print-WoRMS-suggestions" data-toc-modified-id="print-WoRMS-suggestions-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>print WoRMS suggestions</a></span></li><li><span><a href="#Example-API-return" data-toc-modified-id="Example-API-return-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Example API return</a></span></li></ul></li></ul></div>

In [None]:
# Last changed 2024.12.11

# Scientific Names Validity Review

This notebooks is part of the 2024 Spyfish data cleaning process and it reviews the validity of species scientific names in a given column of an Excel sheet. 

The checks are performed with calls to the [WoRMS API](https://www.marinespecies.org/rest/AphiaRecordsByName).


The output of this notebook creates a data frame containing 2 columns, the given scientific name, and the response from WoRMS:
- _exact_: the given scientific name is valid according to the WoRMS API
- _Alternative name_: a new matched name, if the previous one is deprecated or not accepted
- _False_: if there is no match.


In [3]:
import requests
import pandas as pd

from ipyfilechooser import FileChooser
from IPython.display import display


## Chose excel file containing the scientific names to check

In [None]:
file_chooser = FileChooser(title='<b>Select the Excel file containing scientfic names</b>')
display(file_chooser)

In [None]:
scientific_names_file = file_chooser.selected
assert scientific_names_file != None, "Select the file containing the scientific names in the cell above."

print(f"The scientific_names_file is {scientific_names_file}")
# The next call assumes there is one sheet in the file or the relevant sheet is first. 
# If it's not, add it as a parameter.
sheet_df = pd.read_excel(scientific_names_file)

# If the column name containing scientfic names is not called ScientificName replace with column name
sheet_df = sheet_df[["ScientificName"]]

In [None]:
sheet_df.sample(10)

## Small test df
To use for testing etc in order to avoid 160 API calls

In [6]:
sheet_df = pd.DataFrame(['Kathetostoma giganteum', # correct
                                    'Kathetostoma giganteu', # typo
                                    'Cephaloscyllium isabellum', # replaced by new nomenclature
                                    'Triglidae sp', # only genus correct
                                    'Blennioidei sp' # new nomenclature genus to fix
                                   ], columns=["ScientificName"])
sheet_df

Unnamed: 0,ScientificName
0,Kathetostoma giganteum
1,Kathetostoma giganteu
2,Cephaloscyllium isabellum
3,Triglidae sp
4,Blennioidei sp


In [7]:
scientific_names_df = sheet_df.copy()
scientific_names_df

Unnamed: 0,ScientificName
0,Kathetostoma giganteum
1,Kathetostoma giganteu
2,Cephaloscyllium isabellum
3,Triglidae sp
4,Blennioidei sp


## Check each of the scientific names 



In [16]:
def check_name(scientific_name):
    print(f"Checking {scientific_name}")
    
    genus = False
    if scientific_name.endswith("sp"):
        scientific_name = scientific_name.split()[0]
        genus = True
            
    url = f"https://www.marinespecies.org/rest/AphiaRecordsByName/{scientific_name}?like=false&marine_only=true&offset=1"     
    response = requests.get(url)
    response.raise_for_status()  
    accepted = False
    if response.status_code == 200:
        response_json = response.json()[0]
        if response_json["status"] == 'accepted':
            if response_json["scientificname"] != scientific_name: 
                print(f"Names don't match: {response_json['scientificname']}: {scientific_name}" )
            accepted = response_json["match_type"]
        else: 
            accepted = response_json['valid_name']
            if genus:
                accepted = f"{accepted} sp"
        print(f"{scientific_name} was processed, match: {accepted}")
    else: 
        print(f"{scientific_name} was not processed, more info:\n{response.content}")
    return accepted
    

## run the name checker on each row 


In [17]:
def check_scientific_names(scientific_names_df, column_scientific_names="ScientificName"):
    scientific_names_df["WoRMSScientificNameMatch"] = scientific_names_df[column_scientific_names].apply(check_name)
    return scientific_names_df


# run the name checker on the whole dataframe


In [18]:
check_scientific_names(scientific_names_df)

Checking Kathetostoma giganteum
Kathetostoma giganteum was processed, match: exact
Checking Kathetostoma giganteu
Kathetostoma giganteu was not processed, more info:
b''
Checking Cephaloscyllium isabellum
Cephaloscyllium isabellum was processed, match: Cephaloscyllium isabella
Checking Triglidae sp
Triglidae was processed, match: exact
Checking Blennioidei sp
Blennioidei was processed, match: Blenniiformes sp


Unnamed: 0,ScientificName,WoRMSScientificNameMatch
0,Kathetostoma giganteum,exact
1,Kathetostoma giganteu,False
2,Cephaloscyllium isabellum,Cephaloscyllium isabella
3,Triglidae sp,exact
4,Blennioidei sp,Blenniiformes sp


In [20]:
matching_names = scientific_names_df[scientific_names_df["WoRMSScientificNameMatch"] == "exact"]
print("Missing names n: ",len(matching_names))
matching_names

Missing names n:  2


Unnamed: 0,ScientificName,WoRMSScientificNameMatch
0,Kathetostoma giganteum,exact
3,Triglidae sp,exact


In [21]:
wrong_names = scientific_names_df[(
    scientific_names_df["WoRMSScientificNameMatch"] != "exact") & (scientific_names_df["WoRMSScientificNameMatch"] != False)]
print("Wrong names n: ",len(wrong_names))
wrong_names

Wrong names n:  2


Unnamed: 0,ScientificName,WoRMSScientificNameMatch
2,Cephaloscyllium isabellum,Cephaloscyllium isabella
4,Blennioidei sp,Blenniiformes sp


In [22]:
missing_names = scientific_names_df[scientific_names_df["WoRMSScientificNameMatch"] == False]
print("Missing names n: ",len(missing_names))
missing_names

Missing names n:  1


Unnamed: 0,ScientificName,WoRMSScientificNameMatch
1,Kathetostoma giganteu,False


In [23]:
# check if all scientific names accounted for? will flag errors

total_processed = len(wrong_names) + len(missing_names) + len(matching_names)
assert len(scientific_names_df) == total_processed, "Check the names some other issue is happending"

# TODOs and Extras: 

The above is not inserted manually into the original sheet, but it needs to be manually transferred.

## potential todos

- TODO: create export file, a standardized output of the species list to be used for the app drop down selection.
- TODO(low): deal with the various scenarios: non perfect match/204/404


## print WoRMS suggestions

In [None]:
print("Scientific name in excel: WoRMS suggested name")
for i, row in wrong_names.iterrows():
    print(f"{row['ScientificName']}: {row['WoRMSScientificNameMatch']}")

## Example API return

In [None]:
[
  {
    "AphiaID": 277101,
    "url": "https://www.marinespecies.org/aphia.php?p=taxdetails&id=277101",
    "scientificname": "Cephaloscyllium isabellum",
    "authority": "(Bonnaterre, 1788)",
    "status": "unaccepted",
    "unacceptreason": null,
    "taxonRankID": 220,
    "rank": "Species",
    "valid_AphiaID": 298238,
    "valid_name": "Cephaloscyllium isabella",
    "valid_authority": "(Bonnaterre, 1788)",
    "parentNameUsageID": 204168,
    "kingdom": "Animalia",
    "phylum": "Chordata",
    "class": "Elasmobranchii",
    "order": "Carcharhiniformes",
    "family": "Scyliorhinidae",
    "genus": "Cephaloscyllium",
    "citation": "Froese, R. and D. Pauly. Editors. (2024). FishBase. Cephaloscyllium isabellum (Bonnaterre, 1788). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=277101 on 2024-11-29",
    "lsid": "urn:lsid:marinespecies.org:taxname:277101",
    "isMarine": 1,
    "isBrackish": 0,
    "isFreshwater": 0,
    "isTerrestrial": 0,
    "isExtinct": null,
    "match_type": "exact",
    "modified": "2023-01-11T08:59:53.383Z"
  }