# Exploring Pterosaur Data with Jupyter Notebook

Welcome to this Jupyter Notebook, where we delve into the fascinating world of paleobiology! In this notebook, we'll explore data from the Paleobiology Database to provide a clearer, more accessible overview of pterosaur species and genera. Our goal is to transform complex, raw paleobiological datasets into a format that is not only easier to navigate but also more engaging for those who are new to the field or simply curious about pterosaurs.

### What You'll Find Here

1. **Data Acquisition**: We'll begin by retrieving pterosaur data from the Paleobiology Database, which is a comprehensive repository of fossil records and related information.

2. **Data Cleanup and Transformation**: Next, we'll process and clean the data to ensure it's in a usable format. This involves handling missing values, standardizing formats, and organizing the data for better readability.

3. **Data Visualization**: To make the data more digestible, we'll create visualizations that highlight key aspects of pterosaur species and genera. This includes distribution maps, frequency charts, and other graphical representations.

4. **Insights and Future Actions**: Finally, we'll summarize our findings and offer insights into the patterns and trends observed in the data. From here, we will procede with how else this data can and will be used.


### Why This Matters

The Paleobiology Database is a treasure trove of information about prehistoric life, but its sheer volume and complexity can be overwhelming to navigate. By condensing and visualizing this data, we hope to make it more accessible and enjoyable for a broader audience. Whether you're a student, educator, or simply a pterosaur enthusiast, this notebook is designed to offer a clearer view into the ancient world of pterosaurs.


In [1]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib as mpl
import requests as rq
import io

## Data Acquisition
We start by accessing the Paleobiology Database through their [data service API](https://paleobiodb.org/data1.2/). The criteria I am using is based off the Taxonomy of Fossil Occurrences Dataset. Cleaning up the data leaves us with the Classifications, Diet, and First/Last Appearances in the fossil record.

In [2]:
#taxa_url = rq.get('https://paleobiodb.org/data1.2/occs/taxa.csv?base_name=Pterosauria&idreso=species&idqual=certain&pres=regular&max_ma=252&min_ma=65&show=class,size,app,ecospace,img').content
#taxa = pd.read_csv(io.StringIO(taxa_url.decode('utf-8')))[['taxon_rank', 'taxon_name', 'genus', 'family', 'order', 'taxon_size', 'diet', 'firstapp_max_ma', 'lastapp_min_ma']]

# Use the following code if URL no longer works 
# *** (CSV updated on 6/26/2024) ***
taxa = pd.read_csv('../data-reserve/pterosauria-taxa.csv')[['taxon_rank', 'taxon_name', 'genus', 'family', 'order', 'taxon_size', 'diet', 'firstapp_max_ma', 'lastapp_min_ma']]

taxa = taxa.dropna(subset=['taxon_name']).query('(taxon_rank == \'genus\') or (taxon_rank == \'species\')')
taxa.columns = ['Rank', 'Name', 'Genus', 'Family', 'Order', 'Taxon Size', 'Diet', 'Max MYA', 'Min MYA']

taxa.head()

Unnamed: 0,Rank,Name,Genus,Family,Order,Taxon Size,Diet,Max MYA,Min MYA
4,genus,Pachagnathus,Pachagnathus,Raeticodactylidae,Pterosauria,2.0,piscivore,227.0,208.5
5,species,Pachagnathus benitoi,Pachagnathus,Raeticodactylidae,Pterosauria,1.0,piscivore,227.0,208.5
6,genus,Yelaphomte,Yelaphomte,Raeticodactylidae,Pterosauria,2.0,piscivore,227.0,208.5
7,species,Yelaphomte praderioi,Yelaphomte,Raeticodactylidae,Pterosauria,1.0,piscivore,227.0,208.5
14,genus,Dearc,Dearc,Rhamphorhynchidae,Pterosauria,2.0,piscivore,168.2,165.3


While the dataset above contains a surplus of information, it does not include locations of where the species have been found. To account for this, I am grabbing another dataset that contains all the known fossil occurrences and their respective origin location. Once merged with the taxa dataset, this data will help analyze where certain dinosaurs reside.

In [4]:
# Scraping data from the Paleobiology Database and cleaning up the Null values
occ_url = rq.get('https://paleobiodb.org/data1.2/occs/list.csv?base_name=Pterosauria&taxon_reso=species&idqual=certain&pres=regular&max_ma=252&min_ma=65&show=class,coords,loc,acconly').content
occ = pd.read_csv(io.StringIO(occ_url.decode('utf-8')))[['accepted_name', 'lng', 'lat']]#.drop(columns=['early_interval', 'late_interval', 'occurrence_no', 'record_type', 'reid_no', 'flags', 'collection_no', 'accepted_rank', 'accepted_no', 'reference_no', 'phylum', 'class', 'county', 'latlng_basis', 'latlng_precision', 'geogscale', 'geogcomments'])

# Use the following code if URL no longer works 
# *** (CSV updated on 6/26/2024) ***
# occ = pd.read_csv('../data-reserve/pterosauria-occ.csv')[['accepted_name', 'lng', 'lat']]


occ.columns = ['Name', 'Longitude', 'Latitude']# County may need to be added later

occ.head()

Unnamed: 0,Name,Longitude,Latitude
0,Dendrorhynchoides curvidentatus,120.8722,41.601398
1,Rhamphinion jenkinsi,-111.006943,35.695278
2,Eurolimnornis corneti,22.4,46.950001
3,Arambourgiania philadelphiae,36.033333,32.016666
4,Pterodaustro guinazui,-66.993797,-32.501301


## Data  Cleanup and Transformation
With the both of these dataframes at our disposal, the next step is to clean it up in a presentable format. Firstly, we want to fix the null values in the Age Columns and add in the corresponding Period and Epochs. To do this, we can sort by year and then choose the Period and Epoch based on what Age the species or genus lived in.

In [None]:
# Looking at the dataframe, we need to clean the columns relating to species lifetime --> Some dinosaurs have NaN as their entries for Early and Late Ages
periods = ['Triassic', 'Jurassic', 'Cretaceous']
epochs = ['Lower', 'Middle', 'Upper']

tri_ages = ['Induan', 'Olenekian', 'Anisian', 'Ladinian', 'Carnian', 'Norian', 'Rhaetian']
jur_ages = ['Hettangian', 'Sinemurian', 'Pliensbachian', 'Toarcian', 'Aalenian', 'Bajocian', 'Bathonian', 'Callovian', 'Oxfordian', 'Kimmeridgian', 'Tithonian']
cre_ages = ['Berriasian', 'Valanginian', 'Hauterivian', 'Barremian', 'Aptian', 'Albian', 'Cenomanian', 'Turonian', 'Coniacian', 'Santonian', 'Campanian', 'Maastrichtian']

ages = [*tri_ages, *jur_ages, *cre_ages]

# Arguements for Age
mya_args = lambda x : [(taxa[x] <= 251.9) & (taxa[x] > 251.2), 
                (taxa[x] <= 251.2) & (taxa[x] > 247.2),
                (taxa[x] <= 247.2) & (taxa[x] > 242),
                (taxa[x] <= 242) & (taxa[x] > 237),
                (taxa[x] <= 237) & (taxa[x] > 227),
                (taxa[x] <= 227) & (taxa[x] > 208.5),
                (taxa[x] <= 208.5) & (taxa[x] > 201.4),
                (taxa[x] <= 201.4) & (taxa[x] > 199.5),
                (taxa[x] <= 199.5) & (taxa[x] > 192.9),
                (taxa[x] <= 192.9) & (taxa[x] > 184.2),
                (taxa[x] <= 184.2) & (taxa[x] > 174.7),
                (taxa[x] <= 174.7) & (taxa[x] > 170.9),
                (taxa[x] <= 170.9) & (taxa[x] > 168.2),
                (taxa[x] <= 168.2) & (taxa[x] > 165.3),
                (taxa[x] <= 165.3) & (taxa[x] > 161.5),
                (taxa[x] <= 161.5) & (taxa[x] > 154.8),
                (taxa[x] <= 154.8) & (taxa[x] > 149.2),
                (taxa[x] <= 149.2) & (taxa[x] > 145),
                (taxa[x] <= 145) & (taxa[x] > 139.8),
                (taxa[x] <= 139.8) & (taxa[x] > 132.6),
                (taxa[x] <= 132.6) & (taxa[x] > 125.77),
                (taxa[x] <= 125.77) & (taxa[x] > 121.4),
                (taxa[x] <= 121.4) & (taxa[x] > 113),
                (taxa[x] <= 113) & (taxa[x] > 100.5),
                (taxa[x] <= 100.5) & (taxa[x] > 93.9),
                (taxa[x] <= 93.9) & (taxa[x] > 89.8),
                (taxa[x] <= 89.8) & (taxa[x] > 86.3),
                (taxa[x] <= 86.3) & (taxa[x] > 83.6),
                (taxa[x] <= 83.6) & (taxa[x] > 72.1),
                (taxa[x] <= 72.1) & (taxa[x] > 66)] 
                
# Arguments for Period and Epoch (will combine into one column later)
pers = lambda x : [(taxa[x].isin(tri_ages)),
                   (taxa[x].isin(jur_ages)),
                   (taxa[x].isin(cre_ages))]

eps = lambda x: [((taxa[x] == 'Induan') | (taxa[x] == 'Olenekian') | (taxa[x] == 'Hettangian') | (taxa[x] == 'Sinemurian') | 
                  (taxa[x] == 'Pliensbachian') | (taxa[x] == 'Toarcian') | (taxa[x] == 'Berriasian') | (taxa[x] == 'Valanginian') | 
                  (taxa[x] == 'Hauterivian') | (taxa[x] == 'Barremian') | (taxa[x] == 'Aptian') | (taxa[x] == 'Albian')),
                 ((taxa[x] == 'Anisian') | (taxa[x] == 'Ladinian') | (taxa[x] == 'Aalenian') | (taxa[x] == 'Bajocian') | 
                  (taxa[x] == 'Bathonian') | (taxa[x] == 'Callovian')),
                 ((taxa[x] == 'Carnian') | (taxa[x] == 'Norian') | (taxa[x] == 'Rhaetian') | (taxa[x] == 'Oxfordian') | 
                  (taxa[x] == 'Kimmeridgian') | (taxa[x] == 'Tithonian') | (taxa[x] == 'Cenomanian') | (taxa[x] == 'Turonian') | 
                  (taxa[x] == 'Coniacian') | (taxa[x] == 'Santonian') | (taxa[x] == 'Campanian') | (taxa[x] == 'Maastrichtian'))
                 ]


# Adding the Period and Age columns
taxa['Early Age'] = np.select(mya_args('Max MYA'), ages, default=pd.NaT)

# We add 0.01 to accomodate for edge cases where a pterosaur is estimated to have lived at the cusp of two mesozoic ages
taxa['Min MYA'] += 0.01
taxa['Late Age'] = np.select(mya_args('Min MYA'), ages, default=pd.NaT)
taxa['Min MYA'] -= 0.01
taxa['Late Age'] = taxa['Late Age'].fillna(taxa['Early Age'])

taxa['Early Period'] = np.select(eps('Early Age'), epochs, default=pd.NaT) + ' ' + np.select(pers('Early Age'), periods, default=pd.NaT)
taxa['Late Period'] = np.select(eps('Late Age'), epochs, default=pd.NaT) + ' ' + np.select(pers('Late Age'), periods, default=pd.NaT)

taxa.head()


Unnamed: 0,Rank,Name,Genus,Family,Order,Taxon Size,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period
4,genus,Pachagnathus,Pachagnathus,Raeticodactylidae,Pterosauria,2.0,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic
5,species,Pachagnathus benitoi,Pachagnathus,Raeticodactylidae,Pterosauria,1.0,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic
6,genus,Yelaphomte,Yelaphomte,Raeticodactylidae,Pterosauria,2.0,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic
7,species,Yelaphomte praderioi,Yelaphomte,Raeticodactylidae,Pterosauria,1.0,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic
14,genus,Dearc,Dearc,Rhamphorhynchidae,Pterosauria,2.0,piscivore,168.2,165.3,Bathonian,Bathonian,Middle Jurassic,Middle Jurassic


I also want to split this dataframe to separate the genus from the species for easier tracking.

In [None]:
species = taxa.loc[taxa['Rank'] == 'species'].reset_index().drop(columns=['Rank', 'Taxon Size', 'index'])
genus = taxa.loc[taxa['Rank'] == 'genus'].reset_index().drop(columns=['Rank', 'Genus', 'index'])

genus.head()

Unnamed: 0,Name,Family,Order,Taxon Size,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period
0,Pachagnathus,Raeticodactylidae,Pterosauria,2.0,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic
1,Yelaphomte,Raeticodactylidae,Pterosauria,2.0,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic
2,Dearc,Rhamphorhynchidae,Pterosauria,2.0,piscivore,168.2,165.3,Bathonian,Bathonian,Middle Jurassic,Middle Jurassic
3,Angustinaripterus,Rhamphorhynchidae,Pterosauria,2.0,piscivore,170.9,154.8,Bajocian,Oxfordian,Middle Jurassic,Upper Jurassic
4,Cacibupteryx,Rhamphorhynchidae,Pterosauria,2.0,piscivore,161.5,154.8,Oxfordian,Oxfordian,Upper Jurassic,Upper Jurassic


Next, we want to include the location of where each pterosaur was found. To do this, we can do an left-join to get the occurrences data into the taxa daraframe. However, the biggest obstacle is that the occurrences data contains duplicates since multiple fossils of the same species are found. To account for this, we can combine the Longitude and Latitude into a duple contain two objects and put these into a list of duples per unique species.

In [None]:
occ['Coordinates'] = tuple(zip(occ['Latitude'], occ['Longitude']))

occ = occ.groupby(occ['Name']).aggregate({'Coordinates' : 'unique'}).reset_index()
occ.head()

Unnamed: 0,Name,Coordinates
0,Aerodraco sedgwickii,"[(52.200001, 0.133333)]"
1,Aerotitan sudamericanus,"[(-39.472221, -67.333611)]"
2,Aetodactylus halli,"[(32.605556, -97.061111)]"
3,Afrotapejara zouhri,"[(30.9, -3.983333), (31.647499, -4.228333)]"
4,Alamodactylus byrdi,"[(33.0, -96.833336)]"


Now that the data is formatted correctly, we can do a left-join on the taxa dataframe

In [None]:
species = species.merge(occ, on='Name', how='left').fillna('Not Applicable')
species.head()

Unnamed: 0,Name,Genus,Family,Order,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period,Coordinates
0,Pachagnathus benitoi,Pachagnathus,Raeticodactylidae,Pterosauria,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic,"[(-31.633333, -67.26667)]"
1,Yelaphomte praderioi,Yelaphomte,Raeticodactylidae,Pterosauria,piscivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic,"[(-31.633333, -67.26667)]"
2,Dearc sgiathanach,Dearc,Rhamphorhynchidae,Pterosauria,piscivore,168.2,165.3,Bathonian,Bathonian,Middle Jurassic,Middle Jurassic,"[(57.58453, -6.14067)]"
3,Angustinaripterus longicephalus,Angustinaripterus,Rhamphorhynchidae,Pterosauria,piscivore,170.9,154.8,Bajocian,Oxfordian,Middle Jurassic,Upper Jurassic,"[(29.411388, 104.852501)]"
4,Cacibupteryx caribensis,Cacibupteryx,Rhamphorhynchidae,Pterosauria,piscivore,161.5,154.8,Oxfordian,Oxfordian,Upper Jurassic,Upper Jurassic,"[(22.716667, -83.616669)]"
