# Exploring Dinosaur Data with Jupyter Notebook

Welcome to this Jupyter Notebook, where we delve into the fascinating world of paleobiology! In this notebook, we'll explore data from the Paleobiology Database to provide a clearer, more accessible overview of dinosaur species and genera. Our goal is to transform complex, raw paleobiological datasets into a format that is not only easier to navigate but also more engaging for those who are new to the field or simply curious about dinosaurs.

### What You'll Find Here

1. **Data Acquisition**: We'll begin by retrieving dinosaur data from the Paleobiology Database, which is a comprehensive repository of fossil records and related information.

2. **Data Cleanup and Transformation**: Next, we'll process and clean the data to ensure it's in a usable format. This involves handling missing values, standardizing formats, and organizing the data for better readability.

3. **Data Visualization and Analysis**: To make the data more digestible, we'll create visualizations that highlight key aspects of dinosaur species and genera. This includes distribution maps, frequency charts, and other graphical representations.

4. **Insights and Future Actions**: Finally, we'll summarize our findings and offer insights into the patterns and trends observed in the data. From here, we will procede with how else this data can and will be used.


### Why This Matters

The Paleobiology Database is a treasure trove of information about prehistoric life, but its sheer volume and complexity can be overwhelming to navigate. By condensing and visualizing this data, we hope to make it more accessible and enjoyable for a broader audience. Whether you're a student, educator, or simply a dinosaur enthusiast, this notebook is designed to offer a clearer view into the ancient world of dinosaurs.


In [56]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib as mpl
import requests as rq
import folium as fm
import re
import io

### **Data Acquisition**
We start by accessing the Paleobiology Database through their [data service API](https://paleobiodb.org/data1.2/). The criteria I am using is based off the Taxonomy of Fossil Occurrences Dataset. Cleaning up the data leaves us with the Classifications, Diet, and First/Last Appearances in the fossil record.

In [57]:
taxa_url = rq.get('https://paleobiodb.org/data1.2/occs/taxa.csv?base_name=Dinosauria&idreso=species&idqual=certain&pres=regular&max_ma=252&min_ma=65&show=class,size,app,ecospace,img').content
taxa = pd.read_csv(io.StringIO(taxa_url.decode('utf-8')))[['taxon_rank', 'taxon_name', 'genus', 'family', 'order', 'taxon_size', 'diet', 'firstapp_max_ma', 'lastapp_min_ma']]

# Use the following code if URL no longer works 
# *** (CSV updated on 6/26/2024) ***
#taxa = pd.read_csv('../data-reserve/dinosauria-taxa.csv')[['taxon_rank', 'taxon_name', 'genus', 'family', 'order', 'taxon_size', 'diet', 'firstapp_max_ma', 'lastapp_min_ma']]

taxa = taxa.dropna(subset=['taxon_name']).query('(taxon_rank == \'genus\') or (taxon_rank == \'species\')')
taxa.columns = ['Rank', 'Name', 'Genus', 'Family', 'Order', 'Taxon Size', 'Diet', 'Max MYA', 'Min MYA']

taxa = taxa.replace(regex=['NO_FAMILY_SPECIFIED', 'NO_ORDER_SPECIFIED'], value='Unknown')

taxa['Diet'] = taxa['Diet'].str.capitalize()
taxa['Order'] = 'Unknown'

taxa.head()

Unnamed: 0,Rank,Name,Genus,Family,Order,Taxon Size,Diet,Max MYA,Min MYA
10,genus,Ajkaceratops,Ajkaceratops,Unknown,Unknown,2.0,Herbivore,86.3,83.6
11,species,Ajkaceratops kozmai,Ajkaceratops,Unknown,Unknown,1.0,Herbivore,86.3,83.6
12,genus,Turanoceratops,Turanoceratops,Unknown,Unknown,2.0,Herbivore,93.9,89.8
13,species,Turanoceratops tardabilis,Turanoceratops,Unknown,Unknown,1.0,Herbivore,93.9,89.8
14,genus,Zuniceratops,Zuniceratops,Unknown,Unknown,2.0,Herbivore,93.9,89.8


While the dataset above contains a surplus of information, it does not include locations of where the species have been found. To account for this, I am grabbing another dataset that contains all the known fossil occurrences and their respective origin location. Once merged with the taxa dataset, this data will help analyze where certain dinosaurs reside.

In [58]:
# Scraping data from the Paleobiology Database and cleaning up the Null values
occ_url = rq.get('https://paleobiodb.org/data1.2/occs/list.csv?base_name=Dinosauria&taxon_reso=species&idqual=certain&pres=regular&max_ma=252&min_ma=65&show=class,coords,loc,strat,acconly').content
occ = pd.read_csv(io.StringIO(occ_url.decode('utf-8')))[['accepted_name', 'lng', 'lat', 'formation']]

# Use the following code if URL no longer works 
# *** (CSV updated on 6/26/2024) ***
# occ = pd.read_csv('../data-reserve/dinosauria-occ.csv')[['accepted_name', 'lng', 'lat', 'formation']]

occ.columns = ['Name', 'Longitude', 'Latitude', 'Formation']

occ.head()

Unnamed: 0,Name,Longitude,Latitude,Formation
0,Chaoyangsaurus youngi,123.966698,42.9333,Tuchengzi
1,Protarchaeopteryx robusta,120.73333,41.799999,Yixian
2,Caudipteryx zoui,120.73333,41.799999,Yixian
3,Gorgosaurus libratus,-111.528732,50.740726,Dinosaur Park
4,Gorgosaurus libratus,-111.549347,50.737015,Dinosaur Park


### **Data  Cleanup and Transformation**
With the both of these dataframes at our disposal, the next step is to clean it up in a presentable format. Firstly, we want to fix the null values in the Age Columns and add in the corresponding Period and Epochs. To do this, we can sort by year and then choose the Period and Epoch based on what Age the species or genus lived in.

In [59]:

# Looking at the dataframe, we need to clean the columns relating to species lifetime --> Some dinosaurs have NaN as their entries for Early and Late Ages
periods = ['Triassic', 'Jurassic', 'Cretaceous']
epochs = ['Lower', 'Middle', 'Upper']

tri_ages = ['Induan', 'Olenekian', 'Anisian', 'Ladinian', 'Carnian', 'Norian', 'Rhaetian']
jur_ages = ['Hettangian', 'Sinemurian', 'Pliensbachian', 'Toarcian', 'Aalenian', 'Bajocian', 'Bathonian', 'Callovian', 'Oxfordian', 'Kimmeridgian', 'Tithonian']
cre_ages = ['Berriasian', 'Valanginian', 'Hauterivian', 'Barremian', 'Aptian', 'Albian', 'Cenomanian', 'Turonian', 'Coniacian', 'Santonian', 'Campanian', 'Maastrichtian']

ages = [*tri_ages, *jur_ages, *cre_ages]

# Arguements for Age
mya_args = lambda x : [(taxa[x] <= 251.9) & (taxa[x] > 251.2), 
                (taxa[x] <= 251.2) & (taxa[x] > 247.2),
                (taxa[x] <= 247.2) & (taxa[x] > 242),
                (taxa[x] <= 242) & (taxa[x] > 237),
                (taxa[x] <= 237) & (taxa[x] > 227),
                (taxa[x] <= 227) & (taxa[x] > 208.5),
                (taxa[x] <= 208.5) & (taxa[x] > 201.4),
                (taxa[x] <= 201.4) & (taxa[x] > 199.5),
                (taxa[x] <= 199.5) & (taxa[x] > 192.9),
                (taxa[x] <= 192.9) & (taxa[x] > 184.2),
                (taxa[x] <= 184.2) & (taxa[x] > 174.7),
                (taxa[x] <= 174.7) & (taxa[x] > 170.9),
                (taxa[x] <= 170.9) & (taxa[x] > 168.2),
                (taxa[x] <= 168.2) & (taxa[x] > 165.3),
                (taxa[x] <= 165.3) & (taxa[x] > 161.5),
                (taxa[x] <= 161.5) & (taxa[x] > 154.8),
                (taxa[x] <= 154.8) & (taxa[x] > 149.2),
                (taxa[x] <= 149.2) & (taxa[x] > 145),
                (taxa[x] <= 145) & (taxa[x] > 139.8),
                (taxa[x] <= 139.8) & (taxa[x] > 132.6),
                (taxa[x] <= 132.6) & (taxa[x] > 125.77),
                (taxa[x] <= 125.77) & (taxa[x] > 121.4),
                (taxa[x] <= 121.4) & (taxa[x] > 113),
                (taxa[x] <= 113) & (taxa[x] > 100.5),
                (taxa[x] <= 100.5) & (taxa[x] > 93.9),
                (taxa[x] <= 93.9) & (taxa[x] > 89.8),
                (taxa[x] <= 89.8) & (taxa[x] > 86.3),
                (taxa[x] <= 86.3) & (taxa[x] > 83.6),
                (taxa[x] <= 83.6) & (taxa[x] > 72.1),
                (taxa[x] <= 72.1) & (taxa[x] > 66)] 
                
# Arguments for Period and Epoch (will combine into one column later)
pers = lambda x : [(taxa[x].isin(tri_ages)),
                   (taxa[x].isin(jur_ages)),
                   (taxa[x].isin(cre_ages))]

eps = lambda x: [((taxa[x] == 'Induan') | (taxa[x] == 'Olenekian') | (taxa[x] == 'Hettangian') | (taxa[x] == 'Sinemurian') | 
                  (taxa[x] == 'Pliensbachian') | (taxa[x] == 'Toarcian') | (taxa[x] == 'Berriasian') | (taxa[x] == 'Valanginian') | 
                  (taxa[x] == 'Hauterivian') | (taxa[x] == 'Barremian') | (taxa[x] == 'Aptian') | (taxa[x] == 'Albian')),
                 ((taxa[x] == 'Anisian') | (taxa[x] == 'Ladinian') | (taxa[x] == 'Aalenian') | (taxa[x] == 'Bajocian') | 
                  (taxa[x] == 'Bathonian') | (taxa[x] == 'Callovian')),
                 ((taxa[x] == 'Carnian') | (taxa[x] == 'Norian') | (taxa[x] == 'Rhaetian') | (taxa[x] == 'Oxfordian') | 
                  (taxa[x] == 'Kimmeridgian') | (taxa[x] == 'Tithonian') | (taxa[x] == 'Cenomanian') | (taxa[x] == 'Turonian') | 
                  (taxa[x] == 'Coniacian') | (taxa[x] == 'Santonian') | (taxa[x] == 'Campanian') | (taxa[x] == 'Maastrichtian'))
                 ]


# Adding the Period and Age columns
taxa['Early Age'] = np.select(mya_args('Max MYA'), ages, default=pd.NaT)

# We add 0.01 to accomodate for edge cases where a dinosaur is estimated to have lived at the cusp of two mesozoic ages
taxa['Min MYA'] += 0.01
taxa['Late Age'] = np.select(mya_args('Min MYA'), ages, default=pd.NaT)
taxa['Min MYA'] -= 0.01
taxa['Late Age'] = taxa['Late Age'].fillna(taxa['Early Age'])

taxa['Early Period'] = np.select(eps('Early Age'), epochs, default=pd.NaT) + ' ' + np.select(pers('Early Age'), periods, default=pd.NaT)
taxa['Late Period'] = np.select(eps('Late Age'), epochs, default=pd.NaT) + ' ' + np.select(pers('Late Age'), periods, default=pd.NaT)

# Adding a lifespan column to show how long each species/genus lived
taxa['Lifespan (MYA)'] = taxa['Max MYA'] - taxa['Min MYA']
taxa.head()


Unnamed: 0,Rank,Name,Genus,Family,Order,Taxon Size,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period,Lifespan (MYA)
10,genus,Ajkaceratops,Ajkaceratops,Unknown,Unknown,2.0,Herbivore,86.3,83.6,Santonian,Santonian,Upper Cretaceous,Upper Cretaceous,2.7
11,species,Ajkaceratops kozmai,Ajkaceratops,Unknown,Unknown,1.0,Herbivore,86.3,83.6,Santonian,Santonian,Upper Cretaceous,Upper Cretaceous,2.7
12,genus,Turanoceratops,Turanoceratops,Unknown,Unknown,2.0,Herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous,4.1
13,species,Turanoceratops tardabilis,Turanoceratops,Unknown,Unknown,1.0,Herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous,4.1
14,genus,Zuniceratops,Zuniceratops,Unknown,Unknown,2.0,Herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous,4.1


I also want to split this dataframe to separate the genus from the species for easier tracking.

In [60]:
species = taxa.loc[taxa['Rank'] == 'species'].reset_index().drop(columns=['Rank', 'Taxon Size', 'index'])
genus = taxa.loc[taxa['Rank'] == 'genus'].reset_index().drop(columns=['Rank', 'Genus', 'index'])

# Dropping this count by 1 because 
genus['Taxon Size'] = genus['Taxon Size'].astype(int) - 1
genus.head()

Unnamed: 0,Name,Family,Order,Taxon Size,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period,Lifespan (MYA)
0,Ajkaceratops,Unknown,Unknown,1,Herbivore,86.3,83.6,Santonian,Santonian,Upper Cretaceous,Upper Cretaceous,2.7
1,Turanoceratops,Unknown,Unknown,1,Herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous,4.1
2,Zuniceratops,Unknown,Unknown,1,Herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous,4.1
3,Bagaceratops,Protoceratopsidae,Unknown,1,Herbivore,83.6,66.0,Campanian,Maastrichtian,Upper Cretaceous,Upper Cretaceous,17.6
4,Breviceratops,Protoceratopsidae,Unknown,1,Herbivore,83.6,72.1,Campanian,Campanian,Upper Cretaceous,Upper Cretaceous,11.5


Next, we want to include the location of where each dinosaur was found. To do this, we can do an left-join to get the occurrences data into the taxa daraframe. However, the biggest obstacle is that the occurrences data contains duplicates since multiple fossils of the same species are found. To account for this, we can combine the Longitude and Latitude into a duple contain two objects and put these into a list of duples per unique species.

In [61]:
occ['Coordinates'] = tuple(zip(occ['Latitude'], occ['Longitude']))

occ = occ.groupby(occ['Name']).aggregate({'Formation' : 'unique', 'Coordinates' : 'unique'}).reset_index()
occ.head()

Unnamed: 0,Name,Formation,Coordinates
0,Aardonyx celestae,[Elliot],"[(-28.466389, 27.824444)]"
1,Abavornis bonaparti,[Bissekty],"[(42.117294, 62.655315)]"
2,Abdarainurus barsboldi,[Alagteeg],"[(44.523335, 103.154999)]"
3,Abditosaurus kuehnei,[Conques],"[(42.159443, 0.973056)]"
4,Abelisaurus comahuensis,[Anacleto],"[(-38.76173, -67.982765)]"


Now that the data is formatted correctly, we can do a left-join on the taxa dataframe

In [62]:
species = species.merge(occ, on='Name', how='left').fillna('Not Applicable')
species.head()

Unnamed: 0,Name,Genus,Family,Order,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period,Lifespan (MYA),Formation,Coordinates
0,Ajkaceratops kozmai,Ajkaceratops,Unknown,Unknown,Herbivore,86.3,83.6,Santonian,Santonian,Upper Cretaceous,Upper Cretaceous,2.7,[Csehbánya],"[(47.216702, 17.6)]"
1,Turanoceratops tardabilis,Turanoceratops,Unknown,Unknown,Herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous,4.1,[Bissekty],"[(42.117294, 62.655315)]"
2,Zuniceratops christopheri,Zuniceratops,Unknown,Unknown,Herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous,4.1,[Moreno Hill],"[(35.066666, -108.849998)]"
3,Bagaceratops rozhdestvenskyi,Bagaceratops,Protoceratopsidae,Unknown,Herbivore,83.6,72.1,Campanian,Campanian,Upper Cretaceous,Upper Cretaceous,11.5,"[Baruungoyot, Bayan Mandahu]","[(43.25, 99.75), (43.299999, 99.599998), (41.7..."
4,Breviceratops kozlowskii,Breviceratops,Protoceratopsidae,Unknown,Herbivore,83.6,72.1,Campanian,Campanian,Upper Cretaceous,Upper Cretaceous,11.5,[Baruungoyot],"[(43.487499, 101.125)]"


#### **Extra Datasets**

In [63]:
formations = pd.read_csv('../data-reserve/formation-list.csv')

# this was the initial data scrape from wikipedia's list of fossil formations
#formations = pd.read_html('https://en.wikipedia.org/wiki/List_of_stratigraphic_units_with_dinosaur_body_fossils')
#formations = pd.concat(formations).reset_index()[['Name', 'Location']]

#formations['Name'] = formations['Name'].str.replace(r'\[\d+\]', '', regex=True)
#formations['Location'] = formations['Location'].str.replace(r'\[\d+\]', '', regex=True).str.split()
#formations['Center'] = pd.NA

#formations.to_csv('../data-reserve/formation-list.csv', index=False)

formations.head()

Unnamed: 0,Name,Location,Center
0,Aguja Formation,"['Mexico', 'USA']",
1,Allen Formation,['Argentina'],
2,Bajo de la Carpa Formation,['Argentina'],
3,Barun Goyot Formation,['Mongolia'],
4,Bayan Mandahu Formation,['China'],


##### Dataset Testing - *delete this once final product is done*

Antarctopelta not being found in the dataframe?? Need to double check this in the csv and on paleo db

In [64]:
test = species.loc[species['Genus'] == 'Antarctopelta']
test

Unnamed: 0,Name,Genus,Family,Order,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period,Lifespan (MYA),Formation,Coordinates


In [65]:
g_test = genus.loc[genus['Name'] == 'Camarasaurus']
g_test

Unnamed: 0,Name,Family,Order,Taxon Size,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period,Lifespan (MYA)
528,Camarasaurus,Camarasauridae,Unknown,5,Herbivore,161.5,145.0,Oxfordian,Tithonian,Upper Jurassic,Upper Jurassic,16.5


### **Data Visualization and Analysis**
In this section, we perform a comprehensive analysis of the paleobiological dataset focusing on the Dinosauria and Pterosauria taxons. The analysis aims to uncover patterns, trends, and insights regarding the diversity, distribution, and evolutionary history of these ancient species.

***Geographical Analysis***: Using the coordinates of fossil findings, we map the geographical spread of species, revealing hotspots of paleobiological activity and potential correlations with ancient environmental conditions.

***Species Distribution***: We analyze the distribution of species across different taxonomic orders, periods, and geographical locations. Visualizations such as bar charts, pie charts, and maps are employed to highlight diversity and prevalence within these groups.

***Temporal Analysis***: By examining the first and last appearances of species, we create timelines to track the evolutionary lifespan and extinction patterns. This helps to identify periods of significant biodiversity or extinction events.

***Lifespan Analysis***: The analysis of species lifespans across different orders and periods provides insights into the survival and adaptation strategies of these taxa. Statistical comparisons and visual summaries are used to present the findings.

#### **Geographical Analysis** ####

In [66]:
map = fm.Map(location=[20, 0], zoom_start=1.5)
for index, i in species.iterrows():
    for coords in i['Coordinates']:
        
        perd = str(i['Early Period']) 
        if (i['Early Period'] != i['Late Period']):
            perd += ' - ' + str(i['Late Period'])
        
        pops = fm.IFrame('<h2><b><i>' + i['Name'] + '</i></b></h2><b>Order:</b> ' + i['Order'] + '<br><b>Diet:</b> ' + i['Diet'] + '<br><b>Period:</b> ' + perd)
        fm.CircleMarker(location=[coords[0], coords[1]],
                        radius=5, popup=fm.Popup(pops, min_width = 250, max_width = 250),
                        color='green').add_to(map)
        
display(map)