# Exploring Dinosaur Data with Jupyter Notebook

Welcome to this Jupyter Notebook, where we delve into the fascinating world of paleobiology! In this notebook, we'll explore data from the Paleobiology Database to provide a clearer, more accessible overview of dinosaur species and genera. Our goal is to transform complex, raw paleobiological datasets into a format that is not only easier to navigate but also more engaging for those who are new to the field or simply curious about dinosaurs.

### What You'll Find Here

1. **Data Acquisition**: We'll begin by retrieving dinosaur data from the Paleobiology Database, which is a comprehensive repository of fossil records and related information.

2. **Data Cleaning and Transformation**: Next, we'll process and clean the data to ensure it's in a usable format. This involves handling missing values, standardizing formats, and organizing the data for better readability.

3. **Data Visualization**: To make the data more digestible, we'll create visualizations that highlight key aspects of dinosaur species and genera. This includes distribution maps, frequency charts, and other graphical representations.

4. **Insights and Conclusions**: Finally, we'll summarize our findings and offer insights into the patterns and trends observed in the data.

### Why This Matters

The Paleobiology Database is a treasure trove of information about prehistoric life, but its sheer volume and complexity can be overwhelming. By condensing and visualizing this data, we hope to make it more accessible and enjoyable for a broader audience. Whether you're a student, educator, or simply a dinosaur enthusiast, this notebook is designed to offer a clearer view into the ancient world of dinosaurs.


In [39]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib as mpl
import requests as rq
import io

## Data Retrieval
We start by accessing the Paleobiology Database through their [data service API](https://paleobiodb.org/data1.2/). The criteria I am using is based off the Taxonomy of Fossil Occurrences Dataset. Cleaning up the data leaves us with the Classifications, Diet, and First/Last Appearances in the fossil record.

In [40]:
url1 = rq.get('https://paleobiodb.org/data1.2/occs/taxa.csv?base_name=Dinosauria&idreso=species&idqual=certain&pres=regular&max_ma=252&min_ma=65&show=class,size,app,ecospace,img').content
taxa = pd.read_csv(io.StringIO(url1.decode('utf-8')))[['taxon_rank', 'taxon_name', 'genus', 'family', 'order', 'taxon_size', 'diet', 'firstapp_max_ma', 'lastapp_min_ma']]
taxa = taxa.dropna(subset=['taxon_name']).query('(taxon_rank == \'genus\') or (taxon_rank == \'species\')')
taxa.columns = ['Rank', 'Name', 'Genus', 'Family', 'Order', 'Taxon Size', 'Diet', 'Max MYA', 'Min MYA']

taxa.head()

Unnamed: 0,Rank,Name,Genus,Family,Order,Taxon Size,Diet,Max MYA,Min MYA
10,genus,Ajkaceratops,Ajkaceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,2.0,herbivore,86.3,83.6
11,species,Ajkaceratops kozmai,Ajkaceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,1.0,herbivore,86.3,83.6
12,genus,Turanoceratops,Turanoceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,2.0,herbivore,93.9,89.8
13,species,Turanoceratops tardabilis,Turanoceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,1.0,herbivore,93.9,89.8
14,genus,Zuniceratops,Zuniceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,2.0,herbivore,93.9,89.8


While the dataset above contains a surplus of information, it does not include locations of where the species have been found. To account for this, I am grabbing another dataset that contains all the known fossil occurrences and their respective origin location. Once merged with the taxa dataset, this data will help analyze where certain dinosaurs reside.

In [41]:
# Scraping data from the Paleobiology Database and cleaning up the Null values
url = rq.get('https://paleobiodb.org/data1.2/occs/list.csv?base_name=Dinosauria&taxon_reso=species&idqual=certain&pres=regular&max_ma=252&min_ma=65&show=class,coords,loc,acconly').content
occ = pd.read_csv(io.StringIO(url.decode('utf-8')))[['accepted_name', 'lng', 'lat']]#.drop(columns=['early_interval', 'late_interval', 'occurrence_no', 'record_type', 'reid_no', 'flags', 'collection_no', 'accepted_rank', 'accepted_no', 'reference_no', 'phylum', 'class', 'county', 'latlng_basis', 'latlng_precision', 'geogscale', 'geogcomments'])
occ.columns = ['Species', 'Longitude', 'Latitude']# County may need to be added later

occ.head()

Unnamed: 0,Species,Longitude,Latitude
0,Chaoyangsaurus youngi,123.966698,42.9333
1,Protarchaeopteryx robusta,120.73333,41.799999
2,Caudipteryx zoui,120.73333,41.799999
3,Gorgosaurus libratus,-111.528732,50.740726
4,Gorgosaurus libratus,-111.549347,50.737015


In [42]:
testing = taxa.loc[taxa['Genus'] == 'Nyasasaurus'].reset_index()[['Name', 'Genus']]
testing

Unnamed: 0,Name,Genus
0,Nyasasaurus,Nyasasaurus
1,Nyasasaurus parringtoni,Nyasasaurus


## Data  Cleanup and Formatting

In [43]:

# Looking at the dataframe, we need to clean the early interval and late interval 
# --> Might need to make another np.select() call and manually assign geological ages using a premade list and condition
periods = ['Triassic', 'Jurassic', 'Cretaceous']
epochs = ['Lower', 'Middle', 'Upper']

tri_ages = ['Induan', 'Olenekian', 'Anisian', 'Ladinian', 'Carnian', 'Norian', 'Rhaetian']
jur_ages = ['Hettangian', 'Sinemurian', 'Pliensbachian', 'Toarcian', 'Aalenian', 'Bajocian', 'Bathonian', 'Callovian', 'Oxfordian', 'Kimmeridgian', 'Tithonian']
cre_ages = ['Berriasian', 'Valanginian', 'Hauterivian', 'Barremian', 'Aptian', 'Albian', 'Cenomanian', 'Turonian', 'Coniacian', 'Santonian', 'Campanian', 'Maastrichtian']

ages = [*tri_ages, *jur_ages, *cre_ages]

# arguements for np.select
mya_args = lambda x : [(taxa[x] <= 251.9) & (taxa[x] > 251.2), 
                (taxa[x] <= 251.2) & (taxa[x] > 247.2),
                (taxa[x] <= 247.2) & (taxa[x] > 242),
                (taxa[x] <= 242) & (taxa[x] > 237),
                (taxa[x] <= 237) & (taxa[x] > 227),
                (taxa[x] <= 227) & (taxa[x] > 208.5),
                (taxa[x] <= 208.5) & (taxa[x] > 201.4),
                (taxa[x] <= 201.4) & (taxa[x] > 199.5),
                (taxa[x] <= 199.5) & (taxa[x] > 192.9),
                (taxa[x] <= 192.9) & (taxa[x] > 184.2),
                (taxa[x] <= 184.2) & (taxa[x] > 174.7),
                (taxa[x] <= 174.7) & (taxa[x] > 170.9),
                (taxa[x] <= 170.9) & (taxa[x] > 168.2),
                (taxa[x] <= 168.2) & (taxa[x] > 165.3),
                (taxa[x] <= 165.3) & (taxa[x] > 161.5),
                (taxa[x] <= 161.5) & (taxa[x] > 154.8),
                (taxa[x] <= 154.8) & (taxa[x] > 149.2),
                (taxa[x] <= 149.2) & (taxa[x] > 145),
                (taxa[x] <= 145) & (taxa[x] > 139.8),
                (taxa[x] <= 139.8) & (taxa[x] > 132.6),
                (taxa[x] <= 132.6) & (taxa[x] > 125.77),
                (taxa[x] <= 125.77) & (taxa[x] > 121.4),
                (taxa[x] <= 121.4) & (taxa[x] > 113),
                (taxa[x] <= 113) & (taxa[x] > 100.5),
                (taxa[x] <= 100.5) & (taxa[x] > 93.9),
                (taxa[x] <= 93.9) & (taxa[x] > 89.8),
                (taxa[x] <= 89.8) & (taxa[x] > 86.3),
                (taxa[x] <= 86.3) & (taxa[x] > 83.6),
                (taxa[x] <= 83.6) & (taxa[x] > 72.1),
                (taxa[x] <= 72.1) & (taxa[x] > 66)] 
                
pers = lambda x : [(taxa[x].isin(tri_ages)),
                   (taxa[x].isin(jur_ages)),
                   (taxa[x].isin(cre_ages))]

eps = lambda x: [((taxa[x] == 'Induan') | (taxa[x] == 'Olenekian') | (taxa[x] == 'Hettangian') | (taxa[x] == 'Sinemurian') | 
                  (taxa[x] == 'Pliensbachian') | (taxa[x] == 'Toarcian') | (taxa[x] == 'Berriasian') | (taxa[x] == 'Valanginian') | 
                  (taxa[x] == 'Hauterivian') | (taxa[x] == 'Barremian') | (taxa[x] == 'Aptian') | (taxa[x] == 'Albian')),
                 ((taxa[x] == 'Anisian') | (taxa[x] == 'Ladinian') | (taxa[x] == 'Aalenian') | (taxa[x] == 'Bajocian') | 
                  (taxa[x] == 'Bathonian') | (taxa[x] == 'Callovian')),
                 ((taxa[x] == 'Carnian') | (taxa[x] == 'Norian') | (taxa[x] == 'Rhaetian') | (taxa[x] == 'Oxfordian') | 
                  (taxa[x] == 'Kimmeridgian') | (taxa[x] == 'Tithonian') | (taxa[x] == 'Cenomanian') | (taxa[x] == 'Turonian') | 
                  (taxa[x] == 'Coniacian') | (taxa[x] == 'Santonian') | (taxa[x] == 'Campanian') | (taxa[x] == 'Maastrichtian'))
                 ]


# Adding the Period and Age columns
taxa['Early Age'] = np.select(mya_args('Max MYA'), ages, default=pd.NaT)

# We add 0.01 to accomodate for edge cases where a dinosaur is estimated to have lived at the cusp of two mesozoic ages
taxa['Min MYA'] += 0.01
taxa['Late Age'] = np.select(mya_args('Min MYA'), ages, default=pd.NaT)
taxa['Min MYA'] -= 0.01
taxa['Late Age'] = taxa['Late Age'].fillna(taxa['Early Age'])


taxa['Early Period'] = np.select(eps('Early Age'), epochs, default=pd.NaT) + ' ' + np.select(pers('Early Age'), periods, default=pd.NaT)
taxa['Late Period'] = np.select(eps('Late Age'), epochs, default=pd.NaT) + ' ' + np.select(pers('Late Age'), periods, default=pd.NaT)

taxa


Unnamed: 0,Rank,Name,Genus,Family,Order,Taxon Size,Diet,Max MYA,Min MYA,Early Age,Late Age,Early Period,Late Period
10,genus,Ajkaceratops,Ajkaceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,2.0,herbivore,86.3,83.6,Santonian,Santonian,Upper Cretaceous,Upper Cretaceous
11,species,Ajkaceratops kozmai,Ajkaceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,1.0,herbivore,86.3,83.6,Santonian,Santonian,Upper Cretaceous,Upper Cretaceous
12,genus,Turanoceratops,Turanoceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,2.0,herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous
13,species,Turanoceratops tardabilis,Turanoceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,1.0,herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous
14,genus,Zuniceratops,Zuniceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,2.0,herbivore,93.9,89.8,Turonian,Turonian,Upper Cretaceous,Upper Cretaceous
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3432,species,Tawa hallae,Tawa,NO_FAMILY_SPECIFIED,Theropoda,1.0,carnivore,227.0,208.5,Norian,Norian,Upper Triassic,Upper Triassic
3433,genus,Yaverlandia,Yaverlandia,NO_FAMILY_SPECIFIED,Theropoda,2.0,carnivore,132.6,121.4,Hauterivian,Barremian,Lower Cretaceous,Lower Cretaceous
3434,species,Yaverlandia bitholus,Yaverlandia,NO_FAMILY_SPECIFIED,Theropoda,1.0,carnivore,132.6,121.4,Hauterivian,Barremian,Lower Cretaceous,Lower Cretaceous
3436,genus,Nyasasaurus,Nyasasaurus,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,3.0,,247.2,242.0,Anisian,Anisian,Middle Triassic,Middle Triassic
