#### Paleo Analysis and Parsing of the Paleontological Database
Data extracted from *The Paleobiology Database*															


In [12]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib as mpl
import requests as rq
import io

In [26]:
# Scraping data from the Paleobiology Database and cleaning up the Null values
url = rq.get('https://paleobiodb.org/data1.2/occs/list.csv?base_name=Dinosauria&taxon_reso=species&idqual=certain&pres=regular&max_ma=252&min_ma=65&show=class,coords,loc,acconly').content
occ = pd.read_csv(io.StringIO(url.decode('utf-8'))).drop(columns=['occurrence_no', 'record_type', 'reid_no', 'flags', 'collection_no', 'accepted_rank', 'accepted_no', 'reference_no', 'phylum', 'class', 'county', 'latlng_basis', 'latlng_precision', 'geogscale', 'geogcomments'])
occ.columns = ['Name', 'Early Interval', 'Late Interval', 'Max MYA', 'Min MYA', 'Order', 'Family', 'Genus', 'Longitude', 'Latitude', 'Country', 'State'] # County may need to be added later
occ = occ[['Name', 'Genus', 'Family', 'Order', 'Max MYA', 'Min MYA', 'Early Interval', 'Late Interval', 'Country', 'State']]
occ['Late Interval'] = occ['Late Interval'].fillna(occ['Early Interval'])

# USE THIS BLOCK TO CHANGE THE VALUES FOR WHEN FAMILY OR ORDER IS UNKNOWN
#occ.loc[occ['Family'] == 'NO_FAMILY_SPECIFIED', 'Family'] = 'Unknown Family'
#occ.loc[occ['Order'] == 'NO_ORDER_SPECIFIED', 'Order'] = 'Unknown Order'


# Note that we use the GRE operator because the conditions do not coincide with each other
conditions = [
    (occ['Max MYA'] <= 251.9) & (occ['Min MYA'] >= 201.4), 
    (occ['Max MYA'] <= 201.4) & (occ['Min MYA'] >= 145.0),
    (occ['Max MYA'] <= 145.0) & (occ['Min MYA'] >= 66.0)
]

eras = ['Triassic', 'Jurassic', 'Cretaceous']

occ['Era'] = np.select(conditions, eras, default=pd.NaT) # Still need to address the cases where a species lived within to time periods (Late Jurassic to Early Cretaceous)
occ.head()

Unnamed: 0,Name,Genus,Family,Order,Max MYA,Min MYA,Early Interval,Late Interval,Country,State,Era
0,Chaoyangsaurus youngi,Chaoyangsaurus,Chaoyangsauridae,NO_ORDER_SPECIFIED,152.2,132.6,Late Kimmeridgian,Valanginian,CN,Liaoning,NaT
1,Protarchaeopteryx robusta,Protarchaeopteryx,NO_FAMILY_SPECIFIED,Theropoda,125.77,119.5,Late Barremian,Early Aptian,CN,Liaoning,Cretaceous
2,Caudipteryx zoui,Caudipteryx,NO_FAMILY_SPECIFIED,Theropoda,125.77,119.5,Late Barremian,Early Aptian,CN,Liaoning,Cretaceous
3,Gorgosaurus libratus,Gorgosaurus,Tyrannosauridae,Theropoda,83.6,72.1,Late Campanian,Late Campanian,CA,Alberta,Cretaceous
4,Gorgosaurus libratus,Gorgosaurus,Tyrannosauridae,Theropoda,83.6,72.1,Late Campanian,Late Campanian,CA,Alberta,Cretaceous


Because of the way the Paleobiology Database is set up, we need to create two dataframes - one with the fossil occurrences (above) and one with all the recognized Dinosaur genera and species (below).

In [28]:
#url2 = rq.get('https://paleobiodb.org/data1.2/occs/taxa.csv?base_name=Dinosauria&rank=max_genus&taxon_status=accepted&pres=regular&max_ma=252&min_ma=65&show=parent,app,size,class').content
#dinos = pd.read_csv(io.StringIO(url2.decode('utf-8')))
#dinos

The dataset we have here shows the occurrences of each species of dinosaur, where they were found, and where the fossil sits within the geological time scale. However, rather than showing each individual occurrence, I want to represent each species only once and give a general idea of where and when the dinosaur lived. To do this, we can grab each occurrence of a fossil (and keep a count to maintain the current taxon size) and create a list of unique locations that where the dinosaur was found. Using this same logic, we can also create a more accurate time frame of the geological time scale placement of the species, based on all occurrences of the fossil.

Unnamed: 0,Name,Genus,Family,Order,Max MYA,Min MYA,Early Interval,Late Interval,Country,State,Era
2300,Ajkaceratops kozmai,Ajkaceratops,NO_FAMILY_SPECIFIED,NO_ORDER_SPECIFIED,86.3,83.6,Santonian,Santonian,HU,Veszprém,Cretaceous
