#### Paleo Analysis and Parsing of the Paleontological Database
Data extracted from *The Paleobiology Database*															


In [19]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib as mpl
import requests as rq
import io

In [34]:
# Scraping data from the Paleobiology Database and cleaning up the Null values
url = rq.get('https://paleobiodb.org/data1.2/occs/list.csv?base_name=Dinosauria&taxon_reso=species&idmod=!ns,ng,af,cf,sl,if,eg,qm,qu&pres=regular&max_ma=252&min_ma=65&show=class,coords,loc,acconly').content
dinos = pd.read_csv(io.StringIO(url.decode('utf-8'))).drop(columns=['occurrence_no', 'record_type', 'reid_no', 'flags', 'collection_no', 'accepted_rank', 'accepted_no', 'reference_no', 'phylum', 'class', 'county', 'latlng_basis', 'latlng_precision', 'geogscale', 'geogcomments'])
dinos.columns = ['Name', 'Early Interval', 'Late Interval', 'Max MYA', 'Min MYA', 'Order', 'Family', 'Genus', 'Longitude', 'Latitude', 'Country', 'State'] # County may need to be added later
dinos = dinos[['Name', 'Genus', 'Family', 'Order', 'Max MYA', 'Min MYA', 'Early Interval', 'Late Interval', 'Country', 'State']]
dinos['Late Interval'] = dinos['Late Interval'].fillna(dinos['Early Interval'])

# USE THIS BLOCK TO CHANGE THE VALUES FOR WHEN FAMILY OR ORDER IS UNKNOWN
#dinos.loc[dinos['Family'] == 'NO_FAMILY_SPECIFIED', 'Family'] = 'Unknown Family'
#dinos.loc[dinos['Order'] == 'NO_ORDER_SPECIFIED', 'Order'] = 'Unknown Order'


# Note that we use the GRE operator because the conditions do not coincide with each other
conditions = [
    (dinos['Max MYA'] <= 251.9) & (dinos['Min MYA'] >= 201.4), 
    (dinos['Max MYA'] <= 201.4) & (dinos['Min MYA'] >= 145.0),
    (dinos['Max MYA'] <= 145.0) & (dinos['Min MYA'] >= 66.0)
]

eras = ['Triassic', 'Jurassic', 'Cretaceous']

dinos['Era'] = np.select(conditions, eras, default=pd.NaT) # Still need to address the cases where a species lived within to time periods (Late Jurassic to Early Cretaceous)
dinos.head()

Unnamed: 0,Name,Genus,Family,Order,Max MYA,Min MYA,Early Interval,Late Interval,Country,State,Era
0,Gorgosaurus libratus,Gorgosaurus,Tyrannosauridae,Theropoda,83.6,72.1,Late Campanian,Late Campanian,CA,Alberta,Cretaceous
1,Centrosaurus apertus,Centrosaurus,Ceratopsidae,NO_ORDER_SPECIFIED,83.6,72.1,Late Campanian,Late Campanian,CA,Alberta,Cretaceous
2,Gorgosaurus libratus,Gorgosaurus,Tyrannosauridae,Theropoda,83.6,72.1,Late Campanian,Late Campanian,CA,Alberta,Cretaceous
3,Gorgosaurus libratus,Gorgosaurus,Tyrannosauridae,Theropoda,83.6,72.1,Late Campanian,Late Campanian,CA,Alberta,Cretaceous
4,Albertosaurus sarcophagus,Albertosaurus,Tyrannosauridae,Theropoda,72.1,66.0,Late Maastrichtian,Late Maastrichtian,CA,Alberta,Cretaceous


The dataset we have here shows the occurrences of each species of dinosaur, where they were found, and where the fossil sits within the geological time scale. However, rather than showing each individual occurrence, I want to represent each species only once and give a general idea of where and when the dinosaur lived. To do this, we can grab each occurrence of a fossil (and keep a count to maintain the current taxon size) and create a list of unique locations that where the dinosaur was found. Using this same logic, we can also create a more accurate time frame of the geological time scale placement of the species, based on all occurrences of the fossil.