## Project Goals

In this project the perspective will be through a biodiversity analyst for the National Parks Service. The National Park Service wants to ensure the survival of at-risk species, to maintain the level of biodiversity within their parks. Therefore, the main objectives as an analyst will be understanding characteristics about the species and their conservations status, and those species and their relationship to the national parks. Some questions that are posed:

- What is the distribution of conservation status for species?
- Are certain types of species more likely to be endangered?
- Are the differences between species and their conservation status significant?
- Which animal is most prevalent and what is their distribution amongst parks?

## Explore Data

This project has two data sets that came with the package. The first `csv` file has information about each species and another has observations of species with park locations. The data for this project is inspired by real data, but is mostly fictional.

The `species_info.csv` contains information on the different species in the National Parks. The columns in the data set include:
- `category` - the category of taxonomy for each species
- `scientific_name` - the scientific name of each species
- `common_names` - the common names of each species
- `conservation_status` - the species conservation status

The `observations.csv` contains information from recorded sightings of different species throughout the national parks in the past 7 days. The columns included are:

- `scientific_name` - the scientific name of each species
- `park_name` - the name of the national park
- `observations` - the number of observations in the past 7 days


In [2]:
# Import the libraries
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

In [8]:
# Load in the dataset and get a preview
species = pd.read_csv("species_info.csv", encoding="utf-8")
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,


In [9]:
# Load in the dataset and get a preview
observations = pd.read_csv("observations.csv", encoding="utf-8")
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [10]:
# Get data characteristics
print(f"Species shape: {species.shape}")
print(f"Observations shape: {observations.shape}")

Species shape: (5824, 4)
Observations shape: (23296, 3)


In [16]:
# Get distinct species
print(f"Number of species: {species.scientific_name.nunique()}")

# Get unique categories of species
print(f"Number of categories: {species.category.nunique()}")
print(f"Categories: {species.category.unique()}")

# Get the count of categories
species.groupby("category").size()

Number of species: 5541
Number of categories: 7
Categories: ['Mammal' 'Bird' 'Reptile' 'Amphibian' 'Fish' 'Vascular Plant'
 'Nonvascular Plant']


category
Amphibian              80
Bird                  521
Fish                  127
Mammal                214
Nonvascular Plant     333
Reptile                79
Vascular Plant       4470
dtype: int64

In [18]:
# Explore conservation status groups
print(f"Number of conservation statuses: {species.conservation_status.nunique()}")
print(f"Unique conservation statuses: {species.conservation_status.unique()}")

Number of conservation statuses: 4
Unique conservation statuses: [nan 'Species of Concern' 'Endangered' 'Threatened' 'In Recovery']


In [19]:

# Get the count of conservation status groups
print(species.groupby("conservation_status").size())
print(f"NaN values: {species.conservation_status.isna().sum()}")

conservation_status
Endangered             16
In Recovery             4
Species of Concern    161
Threatened             10
dtype: int64
NaN values: 5633


There are 5,633 `nan` values which means that they are species without concerns. On the other hand there are 161 species of concern, 16 endangered, 10 threatened, and 4 in recovery. 

Note: In most cases coming across `nan` values must be treated carefully, but the absence of data here means that these species are not under any conservation status.

In [22]:
# Get distinct observations
print(f"Number of parks: {observations.park_name.nunique()}")
print(f"Unique parks: {observations.park_name.unique()}")

# Get the number of observations
print(f"Number of observations: {observations.observations.sum()}")

Number of parks: 4
Unique parks: ['Great Smoky Mountains National Park' 'Yosemite National Park'
 'Bryce National Park' 'Yellowstone National Park']
Number of observations: 3314739


## Analyze Data