# Introduction

Biodiversity is essential to ecosystem health, but is threatened by habitat loss, climate change, and other human activities. National parks aim to conserve wildlife, but even protected areas experience biodiversity declines. Tracking species distributions and conservation status in parks can inform management efforts.

This project analyzes National Park Service biodiversity data to understand species status and distributions in national parks. Key questions include:

- What is the distribution of conservation status for observed species? Are certain species more endangered?
- How does species prevalence differ between parks? Which species are most widespread?
- Are there significant differences between species and their conservation status?

This project will scope, analyze, prepare, plot data, and seek to explain the findings from the analysis.

**Data sources:**

Both `Observations.csv` and `Species_info.csv` was provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is *inspired* by real data, but is mostly fictional.

## Scoping

Defining a clear project scope provides direction and focus when starting a new data analysis project. This scope contains four key sections:

- Goals: The high-level objectives and intentions for the project. What questions are we trying to answer?

- Data: Reviewing the available data to ensure it can support the project goals. For this project, biodiversity data has already been obtained.

- Analysis Plan: The methods and specific research questions to analyze the data and meet the goals. What techniques and visualizations will be used?

- Evaluation: Drawing conclusions from the analysis to arrive at findings and insights that achieve the project goals. What stories does the data tell us?


### Project Goals

In this project the perspective will be through a biodiversity analyst for the National Parks Service. My goal is to help conserve at-risk species and maintain biodiversity within the national parks. My key objectives are to analyze species characteristics, conservation status, and distributions across parks. Specifically, I aim to explore these questions:
- What is the distribution of conservation status for observed species? Are certain species more endangered?
- How does species prevalence differ between parks? Which species are most widespread?
- Are there significant differences between species and their conservation status?

By analyzing biodiversity data to uncover patterns, trends, and insights related to these questions, I aim to generate data-driven recommendations to help guide conservation priorities and actions across national parks. My goal is to support the Park Service mission of protecting vulnerable species.

### Data

This project has two data sets that came with the package. The first `csv` file has information about each species and another has observations of species with park locations. This data will be used to analyze the goals of the project. 

### Analysis

In this section, descriptive statistics and data visualization techniques will be employed to understand the data better. Statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics that will be computed include: 

1. Distributions
1. counts
1. relationship between species
1. conservation status of species
1. observations of species in parks. 

### Evaluation

Lastly, it's a good idea to revisit the goals and check if the output of the analysis corresponds to the questions first set to be answered (in the goals section). This section will also reflect on what has been learned through the process, and if any of the questions were unable to be answered. This could also include limitations or if any of the analysis could have been done using different methodologies.


### Import Python Module
First, let import necessary python module for this project

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


### Import the data
To analyze species conservation status and park observations, the dataset is loaded into `Pandas DataFrames` for exploration in `Python`. This allows flexible manipulation and visualization of the biodiversity data.

The `Observations.csv` is read into a DataFrame called `observations` and `Species_info.csv` is read into a DataFrame called `species`. The `.head()` method is used to glimpse the first few rows of each DataFrame and verify the data loaded correctly.

#### Species
The `Species_info.csv` containt information about difference spieces in the National Parks. The data include:
- `Category:` Category of each species
- `Scientific_name:` The scientific name of each species
- `Common_Names:` The common name of eacjh species
- `Conservation_status:` the species concervation status

In [3]:
species = pd.read_csv('species_info.csv')
species.head(10)

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,
5,Mammal,Odocoileus virginianus,White-Tailed Deer,
6,Mammal,Sus scrofa,"Feral Hog, Wild Pig",
7,Mammal,Canis latrans,Coyote,Species of Concern
8,Mammal,Canis lupus,Gray Wolf,Endangered
9,Mammal,Canis rufus,Red Wolf,Endangered


#### Observations
The `Observations.csv` record the sighting of difference spieces in the Nation Parks. The data including:
- `Scientific_name :` The sciencetis name of the species
- `Park_name :` Name of the national park
- `Observations :` The number of observations in the last 7 days

In [4]:
observations = pd.read_csv('observations.csv')
observations.head(10)

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85
5,Elymus virginicus var. virginicus,Yosemite National Park,112
6,Spizella pusilla,Yellowstone National Park,228
7,Elymus multisetus,Great Smoky Mountains National Park,39
8,Lysimachia quadrifolia,Yosemite National Park,168
9,Diphyscium cumberlandianum,Yellowstone National Park,250


#### Data Shape
Next, there will be a check for the dimensions of the data sets. As you can see, for `species` there are 5,824 rows and 4 columns while `observations` has 23,296 rows and 3 columns.

In [5]:
print("Species data shaoe: ", species.shape)
print("Observations data shape: ", observations.shape)

Species data shaoe:  (5824, 4)
Observations data shape:  (23296, 3)


### Explore the data