# Project Title

## Introduction

In this section, briefly describe:
- **Goal of the analysis** — What are you trying to achieve?
- **Key question(s)** — The main problem(s) or topic(s) you aim to answer.
- **Context or background** — Why this question matters, and any relevant domain info.

## Table of Contents
1. [Overview](#Introduction)

2. [Setup & Scope](#setup--scope)

3. [Data Exploration](#data)

4. [Analysis](#Analysis)

5. [Conclusions](#Conclusions)

## Setup & Scope 

This analysis is based on a dataset provided by Codecademy, inspired by information from the U.S. National Parks Service. It focuses on endangered species observed across various national parks and aims to uncover patterns in species vulnerability and distribution.

The following key questions guide the scope of this report:
- Which biological categories (e.g., mammals, birds, plants) are most affected by endangerment?
- What are the top ten most critically endangered species?
- Are certain national parks home to a higher concentration of endangered species?
- How are endangered species distributed across parks and categories?
- How does conservation status vary across different species categories?
- Which endangered species have the highest number of recorded observations?

## Python Libraries

In [109]:
# Data handling
import numpy as np
import pandas as pd

# Visualization
from matplotlib import pyplot as plt
import seaborn as sns

## Data Exploration

### Observations

The data is sourced from two files: 
- observations: a file containig the amount of observations in each park, for each animal, mentioned in its scientific name. 


In [106]:
observations = pd.read_csv('observations.csv')
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85



- species_info: a file containing information about the species, including category (mammal, bird, etc ),scientific_name,common_names and conservation_status. 


In [111]:
species = pd.read_csv('species_info.csv')
species.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Domesticated Cattle",
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",
4,Mammal,Cervus elaphus,Wapiti Or Elk,



By examining the species_info file we can notice that we have several scientific name duplicates, for example canis lupus appears three times with different converstaion status, Vireo solitarius appears twice, one with two common names. 
Also, the conversation status has nan values for not endagered species, so thee is a need to replace it with a non endangered string instead. 
By using dtypes, info and describe we can confirm that except the observations which is an int, all the other variables are objects, matching the categorical variables in all those categories. 


## Cleaning and formatting the Data:

- observations: going over the data found several identical observations, meaning same scientific name, same park, same number of observations, since there are not timestamps or any other discerning metric, i treated them as duplicates and removed them. No nan values where found. 

- species_info: Here its more complex. There are several rows with the same category and scientific name, but can have different or more than one common name and different conservation status. Since common names is not an analyzable parameter, i ingored it and decided to keep the last row that appears in the csv, where the reasoning is that this is the latest updated entry to the file, which lacks any kind of timestamps to confirm or deny. 
The Conversation status column has 5633 nan values, meaning there is no data about the conversation status of these species, which i replaced with "No Intervention", but wont be able to identify which stauts is using the data, since the lack of sample size in relation to the general population. 

- for observation amount based analysis i add a column of each secientific name sum of total observations in the observations.csv file. 

In [None]:
## Check for missing values
observations.isnull().sum()
species.isnull().sum()

# check for duplicates in the 'observations' DataFrame
print(observations.duplicated(keep=False).sum())

# drop the duplicated rows 
observations.drop_duplicates(inplace=True)
species['conservation_status'] = species['conservation_status'].fillna('No Intervention')  # Fill NaN values with 'No Intervention'

# check for duplicates in the 'species_info' DataFrame - where category and scientific_name are the same
dup_mask = species[species.duplicated(subset=["category", "scientific_name"], keep=False)].sort_values(by=["scientific_name"])
species = species.drop_duplicates(subset=["category", "scientific_name"], keep='last')

# Count the number of observations for each species and add it to the species DataFrame 
obs_counts = observations.groupby('scientific_name')['observations'].sum()
species['observations'] = species['scientific_name'].map(obs_counts)
print(species.sort_values(by='observations', ascending=False))


## Data Exploartion

- observations 

- How is each file looks like? 

- general statistics 

- comments about the data 


In [112]:
observations.info()
observations.describe(include='all')
observations.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


scientific_name    object
park_name          object
observations        int64
dtype: object

- species info 

In [113]:
species.info()
species.describe(include='all')
species.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB


category               object
scientific_name        object
common_names           object
conservation_status    object
dtype: object

## Analysis

In this combined section, cover:
- **Approach** — The methods you used to explore and answer the question.
- **Exploratory findings** — Summary stats, plots, correlations.
- **Techniques applied** — Statistical tests, models, transformations, etc.
- **Results** — Tables, charts, metrics.
- **Interpretation** — What the results mean for the original question.

## Conclusions

Summarize:
- **Main takeaway** — The answer to your original question.
- **Implications** — How results could be used or acted upon.
- **Limitations** — Gaps in data, assumptions, or methods.
- **Next steps** — Further analyses, data to collect, or experiments to run.
