### This is a data science project about biodiversity

Let's first take a look at the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

species_data = pd.read_csv("species_info.csv")
observation_data = pd.read_csv("observations.csv")

print(species_data.head(), observation_data.head())

### A few potential questions to look at
1. Does the number of observations match the conservation status (i.e. endangered -> fewer observations), and is it consistent between the national parks?
2. Are there differences in the diversity of species between the national parks?
3. Are rare or endangered species clustered in particular parks?

Let's start from question number 1. First we need to figure out the possible conservation status values.

In [None]:
print(species_data.conservation_status.unique())

First, the datasets need to be joined for this categorisation to be possible with observation data

In [None]:
merged_data = observation_data.merge(
    species_data[['scientific_name', 'conservation_status']],
    on='scientific_name',
    how='left'
)
print(merged_data.head())


Plotting the observations per species for each conservation status

In [None]:
# Create the plot
plt.figure(figsize=(10,6))
sns.violinplot(
    x='conservation_status',
    y='observations',
    data=merged_data
)
plt.title('Distribution of Observations per Species by Conservation Status')
plt.xlabel('Conservation Status')
plt.ylabel('Number of Observations')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

As expected, the distribution for "Endangered" is the widest at the lowest number of observations, "Threatened" a bit above it, as is also "In Recovery" and the highest number of observations at the widest point of the distribution is for "Species of Concern".

Next thing is to check if the number of observations is consistent between the parks. As parks are of different size, we are going to compare the percentage of the total observations from that park for each species.

In [None]:
# Total observations for each park
park_totals = observation_data.groupby('park_name')['observations'].sum().reset_index()
park_totals = park_totals.rename(columns={'observations': 'total_park_observations'})

# Merged data
merged = observation_data.merge(park_totals, on='park_name', how='left')

# Percentage of total observations per park for each species
merged['species_percent'] = (
    merged['observations'] / merged['total_park_observations'] * 100
)

# Taking only the 10 species with the most observations
top_species = observation_data.groupby('scientific_name')['observations'].sum().reset_index()
top_species = top_species.sort_values('observations', ascending=False)
top_species_list = top_species['scientific_name'].head(10).tolist()

# Filtering the dataframe to only include these species
merged_top = merged[merged['scientific_name'].isin(top_species_list)]

# Plotting the percentages
plt.figure(figsize=(12,6))
sns.barplot(
    data=merged_top,
    x='park_name',
    y='species_percent',
    hue='scientific_name',
    errorbar=None
)
plt.ylabel('Percent of Park Observations (%)')
plt.title('Top 10 Species by Share of Observations in Each Park')
plt.legend(title='Species', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


Even the highest percentages of the observations are small, and their ratios are not consistent between the parks. This could be due to different parks having different distributions of habitats for these species.

The to the question number 2: are there differences in the diversity of species between the parks?

In [None]:
# 1) Species richness and total observations per park
richness_obs = (
    observation_data
    .groupby('park_name')
    .agg(
        species_richness=('scientific_name', 'nunique'),
        total_observations=('observations', 'sum')
    )
    .reset_index()
)

# 2) Species per 100 observations
richness_obs['species_per_100_obs'] = (
    richness_obs['species_richness'] / richness_obs['total_observations'] * 100
)

# 3) Plot bar chart
plt.figure(figsize=(10, 6))
sns.barplot(
    data=richness_obs,
    x='park_name',
    y='species_per_100_obs',
    color='steelblue',
    errorbar=None  # no CI; values are already aggregated
)
plt.title('Species per 100 Observations in Each Park')
plt.xlabel('Park')
plt.ylabel('Species per 100 Observations')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Plotting the observations per category (Mammal, Amphibian, ...)
# Merge in category (major group) info
obs_cat = observation_data.merge(
    species_data[['scientific_name', 'category']],
    on='scientific_name',
    how='left'
)

# Sum observations by park and category
park_cat = (
    obs_cat
    .groupby(['park_name', 'category'])['observations']
    .sum()
    .unstack(fill_value=0)  # columns = categories
)

# Optional: convert to percentages within each park
park_cat_pct = park_cat.div(park_cat.sum(axis=1), axis=0) * 100

# Stacked bar plot (absolute counts -> use park_cat; percentages -> park_cat_pct)
ax = park_cat.plot(
    kind='bar',
    stacked=True,
    figsize=(10, 6),
    colormap='tab20'
)
# For percentages, use: ax = park_cat_pct.plot(...)

plt.title('Observations per Major Group in Each Park')
plt.xlabel('Park')
plt.ylabel('Number of Observations')  # or 'Percent of Observations' if using park_cat_pct
plt.xticks(rotation=45, ha='right')
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


From the first plot, we see that in the Great Smoky Mountains, there is the highest amount of species per 100 observations. This means that either there are a higher number of species in total or there are fewer observations per species, i.e., smaller populations. The latter one seems more probable. Conversely, Yellowstone has the fewest number of species per 100 observations, which indicates that there are likely very large populations of some species there.

From the latter figure, we observe that Vascular Plants are the most common category of observations in each park by far. The next largest category is Birds, but there are already significantly fewer observations for that and even fewer for other categories.

Next, we are going to check if rare or endangered species are present in all parks equally, or if they cluster in some specific ones.

In [None]:
# Merge status into observations
obs_status = observation_data.merge(
    species_data[['scientific_name', 'conservation_status']],
    on='scientific_name',
    how='left'
)

rare_statuses = ['Species of Concern', 'Endangered', 'Threatened', 'In Recovery']

rare_obs = obs_status[obs_status['conservation_status'].isin(rare_statuses)]

# Number of distinct rare species per park
rare_species_per_park = (
    rare_obs
    .groupby('park_name')['scientific_name']
    .nunique()
    .reset_index(name='n_rare_species')
)
print(rare_species_per_park)

rare_obs_per_park = (
    rare_obs
    .groupby('park_name')['observations']
    .sum()
    .reset_index(name='rare_observations')
)
print(rare_obs_per_park)

# Total observations per park
total_obs_per_park = (
    observation_data
    .groupby('park_name')['observations']
    .sum()
    .reset_index(name='total_observations')
)

# Combine with rare observations
park_rare = rare_obs_per_park.merge(total_obs_per_park, on='park_name', how='left')
park_rare['rare_share_pct'] = (
    park_rare['rare_observations'] / park_rare['total_observations'] * 100
)
print(park_rare)

# Plotting rare observations per park
plt.figure(figsize=(8, 5))
sns.barplot(
    data=park_rare,
    x='park_name',
    y='rare_share_pct',
    color='firebrick',
    errorbar=None
)
plt.ylabel('Rare / Endangered Observations (%)')
plt.xlabel('Park')
plt.title('Share of Rare/Endangered Observations by Park')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


In all of the parks, the same number of rare species is tabulated. As there are wildly different number of observations between the parks, the absolute numbers are not saying much. However, the percantages of the rare species observations from all observations are more comparable. In all parks, about 3% of all observations are of rare species, the largest share (3.23%) in Bryce National Park and the smallest (2.97%) in Great Smoky Mountains National Park.