# Describing Conservation Status Across Observations

## Purpose

This analysis describes the **distribution of conservation status classifications** across all recorded observations in the dataset.

The goal is to establish a clear **baseline understanding** of how conservation status categories are represented overall, prior to examining differences by park or other subgroups.

---

## Context and Scope

This notebook follows a prior **methodological validation step**, which established that observation structure is consistent across parks at the chosen level of aggregation. Building on that foundation, the present analysis focuses on **descriptive characterization**, not comparison.

This analysis therefore examines:

* Overall frequency and proportion of conservation status categories
* Dataset-level representation, aggregated across all parks and taxa

---

## Analytical Question

> What is the overall distribution of conservation status classifications in the dataset?

---

## Method Overview

1. Aggregate observations by conservation status
2. Compute total counts and relative proportions
3. Visualize the distribution using simple summary plots

Counts and proportions are presented together to provide both scale and relative context.

---

## Interpretive Boundaries

* Conservation status labels reflect **classification frequency**, not species abundance or population size
* Observation counts are influenced by reporting effort and data collection practices
* This analysis does not assess ecological risk or conservation outcomes

Results here serve as **descriptive context** for subsequent comparative analyses rather than standalone inference.


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from util import summarize

sns.set_theme(style="whitegrid")

# load merged / cleaned data frame `df_merged.feather`
df = pd.read_feather('df_merged.feather')

In [None]:
# Add column » Binary » Conservation Concern: No Concern
df['agg_cons_status'] = np.where(
    df['conservation_status'].eq('No Concern'),
    'No Concern',
    'Conservation Concern'
)

In [None]:
df.head()
df.info()

In [None]:
# Observation-weighted
obs_weighted = (
    df.groupby("agg_cons_status", observed=True)["observations"]
      .sum()
      .rename("obs_count")
      .reset_index()
)
obs_weighted["prop"] = obs_weighted["obs_count"] / obs_weighted["obs_count"].sum()

# Species-weighted (unique taxa counts)
species_weighted = (
    df.drop_duplicates(["scientific_name", "agg_cons_status"])
      .groupby("agg_cons_status", observed=True)["scientific_name"]
      .nunique()
      .rename("species_count")
      .reset_index()
)
species_weighted["prop"] = species_weighted["species_count"] / species_weighted["species_count"].sum()

obs_weighted, species_weighted


In [None]:

# Combine into long format
plot_df = pd.concat([
    obs_weighted.assign(method="Observation-weighted")[["agg_cons_status", "prop", "method"]],
    species_weighted.assign(method="Species-weighted")[["agg_cons_status", "prop", "method"]],
], ignore_index=True)

# Ensure consistent ordering
order = ["No Concern", "Conservation Concern"]
plot_df["agg_cons_status"] = pd.Categorical(plot_df["agg_cons_status"], categories=order, ordered=True)

# Pivot to wide for easy grouped bars
wide = plot_df.pivot(index="agg_cons_status", columns="method", values="prop")

ax = wide.plot(kind="bar", figsize=(8, 5))
ax.set_ylabel("Proportion")
ax.set_xlabel("")
ax.set_title("Conservation Status Proportions: Observation vs Species Weighted")
ax.legend(title="")
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig('fig2_cons_status_proportions.png')


In [None]:
diff = (
    plot_df.pivot(index="agg_cons_status", columns="method", values="prop")
    .assign(delta=lambda d: d["Species-weighted"] - d["Observation-weighted"])
)

ax = diff["delta"].plot(kind="bar", figsize=(6, 4))
ax.axhline(0, linewidth=1)
ax.set_ylabel("Species-weighted − Observation-weighted")
ax.set_xlabel("")
ax.set_title("Difference in Proportions by Weighting")
plt.xticks(rotation=0)
plt.tight_layout()
# Not saved - interesting plot but not additional information that's helpful to above


## Validation Finding

Normalized biodiversity composition by major taxonomic category appears **broadly similar** across parks.

* Category-level proportions are **closely aligned** across parks
* This similarity persists despite substantial **species-level** variation in observation counts
* At this level of aggregation, there is **no strong evidence** of large, systematic differences in reporting structure between parks

**Interpretation (bounded):**
This pattern is *consistent with* methodological comparability across parks for analyses conducted at the level of major taxonomic categories. However, these results do **not** establish ecological equivalence, and they do not rule out category-specific detection effects, uneven sampling intensity, or other sources of bias that could matter at finer resolution (e.g., species-level comparisons).

Given the high similarity of these conditional distributions, an additional “all-parks” aggregate category-composition plot would likely be **informationally redundant** and is omitted for scope control.

---

## Scope Control: Validation vs. Description

Additional rollups (e.g., flora vs. fauna) were explored during validation. While they are descriptively consistent with the category-level results, they do not provide **independent** evidence about consistency across parks beyond what is already shown in the normalized category-by-park comparison.

To keep the analysis focused on comparisons where meaningful differences are more likely to emerge, these higher-level aggregations are omitted.
