# Predicting Endangered Species Status with Logistic Regression

This notebook analyzes biodiversity data from U.S. national parks. We explore relationships
between species characteristics and whether a species is classified as endangered.

**Objectives**
- Perform exploratory analysis and a Chi‑squared test on category vs. endangered status.
- Engineer features (e.g., `is_mammal`, total park observations).
- Train and evaluate a logistic‑regression model to predict endangered status.
- Visualize model performance with a confusion matrix.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay
from scipy.stats import chi2_contingency

# Load data (ensure CSVs are alongside this notebook)
species = pd.read_csv('species_info.csv')
obs = pd.read_csv('observations.csv')

# Binary target: endangered if conservation_status != 'No Intervention'
species['endangered'] = (species['conservation_status'] != 'No Intervention').astype(int)

# Feature engineering
species['is_mammal'] = (species['category'] == 'Mammal').astype(int)
# Total observations per species across all parks
species_counts = obs.groupby('scientific_name')['observations'].sum().rename('obs_count')
# Join counts back to species (set index to scientific_name for join)
species = species.set_index('scientific_name').join(species_counts).fillna({'obs_count': 0}).reset_index()

# Chi‑squared test: category vs endangered
contingency = pd.crosstab(species['category'], species['endangered'])
chi2, p, dof, expected = chi2_contingency(contingency)
print(f'Chi‑squared: {chi2:.2f}, p‑value: {p:.3e}')

# Prepare features and target
X = species[['category', 'is_mammal', 'obs_count']]
y = species['endangered']

categorical = ['category']
numeric = ['is_mammal', 'obs_count']

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical),
    ('num', StandardScaler(), numeric)
])

clf = Pipeline([
    ('prep', preprocess),
    ('logreg', LogisticRegression(max_iter=1000, C=1.0))
])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)
print(f'Accuracy: {acc:.3f}, ROC‑AUC: {auc:.3f}')

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title('Confusion Matrix')
plt.show()

Chi‑squared: 0.00, p‑value: 1.000e+00


ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1

## Conclusion
The logistic‑regression model demonstrates how species metadata and observation counts can help
predict endangered status. Further improvements could include additional ecological features and
more sophisticated models like gradient boosting or random forests.