<a href="https://colab.research.google.com/github/rodrigo-koblitz/shared_particular_colabs/blob/main/evolution_hirarquical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

# --- Create Effort DataFrame ---
np.random.seed(42) # for reproducibility

num_efforts = 10
dates = pd.to_datetime(pd.date_range(start='2023-01-01', periods=num_efforts, freq='D'))
locations = [f'Site_{i % 3 + 1}' for i in range(num_efforts)]
visibility_conditions = np.random.choice(['Excellent', 'Good', 'Moderate', 'Poor'], size=num_efforts, p=[0.3, 0.4, 0.2, 0.1])
effort_durations_hours = np.round(np.random.uniform(2, 8, num_efforts), 1)

effort_data = {
    'effort_id': range(1, num_efforts + 1),
    'date': dates,
    'location': locations,
    'visibility': visibility_conditions,
    'effort_duration_hours': effort_durations_hours
}

effort_df = pd.DataFrame(effort_data)

print("Effort DataFrame:")
display(effort_df)

Effort DataFrame:


Unnamed: 0,effort_id,date,location,visibility,effort_duration_hours
0,1,2023-01-01,Site_1,Good,2.1
1,2,2023-01-02,Site_2,Poor,7.8
2,3,2023-01-03,Site_3,Moderate,7.0
3,4,2023-01-04,Site_1,Good,3.3
4,5,2023-01-05,Site_2,Excellent,3.1
5,6,2023-01-06,Site_3,Excellent,3.1
6,7,2023-01-07,Site_1,Excellent,3.8
7,8,2023-01-08,Site_2,Moderate,5.1
8,9,2023-01-09,Site_3,Good,4.6
9,10,2023-01-10,Site_1,Moderate,3.7


In [2]:
# --- Create Sightings DataFrame (empty as no animals were found) ---
sightings_df = pd.DataFrame(columns=['sighting_id', 'effort_id', 'species', 'count'])

print("\nSightings DataFrame:")
display(sightings_df)


Sightings DataFrame:


Unnamed: 0,sighting_id,effort_id,species,count


# Task
Simulate animal sightings data, analyze it in conjunction with the effort data, and then outline a conceptual approach for building a hierarchical model using both datasets.

## Simulate sightings data

### Subtask:
Generate simulated animal sightings data to populate the `sightings_df` based on the existing `effort_df`.


In [5]:
np.random.seed(42) # Set a random seed for reproducibility

num_simulated_dolphin_sightings = np.random.randint(15, 26) # Determine the number of sightings to simulate (between 15 and 25)

unique_effort_ids = effort_df['effort_id'].unique() # Create a list of unique effort_id values from effort_df

simulated_effort_ids = np.random.choice(unique_effort_ids, size=num_simulated_dolphin_sightings, replace=True)

simulated_species = ['Dolphin'] * num_simulated_dolphin_sightings # All sightings are for 'Dolphin'

simulated_counts = np.random.randint(1, 21, size=num_simulated_dolphin_sightings) # Generate a random count for each sighting (between 1 and 20)

dolphin_sightings_data = pd.DataFrame({
    'sighting_id': range(1, num_simulated_dolphin_sightings + 1),
    'effort_id': simulated_effort_ids,
    'species': simulated_species,
    'count': simulated_counts
})

print("Simulated Dolphin Sightings Data:")
display(dolphin_sightings_data.head())

Simulated Dolphin Sightings Data:


Unnamed: 0,sighting_id,effort_id,species,count
0,1,4,Dolphin,12
1,2,8,Dolphin,12
2,3,5,Dolphin,17
3,4,7,Dolphin,10
4,5,10,Dolphin,16


**Reasoning**:
The previous step successfully generated the `dolphin_sightings_data` DataFrame. Now, I need to update the `sightings_df` (which previously held general animal sightings) with the content of `dolphin_sightings_data` to fully populate it as per the subtask description. This will complete the subtask of generating simulated animal sightings data, specifically for 'Dolphins', to populate `sightings_df`.



In [6]:
sightings_df = dolphin_sightings_data.copy()

print("Populated Sightings DataFrame with Dolphin data:")
display(sightings_df.head())

Populated Sightings DataFrame with Dolphin data:


Unnamed: 0,sighting_id,effort_id,species,count
0,1,4,Dolphin,12
1,2,8,Dolphin,12
2,3,5,Dolphin,17
3,4,7,Dolphin,10
4,5,10,Dolphin,16


### Buckland's First Stages of Hierarchical Model Conception: Estimating Detectability

Before diving into the full hierarchical model, it's crucial to understand a foundational concept: **detectability**. In wildlife surveys, animals are often present but not always detected. Factors like observer skill, environmental conditions, or the duration of effort can influence the probability of detecting an animal, even if it's there.

Building on the work of statisticians like David R. Buckland in the context of distance sampling, the first stage often involves estimating this **detection probability (p)**. This parameter quantifies the likelihood that an animal (or species) present within a surveyed area is actually observed.

#### 1. Data Preparation for Detectability

To estimate detectability using our `effort_df` and `sightings_df`, we first need to create a dataset that combines effort information with a clear binary outcome: whether a dolphin was sighted during a given effort. This will involve:

*   **Merging:** Combining `effort_df` with `sightings_df` based on `effort_id`. Critically, we need to ensure all efforts are present, even those with no sightings.
*   **Creating a binary response:** A new column, `has_sighting`, will indicate `1` if at least one dolphin was recorded for that effort, and `0` if no dolphins were recorded.

#### 2. Mathematical Formulation (Simplified Logistic Regression)

A common approach to model this binary detection outcome is through **logistic regression**. This statistical model predicts the probability of an event (in our case, detection) based on one or more predictor variables (covariates). The probability of detection, often denoted as `p`, can be modeled as:

$$ \text{logit}(p_i) = \ln\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 \cdot \text{covariate}_{1i} + \beta_2 \cdot \text{covariate}_{2i} + \dots $$

Where:
*   $p_i$ is the probability of detecting dolphins during effort $i$.
*   $\beta_0$ is the intercept.
*   $\beta_k$ are the coefficients for the $k^{th}$ covariate.
*   $\text{covariate}_{ki}$ are the values of the predictor variables (e.g., `visibility`, `effort_duration_hours`) for effort $i$.

From this, the probability $p_i$ can be calculated as:

$$ p_i = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot \text{covariate}_{1i} + \dots)}} $$

#### 3. Using Our Data to Estimate Detectability

We can use covariates from our `effort_df` (such as `visibility` and `effort_duration_hours`) to predict `has_sighting`. For instance:

*   **`visibility`**: We expect that better visibility conditions might increase the probability of detection.
*   **`effort_duration_hours`**: Longer observation efforts might also increase detection probability.

In [16]:
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder

# 1. Prepare the data for detectability modeling

# Create a DataFrame to hold all efforts and their sighting status
# Start with all efforts from effort_df
prob_detection_df = effort_df.copy()

# Merge with sightings_df to get count information. Use how='left' to keep all efforts.
# Fill NaN counts with 0, meaning no dolphins were sighted during that effort.
prob_detection_df = pd.merge(
    prob_detection_df,
    sightings_df[['effort_id', 'count']],
    on='effort_id',
    how='left'
).fillna({'count': 0})

# Create the binary 'has_sighting' column
prob_detection_df['has_sighting'] = (prob_detection_df['count'] > 0).astype(int)

print("DataFrame for Detectability Modeling (first 5 rows):")
display(prob_detection_df.head())
print("\nValue counts for 'has_sighting':")
display(prob_detection_df['has_sighting'].value_counts())

# 2. Prepare categorical variables for logistic regression (e.g., 'location', 'visibility')
# Label encode 'location' and 'visibility' for the model
le = LabelEncoder()
prob_detection_df['location_encoded'] = le.fit_transform(prob_detection_df['location'])
prob_detection_df['visibility_encoded'] = le.fit_transform(prob_detection_df['visibility'])

# 3. Fit a simple Logistic Regression model for detectability
# Define dependent and independent variables
Y = prob_detection_df['has_sighting']
X = prob_detection_df[['effort_duration_hours', 'visibility_encoded', 'location_encoded']]

# Add a constant to the independent variables for the intercept term
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(Y, X)
result = logit_model.fit()

print("\nLogistic Regression Results for Detectability:")
print(result.summary())

print("\nInterpretation of Detectability Parameters:")
print("The coefficients in the 'coef' column represent the change in the log-odds of detection for a one-unit increase in the predictor variable.")
print("For example, a positive coefficient for 'effort_duration_hours' would suggest that longer efforts increase the log-odds (and thus the probability) of detecting dolphins.")
print("Similarly, the coefficients for 'visibility_encoded' and 'location_encoded' indicate how different visibility conditions or locations affect the log-odds of detection compared to the baseline (the category encoded as 0).")

Unnamed: 0,effort_id,date,location,visibility,effort_duration_hours,count,has_sighting,location_encoded,visibility_encoded
0,1,2023-01-01,Site_1,Good,2.1,9.0,1,0,1
1,2,2023-01-02,Site_2,Poor,7.8,7.0,1,1,3
2,2,2023-01-02,Site_2,Poor,7.8,14.0,1,1,3
3,3,2023-01-03,Site_3,Moderate,7.0,15.0,1,2,2
4,3,2023-01-03,Site_3,Moderate,7.0,19.0,1,2,2
5,4,2023-01-04,Site_1,Good,3.3,12.0,1,0,1
6,4,2023-01-04,Site_1,Good,3.3,20.0,1,0,1
7,5,2023-01-05,Site_2,Excellent,3.1,17.0,1,1,0
8,5,2023-01-05,Site_2,Excellent,3.1,12.0,1,1,0
9,5,2023-01-05,Site_2,Excellent,3.1,9.0,1,1,0


### Nichols, Mackenzie, and Colleagues: The Concept of Detectability ($p$)

Building on the foundational work of researchers like Nichols, Mackenzie, and their collaborators, particularly in the field of **occupancy modeling**, the concept of detectability ($p$) is central to robust ecological inference. Unlike simply observing whether an animal was present, their work emphasizes separating the probability that a species occupies a site ($\\psi$) from the probability that it is **detected given that it occupies the site** ($p$).

In our current context, where `has_sighting` represents a binary outcome for each effort, we can think of an empirical estimate of detectability. This empirical estimate of $p$ is a direct proportion of successful detections relative to the total opportunities for detection.

#### Empirical Formulation of Detectability ($p$)

A very basic, empirical estimate of detectability (often a component of more complex models) can be calculated as:

$$ p_{\text{empirical}} = \frac{\text{Number of efforts with at least one dolphin sighting}}{\text{Total number of efforts}} $$

This simple ratio gives us a baseline understanding of how often we are detecting dolphins when we put forth effort. It's important to note that this is a simplification; formal occupancy models require repeated surveys at the same site to properly estimate both occupancy ($\\psi$) and detection probability ($p$) while accounting for spatial and temporal variation, and to distinguish between true absence and non-detection. However, this empirical calculation serves to illustrate the fundamental idea of quantifying the likelihood of observing a species.

We can also explore how this empirical detectability varies with covariates such as `visibility` or `effort_duration_hours` by calculating the proportion of detections within categories of these variables. This informs us about the factors that might influence our ability to detect dolphins.

In [15]:
# Calculate overall empirical detection probability
overall_empirical_detectability = prob_detection_df['has_sighting'].mean()
print(f"Overall Empirical Detection Probability: {overall_empirical_detectability:.3f}")

# Calculate empirical detection probability by visibility
empirical_detectability_by_visibility = prob_detection_df.groupby('visibility')['has_sighting'].mean()
print("\nEmpirical Detection Probability by Visibility:")
display(empirical_detectability_by_visibility)

# Calculate empirical detection probability by location
empirical_detectability_by_location = prob_detection_df.groupby('location')['has_sighting'].mean()
print("\nEmpirical Detection Probability by Location:")
display(empirical_detectability_by_location)

print("\nNote: These are empirical averages. Formal occupancy models (e.g., as developed by Nichols and Mackenzie) use more sophisticated methods to estimate detection probability while accounting for imperfect detection and confounding factors, often requiring repeated surveys at sites.")

Overall Empirical Detection Probability: 0.955

Empirical Detection Probability by Visibility:


Unnamed: 0_level_0,has_sighting
visibility,Unnamed: 1_level_1
Excellent,1.0
Good,0.75
Moderate,1.0
Poor,1.0



Empirical Detection Probability by Location:


Unnamed: 0_level_0,has_sighting
location,Unnamed: 1_level_1
Site_1,1.0
Site_2,1.0
Site_3,0.8



Note: These are empirical averages. Formal occupancy models (e.g., as developed by Nichols and Mackenzie) use more sophisticated methods to estimate detection probability while accounting for imperfect detection and confounding factors, often requiring repeated surveys at sites.


## Outline hierarchical model approach

### Subtask:
Provide a conceptual explanation of how a two-step hierarchical model would be built using the `effort_df` and the now populated `sightings_df`, and how it would evolve into a single model.

#### Instructions
1. Explain the concept of a two-step hierarchical model in the context of wildlife data, where the first step models detection probability/occurrence (e.g., whether a dolphin was seen at all), and the second step models abundance given detection (e.g., how many dolphins were seen).
2. Describe how the `effort_df` and `sightings_df` would be used in each step of this two-step model. For instance, which variables from each DataFrame would inform each part of the model.
3. Discuss the limitations or potential issues of a two-step approach (e.g., propagation of error, ignoring dependencies).
4. Explain how these two steps could be integrated into a single, unified hierarchical model (e.g., a Zero-Inflated Poisson or Zero-Inflated Negative Binomial model) that simultaneously accounts for both presence/absence (detection) and abundance (count) within a single statistical framework.
5. Highlight the benefits of using a single hierarchical model over a two-step approach, particularly in terms of statistical efficiency and more robust inference.

### Conceptual Explanation of Hierarchical Models for Dolphin Sightings

In ecological modeling, especially when dealing with count data that often includes many zeros (meaning no animals were detected), a simple regression model might not adequately capture the underlying ecological processes. Hierarchical models offer a powerful framework to address this complexity by modeling different aspects of the observation process.

#### 1. Two-Step Hierarchical Model Concept
A two-step hierarchical model breaks down the observation process into two distinct parts:

*   **Step 1: Occurrence/Detection Model (Probability of seeing *any* dolphins)**
    This first step models the probability that an animal (or group of animals) is present and detected during an effort. It's essentially a binary outcome: did we see dolphins (1) or not (0)? This accounts for factors that influence whether a species is observed at all, even if it's present. For instance, bad visibility might reduce the chance of detection, regardless of how many dolphins are actually there.

*   **Step 2: Abundance Model (Count *given* detection)**
    If dolphins were detected (i.e., the outcome of Step 1 is '1'), this second step models *how many* dolphins were observed. This focuses on the actual number of individuals, conditioned on the fact that some were seen. This part might be influenced by factors related to the population size or aggregation behavior.

#### 2. Using `effort_df` and `sightings_df` in the Two-Step Model

To apply this to our `effort_df` and `sightings_df`:

*   **Data Preparation:** First, we would need to create a new binary column in `effort_df` (or a merged version) indicating `has_sighting` (1 if `count > 0` for that `effort_id`, 0 otherwise). For efforts with no sightings, the `count` would effectively be 0.

*   **Step 1: Occurrence/Detection Model (e.g., Logistic Regression)**
    *   **Response Variable:** `has_sighting` (binary: 0 or 1).
    *   **Predictor Variables (from `effort_df`):** Factors influencing detectability or presence.
        *   `location`: Different sites might have varying dolphin presence or detection probabilities.
        *   `visibility`: Poor visibility (`visibility`) directly impacts the likelihood of seeing dolphins.
        *   `effort_duration_hours`: Longer effort duration generally increases detection probability.
        *   `date`: Could account for temporal variations in dolphin presence or detectability.
    *   **Model Type:** Typically a logistic regression or probit model is used here.

*   **Step 2: Abundance Model (e.g., Poisson or Negative Binomial Regression)**
    *   **Response Variable:** `count` (from `sightings_df`), *but only for efforts where `has_sighting` is 1*.
    *   **Predictor Variables (from `effort_df` or `merged_df`):** Factors influencing the number of dolphins when they *are* seen.
        *   `location`: Certain locations might host larger groups of dolphins.
        *   `effort_duration_hours`: Longer observation times might lead to higher counts.
        *   `visibility`: While primarily affecting detection, it might also influence how many individuals can be accurately counted.
    *   **Model Type:** Poisson regression (for count data) or Negative Binomial regression (if there's overdispersion in counts) would be appropriate.

#### 3. Limitations of the Two-Step Approach

While conceptually straightforward, the two-step approach has several drawbacks:

*   **Propagation of Error:** Errors or uncertainties from the first step (detection probability) are carried over and can be compounded in the second step (abundance). This can lead to biased estimates in the abundance model.
*   **Ignoring Dependencies:** The two steps are treated as independent, but in reality, they are inherently linked. The factors influencing detection often also influence abundance, and separating them can lead to loss of information or incorrect inferences.
*   **Loss of Statistical Efficiency:** By analyzing the data in two separate stages, we might not be fully leveraging all available information simultaneously, leading to less efficient parameter estimates.
*   **Ignoring Zero Structure:** This approach doesn't explicitly model the excess zeros in the data (i.e., efforts with no sightings) in a unified way, distinguishing between true absence and non-detection.

#### 4. Evolution into a Single, Unified Hierarchical Model

To overcome these limitations, the two steps can be integrated into a single, unified hierarchical model, often referred to as a **Zero-Inflated** model (e.g., Zero-Inflated Poisson (ZIP) or Zero-Inflated Negative Binomial (ZINB)). These models explicitly account for the two processes simultaneously:

*   **Zero-Inflated Poisson (ZIP) / Zero-Inflated Negative Binomial (ZINB) Model:**
    These models assume that zeros can arise from two distinct processes:
    1.  **Structural Zeros (Non-detection/True Absence):** There were no dolphins to be seen, or they were present but not detected (modeled by a Bernoulli or logistic component).
    2.  **Sampling Zeros (Actual Count Process):** Dolphins were present and detected, but the random sampling process (Poisson or Negative Binomial distribution) resulted in a zero count.

    The model essentially has two sub-models running in parallel:
    *   **Detection/Zero-generating process:** This component (often a logistic regression) predicts the probability of an observation being a "structural zero" (i.e., no dolphins were detected, whether due to true absence or failure to detect). It uses `effort_df` variables like `visibility`, `effort_duration_hours`, and `location`.
    *   **Count process:** This component (a Poisson or Negative Binomial regression) models the expected count *if* dolphins are detected (i.e., not a structural zero). It uses `effort_df` variables like `location` and `effort_duration_hours`, and applies to the non-zero counts from `sightings_df` (and potentially the efforts that *could* have had sightings).

    The two components share parameters or are linked, allowing for simultaneous estimation of both processes.

#### 5. Benefits of a Single Hierarchical Model

Using a single, unified hierarchical model (like ZIP or ZINB) offers significant advantages:

*   **Integrated Inference:** It provides a more coherent and robust statistical inference by modeling both detection and abundance within a single framework, accounting for their interdependence.
*   **More Robust Parameter Estimates:** By simultaneously estimating parameters for both the zero-generating and count processes, the model can provide less biased and more precise estimates.
*   **Statistical Efficiency:** It makes more efficient use of the data, leading to potentially stronger statistical power for detecting effects.
*   **Distinguishing Between True Absence and Non-Detection:** Critically, it allows researchers to differentiate between situations where dolphins were truly absent/undetected and situations where dolphins were present but happened to result in a zero count due to the randomness of the process (e.g., a very small group that was missed).
*   **Realistic Representation of Ecological Processes:** This approach better reflects the ecological reality where detection is an integral part of observing abundance.

In summary, while a two-step approach can be a useful starting point, a unified hierarchical model like a Zero-Inflated Poisson or Negative Binomial provides a statistically more sound and ecologically more realistic approach for analyzing wildlife count data with excess zeros, such as our dolphin sightings data.