# Birds Biodiversity Technical report
LEIVA Martin (22205863), PEÑA CASTAÑO Javier (22203616), HERRERA NATIVI Vladimir (22205706)

It is to be seen if the final technical report will be written as an .md or a pdf

---
NOTE : 30/10/2025

### Next steps to take in consideration 

**Overview.md**\
first steps motivations :
- Check schema consistency with df.info() and df.head().
- Validate effort assumptions – confirm each transect has 10 points, look for sites with fewer than - 10 completed visits, and note detection modality distributions.
- Create orientation summaries – species richness by site, observer workload, yearly totals,   detection modality shares.

**Assignment.pdf**\
Goal :  To quantify how biodiversity indicatiors have evolved over time. With statistical uncertainity assessments, highlighting species-specific stories. 

Big points to treat : 
- Dataset Familiarisation and Descriptive Analysis : data structure, descriptive statistics 
- Multi-Year Indicator Trends : Select indicators, for each compute annual estimates, quantify and interpret temporal trends (maybe linear models)
- Species-Level Evolution  : choose subset of birds, for each specie analyse how its recorded presence have changed (CI), potential reasons ? 

---
## Initial approach, data discovery 
First problems encountered : Intial reading and separation of the raw data. 

| Sheet name | Problem in raw data |
|------------|---------------------|
| `ESPECES`  | The raw data sheet jumped the two first columns in the excel sheet, it had no headers and one single NaN on the last column |
| `GPS-MILIEU` | The raw data sheet jumped the two first columns, it had no prescence of NaNs. The headers were wrongly placed due to a double leveled header for geographical coordinates and last 3 columns didn't had headers.   |
| `NOM FRANÇAIS` | The original sheet had 26 columns but the last one was "hidden" (first values started a couple thounsand rows in). Headers were missplaced due to a triple and double leveled headers from columns 13 to 25 |

In order to correct this problems we proposed a 2 step solution, first we dropped the redundant columuns (in ESPECES ans GPS-MILIEU) and then we redefined the headers of the 3 sheets as listed next : 

| Sheet name | New headers |
|------------|---------------------|
| `ESPECES`  | "ESPECIES_NAME", "LATIN_NAME", "NATURE" |
| `GPS-MILIEU` |   "TRANSECT_NAME", "COORDINATE_X", "COORDINATE_Y", "HABITAT_TYPE","TRANSECT_ID", "POINT_ID",  |
| `NOM FRANÇAIS` | (Col 13 to 26) : "AL25", "VL25", "AL50", "VL50", "AL100", "VL100", "AG100", "VG100", "VOL", "TOT_A", "TOT_V_sV", "TOT_AV_sV", "TOT_AV_V", "COMPANIED", |

In the case of `NOM FRANÇAIS` only headers from columns 13 to 25 were changed following a patern that we stablished : \
For columns 13 to 20 (distances de contact) we stablished a 3 part code, the first letter (A or V) is for Auditif or Visuel, the second letter (L or G) is for "<" or ">" and finally the number corresponds to the distance in meters.\
For columns 22 to 25 (totaux) a simalar approach was used, the first section has one or two letters (A, V or AV) for "Auditif", "Visuel" or "A+V", the second sections has two different entries (sV or V) for "sans Vol" or "avec Vol".
For this study, we defined the total number of birds per record using `TOT_AV_sV`(auditory + visual counts of non-flying birds). We excluded `TOT_AV_V` (= `TOT_AV_sV` + `vol`), because we found the meaning of `vol` unclear: many rows had non-NaN values in it despite showing no detected birds in the `distances de contact` fields, which we found inconsistent.

**This new sheets were saved in "data/cleaned"**

---

# Story telling 

## 1. Dataset Familiarisation and Descriptive Analysis 

- **Cleaning** : Talk about all the incoherences, initial problems and how we treat them in utils.py 
- **Familarisation** : Quick glance at all the initial statistical analysis (one for each of the 3 intial tabs) and give first key insights

Here we can explain the dropped columns and the new headers names, ...

## 2. Multi-Year Indicator Trends
Here we should explain in detail all the Indicator's choices, how did we treat them and the conclusions 

### Density Study - Summary of Computations

#### 1. Definition
The density indicator measures the relative abundance of birds observed per transect and year.

#### 2. Computation Steps

**(a) Counting per Transect and Year**  
From the cleaned observation dataset (`nom_francais_clean`), total bird counts were aggregated by `(year, transect)` using columns such as `TOT_A`, `TOT_V_sV`, etc.

**(b) Normalization**  
Each transect’s annual count was normalized by the maximum count observed across all years:

$$ \text{density\_norm}_{i,t} = \frac{\text{count}_{i,t}}{\max(\text{count}_{\text{all years}})} $$
Densities are thus scaled to the range [0, 1].

#### 3. Bootstrap Estimation

A bootstrap resampling method was used to estimate uncertainty:

1. For each year, resample transects with replacement \( B \) times (e.g. \( B = 1000 \)).
2. Compute the mean normalized density for each resample.
3. Obtain 95% confidence intervals from the empirical quantiles of the bootstrap distribution.

$$ \text{CI}_{95\%} = [\hat{\theta}^*_{2.5\%}, \hat{\theta}^*_{97.5\%}] $$

#### 4. Derived Indicators and Visualization

- Annual mean normalized density computed and plotted over time.  
- Per-year density distribution visualized by transect (color-coded).  
- Temporal evolution of normalized densities visualized across transects.  
- Bootstrap mean and confidence intervals plotted for density trends.
  
#### Interpretation

The density analysis provides a standardized way to compare bird abundance across years by accounting for variation in sampling effort. The initial density plots reveal clear differences in abundance between transects, reflecting underlying environmental heterogeneity and habitat-specific suitability. When examining annual mean densities, we observe temporal fluctuations that suggest shifts in overall bird activity or detectability from year to year.

The bootstrap procedure adds statistical rigor by quantifying uncertainty around the estimated yearly densities. The resulting confidence intervals show how stable or variable these density estimates are over time. Narrow intervals indicate consistent sampling responses across transects, while wider intervals suggest greater spatial or temporal variability in bird presence.

Since 2018, the transect **Fort de France Centre Ville** consistently emerges as the most densely sampled and bird-rich zone, showing the highest normalized densities in the period 2018–2025. This pattern likely reflects a combination of favorable urban microhabitats, consistent observer effort, and possibly higher detectability of certain species in this area. The strong density signature of this transect provides a valuable focal point for investigating urban bird community dynamics within the broader regional context.

The fitted regression trends allow us to determine whether density changes represent meaningful ecological patterns rather than random fluctuation. Where the regression slope is statistically significant, we can infer a directional trend—either increasing or declining density—over the study period. In contrast, non-significant slopes suggest that densities remain broadly stable, with year-to-year variation falling within the expected range of sampling variation.

Taken together, the density plots, bootstrap confidence intervals, and regression analysis provide complementary evidence that allows us to evaluate not only the magnitude of density changes but also their reliability and ecological relevance. This integrated approach helps distinguish true shifts in bird community activity such as the consistently high urban density observed at **Fort de France Centre Ville** from random sampling fluctuation.


### Species Diversity Study

### Detectability Study (auditory V.S. visual)

#### Introduction

For this sampling indicator, we wanted to see how many of the detections were visual/auditory per year, for the top five of most visited transects (for more reliable data).

To analyse if this indicator was informative, we first plotted some initial histograms, where we observed that for some transects there seemed to be an increase in the ratio of auditory observations. In order to confirm this trend, we decided to further research this topic.

To count the needed values for this, we used the cleaned observation dataset `nom_francais_clean`. 

#### Computations

Total auditory/visual bird counts were aggregated by `(year, transect)` adding the data from column `TOT_AV_sV`, while the individual auditory and visual totals were extracted from the columns `TOT_A` and `TOT_V_sV` respetively.  

To calculate the percentage of observed audio shares, we did:
$$
\widehat{p}_{y,t}=\frac{A_{y,t}}{AV_{y,t}},\qquad \text{Auditory (in \%)}=100\times \widehat{p}_{y,t}
$$
where $A_{y,t}=\text{auditory detections}$, $V_{y, t} = \text{visual detections}$ and $AV_{y,t} = A_{y,t} + V_{y,t}$ for each year $y$ at transect $t$.

To trace the confidence intervals for the observed shares, we used the Wilson method, because classic "Wald" Confidence Intervalls perform poorly with small $n$ or extreme $p$ (bounds can leave [0, 1]). 
Instead, Wilson score intervals have better coverage and stay in [0, 1], which matters because if we increase the number of transects, the $\text{year} \times \text{transect}$ might have modest totals.

The formula used for $95\%$ CI is:
Let $z = 1.96$, $\widehat{p} = \frac{A_{y,t}}{AV_{y,t}}$. 

$$
\text{CI}_{\text{Wilson}} = \frac{\widehat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\widehat{p}(1 - \widehat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}
$$

In the code, this is calculated by the imported function `statsmodels.stats.proportion.proportion_confint(method="wilson")`, then we multiply it by 100 to get the percentage.

Then, to interpret the slope and temporal trend, we used a **Binomial Generalized Linear Model**. This model is useful for this study, because the share of auditory detections is a proportion (each value comes from a count of auditory detections out of a total number of detections), so it follows a binomial distribution. Also, a GLM with a binomial family and a logit link models how the probability of an auditory detection changes with time (year), while taking into account that the years with more detections provide more reliable estimates. 

$$
\text{logit}(\text{Pr}[\text{auditory}|y,t]) = \beta_0 + \beta_1 (y - \widehat{y})
$$

With:
1. $\beta_0$: intercept
2. $\beta_1$: slope on year
3. $(y-\hat y)$: centered year

To calculate the p_value $p$ of each transect, we are going to do the following computations:

$$
z = \frac{\hat \beta_1}{\text{SE}(\hat \beta_1)d},\qquad p=2(1 - \phi(|z|))
$$

Where $\phi(\cdot)$ is the standard normal CDF.

#### Conclusion

After tracing these figures, with their corresponding CIs, p_values and slopes, we can see that for: 
1. Morne Babet:  $p = 1.32 \times 10^{-6} \implies$ p_value much smaller than $0.05$. The trend is highly significant and there is a strong and steady increase in the proportion of auditory detections over time.
2. Habitation Petit Rivière: $p = 1.8 \times 10^{-9} \implies$ similar interpretation as in Morne Babet.
3. Moulin à Vent: $p = 0.000369 \implies$ p_value is smaller than $0.05$. The trend is signicant and although there is more variability across the years, the fitted line still shows a positive slope.
4. Là-Haut: $p = 0.723 \implies$ p_value is much bigger than $0.05$. The trend is not significant and the percentage of auditory observations remains roughly constant over time.
5. Hôtel des Plaisirs: $p = 0.259 \implies$ p_value is bigger than $0.05$. Although the slope is slightly positive, uncertainty is large, so we cannot conclude an increase in auditory share.

Overall, the results point to different conclusions depending of each one of the top five most visited transects:  `(Morne Babet, Habitation Petit Rivière, and Moulin à Vent)` have a proportion of auditory detections that rises significantly over time, whereas `(Là-Haut and Hôtel des Plaisirs)` have no statistically detectable trend. The significant cases show consistent positive slopes, supported by narrow CIs and GLM p_values much smaller than $0.05$, that indicates that the increase is unlikely to be due to chance. On the other hand, the two non-significant transects show a wider uncertainty and flat fitted lines, suggesting stability in the percentage of auditory observations. Taken together, these findings imply that changes in detectability are not uniform across space.

### Spatial Coverage Study (Not interesting)

#### Summary

Initially, we analysed the spatial coverage, in order to see if it was an informative sampling indicator. We used the cleaned observation dataset `nom_francais_clean` and grouped for each year, the following attributes:
1. The distinct transects that were surveyed
2. The total point-visits that were recorded (row count)

Afterwards, we defined a new column `combo`, which contains each unique combination of (transect, point, n° pass) in order to count how many of such combos were actually observed, and divide them by a theoretical maximum: number of transects that year $\times$ 10 points $\times$ 2 passes (given maximal passes). This yields the `combo_coverage` metric per year.

At the end, we have tree different figures:
1. `Transects sampled per year`: how many distinct transects were visited each year
2. `Point-pass coverage ratio per year`: the ratio of possible (point, pass) combos were actually observed, aggregated at the year level.
3. `Coverage heatmap`: for the top 15 transects, the counts of unique (point, pass) combos per year, showing which transects were thouroughly sampled and when.

#### Analysis

Let's analyse each of those figures:

1) **Transects sampled per year** : 
Early years show fewer transects visited (around 41 in 2014), then a ramp-up through 2018 (around 65) and a rather stable plateau afterward (mostly between 62–65). So spatial coverage by number of transects expanded fast and then stabilized.

2) **Point–pass coverage ratio per year** :
Although the line looks dramatic at first, the y-axis is very tight: from around 0.91 to 1.00. The only clear drop is between 2019 and 2020, and after that, the coverage rebounds to around 0.99 and remains high. These graph shows there is a near-complete coverage most years, with 2020 as the outlier.

3) **Coverage heatmap (unique point×pass combos by transect–year, top-15 transects)**:
 On the first few years, we can see some gaps (Borelie, Boucle du Vauclin) and then some isolated holes (Hôtel des Plaisirs around 2018–2019). Besides those, cells are almost always at the maximum (around 20 combos), which indicates a full pointxpass combo after 2016.

Overall, with the three figures, we can observe that there is a big increase in spatial coverage in the first years, followed by a stable coverage from 2018 onward. This sampling indicator looks strong and steady after the initial ramp-up, so besides the 2020 dip and a few early gaps, we decided that there wasn't much added value in further analyzing this indicator.

## 3. Species-Level Evolution

We should explain the choice of our subset (top 5 most observed birds), and then the choice of indicators to study evolution. Here we should detail our methods for bootstrapping and choice of indexes 

### Abundance Evolution Study 

We use mean abundance instead of total number counted per year because:

 - Sampling effort changes across years (number of transects visited, observers present, duration, weather conditions, etc.).
 - The total number of birds counted is therefore not directly comparable from year to year.

But the mean abundance standardizes for effort, making trends fair and comparable.

#### Trend Significance per Species

For all this species except "Quiscale merle" , the slope is > 0 , so species abundance is increasing over time, but the only significant values are for "Elénie siffleuse" and 	"Tourterelle à queue carrée". "Quiscale merle" has also a significant negative value, so could suppose that the abundance is decreasing over the time, but the p-value is quite high. This suggests a possible decline, but evidence is weak. By looking at p-value (< 0.05), we can say that trend is statistically significant for "Elénie siffleuse" and "Tourterelle à queue carrée".

#### Interpretation of Species-Specific Abundance Trends

The figures above display the temporal evolution of abundance for the five most frequently recorded bird species in the dataset, using the standardized mean abundance (TOT_AV_sV) per year and associated uncertainty estimated via bootstrap resampling. For **Élenie siffleuse**, the annual mean abundance shows a generally increasing pattern over the study period, and the fitted linear trend is positive with confidence intervals that do not overlap heavily with zero, indicating a statistically supported rise in abundance. A similar positive and statistically significant trend is observed for **Tourterelle à queue carrée**, where both the slope and the bootstrap confidence intervals suggest a sustained increase in occurrence intensity over time. 

In contrast, **Quiscale merle** shows a declining fitted trend line, but the year-to-year variability is relatively high and the confidence intervals are broader, resulting in a non-significant trend. This suggests that although the species may be experiencing a reduction in recorded abundance, further data or more controlled sampling would be needed to confidently confirm this decline. For **Sporophile rougegorge** and **Sucrier à ventre jaune**, the mean abundance values fluctuate from year to year without displaying a marked directional change. Their fitted trends are near-flat and associated confidence intervals are wide, indicating that these populations have remained relatively stable across the studied period.

Overall, these results highlight two species experiencing significant increases (Élenie siffleuse and Tourterelle à queue carrée), one species with a possible but unconfirmed decline (Quiscale merle), and two species showing stable abundance levels with no evidence of directional long-term change (Sporophile rougegorge and Sucrier à ventre jaune).

#### Limitations

The mean-abundance and bootstrap approach provides a robust, effort-standardized indicator of species presence over time. However, this method does not explicitly correct for variation in detection probability (observer effects, weather, time of day), and assumes independence between sampling events, which may not always hold. Additionally, mean abundance reflects relative observation rates rather than absolute population sizes, because mean abundance is how often the species are recorded, not how many individuals exist in the ecosystem., and the linear trend model may not capture non-linear ecological dynamics. Therefore, while the approach reliably identifies broad directional changes, results should be interpreted as population indices rather than direct population estimates.

### Per-Transect Detection Rate Evolution Study 


### Ecological Interpretation of Observed Abundance Trends

The differing abundance trajectories among the five focal bird species suggest that shifts in land use, habitat structure, and species-specific ecological traits have influenced population dynamics over the study period.

The increasing trends observed for **Élenie siffleuse** and **Tourterelle à queue carrée** likely reflect their status as generalist species with flexible foraging strategies and broad habitat tolerance. Both can exploit semi-open and human-modified environments, such as gardens, secondary vegetation, or agricultural mosaics. As such landscapes have expanded or become more prevalent, these species may have gained additional foraging and nesting opportunities, resulting in sustained increases in their recorded abundance. Their positive slopes and strong confidence intervals suggest that these increases represent*robust ecological responses, rather than sampling variability.

In contrast, the decline suggested for **Quiscale merle**, though not statistically significant, may indicate greater ecological sensitivity. This species displays more territorial behavior and may rely on more specific foraging or social conditions. Declining abundance could arise from competition with expanding generalist species, altered food availability, or localized habitat degradation. However, the wide confidence intervals and year-to-year abundance variability suggest that more data are needed before confirming this as a true long-term population decline. Continued monitoring is therefore essential.

For **Sporophile rougegorge** and **Sucrier à ventre jaune**, the absence of a consistent upward or downward trend indicates population stability. These species are also relatively common but appear to maintain stable ecological niches across habitats. Their feeding strategies (granivory for Sporophile rougegorge, nectarivory/insectivory for Sucrier à ventre jaune) allow them to utilize widely available and renewable resources, which may buffer their populations against environmental change. Their near-flat trend lines and broad confidence intervals support this interpretation of dynamic equilibrium rather than directional shifts.

Overall, the observed patterns align with a well-established ecological principle:

Species with greater ecological flexibility tend to persist or increase under landscape change,  species with narrower ecological requirements or territorial constraints may be more vulnerable to decline.

Finally, while the bootstrap confidence intervals account for sampling uncertainty, additional factors such as observer variability, detection probability, and habitat-specific visibility may also influence recorded abundance. Continued standardized monitoring will improve the precision of future analyses and help clarify emerging trends.



## 4. Synthesis 

Conclusions 
---

