# Birds Biodiversity Technical report
LEIVA Martin (22205863), PEÑA CASTAÑO Javier (22203616), HERRERA NATIVI Vladimir (22205706)

It is to be seen if the final technical report will be written as an .md or a pdf

---
NOTE : 30/10/2025

### Next steps to take in consideration 

**Overview.md**\
first steps motivations :
- Check schema consistency with df.info() and df.head().
- Validate effort assumptions – confirm each transect has 10 points, look for sites with fewer than - 10 completed visits, and note detection modality distributions.
- Create orientation summaries – species richness by site, observer workload, yearly totals,   detection modality shares.

**Assignment.pdf**\
Goal :  To quantify how biodiversity indicatiors have evolved over time. With statistical uncertainity assessments, highlighting species-specific stories. 

Big points to treat : 
- Dataset Familiarisation and Descriptive Analysis : data structure, descriptive statistics 
- Multi-Year Indicator Trends : Select indicators, for each compute annual estimates, quantify and interpret temporal trends (maybe linear models)
- Species-Level Evolution  : choose subset of birds, for each specie analyse how its recorded presence have changed (CI), potential reasons ? 

---
## Initial approach, data discovery 
First problems encountered : Intial reading and separation of the raw data. 

| Sheet name | Problem in raw data |
|------------|---------------------|
| `ESPECES`  | The raw data sheet jumped the two first columns in the excel sheet, it had no headers and one single NaN on the last column |
| `GPS-MILIEU` | The raw data sheet jumped the two first columns, it had no prescence of NaNs. The headers were wrongly placed due to a double leveled header for geographical coordinates and last 3 columns didn't had headers.   |
| `NOM FRANÇAIS` | The original sheet had 26 columns but the last one was "hidden" (first values started a couple thounsand rows in). Headers were missplaced due to a triple and double leveled headers from columns 13 to 25 |

In order to correct this problems we proposed a 2 step solution, first we dropped the redundant columuns (in ESPECES ans GPS-MILIEU) and then we redefined the headers of the 3 sheets as listed next : 

| Sheet name | New headers |
|------------|---------------------|
| `ESPECES`  | "ESPECIES_NAME", "LATIN_NAME", "NATURE" |
| `GPS-MILIEU` |   "TRANSECT_NAME", "COORDINATE_X", "COORDINATE_Y", "HABITAT_TYPE","TRANSECT_ID", "POINT_ID",  |
| `NOM FRANÇAIS` | (Col 13 to 26) : "AL25", "VL25", "AL50", "VL50", "AL100", "VL100", "AG100", "VG100", "VOL", "TOT_A", "TOT_V_sV", "TOT_AV_sV", "TOT_AV_V", "COMPANIED", |

In the case of `NOM FRANÇAIS` only headers from columns 13 to 25 were changed following a patern that we stablished : \
For columns 13 to 20 (distances de contact) we stablished a 3 part code, the first letter (A or V) is for Auditif or Visuel, the second letter (L or G) is for "<" or ">" and finally the number corresponds to the distance in meters.\
For columns 22 to 25 (totaux) a simalar approach was used, the first section has one or two letters (A, V or AV) for "Auditif", "Visuel" or "A+V", the second sections has two different entries (sV or V) for "sans Vol" or "avec Vol"

**This new sheets were saved in "data/cleaned"**

---

# Story telling 

## 1. Dataset Familiarisation and Descriptive Analysis 

- **Cleaning** : Talk about all the incoherences, initial problems and how we treat them in utils.py 
- **Familarisation** : Quick glance at all the initial statistical analysis (one for each of the 3 intial tabs) and give first key insights

Here we can explain the dropped columns and the new headers names, ...

## 2. Multi-Year Indicator Trends
Here we should explain in detail all the Indicator's choices, how did we treat them and the conclusions 

### Density Study - Summary of Computations

#### 1. Definition
The density indicator measures the relative abundance of birds observed per transect and year.

#### 2. Computation Steps

**(a) Counting per Transect and Year**  
From the cleaned observation dataset (`nom_francais_clean`), total bird counts were aggregated by `(year, transect)` using columns such as `TOT_A`, `TOT_V_sV`, etc.

**(b) Normalization**  
Each transect’s annual count was normalized by the maximum count observed across all years:

$$ \text{density\_norm}_{i,t} = \frac{\text{count}_{i,t}}{\max(\text{count}_{\text{all years}})} $$
Densities are thus scaled to the range [0, 1].

#### 3. Bootstrap Estimation

A bootstrap resampling method was used to estimate uncertainty:

1. For each year, resample transects with replacement \( B \) times (e.g. \( B = 1000 \)).
2. Compute the mean normalized density for each resample.
3. Obtain 95% confidence intervals from the empirical quantiles of the bootstrap distribution.

$$ \text{CI}_{95\%} = [\hat{\theta}^*_{2.5\%}, \hat{\theta}^*_{97.5\%}] $$

#### 4. Derived Indicators and Visualization

- Annual mean normalized density computed and plotted over time.  
- Per-year density distribution visualized by transect (color-coded).  
- Temporal evolution of normalized densities visualized across transects.  
- Bootstrap mean and confidence intervals plotted for density trends.
  
#### Interpretation

The density analysis provides a standardized way to compare bird abundance across
years by accounting for variation in sampling effort. The initial density plots
show clear differences in abundance between transects, reflecting environmental
heterogeneity and habitat-specific suitability. When examining annual mean
densities, we observe temporal fluctuations that suggest shifts in overall bird
activity or detectability from year to year.

The bootstrap procedure adds statistical rigor by quantifying uncertainty around
the estimated yearly densities. The resulting confidence intervals show how
stable or variable these density estimates are over time. Narrow intervals
indicate consistent sampling responses across transects, while wider intervals
suggest greater spatial or temporal variability in bird presence.

The fitted regression trends allow us to determine whether density changes
represent meaningful ecological patterns rather than random fluctuation. Where
the regression slope is statistically significant, we can infer a directional
trend—either increasing or declining density—over the study period. In contrast,
non-significant slopes suggest that densities remain broadly stable, with
year-to-year variation falling within the expected range of sampling variation.

Taken together, the density plots, bootstrap confidence intervals, and regression
analysis provide complementary evidence that allows us to evaluate not only the
magnitude of density changes but also their reliability and ecological relevance.
This integrated approach helps distinguish true shifts in bird community
activity from random sampling fluctuation.

### Species Diversity Study

### Detectability Study (auditory V.S. visual)

### Spatial Coverage Study (Explain why not intersting )

## 3. Species-Level Evolution

We should explain the choice of our subset (top 5 most observed birds), and then the choice of indicators to study evolution. Here we should detail our methods for bootstrapping and choice of indexes 

### Abundance Evolution Study 

We use mean abundance instead of total number counted per year because:

 - Sampling effort changes across years (number of transects visited, observers present, duration, weather conditions, etc.).
 - The total number of birds counted is therefore not directly comparable from year to year.

But the mean abundance standardizes for effort, making trends fair and comparable.

#### Trend Significance per Species

For all this species except "Quiscale merle" , the slope is > 0 , so species abundance is increasing over time, but the only significant values are for "Elénie siffleuse" and 	"Tourterelle à queue carrée". "Quiscale merle" has also a significant negative value, so could suppose that the abundance is decreasing over the time, but the p-value is quite high. This suggests a possible decline, but evidence is weak. By looking at p-value (< 0.05), we can say that trend is statistically significant for "Elénie siffleuse" and "Tourterelle à queue carrée".

#### Interpretation of Species-Specific Abundance Trends

The figures above display the temporal evolution of abundance for the five most frequently recorded bird species in the dataset, using the standardized mean abundance (TOT_AV_V) per year and associated uncertainty estimated via bootstrap resampling. For **Élenie siffleuse**, the annual mean abundance shows a generally increasing pattern over the study period, and the fitted linear trend is positive with confidence intervals that do not overlap heavily with zero, indicating a statistically supported rise in abundance. A similar positive and statistically significant trend is observed for **Tourterelle à queue carrée**, where both the slope and the bootstrap confidence intervals suggest a sustained increase in occurrence intensity over time. 

In contrast, **Quiscale merle** shows a declining fitted trend line, but the year-to-year variability is relatively high and the confidence intervals are broader, resulting in a non-significant trend. This suggests that although the species may be experiencing a reduction in recorded abundance, further data or more controlled sampling would be needed to confidently confirm this decline. For **Sporophile rougegorge** and **Sucrier à ventre jaune**, the mean abundance values fluctuate from year to year without displaying a marked directional change. Their fitted trends are near-flat and associated confidence intervals are wide, indicating that these populations have remained relatively stable across the studied period.

Overall, these results highlight two species experiencing significant increases (Élenie siffleuse and Tourterelle à queue carrée), one species with a possible but unconfirmed decline (Quiscale merle), and two species showing stable abundance levels with no evidence of directional long-term change (Sporophile rougegorge and Sucrier à ventre jaune).

#### Limitations

The mean-abundance and bootstrap approach provides a robust, effort-standardized indicator of species presence over time. However, this method does not explicitly correct for variation in detection probability (observer effects, weather, time of day), and assumes independence between sampling events, which may not always hold. Additionally, mean abundance reflects relative observation rates rather than absolute population sizes, because mean abundance is how often the species are recorded, not how many individuals exist in the ecosystem., and the linear trend model may not capture non-linear ecological dynamics. Therefore, while the approach reliably identifies broad directional changes, results should be interpreted as population indices rather than direct population estimates.

### Per-Transect Detection Rate Evolution Study 

## 4. Synthesis 

Conclusions 
---

