<a href="https://colab.research.google.com/github/timosachsenberg/EuBIC2026/blob/main/notebooks/EUBIC_Task3_Quant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies (for Google Colab)
!pip install -q pyopenms>=3.5.0 pyopenms-viz>=1.0.0

# Notebook 3 – Quantification

In this tutorial, we demonstrate a complete feature detection and annotation workflow using Biosaur2, an isotope-aware feature detection algorithm, as implemented in pyOpenMS.

**What is quantification?** In proteomics, quantification refers to measuring the abundance of peptides or proteins in a sample. This is typically done by integrating the MS1 signal intensity over the elution time of each peptide, producing a value proportional to the amount of that peptide present.

**What is feature detection?** Feature detection groups related peaks across multiple spectra into a single entity called a "feature". Unlike peak picking (which finds peaks in individual spectra), feature detection tracks signals across retention time, recognizing that the same peptide ion produces peaks in consecutive scans as it elutes from the LC column.

In this notebook we will:

1. **Apply the Biosaur2 algorithm to detect isotope-resolved features from mzML data.**

2. **Annotate the feature map with peptide identifications.**

3. **Visually inspect detected features in retention time–m/z–intensity space.**

---

<details>
<summary><b>Quick Reference: Key Terms Used in This Notebook</b></summary>

| Term | Definition |
|------|------------|
| **Feature** | A detected peptide signal tracked across multiple MS1 scans |
| **Centroid** | Intensity-weighted center of a peak (single m/z value) |
| **Apex** | Point of maximum intensity during elution (peak top) |
| **Integrated intensity** | Area under the curve - sum of intensities across elution |
| **FeatureMap** | OpenMS container holding all detected features |
| **IDMapper** | Tool linking MS2 identifications to MS1 features |
| **RT tolerance** | How much retention time can differ in ID mapping |
| **m/z tolerance** | How much m/z can differ in ID mapping (often in ppm) |

</details>

<details>
<summary><b>How Does Feature Detection Differ From Peak Picking?</b></summary>

**Peak Picking** (per-spectrum):
```
Spectrum 1: Find peaks at m/z 500.25, 501.25, 502.26...
Spectrum 2: Find peaks at m/z 500.24, 501.26, 502.25...
Spectrum 3: Find peaks at m/z 500.25, 501.25, 502.26...
```
Each spectrum analyzed independently - no connection between them.

**Feature Detection** (cross-spectrum):
```
Feature: Same peptide ion tracked across RT
┌────────────────────────────────────────┐
│    ▁▂▄▆█▆▄▂▁   ← Intensity over time   │
│    ╔═══════╗                           │
│    ║ m/z=500.25 ║  ← Feature boundary  │
│    ║ RT=100-120 ║                      │
│    ║ z=2        ║  ← Charge state      │
│    ╚═══════╝                           │
│    Integrated intensity = 1,234,567    │
└────────────────────────────────────────┘
```

Feature detection provides:
- Single quantitative value per peptide (not per spectrum)
- Charge state inference from isotope pattern
- Noise filtering (signals must persist over time)
- Ready for quantitative comparison

</details>

In [None]:

%matplotlib inline
import os
import pyopenms as oms
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print("pyOpenMS version:", oms.__version__)


pyOpenMS version: 3.5.0


In [None]:
# Download mzML and idXML files from course repository
if not os.path.exists("BSA1.mzML"):
    !wget -q -O "BSA1.mzML" https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/BSA1.mzML

if not os.path.exists("BSA1_F1.idXML"):
    !wget -q -O "BSA1_F1.idXML" https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/BSA1_F1.idXML

# Load mzML file into MSExperiment
exp = oms.MSExperiment()
oms.MzMLFile().load("BSA1.mzML", exp)

print(f"Loaded {exp.size()} spectra from BSA1.mzML")

**File formats used:**
- **mzML**: An open XML format for mass spectrometry raw data (spectra with m/z and intensity values)
- **idXML**: An OpenMS XML format storing peptide and protein identification results from database searches (PSMs, scores, sequences)
- **featureXML**: An OpenMS XML format storing detected features with their quantitative properties (RT, m/z, intensity, charge, boundaries)

# 1. Apply the Biosaur2 algorithm.

**Aims of this task**

- Biosaur2 is an isotope-aware feature detection method that identifies peptide features by clustering peaks across retention time and evaluating their isotopic patterns and charge state consistency.
- This strategy enables reliable discrimination between true peptide signals and background noise, particularly in complex LC–MS datasets.
- As each detected feature represents an aggregated MS1 signal characterized by a centroid mass-to-charge ratio, a retention time apex, an integrated intensity, and an inferred charge state.

**Implementation**
- The Biosaur2 algorithm (`Biosaur2Algorithm()`) instance initialized and provided with the experimental data as `MSExperiment` loaded from mzML file.
- Feature detection was executed using default algorithm parameters, yielding a FeatureMap object containing all detected features.



<details>
<summary><b>Deep Dive: How Does Biosaur2 Work?</b></summary>

**Biosaur2's Approach to Feature Detection:**

1. **Peak Detection**: Find centroided peaks in each MS1 spectrum

2. **Isotope Pattern Recognition**:
   ```
   Looking for characteristic isotope patterns:
   
   ┌─────────────────────────────────────┐
   │  █                                  │
   │  █  █                               │
   │  █  █  █                            │
   │  █  █  █  █                         │
   │──┴──┴──┴──┴─────────────────────────│
   │  M  M+1 M+2 M+3   (spacing = 1/z)   │
   └─────────────────────────────────────┘
   ```

3. **Charge State Inference**: Use isotope spacing to determine charge
   - 1.0 m/z spacing → +1
   - 0.5 m/z spacing → +2
   - 0.33 m/z spacing → +3

4. **RT Tracking**: Follow the same isotope pattern across consecutive scans
   ```
   Scan 100: ████  (found)
   Scan 101: ████  (found, same pattern)
   Scan 102: ████  (found, intensity increasing)
   Scan 103: █████ (APEX - maximum intensity)
   Scan 104: ████  (found, intensity decreasing)
   Scan 105: ███   (found)
   Scan 106: --    (not found, feature ends)
   ```

5. **Integration**: Sum intensities across all scans to get total abundance

**Why "isotope-aware" matters:**
- Distinguishes overlapping peptides by isotope pattern
- Correct charge state assignment
- Better noise rejection (noise doesn't have isotope patterns)

</details>

In [None]:
# Initialize Biosaur2 feature detection algorithm
biosaur = oms.Biosaur2Algorithm()

# Provide the MS data to the algorithm
biosaur.setMSData(exp)

# Create an empty FeatureMap to store detected features
features = oms.FeatureMap()

# Run feature detection
biosaur.run(features)

print(f"Detected {features.size()} features")

# 2. Feature map annotations with peptide identifications.

**Aims of this task**
- To associate MS1-level quantitative features with sequence-level peptide identifications.
- To integrate peptide and protein identification information into the detected feature map.
- To enable biologically interpretable, peptide-resolved quantitative analysis.

**Implementation**
- Peptide and protein identifications were loaded from an idXML file generated by MS/MS database searching.
- An `IDMapper` instance was initialized to perform feature–identification mapping. see: [https://pyopenms.readthedocs.io/en/latest/user_guide/PSM_to_features.html](https://pyopenms.readthedocs.io/en/latest/user_guide/PSM_to_features.html)
- Peptide identifications were annotated onto detected features based on proximity in RT and m/z space.
- The annotated features were stored within the existing `FeatureMap` structure for downstream analysis.

In [None]:
# Load identification (.idXML) file extract peptides and protein
peptide_ids = oms.PeptideIdentificationList()
protein_ids = []
oms.IdXMLFile().load("BSA1_F1.idXML", protein_ids, peptide_ids)


In [None]:
# Configure IDMapper
id_mapper = oms.IDMapper()
params = id_mapper.getParameters()
params.setValue("rt_tolerance", 5.0)  # RT tolerance in seconds
params.setValue("mz_tolerance", 10.0)  # m/z tolerance in ppm
id_mapper.setParameters(params)


<details>
<summary><b>Deep Dive: Why Do We Need IDMapper Tolerances?</b></summary>

**The Problem: MS1 vs. MS2 Don't Align Perfectly**

When a peptide elutes, the MS acquires data like this:

```
Time ──────────────────────────────────────────→
       MS1    MS1    MS2    MS1    MS2    MS1
       ↑      ↑      ↑      ↑      ↑      ↑
       │      │      │      │      │      │
       ├──────┼──────┼──────┼──────┼──────┤
       │  Feature detected here (MS1)    │
       │      RT = 1000.5 s              │
       │      m/z = 500.2510             │
       │                                  │
       │  PSM recorded here (MS2)         │
       │      RT = 1001.2 s   ← 0.7s off! │
       │      m/z = 500.2498  ← 2.4 ppm! │
       └──────────────────────────────────┘
```

**Why the differences?**
- MS2 happens slightly after the precursor was selected from MS1
- m/z calibration may differ slightly between MS levels
- Precursor isolation isn't perfectly centered

**Tolerance guidelines:**

| Parameter | Typical Value | Explanation |
|-----------|---------------|-------------|
| RT tolerance | 5-20 s | Depends on cycle time and peak width |
| m/z tolerance | 5-20 ppm | Depends on instrument calibration |

**What happens with wrong tolerances:**
- **Too tight**: Miss valid matches (features without IDs)
- **Too loose**: Wrong IDs mapped to features (false associations)

</details>

---

### Exercise 1: Effect of IDMapper Tolerances

**Think about this:**

1. If you set `rt_tolerance` to 0.1 seconds (very tight), what would happen?
2. If you set `mz_tolerance` to 100 ppm (very loose), what could go wrong?

<details>
<summary><b>Click to check your answers</b></summary>

**Answer 1: Very tight RT tolerance (0.1 s)**
- Most identifications would fail to map to features
- You'd have features without peptide annotations
- Only perfect timing matches would work

**Answer 2: Very loose m/z tolerance (100 ppm)**
- Multiple identifications might map to the same feature
- Wrong peptides might get assigned to features
- Risk: peptide A's ID gets mapped to peptide B's feature
- At m/z 500, 100 ppm = ±0.05 Da - could match wrong isotope!

**Best practice**: Start with instrument-appropriate defaults, then adjust if mapping rate is too low or too high.

</details>

---

In [None]:
id_mapper.annotate(features, peptide_ids, protein_ids, True, True, exp)


# 3. Visually inspect detected features in retention time–m/z–intensity space.

**Aims of this task**
- To visually evaluate the detected MS1 features in retention time–mass-to-charge–intensity space, enabling qualitative assessment of feature detection performance.

**Implementation**
- The detected feature map was converted into a tabular pandas DataFrame for exploratory analysis. see: [https://pyopenms.readthedocs.io/en/latest/user_guide/export_pandas_dataframe.html](https://pyopenms.readthedocs.io/en/latest/user_guide/export_pandas_dataframe.html)
- The plotting backend was configured to enable mass spectrometry–specific visualizations. see: [https://pyopenms-viz.readthedocs.io/en/latest/](https://pyopenms-viz.readthedocs.io/en/latest/)
- A peak map visualization was generated, projecting features in retention time, m/z, and intensity space.


In [None]:
# Export features into dataframe
df = features.get_df()
df.head(2)

Unnamed: 0_level_0,peptide_sequence,peptide_score,ID_filename,ID_native_id,charge,rt,mz,rt_start,rt_end,mz_start,mz_end,quality,intensity
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
12592275275208522725,,,unknown,,5,2272.696289,674.534031,2252.12915,2299.366455,674.532496,676.139162,9.0,83422.359375
17439098422051461769,,,,,3,2208.773438,655.301653,2195.811768,2217.431396,655.297018,657.308736,7.0,230824.5


---

### Exercise 2: Explore the Feature DataFrame

Before looking at the visualizations, explore the DataFrame to understand the data:

```python
# Try these commands:
print(f"Total features detected: {len(df)}")
print(f"Features with peptide IDs: {df['peptide_sequence'].notna().sum()}")
print(f"Charge state distribution:\n{df['charge'].value_counts()}")
print(f"Intensity range: {df['intensity'].min():.0f} to {df['intensity'].max():.0f}")
```

**Questions:**
1. What percentage of features have peptide identifications?
2. What's the most common charge state? Why might this be?
3. Look at the `quality` column - what do you think it represents?

<details>
<summary><b>Click for discussion</b></summary>

**Expected observations:**

1. **ID coverage**: Typically 20-50% of features have peptide IDs
   - Not all MS1 features trigger MS2 scans (DDA limitation)
   - Some MS2 spectra fail to identify (low quality, modifications)
   - Some features may be contaminants (not in database)

2. **Charge distribution**: +2 and +3 are typically most common
   - Tryptic peptides have K/R at C-terminus → at least +1
   - N-terminus adds another charge
   - Larger peptides may have internal His, Lys → +3, +4

3. **Quality score**: Algorithm confidence in the feature
   - Higher = more confident detection
   - Based on isotope pattern fit, peak shape, signal-to-noise
   - Can be used to filter low-quality features

</details>

---

In [None]:
# interactive PeakMap plot with plotly
from pyopenms_viz._plotly import PLOTLYPeakMapPlot

plot = PLOTLYPeakMapPlot(
    data=df,
    x="rt",
    y="mz",
    z="intensity",
    width=800,
    height=800,
    grid=False,
    add_marginals=True, # showing RT and intensities
)

plot.show();

In [None]:
# ploting peakmap and having bounding boxes at the feature position
plot = PLOTLYPeakMapPlot(
    data=df,
    x="rt",
    y="mz",
    z="intensity",
    width=1000,
    height=1000,
    grid=False,
)

# Create rectangles for all features
shapes = []
for _, row in df.iterrows():
    shapes.append(
        dict(
            type="rect",
            x0=row["rt_start"],
            x1=row["rt_end"],
            y0=row["mz_start"],
            y1=row["mz_end"],
            line=dict(color="blue", width=1)
        )
    )

# Add all rectangles to the plot
plot.fig.update_layout(shapes=shapes)

# Show the interactive plot
plot.show();

In [None]:
# Filter features within the RT window
df_cut = df[(df["rt_start"] >= 1600) & (df["rt_end"] <= 1650)]

# Plot peakmap
plot = PLOTLYPeakMapPlot(
    data=df_cut,
    x="rt",
    y="mz",
    z="intensity",
    width=1000,
    height=1000,
    grid=False,
)

# Create rectangles for filtered features
shapes = []
for _, row in df_cut.iterrows():
    shapes.append(
        dict(
            type="rect",
            x0=row["rt_start"],
            x1=row["rt_end"],
            y0=row["mz_start"],
            y1=row["mz_end"],
            line=dict(color="blue", width=1)
        )
    )

# Add rectangles to the plot
plot.fig.update_layout(shapes=shapes)

# Show the interactive plot
plot.show();

## Summary

Congratulations! You've completed the proteomics data analysis workflow. Here's what you learned:

| Step | Concept | Key pyOpenMS Tool |
|------|---------|-------------------|
| **Feature detection** | Group MS1 peaks across RT into features | `Biosaur2Algorithm` |
| **Isotope awareness** | Use isotope patterns to infer charge | Built into Biosaur2 |
| **ID mapping** | Link MS2 identifications to MS1 features | `IDMapper` |
| **Tolerance settings** | Account for measurement variability | RT and m/z tolerances |
| **Data export** | Convert to pandas for analysis | `FeatureMap.get_df()` |
| **Visualization** | Interactive peak maps with annotations | `pyopenms_viz` |

---

## Complete Workflow Summary

You've now learned the entire bottom-up proteomics workflow:

```
┌─────────────────────────────────────────────────────────────────┐
│                    BOTTOM-UP PROTEOMICS                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Notebook 1: RAW DATA                                           │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐   │
│  │ Proteins│ →   │ Digest  │ →   │ LC-MS   │ →   │ Spectra │   │
│  │ (FASTA) │     │(Trypsin)│     │  Data   │     │ (mzML)  │   │
│  └─────────┘     └─────────┘     └─────────┘     └─────────┘   │
│                                                                 │
│  Notebook 2: IDENTIFICATION                                     │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐   │
│  │  MS2    │ →   │ Theo.   │ →   │  Match  │ →   │  PSMs   │   │
│  │ Spectra │     │ Spectra │     │ & Score │     │(idXML)  │   │
│  └─────────┘     └─────────┘     └─────────┘     └─────────┘   │
│                                                                 │
│  Notebook 3: QUANTIFICATION                                     │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐   │
│  │  MS1    │ →   │ Feature │ →   │   ID    │ →   │Annotated│   │
│  │ Spectra │     │ Detect  │     │ Mapping │     │Features │   │
│  └─────────┘     └─────────┘     └─────────┘     └─────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## Bonus Challenges

<details>
<summary><b>Challenge 1 (Beginner): Filter by Quality</b></summary>

Filter the DataFrame to keep only high-quality features:

```python
# Filter features with quality > 5
df_high_quality = df[df['quality'] > 5]
print(f"High quality features: {len(df_high_quality)} / {len(df)}")

# Compare ID rates
original_id_rate = df['peptide_sequence'].notna().mean()
filtered_id_rate = df_high_quality['peptide_sequence'].notna().mean()
print(f"ID rate: {original_id_rate:.1%} → {filtered_id_rate:.1%}")
```

**Question**: Does filtering by quality improve the identification rate?

</details>

<details>
<summary><b>Challenge 2 (Intermediate): Highlight Identified Features</b></summary>

Modify the visualization to show identified features in a different color:

```python
# Create shapes with different colors for identified vs unidentified
shapes = []
for _, row in df.iterrows():
    color = "green" if pd.notna(row['peptide_sequence']) else "blue"
    shapes.append(
        dict(
            type="rect",
            x0=row["rt_start"],
            x1=row["rt_end"],
            y0=row["mz_start"],
            y1=row["mz_end"],
            line=dict(color=color, width=1)
        )
    )
```

**Observe**: Where in the RT/m/z space are identified features concentrated?

</details>

<details>
<summary><b>Challenge 3 (Advanced): Intensity Distribution Analysis</b></summary>

Analyze whether feature intensity affects identification:

```python
import matplotlib.pyplot as plt

# Split features by identification status
identified = df[df['peptide_sequence'].notna()]['intensity']
unidentified = df[df['peptide_sequence'].isna()]['intensity']

# Plot histograms
plt.figure(figsize=(10, 5))
plt.hist(np.log10(identified), bins=50, alpha=0.5, label='Identified')
plt.hist(np.log10(unidentified), bins=50, alpha=0.5, label='Unidentified')
plt.xlabel('log10(Intensity)')
plt.ylabel('Count')
plt.legend()
plt.title('Intensity Distribution: Identified vs Unidentified Features')
plt.show()
```

**Question**: Are higher-intensity features more likely to be identified? Why?

</details>

<details>
<summary><b>Challenge 4 (Expert): Compare Quantification Methods</b></summary>

Compare different ways to summarize peptide abundance:

```python
# For identified peptides, compare:
# 1. Apex intensity (single point)
# 2. Integrated intensity (area under curve)

identified_df = df[df['peptide_sequence'].notna()].copy()

# Calculate peak width
identified_df['peak_width'] = identified_df['rt_end'] - identified_df['rt_start']

# Estimate apex vs integrated ratio
# (Assuming Gaussian peak: Area ≈ 1.06 × height × width)

# Explore the relationship between peak width and intensity
```

**Discussion**: When would apex intensity be preferred over integrated intensity?

</details>

---

## What's Next?

You now have the foundation to explore more advanced topics:

- **Label-free quantification (LFQ)**: Compare peptide abundances across samples
- **Isobaric labeling (TMT/iTRAQ)**: Multiplex quantification using reporter ions
- **Data-independent acquisition (DIA)**: Alternative to DDA with better quantification
- **Statistical analysis**: Differential expression between conditions
- **Protein inference**: Roll up peptide quantities to protein level

**Resources:**
- [pyOpenMS Documentation](https://pyopenms.readthedocs.io/)
- [OpenMS Tutorials](https://openms.readthedocs.io/en/latest/tutorials/index.html)