<a href="https://colab.research.google.com/github/timosachsenberg/EuBIC2026/blob/main/notebooks/EUBIC_Task3_Quant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies (for Google Colab)
!pip install -q pyopenms>=3.5.0 pyopenms-viz>=1.0.0

# Notebook 3 ‚Äì Quantification

> **Prerequisites**: This notebook builds on [Notebook 1 - From Proteins to Spectra](EUBIC_Task1_Peaks.ipynb) and [Notebook 2 - Peptide Identification](EUBIC_Task2_ID.ipynb). You should be familiar with MS1 spectra, peptide identification, and PSMs.

In this tutorial, we demonstrate a complete feature detection and annotation workflow using the OpenMS implementation of Biosaur2, an isotope-aware feature detection algorithm.

**What is quantification?** In proteomics, quantification refers to measuring the abundance of peptides or proteins in a sample. This is typically done by integrating the MS1 signal intensity over the elution time of each peptide, producing a value proportional to the amount of that peptide present.

**What is feature detection?** Feature detection groups related peaks across multiple spectra into a single entity called a "feature". Unlike peak picking (which finds peaks in individual spectra), feature detection tracks signals across retention time, recognizing that the same peptide ion produces peaks in consecutive scans as it elutes from the LC column.

In this notebook we will:

1. **Apply the Biosaur2 algorithm to detect isotope-resolved features from mzML data.**

2. **Annotate the feature map with peptide identifications.**

3. **Visually inspect detected features in retention time‚Äìm/z‚Äìintensity space.**

---

<details>
<summary><b>Quick Reference: Key Terms Used in This Notebook</b></summary>

| Term | Definition |
|------|------------|
| **Feature** | A detected peptide signal tracked across multiple MS1 scans |
| **Centroid** | Intensity-weighted center of a peak (single m/z value) |
| **Apex** | Point of maximum intensity during elution (peak top) |
| **Integrated intensity** | Area under the curve - sum of intensities across elution |
| **FeatureMap** | OpenMS container holding all detected features |
| **IDMapper** | Tool linking MS2 identifications to MS1 features |
| **RT tolerance** | Maximum allowed difference in retention time (seconds) when matching an MS2 identification to an MS1 feature |
| **m/z tolerance** | Maximum allowed difference in mass-to-charge ratio (often in ppm) when matching an MS2 identification to an MS1 feature |

</details>

<details>
<summary><b>How Does Feature Detection Differ From Peak Picking?</b></summary>

**Peak Picking** (per-spectrum):

Peak picking converts the raw signal intensities recorded by the ion detector (profile data) into discrete m/z values (centroided data). Each continuous peak profile is reduced to a single m/z position representing its intensity-weighted center.

```
Spectrum 1: Find peaks at m/z 500.25, 501.25, 502.26...
Spectrum 2: Find peaks at m/z 500.24, 501.26, 502.25...
Spectrum 3: Find peaks at m/z 500.25, 501.25, 502.26...
```
Each spectrum analyzed independently - no connection between them.

**Feature Detection** (cross-spectrum):

![Feature detection illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/feature.png)

*A feature represents a peptide ion tracked across retention time. The feature boundaries are defined by the m/z range spanning the isotope pattern (monoisotopic peak plus heavier isotopes) and the RT range spanning the chromatographic elution profile.*

Feature detection provides:
- Single quantitative value per peptide ion (i.e., per peptide observed with a specific charge state)
- Charge state inference from isotope pattern
- Noise filtering (signals must persist over time)
- Ready for quantitative comparison

</details>

<details>
<summary><b>Why Use Integrated Intensity for Quantification?</b></summary>

Integrating the signal over the entire elution profile (area under the curve) rather than using a single intensity value (e.g., apex intensity) provides more robust and comparable quantification across samples.

**The key advantage**: Slight differences in chromatography between experiments can cause peak broadening or sharpening, changing the apex intensity. However, the integrated area remains constant because the same total amount of analyte elutes - it's just spread over a wider or narrower time window.

```
Sample A (sharper peak):     Sample B (broader peak):
       ‚ñà‚ñà‚ñà‚ñà                        ‚ñÑ‚ñÑ‚ñà‚ñà‚ñà‚ñà‚ñÑ‚ñÑ
      ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà         vs           ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
     ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                    ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
    ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    Area = 100                   Area = 100
    Apex = 50                    Apex = 30
```

Both samples contain the same peptide amount, but apex intensities differ. Integrated intensities correctly show equal abundance.

</details>

<details>
<summary><b>New to interactive Plotly visualizations?</b></summary>

This notebook uses **Plotly** for interactive visualizations. Here's how to use them:

**Navigation controls:**
- **Zoom**: Click and drag to select a region
- **Pan**: Hold Shift + click and drag
- **Reset**: Double-click to reset view
- **Hover**: Move mouse over points to see details

**Toolbar (top-right):**
- üì∑ Download as PNG
- üîç Zoom/Pan toggle
- ‚Ü©Ô∏è Reset axes
- üè† Reset to original view

**Tips for exploration:**
- Start zoomed out to see overall patterns
- Zoom into regions of interest
- Use hover info to identify specific features

</details>

In [None]:

%matplotlib inline
import os
import pyopenms as oms
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print("pyOpenMS version:", oms.__version__)


pyOpenMS version: 3.5.0


In [None]:
# Download mzML and idXML files from course repository
if not os.path.exists("UPS1_5min.mzML"):
    !wget -q -O "UPS1_5min.mzML" https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/UPS1_5min.mzML

if not os.path.exists("UPS1_5min.idXML"):
    !wget -q -O "UPS1_5min.idXML" https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/UPS1_5min.idXML

# Load mzML file into MSExperiment
exp = oms.MSExperiment()
oms.MzMLFile().load("UPS1_5min.mzML", exp)

print(f"Loaded {exp.size()} spectra from UPS1_5min.mzML")

**File formats used:**

| Format | Description |
|--------|-------------|
| **mzML** | Open XML format for mass spectrometry raw data (spectra with m/z and intensity values) |
| **idXML** | OpenMS XML format storing peptide and protein identification results from database searches |
| **featureXML** | OpenMS XML format storing detected features with quantitative properties (RT, m/z, intensity, charge, boundaries) |

# 1. Apply the Biosaur2 algorithm.

**Biosaur2** ([Ivanov et al., 2024](https://doi.org/10.1021/acs.jproteome.4c00513)) is an isotope-aware feature detection method that identifies peptide features by clustering peaks across retention time and evaluating their isotopic patterns and charge state consistency. This strategy enables reliable discrimination between true peptide signals and background noise, particularly in complex LC‚ÄìMS datasets. The algorithm has been implemented in OpenMS and is available through pyOpenMS as `Biosaur2Algorithm`.

Each detected feature represents an aggregated MS1 signal characterized by a centroid mass-to-charge ratio, a retention time apex, an integrated intensity, and an inferred charge state.

The implementation initializes a `Biosaur2Algorithm()` instance and provides the experimental data as an `MSExperiment` loaded from the mzML file. Feature detection executes with default parameters, yielding a FeatureMap object containing all detected features.

<details>
<summary><b>Deep Dive: How Does Biosaur2 Work?</b></summary>

**Biosaur2's Approach to Feature Detection:**

1. **Peak Detection**: Find centroided peaks in each MS1 spectrum

2. **Isotope Pattern Recognition**: Look for characteristic isotope patterns where peaks are spaced by 1/z (charge state):

![Isotope pattern example](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/DFPIANGER_isoDistribution.png)

3. **Charge State Inference**: Use isotope spacing to determine charge
   - 1.0 m/z spacing ‚Üí +1
   - 0.5 m/z spacing ‚Üí +2
   - 0.33 m/z spacing ‚Üí +3

4. **RT Tracking**: Follow the same isotope pattern across consecutive scans to build the elution profile:

![RT tracking illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/rt_tracking.svg)

5. **Integration**: Sum intensities across all scans to get total abundance

**Why "isotope-aware" matters:**
- Distinguishes overlapping peptides by isotope pattern
- Correct charge state assignment
- Better noise rejection (noise doesn't have isotope patterns)

</details>

In [None]:
# Initialize Biosaur2 feature detection algorithm
biosaur = oms.Biosaur2Algorithm()

# Provide the MS data to the algorithm
biosaur.setMSData(exp)

# Create an empty FeatureMap to store detected features
features = oms.FeatureMap()

# Run feature detection
biosaur.run(features)

print(f"Detected {features.size()} features")

<details>
<summary><b>pyOpenMS Reference: Feature Detection Algorithms</b></summary>

| Algorithm | Best For | Key Characteristics |
|-----------|----------|---------------------|
| `Biosaur2Algorithm` | General DDA/DIA data | Isotope-aware, fast, good default choice |
| `FeatureFinderCentroided` | High-resolution centroided data | Classic OpenMS algorithm, many parameters |
| `FeatureFinderMultiplex` | SILAC, dimethyl labeling | Detects isotope-labeled peptide pairs/triplets |
| `FeatureFinderIdentification` | Targeted extraction | Uses peptide IDs to guide feature detection |

**Common FeatureMap operations:**
```python
# Get number of features
print(f"Detected {features.size()} features")

# Export to pandas DataFrame
df = features.get_df()

# Access individual features
for feature in features:
    print(f"m/z: {feature.getMZ():.4f}, RT: {feature.getRT():.1f}, intensity: {feature.getIntensity():.0f}")

# Save to featureXML
oms.FeatureXMLFile().store("output.featureXML", features)
```

</details>

---

### Quick Check: Feature Detection Results

**Predict first, then verify!**

Before looking at the output above, think about these questions:

1. **Prediction**: Given that we loaded ~150 MS1 spectra from a 5-minute LC-MS run, how many features would you expect to detect? (Pick: <500, 500-2000, >2000)

2. **Prediction**: Do you expect more +2 or +3 charged features for tryptic peptides?

3. **Reasoning**: Why might the number of features differ significantly from the number of MS2 spectra (~1900)?

<details>
<summary><b>Click to check your predictions</b></summary>

**Answer 1: Number of features**

You'd typically expect **>2000 features** from this data. The actual number (~7500) reflects:
- Sample complexity (UPS1 standard contains 48 human proteins)
- High sensitivity detecting low-abundance peptides
- Multiple charge states per peptide counted as separate features
- Some background/noise features that pass detection thresholds

**Answer 2: Charge state**

**+2 is most common** (~2900 features), followed closely by +3 (~2100 features) for tryptic peptides because:
- Trypsin cleaves after K/R ‚Üí one positive charge at C-terminus
- N-terminus gains a proton ‚Üí second positive charge
- Average tryptic peptide: ~10 amino acids ‚Üí +2 is most stable

Larger peptides (>15 aa) often show +3 due to additional basic residues.

**Answer 3: Features vs MS2 spectra**

We have ~7500 features but only ~1900 MS2 spectra because:
- **More features than MS2**: Not every MS1 feature triggers an MS2 scan (DDA selects only the most intense precursors)
- **Multiple charge states**: Same peptide at different charges = multiple features
- **Low-abundance features**: Many detected features are below the MS2 selection threshold
- **Some MS2 redundancy**: The same peptide may be fragmented multiple times

</details>

---

# 2. Feature map annotations with peptide identifications.

To make quantitative data biologically interpretable, we link MS1-level features with peptide identifications from MS2 database searching. This **ID mapping** step associates each detected feature with its corresponding peptide sequence (if identified), enabling peptide-resolved quantitative analysis.

Peptide and protein identifications are loaded from an idXML file generated by MS/MS database searching. An `IDMapper` instance performs the feature‚Äìidentification mapping based on proximity in RT and m/z space. The annotated features are stored within the existing `FeatureMap` structure for downstream analysis. See the [pyOpenMS PSM mapping documentation](https://pyopenms.readthedocs.io/en/latest/user_guide/PSM_to_features.html) for details.

<details>
<summary><b>pyOpenMS Reference: IDMapper and Feature Annotation</b></summary>

| Class/Method | Purpose | Example |
|--------------|---------|---------|
| `IDMapper` | Map identifications to features | `mapper = oms.IDMapper()` |
| `.getParameters()` | Get current parameter set | `params = mapper.getParameters()` |
| `.setParameters()` | Set configuration | `mapper.setParameters(params)` |
| `.annotate()` | Perform mapping | `mapper.annotate(features, pep_ids, prot_ids, True, True, exp)` |
| `IdXMLFile().load()` | Load identifications | `oms.IdXMLFile().load("file.idXML", prot_ids, pep_ids)` |

**Key IDMapper parameters:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `rt_tolerance` | 5.0 | RT tolerance in seconds |
| `mz_tolerance` | 20.0 | m/z tolerance (in ppm if `mz_measure` is "ppm") |
| `mz_measure` | "ppm" | Units for m/z tolerance ("ppm" or "Da") |
| `ignore_charge` | false | Match regardless of charge state |

**Example usage:**
```python
# Load identifications
pep_ids = oms.PeptideIdentificationList()
prot_ids = []
oms.IdXMLFile().load("results.idXML", prot_ids, pep_ids)

# Configure mapper
mapper = oms.IDMapper()
params = mapper.getParameters()
params.setValue("rt_tolerance", 10.0)
params.setValue("mz_tolerance", 15.0)
mapper.setParameters(params)

# Annotate features
mapper.annotate(features, pep_ids, prot_ids, True, True, exp)
```

</details>

In [None]:
# Load identification (.idXML) file extract peptides and protein
peptide_ids = oms.PeptideIdentificationList()
protein_ids = []
oms.IdXMLFile().load("UPS1_5min.idXML", protein_ids, peptide_ids)
print(f"Loaded {len(peptide_ids)} peptide identifications")

In [None]:
# Configure IDMapper
id_mapper = oms.IDMapper()
params = id_mapper.getParameters()
# RT tolerance: max allowed retention time difference (in seconds) between 
# the MS2 identification and the MS1 feature apex
params.setValue("rt_tolerance", 5.0)
# m/z tolerance: max allowed mass-to-charge difference (in ppm) between
# the precursor selected for MS2 and the feature's centroid m/z  
params.setValue("mz_tolerance", 10.0)
id_mapper.setParameters(params)

In [None]:
# Run the ID mapping: annotate features with peptide identifications
# This links MS2 identifications to MS1 features based on RT and m/z proximity
id_mapper.annotate(features, peptide_ids, protein_ids, True, True, exp)

# Check how many features now have peptide annotations
n_annotated = sum(1 for f in features if f.getPeptideIdentifications())
print(f"Features with peptide IDs: {n_annotated} / {features.size()}")

<details>
<summary><b>Deep Dive: Why Do We Need IDMapper Tolerances?</b></summary>

**The Problem: MS1 vs. MS2 Don't Align Perfectly**

When a peptide elutes, the MS acquires data like this:

```
Time ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí
       MS1    MS1    MS2    MS1    MS2    MS1
       ‚Üë      ‚Üë      ‚Üë      ‚Üë      ‚Üë      ‚Üë
       ‚îÇ      ‚îÇ      ‚îÇ      ‚îÇ      ‚îÇ      ‚îÇ
       ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
       ‚îÇ  Feature detected here (MS1)    ‚îÇ
       ‚îÇ      RT = 1000.5 s              ‚îÇ
       ‚îÇ      m/z = 500.2510             ‚îÇ
       ‚îÇ                                  ‚îÇ
       ‚îÇ  PSM recorded here (MS2)         ‚îÇ
       ‚îÇ      RT = 1001.2 s   ‚Üê 0.7s off! ‚îÇ
       ‚îÇ      m/z = 500.2498  ‚Üê 2.4 ppm! ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Why the differences?**
- MS2 happens slightly after the precursor was selected from MS1
- m/z calibration may differ slightly between MS levels
- Precursor isolation isn't perfectly centered

**Tolerance guidelines:**

| Parameter | Typical Value | Explanation |
|-----------|---------------|-------------|
| RT tolerance | 5-20 s | Depends on cycle time and peak width |
| m/z tolerance | 5-20 ppm | Depends on instrument calibration |

**What happens with wrong tolerances:**
- **Too tight**: Miss valid matches (features without IDs)
- **Too loose**: Wrong IDs mapped to features (false associations)

</details>

---

### Exercise 1: Effect of IDMapper Tolerances

**Predict first, then verify!** This is how scientists think.

1. **Prediction**: If you set `rt_tolerance` to 0.1 seconds (very tight), will you get MORE or FEWER features with peptide identifications?
2. **Why?** Write down your reasoning before looking at the answer.
3. **Verify**: After running the IDMapper, check how many features have peptide IDs.

<details>
<summary><b>Click to reveal the answer</b></summary>

**Answer**: FEWER features with peptide IDs.

**Reasoning**:
- MS2 acquisition happens 0.5-2 seconds after the precursor was selected
- With 0.1 second tolerance, most valid matches will be rejected
- You'd see many features without annotations

**Consequences of tolerance settings:**

| Setting | Result |
|---------|--------|
| **Too tight RT (0.1 s)** | Most IDs fail to map - features stay unannotated |
| **Too loose RT (60 s)** | Wrong IDs might map to adjacent features |
| **Too tight m/z (1 ppm)** | Calibration errors cause missed matches |
| **Too loose m/z (100 ppm)** | At m/z 500, 100 ppm = ¬±0.05 Da - could match wrong isotope! |

**Typical good values:**
- RT: 5-10 seconds for standard DDA data
- m/z: 10-20 ppm for Orbitrap data

</details>

---

# 3. Visually inspect detected features in retention time‚Äìm/z‚Äìintensity space.

Finally, we visualize the detected MS1 features in retention time‚Äìm/z‚Äìintensity space to qualitatively assess feature detection performance. This 2D representation reveals patterns in peptide elution, identifies potential issues (noise, contaminants, missed features), and provides an intuitive overview of the data.

The detected feature map is first converted into a tabular pandas DataFrame for exploratory analysis (see [export documentation](https://pyopenms.readthedocs.io/en/latest/user_guide/export_pandas_dataframe.html)). We then use pyopenms_viz to generate interactive peak map visualizations (see [pyopenms_viz documentation](https://pyopenms-viz.readthedocs.io/en/latest/)).

<details>
<summary><b>pyOpenMS Reference: Visualization with pyopenms_viz</b></summary>

**Registering the plotting backend:**
```python
import pyopenms_viz  # Registers ms_plotly backend for pandas
```

**Available plot types:**

| Plot Type | pandas syntax | Use Case |
|-----------|--------------|----------|
| Spectrum plot | `df.plot(kind='spectrum', ...)` | MS1/MS2 spectra, mirror plots |
| Chromatogram | `df.plot(kind='chromatogram', ...)` | TIC, XIC, BPC |
| Peak map | `PLOTLYPeakMapPlot(...)` | 2D RT vs m/z visualization |
| Mobilogram | `df.plot(kind='mobilogram', ...)` | Ion mobility data |

**Peak map parameters:**

| Parameter | Description |
|-----------|-------------|
| `x`, `y`, `z` | Column names for RT, m/z, intensity |
| `width`, `height` | Plot dimensions in pixels |
| `grid` | Show grid lines (True/False) |
| `add_marginals` | Show RT and m/z distributions on axes |

**Example:**
```python
from pyopenms_viz._plotly import PLOTLYPeakMapPlot

plot = PLOTLYPeakMapPlot(
    data=df,
    x="rt",
    y="mz", 
    z="intensity",
    width=800,
    height=600,
    add_marginals=True
)
plot.show()
```

</details>

In [None]:
# Export features into dataframe
df = features.get_df()
df.head(2)

Unnamed: 0_level_0,peptide_sequence,peptide_score,ID_filename,ID_native_id,charge,rt,mz,rt_start,rt_end,mz_start,mz_end,quality,intensity
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
12592275275208522725,,,unknown,,5,2272.696289,674.534031,2252.12915,2299.366455,674.532496,676.139162,9.0,83422.359375
17439098422051461769,,,,,3,2208.773438,655.301653,2195.811768,2217.431396,655.297018,657.308736,7.0,230824.5


<details>
<summary><b>Understanding the Feature Peak Map</b></summary>

**What does the peak map show?**

The peak map visualizes detected features in 3D space:
- **X-axis (RT)**: Retention time - when the peptide elutes
- **Y-axis (m/z)**: Mass-to-charge ratio of the feature
- **Color/Z-axis (intensity)**: Signal strength (abundance)

```
m/z ‚Üë
    ‚îÇ
900 ‚îÇ         ‚óè    ‚óè              ‚Üê High m/z features
    ‚îÇ       ‚óè‚óè‚óè  ‚óè‚óè‚óè              
800 ‚îÇ     ‚óè‚óè‚óè‚óè‚óè ‚óè‚óè‚óè‚óè‚óè             
    ‚îÇ    ‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè            ‚Üê Most peptides here
700 ‚îÇ   ‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè           
    ‚îÇ  ‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè          
600 ‚îÇ ‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè         ‚Üê Typical tryptic peptide m/z range
    ‚îÇ‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè        
500 ‚îÇ ‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè‚óè         
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí RT (seconds)
         2400    2500    2600
```

**Feature bounding boxes:**
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Feature       ‚îÇ  ‚Üê Blue rectangle
‚îÇ  RT: 2450-2470 ‚îÇ     shows the detected
‚îÇ  m/z: 650-652  ‚îÇ     feature boundaries
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**What patterns to look for:**

| Pattern | Indicates |
|---------|-----------|
| Dense clusters | Complex elution regions |
| Vertical streaks | Co-eluting peptides |
| Empty regions | Gradient void volumes |
| Horizontal bands | Contaminants (same m/z, all RTs) |

</details>

---

### Exercise 2: Explore the Feature DataFrame

**Predict first, then verify!**

Before running the exploration commands below, make predictions:

1. **Prediction**: What percentage of features do you think will have peptide identifications? (Pick: <20%, 20-50%, >50%)
2. **Prediction**: Which charge state will be most common? (+1, +2, +3, or +4?)
3. **Why?** Write down your reasoning.

Now verify your predictions:

```python
# Try these commands:
print(f"Total features detected: {len(df)}")
print(f"Features with peptide IDs: {df['peptide_sequence'].notna().sum()}")
print(f"ID rate: {df['peptide_sequence'].notna().mean():.1%}")
print(f"\nCharge state distribution:\n{df['charge'].value_counts().sort_index()}")
print(f"\nIntensity range: {df['intensity'].min():.0f} to {df['intensity'].max():.0f}")
```

<details>
<summary><b>Click to check your predictions</b></summary>

**Expected observations:**

**1. ID coverage: Typically 20-50% of features have peptide IDs**

Why not 100%?
- Not all MS1 features trigger MS2 scans (DDA limitation)
- Some MS2 spectra fail to identify (low quality, modifications)
- Some features may be contaminants (not in database)
- Multiple charge states of same peptide = multiple features but same ID

**2. Charge distribution: +2 and +3 are typically most common**

Why?
- Tryptic peptides have K/R at C-terminus ‚Üí at least +1
- N-terminus adds another charge ‚Üí +2 baseline
- Larger peptides may have internal His, Lys, Arg ‚Üí +3, +4
- +1 ions are less common (small peptides only)

**3. Quality score meaning:**
- Higher = more confident detection
- Based on isotope pattern fit, peak shape, signal-to-noise
- Can be used to filter low-quality features

</details>

---

In [None]:
# interactive PeakMap plot with plotly
from pyopenms_viz._plotly import PLOTLYPeakMapPlot

plot = PLOTLYPeakMapPlot(
    data=df,
    x="rt",
    y="mz",
    z="intensity",
    width=800,
    height=800,
    grid=False,
    add_marginals=True, # showing RT and intensities
)

plot.show();

In [None]:
# ploting peakmap and having bounding boxes at the feature position
plot = PLOTLYPeakMapPlot(
    data=df,
    x="rt",
    y="mz",
    z="intensity",
    width=1000,
    height=1000,
    grid=False,
)

# Create rectangles for all features
shapes = []
for _, row in df.iterrows():
    shapes.append(
        dict(
            type="rect",
            x0=row["rt_start"],
            x1=row["rt_end"],
            y0=row["mz_start"],
            y1=row["mz_end"],
            line=dict(color="blue", width=1)
        )
    )

# Add all rectangles to the plot
plot.fig.update_layout(shapes=shapes)

# Show the interactive plot
plot.show();

In [None]:
# Filter features within an RT window (UPS1 data spans RT 2400-2700s)
df_cut = df[(df["rt_start"] >= 2500) & (df["rt_end"] <= 2600)]

# Plot peakmap
plot = PLOTLYPeakMapPlot(
    data=df_cut,
    x="rt",
    y="mz",
    z="intensity",
    width=1000,
    height=1000,
    grid=False,
)

# Create rectangles for filtered features
shapes = []
for _, row in df_cut.iterrows():
    shapes.append(
        dict(
            type="rect",
            x0=row["rt_start"],
            x1=row["rt_end"],
            y0=row["mz_start"],
            y1=row["mz_end"],
            line=dict(color="blue", width=1)
        )
    )

# Add rectangles to the plot
plot.fig.update_layout(shapes=shapes)

# Show the interactive plot
plot.show();

In [None]:
# Overlay raw MS1 peak data with feature boundaries
# This shows how detected features correspond to the underlying spectral data

import plotly.graph_objects as go

# Extract all MS1 peaks from the experiment
ms1_peaks = []
for spectrum in exp:
    if spectrum.getMSLevel() == 1:
        rt = spectrum.getRT()
        mzs, intensities = spectrum.get_peaks()
        for mz, intensity in zip(mzs, intensities):
            ms1_peaks.append({"rt": rt, "mz": mz, "intensity": intensity})

ms1_df = pd.DataFrame(ms1_peaks)

# Filter to a specific RT and m/z window for better visualization
rt_min, rt_max = 2500, 2600
mz_min, mz_max = 600, 900

ms1_filtered = ms1_df[
    (ms1_df["rt"] >= rt_min) & (ms1_df["rt"] <= rt_max) &
    (ms1_df["mz"] >= mz_min) & (ms1_df["mz"] <= mz_max)
]

df_filtered = df[
    (df["rt"] >= rt_min) & (df["rt"] <= rt_max) &
    (df["mz"] >= mz_min) & (df["mz"] <= mz_max)
]

# Create the plot
fig = go.Figure()

# Add raw MS1 peaks as scatter points (subsample for performance)
sample_size = min(50000, len(ms1_filtered))
ms1_sample = ms1_filtered.sample(n=sample_size, random_state=42) if len(ms1_filtered) > sample_size else ms1_filtered

fig.add_trace(go.Scattergl(
    x=ms1_sample["rt"],
    y=ms1_sample["mz"],
    mode="markers",
    marker=dict(
        size=2,
        color=np.log10(ms1_sample["intensity"] + 1),
        colorscale="Viridis",
        opacity=0.5
    ),
    name="MS1 peaks",
    hovertemplate="RT: %{x:.1f}s<br>m/z: %{y:.4f}<extra></extra>"
))

# Add feature bounding boxes
for _, row in df_filtered.iterrows():
    color = "green" if pd.notna(row['peptide_sequence']) else "rgba(0,100,255,0.7)"
    fig.add_shape(
        type="rect",
        x0=row["rt_start"], x1=row["rt_end"],
        y0=row["mz_start"], y1=row["mz_end"],
        line=dict(color=color, width=1.5),
        fillcolor="rgba(0,0,0,0)"
    )

fig.update_layout(
    title="MS1 Peaks with Feature Boundaries (green = identified)",
    xaxis_title="Retention Time (s)",
    yaxis_title="m/z",
    width=1000,
    height=800,
    showlegend=True
)

fig.show()

---

### Exercise 3: Interpret the Feature Map

**Predict first, then verify!** Look at the peak map visualization with feature bounding boxes.

1. **Prediction**: For a single peptide ion, what shape should its feature bounding box have? (tall and narrow, short and wide, or roughly square?)

2. **Observation**: Do you see any features that overlap in m/z? What might this indicate?

3. **Exploration**: Zoom into a region with many features. Can you identify isotope patterns (vertically stacked rectangles spaced by ~0.5 m/z for +2 or ~0.33 m/z for +3)?

<details>
<summary><b>Click to check your predictions</b></summary>

**Answer 1: Feature shape**

Features should be **short and wide** (spans RT, narrow in m/z):
- **RT range**: Peptides elute over 10-60 seconds ‚Üí wide
- **m/z range**: Includes isotopes (M, M+1, M+2) ‚Üí narrow (typically <2 Da)

```
m/z ‚Üë
    ‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚Üê Short in m/z (isotope envelope)
    ‚îÇ  ‚îÇ                  ‚îÇ
    ‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚Üê Wide in RT (elution time)
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí RT
```

**Answer 2: Overlapping features**

Features overlapping in m/z but at different RT are different peptides with similar masses:
- Same peptide eluting multiple times (carryover)
- Isobaric peptides (different sequence, same mass)
- Same peptide from different proteins

Features overlapping in BOTH m/z AND RT might indicate:
- Multiple charge states of the same peptide (should be related by charge: m/z‚ÇÅ √ó z‚ÇÅ ‚âà m/z‚ÇÇ √ó z‚ÇÇ)
- Co-eluting isobaric peptides (challenging for quantification)

**Answer 3: Isotope patterns**

For a +2 charged peptide, you should see isotope peaks spaced ~0.5 m/z apart:
- Monoisotopic peak (lightest)
- M+1 (+0.5 m/z)
- M+2 (+1.0 m/z)

Each isotope peak is tracked as part of the same feature in Biosaur2's isotope-aware detection.

</details>

---

## Summary

Congratulations! You've completed the proteomics data analysis workflow. Here's what you learned:

| Step | Concept | Key pyOpenMS Tool |
|------|---------|-------------------|
| **Feature detection** | Group MS1 peaks across RT into features | `Biosaur2Algorithm` |
| **Isotope awareness** | Use isotope patterns to infer charge | Built into Biosaur2 |
| **ID mapping** | Link MS2 identifications to MS1 features | `IDMapper` |
| **Tolerance settings** | Account for measurement variability | RT and m/z tolerances |
| **Data export** | Convert to pandas for analysis | `FeatureMap.get_df()` |
| **Visualization** | Interactive peak maps with annotations | `pyopenms_viz` |

---

<details>
<summary><b>Putting It All Together: Complete Workflow Reflection</b></summary>

**Test your understanding of the complete proteomics workflow!**

You've now completed all three notebooks. Think about how they connect:

**Question 1**: A peptide has been identified by database search (Notebook 2) but wasn't found as a feature (Notebook 3). What could explain this?

<details>
<summary>Answer</summary>

Possible explanations:
- Peptide had low MS1 signal (below feature detection threshold)
- MS2 was acquired during co-elution with another peptide (chimeric)
- Feature detection parameters were too strict
- The peptide eluted at the edge of the gradient (poor peak shape)

</details>

**Question 2**: You found a feature with high intensity (Notebook 3) but it has no peptide identification. What could you do to investigate?

<details>
<summary>Answer</summary>

Investigation steps:
- Check if MS2 was acquired for this precursor (DDA might have missed it)
- Look at the m/z - is it in the typical peptide range?
- Check the isotope pattern - does it look like a peptide?
- Consider: could it be a contaminant, lipid, or metabolite?
- Try searching with different modifications or a larger database

</details>

**Question 3**: How do mass tolerances connect across the workflow?

<details>
<summary>Answer</summary>

Mass tolerance appears at multiple stages:
1. **Notebook 2 (Candidate selection)**: Precursor mass ¬± tolerance ‚Üí candidate peptides
2. **Notebook 2 (Alignment)**: Fragment m/z ¬± tolerance ‚Üí matched ions
3. **Notebook 3 (IDMapper)**: Feature m/z ¬± tolerance ‚Üí mapped identifications

These should all be set according to your instrument's accuracy. Inconsistent tolerances can cause problems (e.g., strict ID tolerance but loose feature mapping could mis-assign IDs).

</details>

</details>

---

## Complete Workflow Summary

**The three notebooks covered:**

| Notebook | Topic | Key Steps |
|----------|-------|-----------|
| **Notebook 1** | Raw Data | Protein digestion ‚Üí LC-MS ‚Üí Spectra (mzML) |
| **Notebook 2** | Identification | MS2 spectra ‚Üí Theoretical spectra ‚Üí Match & Score ‚Üí PSMs (idXML) |
| **Notebook 3** | Quantification | MS1 spectra ‚Üí Feature detection ‚Üí ID mapping ‚Üí Annotated features |

---

## Bonus Challenges

<details>
<summary><b>Challenge 1 (Beginner): Filter by Quality</b></summary>

Filter the DataFrame to keep only high-quality features:

```python
# Filter features with quality > 5
df_high_quality = df[df['quality'] > 5]
print(f"High quality features: {len(df_high_quality)} / {len(df)}")

# Compare ID rates
original_id_rate = df['peptide_sequence'].notna().mean()
filtered_id_rate = df_high_quality['peptide_sequence'].notna().mean()
print(f"ID rate: {original_id_rate:.1%} ‚Üí {filtered_id_rate:.1%}")
```

**Question**: Does filtering by quality improve the identification rate?

</details>

<details>
<summary><b>Challenge 2 (Intermediate): Visualize Identified Features with Peptide Sequences</b></summary>

Create a visualization that highlights identified features in green and adds peptide sequence labels. This helps you see which peptides were successfully quantified.

**Task**: Modify the peak map to:
1. Show identified features in green, unidentified in blue
2. Add peptide sequence annotations to identified features

<details>
<summary><b>Click to reveal solution</b></summary>

```python
import plotly.graph_objects as go

# Filter to a region with identified features
df_viz = df[(df["rt"] >= 2450) & (df["rt"] <= 2650) & 
            (df["mz"] >= 500) & (df["mz"] <= 900)]

fig = go.Figure()

# Add feature rectangles with color coding
shapes = []
annotations = []

for _, row in df_viz.iterrows():
    has_id = pd.notna(row['peptide_sequence'])
    color = "green" if has_id else "rgba(100,100,255,0.5)"
    width = 2 if has_id else 1
    
    shapes.append(dict(
        type="rect",
        x0=row["rt_start"], x1=row["rt_end"],
        y0=row["mz_start"], y1=row["mz_end"],
        line=dict(color=color, width=width),
        fillcolor="rgba(0,255,0,0.1)" if has_id else "rgba(0,0,0,0)"
    ))
    
    # Add sequence label for identified features
    if has_id:
        seq = row['peptide_sequence']
        # Truncate long sequences
        label = seq if len(seq) <= 12 else seq[:10] + "..."
        annotations.append(dict(
            x=(row["rt_start"] + row["rt_end"]) / 2,
            y=row["mz_end"] + 2,  # Position above the box
            text=f"{label} (+{row['charge']})",
            showarrow=False,
            font=dict(size=8, color="darkgreen"),
            textangle=0
        ))

# Add a scatter trace for the legend
fig.add_trace(go.Scatter(x=[None], y=[None], mode='markers',
    marker=dict(size=10, color='green'), name='Identified'))
fig.add_trace(go.Scatter(x=[None], y=[None], mode='markers',
    marker=dict(size=10, color='blue'), name='Unidentified'))

fig.update_layout(
    shapes=shapes,
    annotations=annotations,
    title="Feature Map with Peptide Sequence Labels",
    xaxis_title="Retention Time (s)",
    yaxis_title="m/z",
    width=1200,
    height=900,
    showlegend=True
)

fig.show()
```

**Observations to make:**
- Are identified features clustered in certain RT/m/z regions?
- Do identified features tend to have higher intensities?
- Can you spot the same peptide at different charge states?

</details>

</details>

<details>
<summary><b>Challenge 3 (Advanced): Intensity Distribution Analysis</b></summary>

Analyze whether feature intensity affects identification:

```python
import matplotlib.pyplot as plt

# Split features by identification status
identified = df[df['peptide_sequence'].notna()]['intensity']
unidentified = df[df['peptide_sequence'].isna()]['intensity']

# Plot histograms
plt.figure(figsize=(10, 5))
plt.hist(np.log10(identified), bins=50, alpha=0.5, label='Identified')
plt.hist(np.log10(unidentified), bins=50, alpha=0.5, label='Unidentified')
plt.xlabel('log10(Intensity)')
plt.ylabel('Count')
plt.legend()
plt.title('Intensity Distribution: Identified vs Unidentified Features')
plt.show()
```

**Question**: Are higher-intensity features more likely to be identified? Why?

</details>

<details>
<summary><b>Challenge 4 (Expert): Summarize Peptide Quantities Across Charge States</b></summary>

In proteomics, the same peptide often appears as multiple features with different charge states (+2, +3, etc.). For accurate peptide-level quantification, we need to combine these into a single abundance value per peptide sequence.

**Task**: Group features by peptide sequence and sum their intensities to get peptide-level quantities.

<details>
<summary><b>Click to reveal solution</b></summary>

```python
# Filter to identified features only
identified_df = df[df['peptide_sequence'].notna()].copy()

print(f"Identified features: {len(identified_df)}")
print(f"Unique peptide sequences: {identified_df['peptide_sequence'].nunique()}")

# Group by peptide sequence and aggregate
peptide_quantities = identified_df.groupby('peptide_sequence').agg({
    'intensity': 'sum',           # Sum intensities across charge states
    'charge': lambda x: list(x),  # List all observed charge states
    'rt': 'mean',                 # Average RT (should be similar)
    'mz': 'first',                # Representative m/z
    'quality': 'mean'             # Average quality
}).reset_index()

# Add charge state count
peptide_quantities['n_charge_states'] = peptide_quantities['charge'].apply(len)
peptide_quantities['charge_states'] = peptide_quantities['charge'].apply(
    lambda x: '+' + ', +'.join(map(str, sorted(set(x))))
)

# Sort by intensity (most abundant first)
peptide_quantities = peptide_quantities.sort_values('intensity', ascending=False)

# Display results
print("\nPeptide-level quantities (top 10 by intensity):")
print(peptide_quantities[['peptide_sequence', 'intensity', 'n_charge_states', 
                          'charge_states', 'rt']].head(10).to_string(index=False))

# Visualize: peptides with multiple charge states
multi_charge = peptide_quantities[peptide_quantities['n_charge_states'] > 1]
print(f"\nPeptides detected in multiple charge states: {len(multi_charge)}")

# Plot intensity contribution by charge state for a peptide
if len(multi_charge) > 0:
    example_peptide = multi_charge.iloc[0]['peptide_sequence']
    example_features = identified_df[identified_df['peptide_sequence'] == example_peptide]
    
    plt.figure(figsize=(8, 4))
    plt.bar([f"+{c}" for c in example_features['charge']], 
            example_features['intensity'])
    plt.xlabel('Charge State')
    plt.ylabel('Feature Intensity')
    plt.title(f'Intensity by Charge State: {example_peptide}')
    plt.show()
    
    print(f"\nExample: {example_peptide}")
    print(f"  Total peptide intensity: {example_features['intensity'].sum():.0f}")
    print(f"  Charge states: {list(example_features['charge'].values)}")
```

**Discussion questions:**
1. What percentage of identified peptides appear in multiple charge states?
2. Does one charge state typically dominate the intensity?
3. Why is summing intensities across charge states important for comparing peptide abundances between samples?

</details>

</details>

---

## What's Next?

You now have the foundation to explore more advanced topics:

- **Label-free quantification (LFQ)**: Compare peptide abundances across samples
- **Isobaric labeling (TMT/iTRAQ)**: Multiplex quantification using reporter ions
- **Data-independent acquisition (DIA)**: Alternative to DDA with better quantification
- **Statistical analysis**: Differential expression between conditions
- **Protein inference**: Roll up peptide quantities to protein level

**Resources:**
- [pyOpenMS Documentation](https://pyopenms.readthedocs.io/)
- [OpenMS Tutorials](https://openms.readthedocs.io/en/latest/tutorials/index.html)

---

**Previous notebooks:** [Notebook 1 - Peaks](EUBIC_Task1_Peaks.ipynb) | [Notebook 2 - Identification](EUBIC_Task2_ID.ipynb)