<a href="https://colab.research.google.com/github/timosachsenberg/EuBIC2026/blob/main/notebooks/EUBIC_Task2_ID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies (for Google Colab)
!pip install -q pyopenms>=3.5.0 pyopenms-viz>=1.0.0

# Notebook 2 – Peptide Identification by Database Search

> **Prerequisites**: This notebook builds on concepts from [Notebook 1 - From Proteins to Spectra](EUBIC_Task1_Peaks.ipynb). You should be familiar with protein digestion, MS1/MS2 spectra, and isotope patterns.

In the previous notebook, we explored how proteins are digested into peptides and how LC–MS data is structured. Now we tackle the central challenge of proteomics: **identifying which peptides produced the observed spectra**.

**Database search** is a peptide identification strategy where observed spectra are compared against theoretical spectra generated from a protein sequence database. Each spectrum is matched to candidate peptides whose precursor mass falls within tolerance, and the best-matching peptide is selected based on a scoring function.

This notebook implements a simplified peptide database search workflow—the same conceptual approach used by tools like Comet, MS-GF+, and Sage. We will:

1. **Digest proteins** and compute theoretical peptide masses
2. **Match precursor masses** to find candidate peptides for each spectrum  
3. **Generate theoretical fragment spectra** for each candidate
4. **Align and score** observed vs. theoretical spectra
5. **Select the best-matching peptide** for each spectrum
6. **Visualize** the match with an interactive mirror plot

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration.png)

*Image: Overview of database search workflow - observed spectra are matched against theoretical spectra generated from a protein sequence database. Source: [pyOpenMS documentation](https://pyopenms.readthedocs.io).*

---

<details>
<summary><b>Quick Reference: Key Terms Used in This Notebook</b></summary>

| Term | Definition |
|------|------------|
| **Precursor** | The intact peptide ion selected for fragmentation in MS2 |
| **MS2** | Tandem mass spectrometry - fragmentation spectrum of a selected precursor |
| **b-ion** | N-terminal fragment ion from peptide backbone cleavage |
| **y-ion** | C-terminal fragment ion from peptide backbone cleavage |
| **PSM** | Peptide-Spectrum Match - pairing of spectrum with identified peptide |
| **Tolerance** | Maximum allowed mass difference for peak matching |
| **ppm** | Parts per million - relative mass tolerance that scales with m/z |
| **FDR** | False Discovery Rate - estimated proportion of incorrect identifications |
| **Monoisotopic mass** | Mass calculated using most abundant isotope of each element |

</details>

<details>
<summary><b>Common Errors and How to Fix Them</b></summary>

**1. FileNotFoundError: FASTA or mzML file not found**
```python
# Error: FileNotFoundError: [Errno 2] No such file or directory: 'two_ups_proteins.fasta'
```
**Cause**: The data file wasn't downloaded or the path is wrong.

**Fix**: Make sure the download cell ran successfully:
```python
import os
print(os.listdir("."))  # Check what files exist
```

---

**2. No candidate peptides found (empty DataFrame)**
```python
# Output: Spectra with candidate peptides: 0
```
**Cause**: Mass tolerance too tight, or precursor masses don't match database.

**Fix**: 
- Increase tolerance: `absolute_tolerance=0.5` instead of `0.1`
- Check that the FASTA contains proteins from your sample
- Verify precursor masses are being extracted correctly

---

**3. All scores are zero**
```python
# Output: scores = [0, 0, 0, ...]
```
**Cause**: Alignment found no matching peaks.

**Fix**:
- Increase fragment tolerance: `p.setValue("tolerance", 1000.0)` (1000 ppm)
- Check that theoretical spectra are being generated: `print(len(theo_spectrum))`
- Verify the spectrum has peaks: `print(len(observed_spectrum))`

---

**4. ModuleNotFoundError: No module named 'pyopenms'**
```python
# Error: ModuleNotFoundError: No module named 'pyopenms'
```
**Fix**: Install the package:
```python
!pip install pyopenms>=3.5.0
```

---

**5. Empty theoretical spectrum (0 peaks)**
```python
# Output: theo_spectrum has 0 peaks
```
**Cause**: Invalid peptide sequence or wrong charge state.

**Fix**:
- Check peptide is valid: `print(peptide.toString())`
- Ensure charge > 1 for fragment generation
- Verify parameters: `params.setValue("add_b_ions", "true")`

---

**6. Mirror plot not displaying**

**Cause**: Plotly renderer not configured for your environment.

**Fix for Jupyter**:
```python
import plotly.io as pio
pio.renderers.default = "notebook"  # or "jupyterlab" or "colab"
```

</details>

<details>
<summary><b>New to pandas DataFrames?</b></summary>

This notebook uses **pandas DataFrames** extensively. Here's a quick primer:

```python
# DataFrames are like spreadsheets with named columns
df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'score': [95, 87]
})

# Access columns
df['score']              # Get the 'score' column
df[['name', 'score']]    # Get multiple columns

# Filter rows
df[df['score'] > 90]     # Rows where score > 90

# Apply functions to columns
df['score'].apply(lambda x: x * 2)  # Double all scores

# Iterate over rows
for idx, row in df.iterrows():
    print(row['name'], row['score'])
```

**Key operations used in this notebook:**
- `.copy()` - Creates independent copy to avoid warnings
- `.head(n)` - First n rows
- `.apply()` - Apply function to each row/element
- `.at[idx, col]` - Access specific cell

</details>

## Overview

Having established a foundation in enzymatic digestion and mass-spectral visualization, we now turn to the central task of peptide identification through database search. Each step mirrors the core logic employed by modern search engines, but implemented transparently for educational purposes.

**Workflow steps:**

1. **Compute monoisotopic peptide masses** – Calculate the neutral mass for every peptide from in-silico digestion. We'll use `AASequence.getMonoWeight()` to compute the exact mass from elemental composition.

2. **Compute precursor masses for MS/MS spectra** – For each MS2 spectrum, derive the neutral precursor mass from m/z and charge. The precursor information is accessed via `getPrecursors()` from each spectrum's metadata.

3. **Select candidate peptides** – Compare precursor masses to find peptides within mass tolerance. Using vectorized NumPy comparison, we filter to peptides within a configurable mass window.

4. **Generate theoretical fragment spectra** – Create b/y ion spectra for each candidate peptide. `TheoreticalSpectrumGenerator` predicts fragment m/z values based on sequence and charge state.

5. **Align observed and theoretical spectra** – Match experimental peaks to theoretical fragments. `SpectrumAlignment` pairs peaks within a specified tolerance, returning matched index pairs.

6. **Score peptide–spectrum matches** – Count matched b/y ions as a simple scoring function. Ion annotations stored in the spectrum's metadata let us identify which fragment types matched.

7. **Select best candidate** – Choose the highest-scoring peptide for each spectrum. A simple `argmax` over candidate scores identifies the most likely peptide identity.

8. **Visualize with mirror plot** – Compare experimental and theoretical spectra interactively. `pyopenms_viz` renders both spectra with matched peaks annotated for quality assessment.

In [None]:
%matplotlib inline
import os
import pyopenms as oms
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print("pyOpenMS version:", oms.__version__)

In [None]:
# Download FASTA from course repository (2 UPS1 proteins: Complement C5 and EGF)
if not os.path.exists("two_ups_proteins.fasta"):
    !wget -q -O "two_ups_proteins.fasta" "https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/two_ups_proteins.fasta"

# Download mzML (5-minute subset of UPS1 spike-in experiment, ~36MB)
if not os.path.exists("UPS1_5min.mzML"):
    !wget -q -O "UPS1_5min.mzML" "https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/UPS1_5min.mzML"

In [None]:
# Digest proteins into peptides (same approach as Notebook 1)
def preprocess_database(fasta_path, min_len_pept=6, max_len_pept=30, missed_cleavages=2):
    """
    Load a FASTA file and digest all proteins with Trypsin.
    Returns a list of unique peptides (as AASequence objects).
    """
    database_entries = []
    f = oms.FASTAFile()
    f.load(fasta_path, database_entries)

    # Configure protein digestion with Trypsin
    dig = oms.ProteaseDigestion()
    dig.setEnzyme("Trypsin")
    dig.setMissedCleavages(missed_cleavages)

    peptides = []
    for entry in database_entries:
        protein = oms.AASequence(entry.sequence)
        peptides_ = []
        dig.digest(protein, peptides_, min_len_pept, max_len_pept)
        peptides.extend(peptides_)

    # Remove duplicate peptides (same sequence from different proteins or missed cleavage variants)
    seen = set()
    unique_peptides = []
    for pep in peptides:
        seq_str = pep.toString()
        if seq_str not in seen:
            seen.add(seq_str)
            unique_peptides.append(pep)

    return unique_peptides

<details>
<summary><b>pyOpenMS Reference: FASTAFile & ProteaseDigestion</b></summary>

| Class/Method | Purpose | Documentation |
|--------------|---------|---------------|
| `FASTAFile()` | Read/write FASTA format files | [FASTAFile docs](https://pyopenms.readthedocs.io/en/latest/apidoc/pyopenms.html#pyopenms.FASTAFile) |
| `.load(path, entries)` | Load FASTA into list of entries | |
| `ProteaseDigestion()` | Configure enzymatic digestion | [ProteaseDigestion docs](https://pyopenms.readthedocs.io/en/latest/apidoc/pyopenms.html#pyopenms.ProteaseDigestion) |
| `.setEnzyme(name)` | Set enzyme (e.g., "Trypsin") | |
| `.setMissedCleavages(n)` | Allow up to n missed cleavages | |
| `.digest(protein, peptides, min_len, max_len)` | Digest protein into peptides | |
| `AASequence(string)` | Create amino acid sequence | [AASequence docs](https://pyopenms.readthedocs.io/en/latest/apidoc/pyopenms.html#pyopenms.AASequence) |

See also: [Peptides and Proteins Tutorial](https://pyopenms.readthedocs.io/en/latest/user_guide/peptides_proteins.html)

</details>

In [None]:
# Digest the UPS proteins database
peptides = preprocess_database("two_ups_proteins.fasta")
print(f"Total unique peptides after digestion: {len(peptides)}")

## 1. Compute the Monoisotopic Mass of Each Peptide

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration_digest.png)

After protein digestion, the peptide identification workflow starts with calculating the **monoisotopic mass** of each peptide. This mass serves as the primary filter for matching spectra to candidate sequences—only peptides whose theoretical mass falls within tolerance of the observed precursor mass are considered for further scoring.

The `getMonoWeight()` method from `AASequence` computes a peptide's monoisotopic mass from its elemental composition. See the [pyOpenMS peptide documentation](https://pyopenms.readthedocs.io/en/latest/user_guide/peptides_proteins.html) for details.

<details>
<summary><b>pyOpenMS Reference: AASequence Mass Calculation</b></summary>

| Method | Purpose | Documentation |
|--------|---------|---------------|
| `AASequence.getMonoWeight()` | Get monoisotopic mass (neutral) | [AASequence docs](https://pyopenms.readthedocs.io/en/latest/apidoc/pyopenms.html#pyopenms.AASequence) |
| `AASequence.getAverageWeight()` | Get average mass (isotope-averaged) | |
| `AASequence.getMZ(charge)` | Get m/z for given charge state | Adds protons and divides by charge |
| `AASequence.getFormula()` | Get elemental formula | Returns `EmpiricalFormula` object |

**Mass types:**
- **Monoisotopic**: Uses lightest stable isotope of each element (for high-res MS)
- **Average**: Weighted average by natural isotope abundance (for low-res MS)

**Example:**
```python
peptide = oms.AASequence.fromString("PEPTIDER")
print(f"Monoisotopic mass: {peptide.getMonoWeight():.4f} Da")
print(f"Average mass: {peptide.getAverageWeight():.4f} Da")
print(f"m/z at +2: {peptide.getMZ(2):.4f}")
```

See also: [Peptides and Proteins Tutorial](https://pyopenms.readthedocs.io/en/latest/user_guide/peptides_proteins.html)

</details>

In [None]:
# Function to compute monoisotopic masses for a list of peptides
def get_peptide_weights(peptides):
    return [p.getMonoWeight() for p in peptides]

# calculate the monoisotopic masses of peptides
peptides_weight = get_peptide_weights(peptides)
peptides_weight[:5] # Preview a few weights

<details>
<summary><b>pyOpenMS Reference: Precursor Information</b></summary>

| Class/Method | Purpose | Documentation |
|--------------|---------|---------------|
| `MSSpectrum.getPrecursors()` | Get list of precursor ions (MS2) | [MSSpectrum docs](https://pyopenms.readthedocs.io/en/latest/apidoc/pyopenms.html#pyopenms.MSSpectrum) |
| `Precursor.getMZ()` | Get precursor m/z | |
| `Precursor.getCharge()` | Get precursor charge state | |
| `Precursor.getUnchargedMass()` | Get neutral mass (removes protons) | Calculates: (m/z × charge) - (charge × proton_mass) |
| `Precursor.getIntensity()` | Get precursor intensity | |

**Example:**
```python
for spec in exp:
    if spec.getMSLevel() == 2:
        precursors = spec.getPrecursors()
        if len(precursors) > 0:
            prec = precursors[0]
            print(f"m/z: {prec.getMZ():.4f}, charge: +{prec.getCharge()}")
            print(f"Neutral mass: {prec.getUnchargedMass():.4f} Da")
```

See also: [MS Data Tutorial](https://pyopenms.readthedocs.io/en/latest/user_guide/ms_data.html)

</details>

<details>
<summary><b>pyOpenMS Reference: PeakFileOptions</b></summary>

| Class/Method | Purpose | Documentation |
|--------------|---------|---------------|
| `PeakFileOptions()` | Configure file loading options | [PeakFileOptions docs](https://pyopenms.readthedocs.io/en/latest/apidoc/pyopenms.html#pyopenms.PeakFileOptions) |
| `.setMSLevels(levels)` | Only load specified MS levels | `[1]` for MS1 only, `[2]` for MS2 only |
| `.setRTRange(range)` | Limit retention time range | `oms.DRange1(start, end)` |
| `.setMZRange(range)` | Limit m/z range | `oms.DRange1(start, end)` |
| `MzMLFile.setOptions(opts)` | Apply options before loading | |

**Example - Load only MS2 spectra:**
```python
options = oms.PeakFileOptions()
options.setMSLevels([2])  # Only MS2

mzml = oms.MzMLFile()
mzml.setOptions(options)

exp = oms.MSExperiment()
mzml.load("data.mzML", exp)
print(f"Loaded {exp.size()} MS2 spectra")
```

**Example - Load specific RT range:**
```python
options = oms.PeakFileOptions()
options.setRTRange(oms.DRange1(1000, 2000))  # RT 1000-2000s
```

See also: [MS Data Tutorial](https://pyopenms.readthedocs.io/en/latest/user_guide/ms_data.html)

</details>

## 2. Calculate the precursor mass of MS2 spectra

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration_precursor.png)

With peptide masses computed, the next step is extracting **precursor masses** from the MS2 spectra. Each MS2 spectrum contains metadata about the precursor ion that was selected for fragmentation—specifically its m/z and charge state. From these values, we calculate the neutral (uncharged) mass for comparison against our peptide database.

Precursor information is extracted using `getPrecursors()` from each `MSSpectrum`. The precursor object provides `getUnchargedMass()` and `getCharge()` methods. See the [MS data documentation](https://pyopenms.readthedocs.io/en/latest/user_guide/ms_data.html) for more details.

In [None]:
# Function to extract precursor masses and charges from MS2 spectra
def get_precursor_weights(MS2):
    """
    Extract precursor mass and charge for each MS2 spectrum.
    Skips spectra without precursor information.
    """
    precursors_charge = []
    precursors_M = []
    valid_indices = []  # track which spectra have valid precursors

    for i, spec in enumerate(MS2):
        precursors = spec.getPrecursors()
        if len(precursors) == 0:
            continue  # skip spectra without precursor info
        
        p = precursors[0]
        precursors_charge.append(p.getCharge())
        precursors_M.append(p.getUnchargedMass())
        valid_indices.append(i)

    return np.array(precursors_M), np.array(precursors_charge), valid_indices

# Load only MS2 spectra from mzML (more efficient for large files)
options = oms.PeakFileOptions()
options.setMSLevels([2])  # only load MS level 2

MS2 = oms.MSExperiment()
mzml = oms.MzMLFile()
mzml.setOptions(options)
mzml.load("UPS1_5min.mzML", MS2)

# Sort peaks by m/z in each spectrum
for spec in MS2:
    spec.sortByPosition()

# Get precursor mass and charge for each MS2 spectrum
P_mass, P_charge, valid_indices = get_precursor_weights(MS2)
print(f"Loaded {len(MS2)} MS2 spectra")
print(f"Spectra with valid precursors: {len(P_mass)}")

## 3. Identify Candidate Peptides

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration_candidate.png)

Now we connect precursor masses to peptide sequences. For each MS2 spectrum, we find all peptides from our database whose monoisotopic mass falls within **tolerance** of the observed precursor mass. These become the **candidate peptides** for that spectrum.

The implementation uses `np.isclose()` for efficient vectorized comparison. We convert peptide masses to a NumPy array, then for each precursor mass, find all peptides within the specified absolute (and optionally relative) tolerance. The matching peptides are collected as candidate lists, stored in a DataFrame alongside the precursor charge (needed for fragment ion prediction). Spectra with zero candidates are filtered out.

<details>
<summary><b>Deep Dive: Understanding Mass Tolerance</b></summary>

**Why do we need tolerance?**

Mass spectrometers don't measure masses perfectly. There's always some measurement error due to:
- Instrument calibration
- Temperature fluctuations
- Space charge effects
- Detector limitations

**Absolute vs. Relative Tolerance**

```
Absolute tolerance (e.g., 0.1 Da):
┌────────────────────────────────────────────┐
│  At m/z 500:  ±0.1 Da = ±200 ppm          │
│  At m/z 1000: ±0.1 Da = ±100 ppm          │
│  At m/z 2000: ±0.1 Da = ±50 ppm           │
└────────────────────────────────────────────┘
Same absolute window, but different relative precision!

Relative tolerance (e.g., 10 ppm):
┌────────────────────────────────────────────┐
│  At m/z 500:  ±0.005 Da                   │
│  At m/z 1000: ±0.01 Da                    │
│  At m/z 2000: ±0.02 Da                    │
└────────────────────────────────────────────┘
Window scales with m/z - matches instrument behavior!
```

**Converting ppm to Da:**
```
tolerance_Da = (m/z × ppm) / 1,000,000

Example: 10 ppm at m/z 1000
tolerance_Da = (1000 × 10) / 1,000,000 = 0.01 Da
```

**Typical tolerance values:**
- Orbitrap instruments: 5-10 ppm
- TOF instruments: 10-20 ppm  
- Ion trap instruments: 0.2-0.5 Da (absolute)

</details>

In [None]:
def get_candidates_per_spectrum(precursor_weights, peptide_weights, peptides, 
                                  spectrum_indices, absolute_tolerance=0.1, relative_tolerance=0.1):
    """
    For each precursor mass, find peptides whose masses match within tolerance.

    Parameters:
    -----------
    precursor_weights : array-like
        Precursor masses (one per spectrum).
    peptide_weights : array-like
        Theoretical peptide masses.
    peptides : list
        List of peptide sequences (AASequence objects).
    spectrum_indices : list
        Original indices of spectra in the MS2 experiment.
    absolute_tolerance : float
        Maximum absolute mass difference allowed (default: 0.1 Da).
    relative_tolerance : float
        Maximum relative mass difference allowed (default: 0).

    Returns:
    --------
    pd.DataFrame with columns: 'candidates', 'spectrum_idx'
    """
    pept_candidates = []
    peptide_weights_arr = np.array(peptide_weights)

    for prec_weight in precursor_weights:
        # Find peptides matching precursor mass within tolerance
        pept_indices = np.where(
            np.isclose(prec_weight, peptide_weights_arr, 
                      atol=absolute_tolerance, rtol=relative_tolerance)
        )[0]
        pept_candidates.append([peptides[i] for i in pept_indices])

    return pd.DataFrame({
        'candidates': pept_candidates,
        'num_candidates': map(lambda x: len(x), pept_candidates),
        'spectrum_idx': spectrum_indices
    })

In [None]:
# Find candidate peptides for each spectrum
candidate_df = get_candidates_per_spectrum(
    precursor_weights=P_mass, 
    peptide_weights=peptides_weight, 
    peptides=peptides,
    spectrum_indices=valid_indices
)

# Add precursor charge (needed for fragment charge calculation)
candidate_df["charge"] = P_charge

# Keep only spectra with at least one candidate peptide
candidate_df = candidate_df[
    candidate_df["num_candidates"] > 0
].copy()  # use .copy() to avoid SettingWithCopyWarning

print(f"Spectra with candidate peptides: {len(candidate_df)}")

# For this tutorial, work with first 5 spectra that have candidates
candidate_df.sort_values(by="num_candidates", ascending=False)
candidate_df = candidate_df.head(5).copy()
print(f"Working with {len(candidate_df)} spectra for demonstration")

print_df = candidate_df.copy()
print_df['candidates'] = print_df['candidates'].map(lambda x: list(map(lambda y: y.toString(), x)))
print_df[['spectrum_idx', 'candidates', 'charge']]

---

### Exercise 1: Effect of Mass Tolerance

**Predict first, then verify!**

1. **Prediction**: If we increase the `absolute_tolerance` from 0.1 Da to 1.0 Da, will we get MORE or FEWER candidate peptides per spectrum?

2. **Trade-off**: What are the consequences of using too narrow vs. too wide tolerance?

<details>
<summary><b>Click to reveal the answer</b></summary>

**Answer**: MORE candidate peptides.

A wider tolerance window means more peptides will have masses "close enough" to match each precursor. This has implications:

**Too narrow tolerance:**
- May miss correct peptides due to mass measurement error
- Fewer false candidates but risk of false negatives
- Good for high-accuracy instruments (Orbitrap, FT-ICR)

**Too wide tolerance:**
- More candidate peptides to evaluate (slower)
- Higher chance of incorrect matches (false positives)
- Better for low-accuracy instruments (ion traps)

**The sweet spot** depends on your instrument's mass accuracy:
- For Orbitrap data (5-10 ppm accurate): use 10-20 ppm
- For ion trap data (0.2-0.5 Da accurate): use 0.5 Da

**Try it**: Change `absolute_tolerance=0.1` to `absolute_tolerance=1.0` in the code above and observe how the number of candidates changes!

</details>

---

In [None]:
# Exercise 1: Experiment with mass tolerance
# Try changing the tolerance value and observe the effect on candidate counts

# Compare different tolerance settings
for tol in [0.01, 0.1, 0.5, 1.0]:
    test_df = get_candidates_per_spectrum(
        precursor_weights=P_mass, 
        peptide_weights=peptides_weight, 
        peptides=peptides,
        spectrum_indices=valid_indices,
        absolute_tolerance=tol
    )
    # Count spectra with at least one candidate
    n_with_candidates = (test_df["candidates"].apply(len) >= 1).sum()
    # Average candidates per spectrum (for those with candidates)
    avg_candidates = test_df[test_df["candidates"].apply(len) >= 1]["candidates"].apply(len).mean()
    print(f"Tolerance {tol:>4} Da: {n_with_candidates:>4} spectra with candidates, avg {avg_candidates:.1f} candidates each")

## 4. Generate Theoretical Spectra

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration_theoretical.png)

**Background: MS/MS fragmentation and ion types**

In **tandem mass spectrometry (MS/MS or MS2)**, a precursor ion is first isolated based on its m/z, then **fragmented** by collision with inert gas molecules (collision-induced dissociation, CID). The resulting fragment ions are analyzed in a second MS scan, producing a fragmentation spectrum that serves as a fingerprint for peptide identification.

![Precursor isolation window showing selected isotope pattern](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/isolation_window_illustration.png)

*Image: The isolation window (typically 1-2 Da wide) selects the precursor ion and its isotopes for fragmentation. Source: [pyOpenMS documentation](https://pyopenms.readthedocs.io).*

When peptides fragment along the backbone, they predominantly break at peptide bonds. If the charge stays on the **N-terminal fragment**, it's called a **b-ion**; if it stays on the **C-terminal fragment**, it's a **y-ion**. The series of b- and y-ions form a ladder pattern that reveals the amino acid sequence:

![Peptide fragmentation showing b and y ion ladder](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/peptide_fragmentation_tandemMS.svg)

*Image: Peptide backbone with labeled fragment ion positions. N-terminal ions (a, b, c) are numbered from the left, C-terminal ions (x, y, z) from the right. Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Peptide_fragmentation_tandemMS.svg), CC BY-SA 3.0.*

With candidate peptides identified, we now generate **theoretical fragmentation spectra** for each one. These predicted spectra show where we expect to see peaks if the peptide is the correct match—specifically, the m/z values of b-ions and y-ions based on the amino acid sequence.

We use `TheoreticalSpectrumGenerator` configured for b-ions, y-ions, and first prefix ions, with meta-information enabled so each peak carries its ion annotation. For each candidate peptide, we generate a spectrum with fragment charges from 1 up to min(2, precursor_charge - 1). See the [spectrum alignment documentation](https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html) for details.

In [None]:
# Configure theoretical spectrum generation
tsg = oms.TheoreticalSpectrumGenerator()
params = oms.Param()
params.setValue("add_y_ions", "true")
params.setValue("add_b_ions", "true")
params.setValue("add_first_prefix_ion", "true")
params.setValue("add_precursor_peaks", "false")
params.setValue("add_losses", "false")
params.setValue("add_metainfo", "true")  # needed for ion annotations
tsg.setParameters(params)

# Generate theoretical spectra for each candidate peptide
candidate_df["theo_spectra"] = None

for idx, row in candidate_df.iterrows():
    row_theo_spectra = []
    for peptide in row["candidates"]:
        theo_spectrum = oms.MSSpectrum()
        # max fragment charge is min(2, precursor_charge - 1)
        max_frag_charge = min(row['charge'] - 1, 2)
        tsg.getSpectrum(theo_spectrum, peptide, 1, max_frag_charge)
        row_theo_spectra.append(theo_spectrum)
    candidate_df.at[idx, "theo_spectra"] = row_theo_spectra

print("Generated theoretical spectra for all candidates")

<details>
<summary><b>pyOpenMS Reference: TheoreticalSpectrumGenerator Parameters</b></summary>

| Parameter | Default | Description |
|-----------|---------|-------------|
| `add_b_ions` | true | N-terminal fragment ions (most common) |
| `add_y_ions` | true | C-terminal fragment ions (most common) |
| `add_a_ions` | false | a-ions: b-ions minus CO (-28 Da) |
| `add_c_ions` | false | c-ions: N-terminal + NH (for ETD fragmentation) |
| `add_x_ions` | false | x-ions: y-ions + CO (rare) |
| `add_z_ions` | false | z-ions: C-terminal - NH (for ETD fragmentation) |
| `add_losses` | false | Neutral losses: -H2O (S,T,E,D) and -NH3 (R,K,Q,N) |
| `add_precursor_peaks` | false | Include the intact precursor ion |
| `add_first_prefix_ion` | false | Include b1 ion (often missing in CID) |
| `add_metainfo` | false | Store ion annotations (needed for labeling peaks) |

**Ion nomenclature (Roepstorff-Fohlman):**

![Peptide fragmentation ion nomenclature](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/peptide_fragmentation.svg)

*Image: Peptide backbone cleavage producing a, b, c (N-terminal) and x, y, z (C-terminal) fragment ions. Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Peptide_fragmentation.svg), CC BY-SA 3.0.*

**Setting parameters:**
```python
tsg = oms.TheoreticalSpectrumGenerator()
params = tsg.getParameters()
params.setValue("add_losses", "true")  # Enable neutral losses
tsg.setParameters(params)
```

</details>

In [None]:
print_df = candidate_df.copy()
print_df['candidates'] = print_df['candidates'].map(lambda x: list(map(lambda y: y.toString(), x)))
print_df['theo_spectra'] = print_df['theo_spectra'].map(lambda x: len(x))
print_df[['spectrum_idx', 'candidates', 'charge', 'theo_spectra']]

## 5. Spectra Alignment

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration_align.png)

With theoretical spectra generated, we now **align** them against the observed MS2 spectra. Alignment identifies which theoretical peaks have matching experimental peaks within a specified mass tolerance. This forms the basis for scoring—more matched peaks suggests a better peptide-spectrum match.

**Tolerance** defines the maximum allowed difference between observed and theoretical peak masses. It can be absolute (e.g., 0.02 Da) or relative in ppm, which scales with m/z and better reflects how mass accuracy behaves in most instruments.

We use `SpectrumAlignment` with relative tolerance (500 ppm in this example). For each spectrum-candidate pair, `getSpectrumAlignment()` returns a list of matched peak index pairs. These alignments are stored in the DataFrame for subsequent scoring. See the [spectrum alignment documentation](https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html) for details.

In [None]:
# Configure spectrum alignment
spa = oms.SpectrumAlignment()
p = spa.getParameters()
p.setValue("tolerance", 500.0)  # 500 ppm
p.setValue("is_relative_tolerance", "true")
spa.setParameters(p)

# Align each theoretical spectrum with the observed spectrum
candidate_df["alignment"] = None

for idx, row in candidate_df.iterrows():
    # Get observed spectrum using stored spectrum index
    observed_spectrum = MS2[row["spectrum_idx"]]
    row_alignment = []
    
    for theo_spec in row["theo_spectra"]:
        alignment = []
        spa.getSpectrumAlignment(alignment, theo_spec, observed_spectrum)
        row_alignment.append(alignment)
    
    candidate_df.at[idx, "alignment"] = row_alignment

print("Aligned all theoretical spectra with observed spectra")

<details>
<summary><b>pyOpenMS Reference: SpectrumAlignment</b></summary>

| Class/Method | Purpose | Documentation |
|--------------|---------|---------------|
| `SpectrumAlignment()` | Align two spectra by matching peaks | [SpectrumAlignment docs](https://pyopenms.readthedocs.io/en/latest/apidoc/pyopenms.html#pyopenms.SpectrumAlignment) |
| `.getParameters()` | Get current parameter settings | Returns `Param` object |
| `.setParameters(params)` | Set alignment configuration | |
| `.getSpectrumAlignment(result, theo, exp)` | Perform alignment | Fills `result` with matched peak pairs |

**Key parameters:**
- `tolerance`: Mass tolerance value (Da or ppm)
- `is_relative_tolerance`: "true" for ppm, "false" for Da

**Example:**
```python
spa = oms.SpectrumAlignment()
p = spa.getParameters()
p.setValue("tolerance", 20.0)          # 20 ppm tolerance
p.setValue("is_relative_tolerance", "true")
spa.setParameters(p)

alignment = []
spa.getSpectrumAlignment(alignment, theo_spectrum, exp_spectrum)
# alignment = [(theo_idx, exp_idx), ...]
```

See also: [Spectrum Alignment Tutorial](https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html)

</details>

<details>
<summary><b>Understanding the Alignment Results</b></summary>

**What does the alignment output mean?**

The alignment returns a list of `(theo_idx, exp_idx)` pairs:
```python
alignment = [(7, 7), (9, 39), (12, 61), (14, 97), ...]
             │  │    │   │
             │  │    │   └─ Index in experimental spectrum
             │  │    └───── Index in theoretical spectrum
             │  └────────── First matched peak
             └───────────── Second matched peak
```

**Visual interpretation:**
```
Theoretical:  ─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─
              0 1 2 3 4 5 6 7 8 9 ...
                            ↓   ↓
                            │   └── theo[9] matches exp[39]
                            └────── theo[7] matches exp[7]

Experimental: ─┬─┬─┬─┬─┬─┬─┬─┬─...─┬─┬─...─┬─┬─...
              0 1 2 3 4 5 6 7 8   39      61
                            ↑     ↑       ↑
                            matched peaks
```

**More pairs = better match** (in general)

</details>

---

### Quick Check: Alignment Output

Look at the alignment output in the DataFrame above:

1. **Count**: How many peak pairs are in the alignment for the first candidate?
2. **Interpretation**: If the theoretical spectrum has 20 peaks and only 5 are matched, what might this tell you?

<details>
<summary><b>Click to check your understanding</b></summary>

**Answer 1**: Count the pairs in the alignment list. Each `(x, y)` tuple represents one matched peak.

**Answer 2**: Low match rate (5/20 = 25%) could indicate:
- **Wrong peptide candidate** - most theoretical peaks don't appear
- **Poor spectrum quality** - low signal-to-noise
- **Missing modifications** - actual peptide differs from theoretical
- **Wrong charge state** - fragment charges don't match predictions
- **Tolerance too tight** - valid peaks rejected as non-matches

A good match typically has >50% of theoretical ions matched.

</details>

---

In [None]:
print_df = candidate_df.copy()
print_df['candidates'] = print_df['candidates'].map(lambda x: list(map(lambda y: y.toString(), x)))
print_df['theo_spectra'] = print_df['theo_spectra'].map(lambda x: len(x))
print_df[['spectrum_idx', 'candidates', 'charge', 'theo_spectra', 'alignment']]

## 6. Calculate Score

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration_score.png)

With alignments computed, we now **score** each peptide-spectrum match. The score quantifies how well a candidate peptide explains the observed fragmentation pattern. Our simple approach counts matched b-ions and y-ions—more matches mean a higher score.

**Background: Peptide-Spectrum Matches (PSMs)**

A **PSM** pairs an observed MS2 spectrum with a candidate peptide sequence, along with a score indicating match quality. The score helps rank candidates and select the most likely identification.

The scoring function extracts ion annotations from the theoretical spectrum's metadata (StringDataArrays), checks whether each matched peak is a b-ion or y-ion, and returns the total count. While simple, this captures the core logic of peptide identification—real search engines use more sophisticated scoring that considers intensity patterns, consecutive ion series, and statistical significance.

In [None]:
def match_score(spec, matched_indices):
    """
    Calculate PSM score as count of matched b- and y-ions.
    
    Parameters:
    -----------
    spec : MSSpectrum
        Theoretical spectrum with ion annotations.
    matched_indices : list
        Indices of matched theoretical peaks.
    
    Returns:
    --------
    int : Number of matched b- and y-ions
    """
    y_ion_count = 0
    b_ion_count = 0
    
    for idx in matched_indices:
        ion_type = spec.getStringDataArrays()[0][int(idx)].decode()
        if ion_type.startswith('y'):
            y_ion_count += 1
        elif ion_type.startswith('b'):
            b_ion_count += 1
    
    return b_ion_count + y_ion_count

# Calculate scores for each candidate
candidate_df["scores"] = None

for idx, row in candidate_df.iterrows():
    row_scores = []
    for theo_spectrum, alignment in zip(row["theo_spectra"], row["alignment"]):
        # Get theoretical peak indices that were matched
        theo_peak_indices = [pair[0] for pair in alignment]
        score = match_score(theo_spectrum, theo_peak_indices)
        row_scores.append(score)
    candidate_df.at[idx, "scores"] = row_scores

print("Calculated scores for all PSMs")

In [None]:
print_df = candidate_df.copy()
print_df['candidates'] = print_df['candidates'].map(lambda x: list(map(lambda y: y.toString(), x)))
print_df['theo_spectra'] = print_df['theo_spectra'].map(lambda x: len(x))
print_df[['spectrum_idx', 'candidates', 'charge', 'theo_spectra', 'alignment', 'scores']]

---

### Exercise 2: Analyze the Scoring Results

Look at the scores in the results above and answer these questions:

1. **Interpretation**: For a peptide with 10 amino acids, what is the maximum possible score using our simple ion counting method? (Hint: consider both b and y ions)

2. **Quality assessment**: If a PSM has a score of 5 for a 15-residue peptide, would you consider this a good or poor match? Why?

<details>
<summary><b>Click to check your answers</b></summary>

**Answer 1: Maximum possible score**

For a peptide with n amino acids:
- b-ions: b1, b2, ..., b(n-1) = **n-1 ions**
- y-ions: y1, y2, ..., y(n-1) = **n-1 ions**
- Maximum score = 2 × (n-1)

For a 10-residue peptide: max score = 2 × 9 = **18**

**Answer 2: Quality assessment**

A score of 5 for a 15-residue peptide is relatively **poor**:
- Maximum possible: 2 × 14 = 28
- Coverage: 5/28 = 18%

Good matches typically have:
- >50% ion coverage
- Consecutive ion series (e.g., y3, y4, y5, y6...)
- Both b and y ions represented

**Why might coverage be low?**
- Spectrum quality (noise, low abundance)
- Post-translational modifications not considered
- Neutral losses not included in our simple model
- Fragment charge states we didn't predict

**Real search engines** use more sophisticated scoring that considers:
- Ion intensity patterns
- Consecutive ion series
- Expected vs. unexpected peaks
- Statistical significance

</details>

<details>
<summary><b>Deep Dive: More Sophisticated Scoring Functions</b></summary>

Our simple ion counting is educational but limited. Real search engines use advanced scoring:

**XCorr (Comet, SEQUEST)**
- Cross-correlation between observed and theoretical spectra
- Accounts for peak intensities and spacing
- Normalized for spectrum complexity

**Hyperscore (X!Tandem)**
- Factorial of matched ions × intensity product
- Rewards consecutive ion series
- Fast computation

**E-value (MS-GF+, OMSSA)**
- Statistical expectation value
- Probability of match occurring by chance
- Accounts for database size

**Percolator (post-processing)**
- Machine learning approach
- Combines multiple features (mass error, ion coverage, etc.)
- Improves FDR estimation

</details>

---

## 7. Select the Best Candidate

![Database search workflow illustration](https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/notebooks/images/database_search_illustration_compare.png)

For each spectrum, we now select the peptide with the **highest score** as the identified sequence.

**Note on FDR (False Discovery Rate):** In a real search engine, additional steps like **FDR control** would follow. The FDR estimates the proportion of incorrect identifications among all accepted PSMs—typically controlled at 1% using target-decoy approaches (searching against both real and reversed/shuffled protein sequences). This means roughly 1 in 100 accepted identifications may be incorrect. For this tutorial, we simply pick the top-scoring candidate without FDR filtering.

<details>
<summary><b>Deep Dive: Understanding FDR and Target-Decoy Approach</b></summary>

**The Problem: How Many Identifications Are Wrong?**

When you search thousands of spectra against a database, some will match by chance. But how many? That's where **False Discovery Rate (FDR)** comes in.

**Target-Decoy Strategy**

1. Create a **decoy database** by reversing or shuffling protein sequences
2. Search spectra against **both** target (real) and decoy databases
3. Decoy matches are definitionally incorrect - they estimate false positives

```
Score Distribution:

High Score ←─────────────────────→ Low Score

TARGET:    ████████████████░░░░░░░░░░░░░░
DECOY:     ░░░░░░░░░░░░░░████████████████

           Good matches    Random matches
```

**FDR Calculation**

At any score threshold:
```
FDR = (# decoy hits) / (# target hits)
```

If you have 100 target hits and 1 decoy hit at a score threshold:
- FDR ≈ 1/100 = 1%

**Typical FDR thresholds:**
- PSM level: 1%
- Peptide level: 1%
- Protein level: 1%

**Why this matters**: Without FDR control, you might report thousands of "identifications" that are actually random matches!

</details>

In [None]:
# Select best candidate for each spectrum
def select_best_candidate(row):
    """Return the best-scoring peptide and its details."""
    if not row['candidates'] or not row['scores']:
        return None, None, None, None, 0
    
    best_idx = np.argmax(row['scores'])
    return (
        row['candidates'][best_idx],
        row['theo_spectra'][best_idx],
        row['alignment'][best_idx],
        row['scores'][best_idx],
        len(row['candidates'])
    )

# Create results summary
results = []
for idx, row in candidate_df.iterrows():
    best_pep, best_theo, best_align, best_score, n_candidates = select_best_candidate(row)
    if best_pep:
        results.append({
            'spectrum_idx': row['spectrum_idx'],
            'peptide': best_pep.toString(),
            'score': best_score,
            'n_candidates': n_candidates,
            'charge': row['charge']
        })

results_df = pd.DataFrame(results)
print("=== Best Peptide Identifications ===\n")
print(results_df.to_string(index=False))

## 8. Visualize the Best Match with Mirror Plot

Finally, we visualize the best peptide-spectrum match using an interactive **mirror plot**.

**What is a mirror plot?** A mirror plot displays two spectra for easy visual comparison:
- **Top (pointing up)**: The experimental/observed spectrum, with matched peaks annotated by their ion type (b1, y3, etc.)
- **Bottom (pointing down/mirrored)**: The theoretical spectrum generated from the candidate peptide

Peaks that align vertically between the two spectra represent successful matches. Unmatched peaks in the experimental spectrum may be noise, neutral losses, or ions not included in the theoretical model. This visualization helps assess how well the identified peptide explains the observed fragmentation pattern.

<details>
<summary><b>Understanding the Mirror Plot</b></summary>

**How to read a mirror plot:**

```
Intensity ↑
          │  b3        y5   y7
          │   █    y4   █    █      ← Experimental (top)
          │   █     █   █    █
          │   █     █   █    █
──────────┼──────────────────────── m/z →
          │   █     █   █    █
          │   █     █   █    █
          │  b3    y4  y5   y7      ← Theoretical (bottom)
          ↓
```

**What to look for:**

| Pattern | Meaning |
|---------|---------|
| Many aligned peaks | Good match, confident ID |
| Unmatched peaks (top only) | Noise, neutral losses, or modifications |
| Missing theoretical peaks | Low fragmentation, ion suppression |
| Consecutive ion series | Strong sequence evidence (e.g., y3, y4, y5) |

**Peak annotations:**
- `b3` = b-ion from position 3 (N-terminal side)
- `y5` = y-ion from position 5 (C-terminal side)
- `b3++` = doubly-charged b3 ion

</details>

---

### Exercise 3: Interpret the Mirror Plot

After generating the visualization, examine it carefully:

1. **Count the matched peaks**: How many b-ions and y-ions are annotated?
2. **Look for gaps**: Are there missing ions in the series (e.g., y3, y5, y7 but no y4, y6)?
3. **Check unmatched peaks**: Are there intense peaks in the experimental spectrum without annotations?

<details>
<summary><b>Click for interpretation guidance</b></summary>

**What gaps tell you:**
- Some fragment ions don't form efficiently (especially around Proline)
- Neutral losses may "steal" intensity from the main ion
- Some fragments may be outside the mass range

**Why unmatched peaks matter:**
- **Intense unmatched peaks** might indicate:
  - Post-translational modifications not in your search
  - Chimeric spectra (two peptides co-isolated)
  - Contaminants
- **Low-intensity unmatched peaks** are often just noise

**A good match typically has:**
- >50% of theoretical ions matched
- At least one long consecutive series (3+ ions)
- Matched ions have reasonable intensity

</details>

---

In [None]:
import pyopenms_viz  # registers the plotting backend

def create_annotated_spectrum_df(observed_spectrum, theo_spectrum, alignment):
    """
    Create an annotated DataFrame from observed spectrum with ion labels from alignment.
    """
    mzs, intensities = observed_spectrum.get_peaks()
    annotations = [""] * len(mzs)
    
    for theo_idx, exp_idx in alignment:
        annotations[exp_idx] = theo_spectrum.getStringDataArrays()[0][theo_idx].decode()

    return pd.DataFrame({
        'mz': mzs,
        'intensity': intensities,
        'ion_annotation': annotations
    })

# Get best matches and sort by score (top 3)
best_matches = []
for idx, row in candidate_df.iterrows():
    best_pep, best_theo, best_align, best_score, _ = select_best_candidate(row)
    if best_pep is not None:
        best_matches.append({
            'df_idx': idx,
            'spectrum_idx': row['spectrum_idx'],
            'peptide': best_pep,
            'theo_spectrum': best_theo,
            'alignment': best_align,
            'score': best_score
        })

# Sort by score descending and take top 3
best_matches.sort(key=lambda x: x['score'], reverse=True)
top_matches = best_matches[:3]

print(f"Showing top {len(top_matches)} matches by score:\n")

# Visualize top matches
for match in top_matches:
    observed_spectrum = MS2[match['spectrum_idx']]
    
    # Create annotated experimental spectrum DataFrame
    exp_df = create_annotated_spectrum_df(observed_spectrum, match['theo_spectrum'], match['alignment'])
    
    # Create theoretical spectrum DataFrame
    theo_df = match['theo_spectrum'].get_df()
    
    print(f"Spectrum {match['spectrum_idx']}: {match['peptide'].toString()} (score: {match['score']})")

    # Plot mirror spectrum
    exp_df.plot(
        kind='spectrum',
        backend='ms_plotly',
        x='mz',
        y='intensity',
        reference_spectrum=theo_df,
        mirror_spectrum=True,
        ion_annotation='ion_annotation',
        title=f"Best Match: {match['peptide'].toString()} (score: {match['score']})",
        width=900,
        height=500,
        annotate_top_n_peaks=0
    );

## Summary

Congratulations! You've implemented a complete peptide identification workflow. Here's what you learned:

| Step | Concept | Key pyOpenMS Tool |
|------|---------|-------------------|
| **1. Digestion** | Trypsin cleavage produces searchable peptides | `ProteaseDigestion` |
| **2. Mass calculation** | Monoisotopic mass for database filtering | `AASequence.getMonoWeight()` |
| **3. Candidate selection** | Mass tolerance defines search space | `np.isclose()` |
| **4. Theoretical spectra** | b/y ions from peptide fragmentation | `TheoreticalSpectrumGenerator` |
| **5. Alignment** | Match observed to theoretical peaks | `SpectrumAlignment` |
| **6. Scoring** | Count matched ions as simple score | Custom function |
| **7. Selection** | Best-scoring candidate wins | `np.argmax()` |
| **8. Visualization** | Mirror plots show match quality | `pyopenms_viz` |

---

<details>
<summary><b>Common Pitfalls: What Can Go Wrong?</b></summary>

When implementing or using peptide identification, watch out for these common issues:

| Problem | Symptom | Solution |
|---------|---------|----------|
| **Tolerance too tight** | Few/no candidates per spectrum | Increase mass tolerance (check instrument specs) |
| **Tolerance too loose** | Many wrong candidates, low scores | Decrease tolerance, use ppm instead of Da |
| **Missing modifications** | Good spectra get low scores | Add variable modifications (oxidation, etc.) |
| **Wrong enzyme** | Few database matches | Check if sample was digested with expected enzyme |
| **Database too small** | Low ID rate even for good spectra | Use appropriate species database |
| **Database too large** | Slow search, high FDR | Use reviewed (Swiss-Prot) entries, filter species |
| **Chimeric spectra** | Extra unmatched peaks | Use narrower isolation window, or accept lower coverage |
| **Low-quality spectra** | Low scores across the board | Check LC-MS acquisition, consider filtering spectra |

**Debugging tips:**
1. Start with a few spectra to validate the workflow
2. Check that precursor masses are extracted correctly
3. Verify theoretical spectra look reasonable (correct ions)
4. Inspect mirror plots for representative matches

</details>

---

## Bonus Challenges

<details>
<summary><b>Challenge 1 (Beginner): Analyze More Spectra</b></summary>

Modify the code to analyze more than 5 spectra:

```python
# Change this line:
candidate_df = candidate_df.head(5).copy()
# To:
candidate_df = candidate_df.head(20).copy()
```

**Questions to explore:**
1. How many spectra get identified?
2. What's the distribution of scores?
3. Are higher-charge precursors harder to identify?

</details>

<details>
<summary><b>Challenge 2 (Intermediate): Improve the Scoring</b></summary>

Modify the `match_score` function to weight consecutive ion series more heavily:

```python
def improved_score(spec, matched_indices):
    # Your code: Give bonus points for consecutive ions
    # e.g., if you match y3, y4, y5, give extra points
    pass
```

**Hint**: Sort the matched ions by number and check for sequences.

</details>

<details>
<summary><b>Challenge 3 (Advanced): Add Neutral Loss Support</b></summary>

Enable neutral losses in the theoretical spectrum generator:

```python
params.setValue("add_losses", "true")  # Enable losses
```

**Questions:**
1. How does this change the number of theoretical peaks?
2. Does it improve or hurt your scores?
3. What neutral losses are most common (H2O, NH3)?

</details>

<details>
<summary><b>Challenge 4 (Expert): Implement Simple FDR</b></summary>

Create a decoy database and estimate FDR:

1. Reverse all peptide sequences to create decoys
2. Search against combined target-decoy database
3. At each score threshold, calculate: FDR = decoys / targets
4. Find the score threshold for 1% FDR

**Warning**: This is a simplified exercise. Real FDR requires careful statistical treatment.

</details>

---

**Next up: [Notebook 3 - Quantification](EUBIC_Task3_Quant.ipynb)** - Learn how to measure peptide abundances and compare samples!

<details>
<summary><b>Good to Know: Monoisotopic vs Average Mass</b></summary>

**Why "monoisotopic" mass?**

Elements have multiple isotopes. For carbon:
- **¹²C** (98.9%): 12.000 Da (by definition)
- **¹³C** (1.1%): 13.003 Da

**Two ways to calculate peptide mass:**

| Type | Definition | Use Case |
|------|------------|----------|
| **Monoisotopic** | Uses lightest stable isotope of each element | High-resolution MS, peptide ID |
| **Average** | Weighted average by natural abundance | Low-resolution MS, older instruments |

**Example for water (H₂O):**
- Monoisotopic: 1.008 + 1.008 + 15.995 = 18.011 Da
- Average: 1.008 + 1.008 + 15.999 = 18.015 Da

For a typical 1500 Da peptide, the difference is ~0.5-1 Da!

**In high-resolution MS**, we can resolve isotopes, so we match against the monoisotopic mass (the lightest peak in the isotope pattern).

</details>

---

**Next up: [Notebook 3 - Quantification](EUBIC_Task3_Quant.ipynb)** - Learn how to measure peptide abundances and compare samples!