<a href="https://colab.research.google.com/github/timosachsenberg/EuBIC2026/blob/main/notebooks/EUBIC_Task2_ID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies (for Google Colab)
!pip install -q pyopenms>=3.5.0 pyopenms-viz>=1.0.0

# Notebook 2 – Peptide Identification by Database Search

In the previous notebook, we explored how proteins are digested into peptides and how LC–MS data is structured. Now we tackle the central challenge of proteomics: **identifying which peptides produced the observed spectra**.

**Database search** is a peptide identification strategy where observed spectra are compared against theoretical spectra generated from a protein sequence database. Each spectrum is matched to candidate peptides whose precursor mass falls within tolerance, and the best-matching peptide is selected based on a scoring function.

This notebook implements a simplified peptide database search workflow—the same conceptual approach used by tools like Comet, MS-GF+, and Sage. We will:

1. **Digest proteins** and compute theoretical peptide masses
2. **Match precursor masses** to find candidate peptides for each spectrum  
3. **Generate theoretical fragment spectra** for each candidate
4. **Align and score** observed vs. theoretical spectra
5. **Select the best-matching peptide** for each spectrum
6. **Visualize** the match with an interactive mirror plot

---

<details>
<summary><b>Quick Reference: Key Terms Used in This Notebook</b></summary>

| Term | Definition |
|------|------------|
| **Precursor** | The intact peptide ion selected for fragmentation in MS2 |
| **MS2/MS/MS** | Tandem mass spectrometry - fragmentation spectrum of a selected precursor |
| **b-ion** | N-terminal fragment ion from peptide backbone cleavage |
| **y-ion** | C-terminal fragment ion from peptide backbone cleavage |
| **PSM** | Peptide-Spectrum Match - pairing of spectrum with identified peptide |
| **Tolerance** | Maximum allowed mass difference for peak matching |
| **ppm** | Parts per million - relative mass tolerance that scales with m/z |
| **FDR** | False Discovery Rate - estimated proportion of incorrect identifications |
| **Monoisotopic mass** | Mass calculated using most abundant isotope of each element |

</details>

<details>
<summary><b>New to pandas DataFrames?</b></summary>

This notebook uses **pandas DataFrames** extensively. Here's a quick primer:

```python
# DataFrames are like spreadsheets with named columns
df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'score': [95, 87]
})

# Access columns
df['score']              # Get the 'score' column
df[['name', 'score']]    # Get multiple columns

# Filter rows
df[df['score'] > 90]     # Rows where score > 90

# Apply functions to columns
df['score'].apply(lambda x: x * 2)  # Double all scores

# Iterate over rows
for idx, row in df.iterrows():
    print(row['name'], row['score'])
```

**Key operations used in this notebook:**
- `.copy()` - Creates independent copy to avoid warnings
- `.head(n)` - First n rows
- `.apply()` - Apply function to each row/element
- `.at[idx, col]` - Access specific cell

</details>

## Overview

Having established a foundation in enzymatic digestion and mass-spectral visualization, we now turn to the central task of peptide identification through database search. Each step mirrors the core logic employed by modern search engines, but implemented transparently for educational purposes.

**Workflow steps:**

1. **Compute monoisotopic peptide masses** – Calculate the neutral mass for every peptide from in-silico digestion.

2. **Compute precursor masses for MS/MS spectra** – For each MS2 spectrum, derive the neutral precursor mass from m/z and charge.

3. **Select candidate peptides** – Compare precursor masses to find peptides within mass tolerance.

4. **Generate theoretical fragment spectra** – Create b/y ion spectra for each candidate peptide.

5. **Align observed and theoretical spectra** – Match experimental peaks to theoretical fragments.

6. **Score peptide–spectrum matches** – Count matched b/y ions as a simple scoring function.

7. **Select best candidate** – Choose the highest-scoring peptide for each spectrum.

8. **Visualize with mirror plot** – Compare experimental and theoretical spectra interactively.

In [None]:
%matplotlib inline
import os
import pyopenms as oms
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print("pyOpenMS version:", oms.__version__)

pyOpenMS version: 3.5.0


In [None]:
# Download FASTA from course repository (2 UPS1 proteins: Complement C5 and EGF)
if not os.path.exists("two_ups_proteins.fasta"):
    !wget -q -O "two_ups_proteins.fasta" https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/two_ups_proteins.fasta

# Download mzML (5-minute subset of UPS1 spike-in experiment, ~36MB)
if not os.path.exists("UPS1_5min.mzML"):
    !wget -q -O "UPS1_5min.mzML" https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/UPS1_5min.mzML

In [None]:
# Digest proteins into peptides (same approach as Notebook 1)
def preprocess_database(fasta_path, min_len_pept=6, max_len_pept=30, missed_cleavages=2):
    """
    Load a FASTA file and digest all proteins with Trypsin.
    Returns a list of unique peptides (as AASequence objects).
    """
    database_entries = []
    f = oms.FASTAFile()
    f.load(fasta_path, database_entries)

    # Configure protein digestion with Trypsin
    dig = oms.ProteaseDigestion()
    dig.setEnzyme("Trypsin")
    dig.setMissedCleavages(missed_cleavages)

    peptides = []
    for entry in database_entries:
        protein = oms.AASequence.fromString(entry.sequence)
        peptides_ = []
        dig.digest(protein, peptides_, min_len_pept, max_len_pept)
        peptides.extend(peptides_)

    # Remove duplicate peptides (same sequence from different proteins or missed cleavage variants)
    seen = set()
    unique_peptides = []
    for pep in peptides:
        seq_str = pep.toString()
        if seq_str not in seen:
            seen.add(seq_str)
            unique_peptides.append(pep)

    return unique_peptides

In [None]:
# Digest the UPS proteins database
peptides = preprocess_database("two_ups_proteins.fasta")
print(f"Total unique peptides after digestion: {len(peptides)}")

## 1. Compute the Monoisotopic Mass of Each Peptide

**Aim of this task**
  - After protein digestion, the peptide identification workflow starts with calculating the monoisotopic mass.
  - Calculating the monoisotopic mass of each peptide allows for filtering candidate sequences based on their mass, facilitating comparison with experimental spectra.

**Implementation**
  - The `getMonoWeight()` method from `AASequence` can be used to compute a peptide sequence's monoisotopic mass.
  - See: [https://pyopenms.readthedocs.io/en/latest/user_guide/peptides_proteins.html](https://pyopenms.readthedocs.io/en/latest/user_guide/peptides_proteins.html)

In [None]:
# Function to compute monoisotopic masses for a list of peptides
def get_peptide_weights(peptides):
    return [p.getMonoWeight() for p in peptides]

# calculate the monoisotopic masses of peptides
peptides_weight = get_peptide_weights(peptides)

## 2. Calculate the precursor mass of MS2 spectra

**Aim of this task**
  - After computing peptide monoisotopic masses, the next step is to determine the precursor masses from the MS2 spectra.
  - This allows identification of candidate peptides that match the experimental spectra.

**Implementation**
  - Precursor information can be extracted from each spectrum using the `getPrecursors()` method from `MSSpectrum`.
  - From the precursor object, the uncharged mass and charge can be obtained via `getUnchargedMass()` and `getCharge()`.
  - checkout: [https://pyopenms.readthedocs.io/en/latest/user_guide/ms_data.html](https://pyopenms.readthedocs.io/en/latest/user_guide/ms_data.html)


In [None]:
# Function to extract precursor masses and charges from MS2 spectra
def get_precursor_weights(MS2):
    """
    Extract precursor mass and charge for each MS2 spectrum.
    Skips spectra without precursor information.
    """
    precursors_charge = []
    precursors_M = []
    valid_indices = []  # track which spectra have valid precursors

    for i, spec in enumerate(MS2):
        precursors = spec.getPrecursors()
        if len(precursors) == 0:
            continue  # skip spectra without precursor info
        
        p = precursors[0]
        precursors_charge.append(p.getCharge())
        precursors_M.append(p.getUnchargedMass())
        valid_indices.append(i)

    return np.array(precursors_M), np.array(precursors_charge), valid_indices

In [None]:
# Load only MS2 spectra from mzML (more efficient for large files)
options = oms.PeakFileOptions()
options.setMSLevels([2])  # only load MS level 2

MS2 = oms.MSExperiment()
mzml = oms.MzMLFile()
mzml.setOptions(options)
mzml.load("UPS1_5min.mzML", MS2)

# Sort peaks by m/z in each spectrum
for spec in MS2:
    spec.sortByPosition()

# Get precursor mass and charge for each MS2 spectrum
P_mass, P_charge, valid_indices = get_precursor_weights(MS2)
print(f"Loaded {len(MS2)} MS2 spectra")
print(f"Spectra with valid precursors: {len(P_mass)}")

## 3. Identify Candidate Peptides

**Aim of this task**

  - After obtaining precursor masses from the MS2 spectra and calculating monoisotopic masses for all peptides, the goal is to identify candidate peptides whose masses match the observed precursors.
  
**Implementation**

- Convert the list of peptide masses into a NumPy array to enable efficient vectorized comparison.

- For each precursor mass:

    - Use `np.isclose` with an absolute and optional relative tolerance to find all peptide masses within the specified matching window.

    - Retrieve the indices of matching peptides and collect the corresponding peptide sequences as candidate lists.

- Store the resulting candidate lists in a DataFrame for downstream theoretical spectrum generation.

- Add the precursor charge as an additional column, since fragment charge limits depend on it.

- Retain only spectra that have at least one candidate peptide.

<details>
<summary><b>Deep Dive: Understanding Mass Tolerance</b></summary>

**Why do we need tolerance?**

Mass spectrometers don't measure masses perfectly. There's always some measurement error due to:
- Instrument calibration
- Temperature fluctuations
- Space charge effects
- Detector limitations

**Absolute vs. Relative Tolerance**

```
Absolute tolerance (e.g., 0.1 Da):
┌────────────────────────────────────────────┐
│  At m/z 500:  ±0.1 Da = ±200 ppm          │
│  At m/z 1000: ±0.1 Da = ±100 ppm          │
│  At m/z 2000: ±0.1 Da = ±50 ppm           │
└────────────────────────────────────────────┘
Same absolute window, but different relative precision!

Relative tolerance (e.g., 10 ppm):
┌────────────────────────────────────────────┐
│  At m/z 500:  ±0.005 Da                   │
│  At m/z 1000: ±0.01 Da                    │
│  At m/z 2000: ±0.02 Da                    │
└────────────────────────────────────────────┘
Window scales with m/z - matches instrument behavior!
```

**Converting ppm to Da:**
```
tolerance_Da = (m/z × ppm) / 1,000,000

Example: 10 ppm at m/z 1000
tolerance_Da = (1000 × 10) / 1,000,000 = 0.01 Da
```

**Typical tolerance values:**
- Orbitrap instruments: 5-10 ppm
- TOF instruments: 10-20 ppm  
- Ion trap instruments: 0.2-0.5 Da (absolute)

</details>

In [None]:
def get_candidates_per_spectrum(precursor_weights, peptide_weights, peptides, 
                                  spectrum_indices, absolute_tolerance=0.1, relative_tolerance=0):
    """
    For each precursor mass, find peptides whose masses match within tolerance.

    Parameters:
    -----------
    precursor_weights : array-like
        Precursor masses (one per spectrum).
    peptide_weights : array-like
        Theoretical peptide masses.
    peptides : list
        List of peptide sequences (AASequence objects).
    spectrum_indices : list
        Original indices of spectra in the MS2 experiment.
    absolute_tolerance : float
        Maximum absolute mass difference allowed (default: 0.1 Da).
    relative_tolerance : float
        Maximum relative mass difference allowed (default: 0).

    Returns:
    --------
    pd.DataFrame with columns: 'candidates', 'spectrum_idx'
    """
    pept_candidates = []
    peptide_weights_arr = np.array(peptide_weights)

    for prec_weight in precursor_weights:
        # Find peptides matching precursor mass within tolerance
        pept_indices = np.where(
            np.isclose(prec_weight, peptide_weights_arr, 
                      atol=absolute_tolerance, rtol=relative_tolerance)
        )[0]
        pept_candidates.append([peptides[i] for i in pept_indices])

    return pd.DataFrame({
        'candidates': pept_candidates,
        'spectrum_idx': spectrum_indices
    })

In [None]:
# Find candidate peptides for each spectrum
candidate_df = get_candidates_per_spectrum(
    precursor_weights=P_mass, 
    peptide_weights=peptides_weight, 
    peptides=peptides,
    spectrum_indices=valid_indices
)

# Add precursor charge (needed for fragment charge calculation)
candidate_df["charge"] = P_charge

# Keep only spectra with at least one candidate peptide
candidate_df = candidate_df[
    candidate_df["candidates"].apply(lambda x: len(x) >= 1)
].copy()  # use .copy() to avoid SettingWithCopyWarning

print(f"Spectra with candidate peptides: {len(candidate_df)}")

# For this tutorial, work with first 5 spectra that have candidates
candidate_df = candidate_df.head(5).copy()
print(f"Working with {len(candidate_df)} spectra for demonstration")
candidate_df

---

### Exercise 1: Effect of Mass Tolerance

**Predict first, then verify!**

1. **Prediction**: If we increase the `absolute_tolerance` from 0.1 Da to 1.0 Da, will we get MORE or FEWER candidate peptides per spectrum?

2. **Trade-off**: What are the consequences of using too narrow vs. too wide tolerance?

<details>
<summary><b>Click to reveal the answer</b></summary>

**Answer**: MORE candidate peptides.

A wider tolerance window means more peptides will have masses "close enough" to match each precursor. This has implications:

**Too narrow tolerance:**
- May miss correct peptides due to mass measurement error
- Fewer false candidates but risk of false negatives
- Good for high-accuracy instruments (Orbitrap, FT-ICR)

**Too wide tolerance:**
- More candidate peptides to evaluate (slower)
- Higher chance of incorrect matches (false positives)
- Better for low-accuracy instruments (ion traps)

**The sweet spot** depends on your instrument's mass accuracy:
- For Orbitrap data (5-10 ppm accurate): use 10-20 ppm
- For ion trap data (0.2-0.5 Da accurate): use 0.5 Da

**Try it**: Change `absolute_tolerance=0.1` to `absolute_tolerance=1.0` in the code above and observe how the number of candidates changes!

</details>

---

## 4. Generate Theoretical Spectra

**Background: MS/MS fragmentation and ion types**

In **tandem mass spectrometry (MS/MS or MS2)**, a precursor ion is first isolated based on its m/z, then **fragmented** by collision with inert gas molecules (collision-induced dissociation, CID). The resulting fragment ions are analyzed in a second MS scan, producing a fragmentation spectrum that serves as a fingerprint for peptide identification.

When peptides fragment along the backbone, they predominantly break at peptide bonds. If the charge stays on the **N-terminal fragment**, it's called a **b-ion**; if it stays on the **C-terminal fragment**, it's a **y-ion**. The series of b- and y-ions form a ladder pattern that reveals the amino acid sequence:

```
        b1  b2  b3  b4
         |   |   |   |
    H2N–[A]–[B]–[C]–[D]–[E]–COOH
             |   |   |   |
            y4  y3  y2  y1
```

**Aim of this task**

  - After finding the candidate peptides for each MS2 spectrum, the next step is to generate their theoretical fragmentation spectra.
  - These spectra serve as references for comparing against the observed MS2 spectra, which is essential for peptide identification.
  
**Implementation**

- Initialize a column in the DataFrame to store theoretical spectra corresponding to each candidate peptide.

- Configure a `TheoreticalSpectrumGenerator` by enabling b-ions, y-ions, and first prefix ions, while suppressing precursor peaks, neutral losses, and other additional features to keep the spectrum simple.

- Enable meta-information so that each theoretical peak contains its fragment-ion label in the StringDataArrays.

- For each candidate peptide:

    - Create an empty MSSpectrum object.

    - Generate the theoretical spectrum by specifying charge states from 1 up to the allowed maximum, defined as the minimum of 2 or (precursor charge − 1).

    - Append the resulting spectrum to a list.

- Store the complete list of theoretical spectra in the DataFrame for downstream scoring and annotation.

- See: [https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html](https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html)

In [None]:
# Configure theoretical spectrum generation
tsg = oms.TheoreticalSpectrumGenerator()
params = oms.Param()
params.setValue("add_y_ions", "true")
params.setValue("add_b_ions", "true")
params.setValue("add_first_prefix_ion", "true")
params.setValue("add_precursor_peaks", "false")
params.setValue("add_losses", "false")
params.setValue("add_metainfo", "true")  # needed for ion annotations
tsg.setParameters(params)

# Generate theoretical spectra for each candidate peptide
candidate_df["theo_spectra"] = None

for idx, row in candidate_df.iterrows():
    row_theo_spectra = []
    for peptide in row["candidates"]:
        theo_spectrum = oms.MSSpectrum()
        # max fragment charge is min(2, precursor_charge - 1)
        max_frag_charge = min(row['charge'] - 1, 2)
        tsg.getSpectrum(theo_spectrum, peptide, 1, max_frag_charge)
        row_theo_spectra.append(theo_spectrum)
    candidate_df.at[idx, "theo_spectra"] = row_theo_spectra

print("Generated theoretical spectra for all candidates")

In [None]:
candidate_df

Unnamed: 0,candidates,charge,theo_spectra
389,[(<pyopenms._pyopenms_1.Residue object at 0x7b...,4,"[[Peak1D(mz=65.0366, intensity=1.0), Peak1D(mz..."


## 5. Spectra Alignment

**Aim of this task**

  - After generating theoretical spectra for all candidate peptides, the next step is to align them with the corresponding observed MS2 spectra.
  - This alignment identifies which theoretical peaks match observed peaks, forming the basis for scoring and selecting the most likely peptide sequence.

**Background: Mass tolerance**

**Tolerance** defines the maximum allowed difference between observed and theoretical peak masses for them to be considered a match. This accounts for instrument measurement error and calibration imperfections.

Tolerance can be specified as:
- **Absolute** (e.g., 0.02 Da): fixed mass difference allowed
- **Relative** in **ppm (parts per million)**: scales with m/z. For example, 10 ppm at m/z 1000 means ±0.01 Da, while at m/z 500 it means ±0.005 Da. This better reflects how mass accuracy behaves in most instruments.
  
**Implementation**

  - Initialize a `SpectrumAlignment` object from PyOpenMS.
  - Configure alignment parameters:

      - Set a relative tolerance (e.g., 500 ppm) to define how closely peaks must match.
      - Use relative tolerance instead of absolute to account for m/z-dependent peak spacing.

 - For each observed MS2 spectrum:

    - Iterate over all candidate theoretical spectra.
    - Align each theoretical spectrum to the observed spectrum using `getSpectrumAlignment`, which returns pairs of matched peak indices.
    - Store the resulting alignments for each candidate in the DataFrame for further analysis, such as scoring or visualization.

  - See: [https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html](https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html)

In [None]:
# Configure spectrum alignment
spa = oms.SpectrumAlignment()
p = spa.getParameters()
p.setValue("tolerance", 500.0)  # 500 ppm
p.setValue("is_relative_tolerance", "true")
spa.setParameters(p)

# Align each theoretical spectrum with the observed spectrum
candidate_df["alignment"] = None

for idx, row in candidate_df.iterrows():
    # Get observed spectrum using stored spectrum index
    observed_spectrum = MS2[row["spectrum_idx"]]
    row_alignment = []
    
    for theo_spec in row["theo_spectra"]:
        alignment = []
        spa.getSpectrumAlignment(alignment, theo_spec, observed_spectrum)
        row_alignment.append(alignment)
    
    candidate_df.at[idx, "alignment"] = row_alignment

print("Aligned all theoretical spectra with observed spectra")

In [None]:
candidate_df

Unnamed: 0,candidates,charge,theo_spectra,alignment
389,[(<pyopenms._pyopenms_1.Residue object at 0x7b...,4,"[[Peak1D(mz=65.0366, intensity=1.0), Peak1D(mz...","[[(7, 7), (9, 39), (12, 61), (14, 97), (21, 19..."


## 6. Calculate Score

**Aim of this task**

  - After aligning each theoretical spectrum with the observed MS2 spectrum, the next step is to quantify how well each candidate peptide explains the observed fragmentation pattern.
  - A simple and interpretable scoring approach is to count how many matched peaks correspond to b-ions and y-ions.

**Background: Peptide-Spectrum Matches (PSMs)**

A **PSM (Peptide-Spectrum Match)** is a pairing of an observed MS2 spectrum with a candidate peptide sequence, along with a score indicating how well the theoretical fragmentation pattern matches the observed peaks. The score helps rank candidates and select the most likely identification.
  
**Implementation**

  - For each observed spectrum:

    - Iterate through all candidate theoretical spectra.
    - Retrieve indices of the corresponding theoretical peaks the experimental peaks were matched with from the alignment.
    - Apply the scoring function to compute the PSM score.

   - Define a scoring function that:

      - Receives the theoretical spectrum and the indices of theoretical peaks that were matched during alignment.
      - For each matched peak, extracts the fragment ion annotation from the theoretical spectrum (stored in StringDataArrays) to determine whether the peak corresponds to a b-ion or y-ion.
      - Counts the total number of matched b- and y-ions; this sum serves as the score for the peptide.

In [None]:
def match_score(spec, matched_indices):
    """
    Calculate PSM score as count of matched b- and y-ions.
    
    Parameters:
    -----------
    spec : MSSpectrum
        Theoretical spectrum with ion annotations.
    matched_indices : list
        Indices of matched theoretical peaks.
    
    Returns:
    --------
    int : Number of matched b- and y-ions
    """
    y_ion_count = 0
    b_ion_count = 0
    
    for idx in matched_indices:
        ion_type = spec.getStringDataArrays()[0][int(idx)].decode()
        if ion_type.startswith('y'):
            y_ion_count += 1
        elif ion_type.startswith('b'):
            b_ion_count += 1
    
    return b_ion_count + y_ion_count

# Calculate scores for each candidate
candidate_df["scores"] = None

for idx, row in candidate_df.iterrows():
    row_scores = []
    for theo_spectrum, alignment in zip(row["theo_spectra"], row["alignment"]):
        # Get theoretical peak indices that were matched
        theo_peak_indices = [pair[0] for pair in alignment]
        score = match_score(theo_spectrum, theo_peak_indices)
        row_scores.append(score)
    candidate_df.at[idx, "scores"] = row_scores

print("Calculated scores for all PSMs")

In [None]:
# View the results
candidate_df[['spectrum_idx', 'candidates', 'charge', 'scores']]

---

### Exercise 2: Analyze the Scoring Results

Look at the scores in the results above and answer these questions:

1. **Interpretation**: For a peptide with 10 amino acids, what is the maximum possible score using our simple ion counting method? (Hint: consider both b and y ions)

2. **Quality assessment**: If a PSM has a score of 5 for a 15-residue peptide, would you consider this a good or poor match? Why?

<details>
<summary><b>Click to check your answers</b></summary>

**Answer 1: Maximum possible score**

For a peptide with n amino acids:
- b-ions: b1, b2, ..., b(n-1) = **n-1 ions**
- y-ions: y1, y2, ..., y(n-1) = **n-1 ions**
- Maximum score = 2 × (n-1)

For a 10-residue peptide: max score = 2 × 9 = **18**

**Answer 2: Quality assessment**

A score of 5 for a 15-residue peptide is relatively **poor**:
- Maximum possible: 2 × 14 = 28
- Coverage: 5/28 = 18%

Good matches typically have:
- >50% ion coverage
- Consecutive ion series (e.g., y3, y4, y5, y6...)
- Both b and y ions represented

**Why might coverage be low?**
- Spectrum quality (noise, low abundance)
- Post-translational modifications not considered
- Neutral losses not included in our simple model
- Fragment charge states we didn't predict

**Real search engines** use more sophisticated scoring that considers:
- Ion intensity patterns
- Consecutive ion series
- Expected vs. unexpected peaks
- Statistical significance

</details>

<details>
<summary><b>Deep Dive: More Sophisticated Scoring Functions</b></summary>

Our simple ion counting is educational but limited. Real search engines use advanced scoring:

**XCorr (Comet, SEQUEST)**
- Cross-correlation between observed and theoretical spectra
- Accounts for peak intensities and spacing
- Normalized for spectrum complexity

**Hyperscore (X!Tandem)**
- Factorial of matched ions × intensity product
- Rewards consecutive ion series
- Fast computation

**E-value (MS-GF+, OMSSA)**
- Statistical expectation value
- Probability of match occurring by chance
- Accounts for database size

**Percolator (post-processing)**
- Machine learning approach
- Combines multiple features (mass error, ion coverage, etc.)
- Improves FDR estimation

</details>

---

## 7. Select the Best Candidate

For each spectrum, we now select the peptide with the **highest score** as the identified sequence.

**Note on FDR (False Discovery Rate):** In a real search engine, additional steps like **FDR control** would follow. The FDR estimates the proportion of incorrect identifications among all accepted PSMs—typically controlled at 1% using target-decoy approaches (searching against both real and reversed/shuffled protein sequences). This means roughly 1 in 100 accepted identifications may be incorrect. For this tutorial, we simply pick the top-scoring candidate without FDR filtering.

<details>
<summary><b>Deep Dive: Understanding FDR and Target-Decoy Approach</b></summary>

**The Problem: How Many Identifications Are Wrong?**

When you search thousands of spectra against a database, some will match by chance. But how many? That's where **False Discovery Rate (FDR)** comes in.

**Target-Decoy Strategy**

1. Create a **decoy database** by reversing or shuffling protein sequences
2. Search spectra against **both** target (real) and decoy databases
3. Decoy matches are definitionally incorrect - they estimate false positives

```
Score Distribution:

High Score ←─────────────────────→ Low Score

TARGET:    ████████████████░░░░░░░░░░░░░░
DECOY:     ░░░░░░░░░░░░░░████████████████

           Good matches    Random matches
```

**FDR Calculation**

At any score threshold:
```
FDR = (# decoy hits) / (# target hits)
```

If you have 100 target hits and 1 decoy hit at a score threshold:
- FDR ≈ 1/100 = 1%

**Typical FDR thresholds:**
- PSM level: 1%
- Peptide level: 1%
- Protein level: 1%

**Why this matters**: Without FDR control, you might report thousands of "identifications" that are actually random matches!

</details>

In [None]:
# Select best candidate for each spectrum
def select_best_candidate(row):
    """Return the best-scoring peptide and its details."""
    if not row['candidates'] or not row['scores']:
        return None, None, None, None, 0
    
    best_idx = np.argmax(row['scores'])
    return (
        row['candidates'][best_idx],
        row['theo_spectra'][best_idx],
        row['alignment'][best_idx],
        row['scores'][best_idx],
        len(row['candidates'])
    )

# Create results summary
results = []
for idx, row in candidate_df.iterrows():
    best_pep, best_theo, best_align, best_score, n_candidates = select_best_candidate(row)
    if best_pep:
        results.append({
            'spectrum_idx': row['spectrum_idx'],
            'peptide': best_pep.toString(),
            'score': best_score,
            'n_candidates': n_candidates,
            'charge': row['charge']
        })

results_df = pd.DataFrame(results)
print("=== Best Peptide Identifications ===\n")
print(results_df.to_string(index=False))

## 8. Visualize the Best Match with Mirror Plot

Finally, we visualize the best peptide-spectrum match using an interactive **mirror plot**.

**What is a mirror plot?** A mirror plot displays two spectra for easy visual comparison:
- **Top (pointing up)**: The experimental/observed spectrum, with matched peaks annotated by their ion type (b1, y3, etc.)
- **Bottom (pointing down/mirrored)**: The theoretical spectrum generated from the candidate peptide

Peaks that align vertically between the two spectra represent successful matches. Unmatched peaks in the experimental spectrum may be noise, neutral losses, or ions not included in the theoretical model. This visualization helps assess how well the identified peptide explains the observed fragmentation pattern.

In [None]:
import pyopenms_viz  # registers the plotting backend

def create_annotated_spectrum_df(observed_spectrum, theo_spectrum, alignment):
    """
    Create an annotated DataFrame from observed spectrum with ion labels from alignment.
    """
    mzs, intensities = observed_spectrum.get_peaks()
    annotations = [""] * len(mzs)
    
    for theo_idx, exp_idx in alignment:
        label = theo_spectrum.getStringDataArrays()[0][theo_idx]
        annotations[exp_idx] = label.decode() if isinstance(label, bytes) else str(label)
    
    return pd.DataFrame({
        'mz': mzs,
        'intensity': intensities,
        'ion_annotation': annotations
    })

# Get best matches and sort by score (top 3)
best_matches = []
for idx, row in candidate_df.iterrows():
    best_pep, best_theo, best_align, best_score, _ = select_best_candidate(row)
    if best_pep is not None:
        best_matches.append({
            'df_idx': idx,
            'spectrum_idx': row['spectrum_idx'],
            'peptide': best_pep,
            'theo_spectrum': best_theo,
            'alignment': best_align,
            'score': best_score
        })

# Sort by score descending and take top 3
best_matches.sort(key=lambda x: x['score'], reverse=True)
top_matches = best_matches[:3]

print(f"Showing top {len(top_matches)} matches by score:\n")

# Visualize top matches
for match in top_matches:
    observed_spectrum = MS2[match['spectrum_idx']]
    
    # Create annotated experimental spectrum DataFrame
    exp_df = create_annotated_spectrum_df(observed_spectrum, match['theo_spectrum'], match['alignment'])
    
    # Create theoretical spectrum DataFrame
    theo_df = match['theo_spectrum'].get_df()
    
    print(f"Spectrum {match['spectrum_idx']}: {match['peptide'].toString()} (score: {match['score']})")
    
    # Plot mirror spectrum
    exp_df.plot(
        kind='spectrum',
        backend='ms_plotly',
        x='mz',
        y='intensity',
        reference_spectrum=theo_df,
        mirror_spectrum=True,
        ion_annotation='ion_annotation',
        title=f"Best Match: {match['peptide'].toString()} (score: {match['score']})",
        width=900,
        height=500,
        annotate_top_n_peaks=0
    );

## Summary

Congratulations! You've implemented a complete peptide identification workflow. Here's what you learned:

| Step | Concept | Key pyOpenMS Tool |
|------|---------|-------------------|
| **1. Digestion** | Trypsin cleavage produces searchable peptides | `ProteaseDigestion` |
| **2. Mass calculation** | Monoisotopic mass for database filtering | `AASequence.getMonoWeight()` |
| **3. Candidate selection** | Mass tolerance defines search space | `np.isclose()` |
| **4. Theoretical spectra** | b/y ions from peptide fragmentation | `TheoreticalSpectrumGenerator` |
| **5. Alignment** | Match observed to theoretical peaks | `SpectrumAlignment` |
| **6. Scoring** | Count matched ions as simple score | Custom function |
| **7. Selection** | Best-scoring candidate wins | `np.argmax()` |
| **8. Visualization** | Mirror plots show match quality | `pyopenms_viz` |

---

## Bonus Challenges

<details>
<summary><b>Challenge 1 (Beginner): Analyze More Spectra</b></summary>

Modify the code to analyze more than 5 spectra:

```python
# Change this line:
candidate_df = candidate_df.head(5).copy()
# To:
candidate_df = candidate_df.head(20).copy()
```

**Questions to explore:**
1. How many spectra get identified?
2. What's the distribution of scores?
3. Are higher-charge precursors harder to identify?

</details>

<details>
<summary><b>Challenge 2 (Intermediate): Improve the Scoring</b></summary>

Modify the `match_score` function to weight consecutive ion series more heavily:

```python
def improved_score(spec, matched_indices):
    # Your code: Give bonus points for consecutive ions
    # e.g., if you match y3, y4, y5, give extra points
    pass
```

**Hint**: Sort the matched ions by number and check for sequences.

</details>

<details>
<summary><b>Challenge 3 (Advanced): Add Neutral Loss Support</b></summary>

Enable neutral losses in the theoretical spectrum generator:

```python
params.setValue("add_losses", "true")  # Enable losses
```

**Questions:**
1. How does this change the number of theoretical peaks?
2. Does it improve or hurt your scores?
3. What neutral losses are most common (H2O, NH3)?

</details>

<details>
<summary><b>Challenge 4 (Expert): Implement Simple FDR</b></summary>

Create a decoy database and estimate FDR:

1. Reverse all peptide sequences to create decoys
2. Search against combined target-decoy database
3. At each score threshold, calculate: FDR = decoys / targets
4. Find the score threshold for 1% FDR

**Warning**: This is a simplified exercise. Real FDR requires careful statistical treatment.

</details>

---

**Next up: [Notebook 3 - Quantification](EUBIC_Task3_Quant.ipynb)** - Learn how to measure peptide abundances and compare samples!