<a href="https://colab.research.google.com/github/timosachsenberg/EuBIC2026/blob/main/notebooks/EUBIC_Task2_ID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies (for Google Colab)
!pip install -q pyopenms>=3.5.0 pyopenms-viz>=1.0.0

# Notebook 2 – Peptide Identification by Database Search

In the previous notebook, we explored how proteins are digested into peptides and how LC–MS data is structured. Now we tackle the central challenge of proteomics: **identifying which peptides produced the observed spectra**.

This notebook implements a simplified peptide database search workflow—the same conceptual approach used by tools like Comet, MS-GF+, and Sage. We will:

1. **Digest proteins** and compute theoretical peptide masses
2. **Match precursor masses** to find candidate peptides for each spectrum  
3. **Generate theoretical fragment spectra** for each candidate
4. **Align and score** observed vs. theoretical spectra
5. **Select the best-matching peptide** for each spectrum
6. **Visualize** the match with an interactive mirror plot

## Overview

Having established a foundation in enzymatic digestion and mass-spectral visualization, we now turn to the central task of peptide identification through database search. Each step mirrors the core logic employed by modern search engines, but implemented transparently for educational purposes.

**Workflow steps:**

1. **Compute monoisotopic peptide masses** – Calculate the neutral mass for every peptide from in-silico digestion.

2. **Compute precursor masses for MS/MS spectra** – For each MS2 spectrum, derive the neutral precursor mass from m/z and charge.

3. **Select candidate peptides** – Compare precursor masses to find peptides within mass tolerance.

4. **Generate theoretical fragment spectra** – Create b/y ion spectra for each candidate peptide.

5. **Align observed and theoretical spectra** – Match experimental peaks to theoretical fragments.

6. **Score peptide–spectrum matches** – Count matched b/y ions as a simple scoring function.

7. **Select best candidate** – Choose the highest-scoring peptide for each spectrum.

8. **Visualize with mirror plot** – Compare experimental and theoretical spectra interactively.

In [None]:
%matplotlib inline
import os
import pyopenms as oms
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
print("pyOpenMS version:", oms.__version__)

pyOpenMS version: 3.5.0


In [None]:
# Download FASTA from course repository (2 UPS1 proteins: Albumin and Carbonic Anhydrase)
if not os.path.exists("two_ups_proteins.fasta"):
    !wget -q -O "two_ups_proteins.fasta" https://raw.githubusercontent.com/timosachsenberg/EuBIC2026/main/data/two_ups_proteins.fasta

# Download mzML (large file ~732MB, UPS1 spike-in experiment)
if not os.path.exists("mzML_file.mzML"):
    !wget -O "mzML_file.mzML" https://abibuilder.cs.uni-tuebingen.de/archive/openms/Tutorials/Example_Data/ProteomicsLFQ/UPS1_50000amol_R2.mzML

In [None]:
# Digest proteins into peptides (same approach as Notebook 1)
def preprocess_database(fasta_path, min_len_pept=6, max_len_pept=30, missed_cleavages=2):
    """
    Load a FASTA file and digest all proteins with Trypsin.
    Returns a list of unique peptides (as AASequence objects).
    """
    database_entries = []
    f = oms.FASTAFile()
    f.load(fasta_path, database_entries)

    # Configure protein digestion with Trypsin
    dig = oms.ProteaseDigestion()
    dig.setEnzyme("Trypsin")
    dig.setMissedCleavages(missed_cleavages)

    peptides = []
    for entry in database_entries:
        protein = oms.AASequence.fromString(entry.sequence)
        peptides_ = []
        dig.digest(protein, peptides_, min_len_pept, max_len_pept)
        peptides.extend(peptides_)

    # Remove duplicate peptides (same sequence from different proteins or missed cleavage variants)
    seen = set()
    unique_peptides = []
    for pep in peptides:
        seq_str = pep.toString()
        if seq_str not in seen:
            seen.add(seq_str)
            unique_peptides.append(pep)

    return unique_peptides

In [None]:
# Digest the UPS proteins database
peptides = preprocess_database("two_ups_proteins.fasta")
print(f"Total unique peptides after digestion: {len(peptides)}")

## 1. Compute the Monoisotopic Mass of Each Peptide

**Aim of this task**
  - After protein digestion, the peptide identification workflow starts with calculating the monoisotopic mass.
  - Calculating the monoisotopic mass of each peptide allows for filtering candidate sequences based on their mass, facilitating comparison with experimental spectra.

**Implementation**
  - The `getMonoWeight()` method from `AASequence` can be used to compute a peptide sequence's monoisotopic mass.
  - See: [https://pyopenms.readthedocs.io/en/latest/user_guide/peptides_proteins.html](https://pyopenms.readthedocs.io/en/latest/user_guide/peptides_proteins.html)

In [None]:
# Function to compute monoisotopic masses for a list of peptides
def get_peptide_weights(peptides):
    return [p.getMonoWeight() for p in peptides]

# calculate the monoisotopic masses of peptides
peptides_weight = get_peptide_weights(peptides)

## 2. Calculate the precursor mass of MS2 spectra

**Aim of this task**
  - After computing peptide monoisotopic masses, the next step is to determine the precursor masses from the MS2 spectra.
  - This allows identification of candidate peptides that match the experimental spectra.

**Implementation**
  - Precursor information can be extracted from each spectrum using the `getPrecursors()` method from `MSSpectrum`.
  - From the precursor object, the uncharged mass and charge can be obtained via `getUnchargedMass()` and `getCharge()`.
  - checkout: [https://pyopenms.readthedocs.io/en/latest/user_guide/ms_data.html](https://pyopenms.readthedocs.io/en/latest/user_guide/ms_data.html)


In [None]:
# Function to extract precursor masses and charges from MS2 spectra
def get_precursor_weights(MS2):
    """
    Extract precursor mass and charge for each MS2 spectrum.
    Skips spectra without precursor information.
    """
    precursors_charge = []
    precursors_M = []
    valid_indices = []  # track which spectra have valid precursors

    for i, spec in enumerate(MS2):
        precursors = spec.getPrecursors()
        if len(precursors) == 0:
            continue  # skip spectra without precursor info
        
        p = precursors[0]
        precursors_charge.append(p.getCharge())
        precursors_M.append(p.getUnchargedMass())
        valid_indices.append(i)

    return np.array(precursors_M), np.array(precursors_charge), valid_indices

In [None]:
# Load only MS2 spectra from mzML (more efficient for large files)
options = oms.PeakFileOptions()
options.setMSLevels([2])  # only load MS level 2

MS2 = oms.MSExperiment()
mzml = oms.MzMLFile()
mzml.setOptions(options)
mzml.load("mzML_file.mzML", MS2)

# Sort peaks by m/z in each spectrum
for spec in MS2:
    spec.sortByPosition()

# Get precursor mass and charge for each MS2 spectrum
P_mass, P_charge, valid_indices = get_precursor_weights(MS2)
print(f"Loaded {len(MS2)} MS2 spectra")
print(f"Spectra with valid precursors: {len(P_mass)}")

## 3. Identify Candidate Peptides

**Aim of this task**

  - After obtaining precursor masses from the MS2 spectra and calculating monoisotopic masses for all peptides, the goal is to identify candidate peptides whose masses match the observed precursors.
  
**Implementation**

- Convert the list of peptide masses into a NumPy array to enable efficient vectorized comparison.

- For each precursor mass:

    - Use `np.isclose` with an absolute and optional relative tolerance to find all peptide masses within the specified matching window.

    - Retrieve the indices of matching peptides and collect the corresponding peptide sequences as candidate lists.

- Store the resulting candidate lists in a DataFrame for downstream theoretical spectrum generation.

- Add the precursor charge as an additional column, since fragment charge limits depend on it.

- Retain only spectra that have at least one candidate peptide.

In [None]:
def get_candidates_per_spectrum(precursor_weights, peptide_weights, peptides, 
                                  spectrum_indices, absolute_tolerance=0.1, relative_tolerance=0):
    """
    For each precursor mass, find peptides whose masses match within tolerance.

    Parameters:
    -----------
    precursor_weights : array-like
        Precursor masses (one per spectrum).
    peptide_weights : array-like
        Theoretical peptide masses.
    peptides : list
        List of peptide sequences (AASequence objects).
    spectrum_indices : list
        Original indices of spectra in the MS2 experiment.
    absolute_tolerance : float
        Maximum absolute mass difference allowed (default: 0.1 Da).
    relative_tolerance : float
        Maximum relative mass difference allowed (default: 0).

    Returns:
    --------
    pd.DataFrame with columns: 'candidates', 'spectrum_idx'
    """
    pept_candidates = []
    peptide_weights_arr = np.array(peptide_weights)

    for prec_weight in precursor_weights:
        # Find peptides matching precursor mass within tolerance
        pept_indices = np.where(
            np.isclose(prec_weight, peptide_weights_arr, 
                      atol=absolute_tolerance, rtol=relative_tolerance)
        )[0]
        pept_candidates.append([peptides[i] for i in pept_indices])

    return pd.DataFrame({
        'candidates': pept_candidates,
        'spectrum_idx': spectrum_indices
    })

In [None]:
# Find candidate peptides for each spectrum
candidate_df = get_candidates_per_spectrum(
    precursor_weights=P_mass, 
    peptide_weights=peptides_weight, 
    peptides=peptides,
    spectrum_indices=valid_indices
)

# Add precursor charge (needed for fragment charge calculation)
candidate_df["charge"] = P_charge

# Keep only spectra with at least one candidate peptide
candidate_df = candidate_df[
    candidate_df["candidates"].apply(lambda x: len(x) >= 1)
].copy()  # use .copy() to avoid SettingWithCopyWarning

print(f"Spectra with candidate peptides: {len(candidate_df)}")

# For this tutorial, work with first 5 spectra that have candidates
candidate_df = candidate_df.head(5).copy()
print(f"Working with {len(candidate_df)} spectra for demonstration")
candidate_df

## 4. Generate Theoretical Spectra

**Aim of this task**

  - After finding the candidate peptides for each MS2 spectrum, the next step is to generate their theoretical fragmentation spectra.
  - These spectra serve as references for comparing against the observed MS2 spectra, which is essential for peptide identification.
  
**Implementation**

- Initialize a column in the DataFrame to store theoretical spectra corresponding to each candidate peptide.

- Configure a `TheoreticalSpectrumGenerator` by enabling b-ions, y-ions, and first prefix ions, while suppressing precursor peaks, neutral losses, and other additional features to keep the spectrum simple.

- Enable meta-information so that each theoretical peak contains its fragment-ion label in the StringDataArrays.

- For each candidate peptide:

    - Create an empty MSSpectrum object.

    - Generate the theoretical spectrum by specifying charge states from 1 up to the allowed maximum, defined as the minimum of 2 or (precursor charge − 1).

    - Append the resulting spectrum to a list.

- Store the complete list of theoretical spectra in the DataFrame for downstream scoring and annotation.

- See: [https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html](https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html)

In [None]:
# Configure theoretical spectrum generation
tsg = oms.TheoreticalSpectrumGenerator()
params = oms.Param()
params.setValue("add_y_ions", "true")
params.setValue("add_b_ions", "true")
params.setValue("add_first_prefix_ion", "true")
params.setValue("add_precursor_peaks", "false")
params.setValue("add_losses", "false")
params.setValue("add_metainfo", "true")  # needed for ion annotations
tsg.setParameters(params)

# Generate theoretical spectra for each candidate peptide
candidate_df["theo_spectra"] = None

for idx, row in candidate_df.iterrows():
    row_theo_spectra = []
    for peptide in row["candidates"]:
        theo_spectrum = oms.MSSpectrum()
        # max fragment charge is min(2, precursor_charge - 1)
        max_frag_charge = min(row['charge'] - 1, 2)
        tsg.getSpectrum(theo_spectrum, peptide, 1, max_frag_charge)
        row_theo_spectra.append(theo_spectrum)
    candidate_df.at[idx, "theo_spectra"] = row_theo_spectra

print("Generated theoretical spectra for all candidates")

In [None]:
candidate_df

Unnamed: 0,candidates,charge,theo_spectra
389,[(<pyopenms._pyopenms_1.Residue object at 0x7b...,4,"[[Peak1D(mz=65.0366, intensity=1.0), Peak1D(mz..."


## 5. Spectra Alignment

**Aim of this task**

  - After generating theoretical spectra for all candidate peptides, the next step is to align them with the corresponding observed MS2 spectra.
  - This alignment identifies which theoretical peaks match observed peaks, forming the basis for scoring and selecting the most likely peptide sequence.
  
**Implementation**

  - Initialize a `SpectrumAlignment` object from PyOpenMS.
  - Configure alignment parameters:

      - Set a relative tolerance (e.g., 500 ppm) to define how closely peaks must match.
      - Use relative tolerance instead of absolute to account for m/z-dependent peak spacing.

 - For each observed MS2 spectrum:

    - Iterate over all candidate theoretical spectra.
    - Align each theoretical spectrum to the observed spectrum using `getSpectrumAlignment`, which returns pairs of matched peak indices.
    - Store the resulting alignments for each candidate in the DataFrame for further analysis, such as scoring or visualization.

  - See: [https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html](https://pyopenms.readthedocs.io/en/latest/user_guide/spectrum_alignment.html)

In [None]:
# Configure spectrum alignment
spa = oms.SpectrumAlignment()
p = spa.getParameters()
p.setValue("tolerance", 500.0)  # 500 ppm
p.setValue("is_relative_tolerance", "true")
spa.setParameters(p)

# Align each theoretical spectrum with the observed spectrum
candidate_df["alignment"] = None

for idx, row in candidate_df.iterrows():
    # Get observed spectrum using stored spectrum index
    observed_spectrum = MS2[row["spectrum_idx"]]
    row_alignment = []
    
    for theo_spec in row["theo_spectra"]:
        alignment = []
        spa.getSpectrumAlignment(alignment, theo_spec, observed_spectrum)
        row_alignment.append(alignment)
    
    candidate_df.at[idx, "alignment"] = row_alignment

print("Aligned all theoretical spectra with observed spectra")

In [None]:
candidate_df

Unnamed: 0,candidates,charge,theo_spectra,alignment
389,[(<pyopenms._pyopenms_1.Residue object at 0x7b...,4,"[[Peak1D(mz=65.0366, intensity=1.0), Peak1D(mz...","[[(7, 7), (9, 39), (12, 61), (14, 97), (21, 19..."


## 6. Calculate Score

**Aim of this task**

  - After aligning each theoretical spectrum with the observed MS2 spectrum, the next step is to quantify how well each candidate peptide explains the observed fragmentation pattern.
  - A simple and interpretable scoring approach is to count how many matched peaks correspond to b-ions and y-ions.
  
**Implementation**

  - For each observed spectrum:

    - Iterate through all candidate theoretical spectra.
    - Retrieve indices of the corresponding theoretical peaks the experimental peaks were matched with from the alignment.
    - Apply the scoring function to compute the peptide-spectrum match score.

   - Define a scoring function that:

      - Receives the theoretical spectrum and the indices of theoretical peaks that were matched during alignment.
      - For each matched peak, extracts the fragment ion annotation from the theoretical spectrum (stored in StringDataArrays) to determine whether the peak corresponds to a b-ion or y-ion.
      - Counts the total number of matched b- and y-ions; this sum serves as the score for the peptide.

In [None]:
def match_score(spec, matched_indices):
    """
    Calculate PSM score as count of matched b- and y-ions.
    
    Parameters:
    -----------
    spec : MSSpectrum
        Theoretical spectrum with ion annotations.
    matched_indices : list
        Indices of matched theoretical peaks.
    
    Returns:
    --------
    int : Number of matched b- and y-ions
    """
    y_ion_count = 0
    b_ion_count = 0
    
    for idx in matched_indices:
        ion_type = spec.getStringDataArrays()[0][int(idx)].decode()
        if ion_type.startswith('y'):
            y_ion_count += 1
        elif ion_type.startswith('b'):
            b_ion_count += 1
    
    return b_ion_count + y_ion_count

# Calculate scores for each candidate
candidate_df["scores"] = None

for idx, row in candidate_df.iterrows():
    row_scores = []
    for theo_spectrum, alignment in zip(row["theo_spectra"], row["alignment"]):
        # Get theoretical peak indices that were matched
        theo_peak_indices = [pair[0] for pair in alignment]
        score = match_score(theo_spectrum, theo_peak_indices)
        row_scores.append(score)
    candidate_df.at[idx, "scores"] = row_scores

print("Calculated scores for all PSMs")

In [None]:
# View the results
candidate_df[['spectrum_idx', 'candidates', 'charge', 'scores']]

## 7. Select the Best Candidate

For each spectrum, we now select the peptide with the **highest score** as the identified sequence. In a real search engine, additional steps like FDR control would follow, but for this tutorial we simply pick the top-scoring candidate.

In [None]:
# Select best candidate for each spectrum
def select_best_candidate(row):
    """Return the best-scoring peptide and its details."""
    if not row['candidates'] or not row['scores']:
        return None, None, None, None, 0
    
    best_idx = np.argmax(row['scores'])
    return (
        row['candidates'][best_idx],
        row['theo_spectra'][best_idx],
        row['alignment'][best_idx],
        row['scores'][best_idx],
        len(row['candidates'])
    )

# Create results summary
results = []
for idx, row in candidate_df.iterrows():
    best_pep, best_theo, best_align, best_score, n_candidates = select_best_candidate(row)
    if best_pep:
        results.append({
            'spectrum_idx': row['spectrum_idx'],
            'peptide': best_pep.toString(),
            'score': best_score,
            'n_candidates': n_candidates,
            'charge': row['charge']
        })

results_df = pd.DataFrame(results)
print("=== Best Peptide Identifications ===\n")
print(results_df.to_string(index=False))

## 8. Visualize the Best Match with Mirror Plot

Finally, we visualize the best peptide-spectrum match using an interactive **mirror plot**:
- **Top**: Experimental spectrum with annotated matched peaks
- **Bottom**: Theoretical spectrum (mirrored)

This visualization confirms how well the identified peptide explains the observed fragmentation.

In [None]:
import pyopenms_viz  # registers the plotting backend

def create_annotated_spectrum_df(observed_spectrum, theo_spectrum, alignment):
    """
    Create an annotated DataFrame from observed spectrum with ion labels from alignment.
    """
    mzs, intensities = observed_spectrum.get_peaks()
    annotations = [""] * len(mzs)
    
    for theo_idx, exp_idx in alignment:
        label = theo_spectrum.getStringDataArrays()[0][theo_idx]
        annotations[exp_idx] = label.decode() if isinstance(label, bytes) else str(label)
    
    return pd.DataFrame({
        'mz': mzs,
        'intensity': intensities,
        'ion_annotation': annotations
    })

# Get best matches and sort by score (top 3)
best_matches = []
for idx, row in candidate_df.iterrows():
    best_pep, best_theo, best_align, best_score, _ = select_best_candidate(row)
    if best_pep is not None:
        best_matches.append({
            'df_idx': idx,
            'spectrum_idx': row['spectrum_idx'],
            'peptide': best_pep,
            'theo_spectrum': best_theo,
            'alignment': best_align,
            'score': best_score
        })

# Sort by score descending and take top 3
best_matches.sort(key=lambda x: x['score'], reverse=True)
top_matches = best_matches[:3]

print(f"Showing top {len(top_matches)} matches by score:\n")

# Visualize top matches
for match in top_matches:
    observed_spectrum = MS2[match['spectrum_idx']]
    
    # Create annotated experimental spectrum DataFrame
    exp_df = create_annotated_spectrum_df(observed_spectrum, match['theo_spectrum'], match['alignment'])
    
    # Create theoretical spectrum DataFrame
    theo_df = match['theo_spectrum'].get_df()
    
    print(f"Spectrum {match['spectrum_idx']}: {match['peptide'].toString()} (score: {match['score']})")
    
    # Plot mirror spectrum
    exp_df.plot(
        kind='spectrum',
        backend='ms_plotly',
        x='mz',
        y='intensity',
        reference_spectrum=theo_df,
        mirror_spectrum=True,
        ion_annotation='ion_annotation',
        title=f"Best Match: {match['peptide'].toString()} (score: {match['score']})",
        width=900,
        height=500,
        annotate_top_n_peaks=0
    );