<a href="https://colab.research.google.com/github/timosachsenberg/EuBIC2026/blob/main/notebooks/EUBIC_Task0_Prerequisites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install dependencies (for Google Colab)
!pip install -q pyopenms>=3.5.0

# Notebook 0 – Prerequisites: Python & pyOpenMS Fundamentals

**Welcome to the EuBIC 2026 Winter School!**

This optional notebook covers the foundational concepts you'll need for the main tutorials. It requires some background in Python. Don't worry if you never used Python before. Just team up with someone that has - we are sure you will learn a lot.

## Who should complete this notebook?

| Your Background | Recommendation |
|-----------------|----------------|
| New to Python | Complete Sections 1-3 |
| Know Python, new to pyOpenMS | Complete Sections 4-5 |
| Experienced in all | Skip to Notebook 1 |

## Contents

1. **Python Essentials** – Variables, lists, dictionaries, functions
2. **NumPy Basics** – Arrays and vectorized operations
3. **Pandas Basics** – DataFrames for data analysis
4. **File Formats** – FASTA, mzML, idXML, featureXML
5. **pyOpenMS Object Model** – Core classes and patterns

---

# 1. Python Essentials

This section covers the Python basics you'll encounter throughout the tutorials.

## 1.1 Variables and Basic Types

Python uses dynamic typing – you don't need to declare variable types.

In [None]:
# Numbers
mass = 1234.5678       # float (decimal number)
charge = 2             # int (integer)
mz = mass / charge     # arithmetic operations

print(f"Mass: {mass}, Charge: {charge}, m/z: {mz}")

In [None]:
# Strings
peptide = "PEPTIDER"   # text in quotes
protein = 'PROTEIN'    # single or double quotes work

# String operations
print(f"Peptide: {peptide}")
print(f"Length: {len(peptide)} amino acids")
print(f"First residue: {peptide[0]}")
print(f"Last residue: {peptide[-1]}")

In [None]:
# Booleans
is_tryptic = True
has_modification = False

# Comparisons return booleans
print(f"Is charge > 1? {charge > 1}")
print(f"Is mass < 1000? {mass < 1000}")

## 1.2 Lists

Lists are ordered, mutable collections. You'll use them constantly.

In [None]:
# Creating lists
masses = [500.25, 600.30, 700.35, 800.40]
amino_acids = ['A', 'R', 'N', 'D', 'C']
empty_list = []

# Accessing elements (0-indexed!)
print(f"First mass: {masses[0]}")
print(f"Last mass: {masses[-1]}")
print(f"First three: {masses[:3]}")
print(f"Last two: {masses[-2:]}")

In [None]:
# Modifying lists
masses.append(900.45)      # Add to end
masses.insert(0, 400.20)   # Insert at position
print(f"After adding: {masses}")

# List information
print(f"Length: {len(masses)}")
print(f"Contains 600.30? {600.30 in masses}")

In [None]:
# Iterating over lists
print("All masses:")
for m in masses:
    print(f"  {m}")

# With index
print("\nWith indices:")
for i, m in enumerate(masses):
    print(f"  [{i}] {m}")

## 1.3 List Comprehensions

A concise way to create lists. You'll see these throughout the tutorials.

In [None]:
# Traditional loop approach
squared = []
for x in [1, 2, 3, 4, 5]:
    squared.append(x ** 2)
print(f"Squared (loop): {squared}")

# List comprehension - same result, one line
squared = [x ** 2 for x in [1, 2, 3, 4, 5]]
print(f"Squared (comprehension): {squared}")

In [None]:
# With filtering
masses = [400, 500, 600, 700, 800, 900, 1000]

# Only masses > 600
large_masses = [m for m in masses if m > 600]
print(f"Large masses: {large_masses}")

# Transform and filter
mz_values = [m / 2 for m in masses if m > 600]  # Assume charge 2
print(f"m/z values (z=2, M>600): {mz_values}")

## 1.4 Dictionaries

Key-value pairs for storing related data.

In [None]:
# Amino acid masses (monoisotopic residue masses)
aa_masses = {
    'A': 71.037,   # Alanine
    'R': 156.101,  # Arginine
    'N': 114.043,  # Asparagine
    'D': 115.027,  # Aspartic acid
    'C': 103.009,  # Cysteine
}

# Accessing values
print(f"Mass of Alanine: {aa_masses['A']} Da")
print(f"Mass of Arginine: {aa_masses['R']} Da")

# Check if key exists
print(f"\nHave mass for 'K'? {'K' in aa_masses}")

In [None]:
# Calculate peptide mass
peptide = "ANDR"
water_mass = 18.011  # H2O lost in peptide bond formation

# Sum residue masses + terminal H and OH
peptide_mass = sum(aa_masses[aa] for aa in peptide) + water_mass
print(f"Mass of {peptide}: {peptide_mass:.3f} Da")

## 1.5 Functions

Reusable blocks of code.

In [None]:
def calculate_mz(mass, charge):
    """
    Calculate m/z from mass and charge.

    Parameters:
    -----------
    mass : float
        Neutral mass in Daltons
    charge : int
        Charge state (positive)

    Returns:
    --------
    float : m/z value
    """
    proton_mass = 1.00728
    return (mass + charge * proton_mass) / charge

# Use the function
print(f"m/z of 1000 Da at z=1: {calculate_mz(1000, 1):.4f}")
print(f"m/z of 1000 Da at z=2: {calculate_mz(1000, 2):.4f}")
print(f"m/z of 1000 Da at z=3: {calculate_mz(1000, 3):.4f}")

In [None]:
# Functions with default parameters
def calculate_mz(mass, charge=2):  # charge defaults to 2
    proton_mass = 1.00728
    return (mass + charge * proton_mass) / charge

print(f"Default z=2: {calculate_mz(1000):.4f}")
print(f"Override z=3: {calculate_mz(1000, charge=3):.4f}")

### Exercise 1.1: Python Basics

Complete the following exercises to test your understanding.

In [None]:
# Exercise 1.1a: Create a list of peptide sequences
peptides = ["PEPTIDE", "SEQUENCE", "EXAMPLE"]

# TODO: Use a list comprehension to get the lengths of all peptides
# lengths = ???

# Uncomment to check:
# print(f"Lengths: {lengths}")  # Should be [7, 8, 7]

In [None]:
# Exercise 1.1b: Filter peptides
# TODO: Get only peptides longer than 7 amino acids
# long_peptides = ???

# Uncomment to check:
# print(f"Long peptides: {long_peptides}")  # Should be ['SEQUENCE']

<details>
<summary><b>Click for solutions</b></summary>

```python
# 1.1a
lengths = [len(p) for p in peptides]

# 1.1b
long_peptides = [p for p in peptides if len(p) > 7]
```

</details>

---

# 2. NumPy Basics

NumPy provides efficient arrays and mathematical operations. It's used extensively in scientific Python.

In [None]:
import numpy as np

# Creating arrays
masses = np.array([400.2, 500.3, 600.4, 700.5, 800.6])
print(f"Array: {masses}")
print(f"Type: {type(masses)}")
print(f"Shape: {masses.shape}")

In [None]:
# Vectorized operations (apply to all elements at once)
mz_z2 = masses / 2  # Divide all by 2
print(f"Original masses: {masses}")
print(f"m/z at z=2: {mz_z2}")

# Much faster than loops for large arrays!

In [None]:
# Useful functions
intensities = np.array([1000, 5000, 2000, 8000, 3000])

print(f"Sum: {np.sum(intensities)}")
print(f"Mean: {np.mean(intensities):.1f}")
print(f"Max: {np.max(intensities)} at index {np.argmax(intensities)}")
print(f"Min: {np.min(intensities)} at index {np.argmin(intensities)}")

In [None]:
# Boolean indexing (filtering)
high_intensity = intensities > 3000
print(f"Boolean mask: {high_intensity}")
print(f"High intensity values: {intensities[high_intensity]}")
print(f"Corresponding masses: {masses[high_intensity]}")

In [None]:
# Creating special arrays
zeros = np.zeros(5)
ones = np.ones(5)
range_arr = np.arange(0, 10, 2)  # start, stop, step
linspace = np.linspace(0, 100, 5)  # start, stop, num_points

print(f"Zeros: {zeros}")
print(f"Ones: {ones}")
print(f"Range: {range_arr}")
print(f"Linspace: {linspace}")

---

# 3. Pandas Basics

Pandas DataFrames are like spreadsheets in Python. You'll use them to analyze results.

In [None]:
import pandas as pd

# Creating a DataFrame
data = {
    'peptide': ['PEPTIDER', 'SAMPLER', 'TESTPEPTIDE'],
    'mass': [969.48, 787.40, 1248.59],
    'charge': [2, 2, 3],
    'score': [45.2, 32.1, 58.9]
}

df = pd.DataFrame(data)
df

In [None]:
# Accessing columns
print("Peptide column:")
print(df['peptide'])

print("\nMultiple columns:")
print(df[['peptide', 'score']])

In [None]:
# Adding new columns
df['mz'] = (df['mass'] + df['charge'] * 1.00728) / df['charge']
df['length'] = df['peptide'].apply(len)  # Apply function to each row
df

In [None]:
# Filtering rows
high_score = df[df['score'] > 40]
print("High scoring PSMs:")
high_score

In [None]:
# Summary statistics
print(df.describe())

In [None]:
# Iterating over rows
print("All PSMs:")
for idx, row in df.iterrows():
    print(f"  {row['peptide']}: score={row['score']}, m/z={row['mz']:.2f}")

---

# 4. File Formats

Common file formats you'll encounter in proteomics.

## 4.1 OpenMS File Formats

| Format | Extension | Contains |
|--------|-----------|----------|
| **idXML** | `.idXML` | Peptide/protein identifications from database search |
| **featureXML** | `.featureXML` | Detected features with quantitative information |
| **consensusXML** | `.consensusXML` | Features aligned across multiple samples |

---

# 5. pyOpenMS Object Model

pyOpenMS provides Python bindings to OpenMS, a C++ library for computational mass spectrometry.

**Online Documentation:**
- [pyOpenMS Documentation](https://pyopenms.readthedocs.io/en/latest/user_guide/index.html) – Official docs with tutorials and API
- advanced: [OpenMS C++ Class Reference](https://abibuilder.cs.uni-tuebingen.de/archive/openms/Documentation/nightly/html/index.html) – Detailed API reference

In [None]:
import pyopenms as oms
print(f"pyOpenMS version: {oms.__version__}")

## 5.1 Getting Help and Documentation

pyOpenMS has extensive documentation. Here's how to explore it:

In [None]:
# List all available methods on a class using dir()
# Filter to show only public methods (not starting with '_')
spectrum = oms.MSSpectrum()
public_methods = [m for m in dir(spectrum) if not m.startswith('_')]
print("MSSpectrum methods (first 15):")
print(public_methods[:15])

In [None]:
# Get documentation for a specific method using help()
# This shows the docstring with parameter info and return types
help(spectrum.setRT)

In [None]:
# show the full help
help(oms.AASequence)

## 5.2 Common Pattern: Load → Process → Save

In [None]:
# The standard pyOpenMS workflow pattern:

# 1. Create empty container
exp = oms.MSExperiment()

# 2. Load data from file
# oms.MzMLFile().load("data.mzML", exp)

# 3. Process data
# ... (e.g., feature detection, filtering)

# 4. Save results
# oms.MzMLFile().store("output.mzML", exp)

print("Pattern: Create → Load → Process → Save")

## 5.3 Working with Algorithms

In [None]:
# Most algorithms follow this pattern:

# 1. Create algorithm instance
tsg = oms.TheoreticalSpectrumGenerator()

# 2. Get and modify parameters
params = tsg.getParameters()
params.setValue("add_b_ions", "true")
params.setValue("add_y_ions", "true")
tsg.setParameters(params)

# 3. Run the algorithm
peptide = oms.AASequence("PEPTIDER")
theo_spectrum = oms.MSSpectrum()
tsg.getSpectrum(theo_spectrum, peptide, 1, 2)  # charge 1 to 2

print(f"Generated {theo_spectrum.size()} theoretical peaks")