# Rescoring of non-cleavable XL-MS data using Oktoberfest

This notebook provides an overview of rescoring non-cleavable XL-MS data in Oktoberfest. 

## 1. Import necessary python packages

In [1]:
import os
from oktoberfest.runner import run_job
import json
import urllib.request
import shutil
from tqdm import tqdm
import zipfile
import requests

  machar = _get_machar(dtype)
  from .autonotebook import tqdm as notebook_tqdm


## 2. Installation:

### ThermoRawFileParser:
- You need this if you want to read thermo rawfiles.

### Get the current directory and set the file name

In [2]:
download_dir = os.getcwd()
download_file = os.path.join(download_dir, "ThermoRawFileParser1.4.3.zip")
url = "https://github.com/compomics/ThermoRawFileParser/releases/download/v1.4.3/ThermoRawFileParser1.4.3.zip"

# set download to False if you already have the file and don"t want to download again in the next step
download = True

### Download and extract files to the same directory

In [3]:
if download:
    with tqdm(unit="B", total=3520000, unit_scale=True, unit_divisor=1000, miniters=1, desc=url.split("/")[-1]) as t:
        urllib.request.urlretrieve(
            url=url,
            filename=download_file,
            reporthook=lambda blocks, block_size, _: t.update(blocks * block_size - t.n)
        )

with zipfile.ZipFile(download_file, 'r') as zip_ref:
    for member in zip_ref.namelist():
        # Remove any folder prefix to extract directly here
        filename = os.path.basename(member)
        if not filename:  # Skip directory entries
            continue
        source = zip_ref.open(member)
        target_path = os.path.join(download_dir, filename)
        with open(target_path, "wb") as target:
            with source as src:
                target.write(src.read())

ThermoRawFileParser1.4.3.zip: 3.52MB [00:01, 2.73MB/s]                                                                        


### Percolator:
- To install percolator on windows download this file: https://github.com/percolator/percolator/releases/download/rel-3-06-01/percolator-v3-06.exe
- Run the downloaded file while running the setup make sure to select "add percolator to the system PATH for all users" when asked.

### Oktoberfest:
- Oktoberfest currently supports Python versions 3.10 and 3.11. Support for 3.12 will be added in the near future.
- Install oktoberfest using pip install oktoberfest

## 3. Download files from zenodo required for rescoring task

The data used in this tutorial is available through a public Zenodo record.
The dataset is approximately 639 MB in size and includes:

- msms.csv: the output from **xiSearch (version 1.8.7)**

- one RAW file used for rescoring crosslinked peptides.

### Get the current directory and set the file name

In [4]:
download_dir = os.getcwd()
download_file = os.path.join(download_dir, "Oktoberfest_XL_input.zip")
url = "https://zenodo.org/records/15663781/files/Oktoberfest_XL_input.zip?download=1"

# set download to False if you already have the file and don"t want to download again in the next step
download = True

### Download and extract files from zenodo to the same directory

In [5]:
if download:
    with tqdm(unit="B", total=758000000, unit_scale=True, unit_divisor=1000, miniters=1, desc=url.split("/")[-1]) as t:
        urllib.request.urlretrieve(
            url=url,
            filename=download_file,
            reporthook=lambda blocks, block_size, _: t.update(blocks * block_size - t.n)
        )
    shutil.unpack_archive(download_file, download_dir)


Oktoberfest_XL_input.zip?download=1: 758MB [01:03, 12.0MB/s]                                                                  


### Check downloaded files

In [6]:
input_dir = download_file[:-4]
print(f"Downloaded data is stored in {input_dir}\nContents:")
os.listdir(input_dir)

Downloaded data is stored in /home/mkalhor/oktoberfest/tutorials/Oktoberfest_XL_input
Contents:


['msms.csv', 'XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw']

## 4. Rescoring
**Important**: The intensity model specified in the config file is
**Prosit_2024_intensity_XL_NMS2** — this is used because DSS was the crosslinker in this dataset.

If you are using cleavable crosslinkers (e.g., DSSO or DSBU), make sure to update the model name in the config file to:
**Prosit_2023_intensity_XL_CMS2**

### Generate config file

In [7]:
task_config_rescoring = {
    "type": "Rescoring",
    "tag": "",
    "inputs":{
        "search_results": input_dir + "/msms.csv",
        "search_results_type": "Xisearch",
        "spectra": input_dir,
        "spectra_type": "raw",
    },
    "output": "./XL_out",
    "models": {
        "intensity": "Prosit_2024_intensity_XL_NMS2",
        "irt": ""
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": True,
    "thermoExe": "ThermoRawFileParser.exe",
    "numThreads": 1,
    "fdr_estimation_method": "percolator",
    "allFeatures": False,
    "massTolerance": 40,
    "unitMassTolerance": "ppm",
    "ce_alignment_options": {
        "ce_range": [
            5,
            45
        ],
        "use_ransac_model": True
    }
}

### Save the config file

In [8]:
with open("./rescoring_config.json", "w") as fp:
    json.dump(task_config_rescoring, fp)

### Start rescoring

In [9]:
run_job("./rescoring_config.json")

2025-06-21 10:04:29,875 - INFO - oktoberfest.utils.config::read Reading configuration from ./rescoring_config.json
2025-06-21 10:04:29,892 - INFO - oktoberfest.runner::run_job Oktoberfest version 0.10.0
Copyright 2025, Wilhelmlab at Technical University of Munich
2025-06-21 10:04:29,920 - INFO - oktoberfest.runner::run_job Job executed with the following config:
2025-06-21 10:04:29,930 - INFO - oktoberfest.runner::run_job {
    "type": "Rescoring",
    "tag": "",
    "inputs": {
        "search_results": "/home/mkalhor/oktoberfest/tutorials/Oktoberfest_XL_input/msms.csv",
        "search_results_type": "Xisearch",
        "spectra": "/home/mkalhor/oktoberfest/tutorials/Oktoberfest_XL_input",
        "spectra_type": "raw"
    },
    "output": "./XL_out",
    "models": {
        "intensity": "Prosit_2024_intensity_XL_NMS2",
        "irt": ""
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": true,
    "thermoExe": "ThermoRawFileParser.exe",
    "numThreads": 1,
    "f



2025-06-21 10:07:38,069 - INFO - oktoberfest.preprocessing.preprocessing::annotate_spectral_library_xl Finished annotating.
2025-06-21 10:07:39,600 - INFO - oktoberfest.predict.predictor::from_config Using model Prosit_2024_intensity_XL_NMS2 via Koina


  hcd_targets = hcd_targets.sort_values(by="SCORE", ascending=False).groupby(groups)
Prosit_2024_intensity_XL_NMS2:: 100%|███████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.08it/s]
Prosit_2024_intensity_XL_NMS2:: 100%|███████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.21s/it]

2025-06-21 10:07:47,237 - INFO - oktoberfest.runner::_get_best_ce Performing RANSAC regression



  .apply(lambda x: x.loc[x["SPECTRAL_ANGLE"].idxmax()])


2025-06-21 10:07:48,508 - INFO - oktoberfest.utils.process_step::is_done Skipping ce_calib.XLpeplib_Beveridge_QEx-HFX_DSS_R1 step because XL_out/proc/ce_calib.XLpeplib_Beveridge_QEx-HFX_DSS_R1.done was found.
2025-06-21 10:07:48,625 - INFO - oktoberfest.predict.predictor::from_config Using model Prosit_2024_intensity_XL_NMS2 via Koina


Prosit_2024_intensity_XL_NMS2:: 100%|███████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.31it/s]
Prosit_2024_intensity_XL_NMS2:: 100%|███████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.26it/s]
  scipy.stats.pearsonr(obs, pred)[0] if method == "pearson" else scipy.stats.spearmanr(obs, pred)[0]
  scipy.stats.pearsonr(obs, pred)[0] if method == "pearson" else scipy.stats.spearmanr(obs, pred)[0]


2025-06-21 10:07:53,297 - INFO - oktoberfest.runner::run_rescoring Merging input tab files for rescoring without peptide property prediction
2025-06-21 10:07:53,372 - INFO - oktoberfest.runner::run_rescoring Merging input tab files for rescoring with peptide property prediction


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  input_psm_rescore["Proteins"].fillna("unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  input_psm_rescore["Proteins"].fillna("unknown", inplace=True)


2025-06-21 10:07:53,956 - INFO - oktoberfest.rescore.rescore::rescore_with_percolator Starting percolator with command percolator --weights XL_out/results/percolator/original.percolator.weights.csv                     --results-psms XL_out/results/percolator/original.percolator.psms.txt                     --decoy-results-psms XL_out/results/percolator/original.percolator.decoy.psms.txt                     --only-psms                     XL_out/results/percolator/original.tab 2> XL_out/results/percolator/original.log
2025-06-21 10:07:56,577 - INFO - oktoberfest.rescore.rescore::rescore_with_percolator Finished rescoring using percolator.
2025-06-21 10:07:56,722 - INFO - oktoberfest.runner::_rescore False
2025-06-21 10:07:56,725 - INFO - oktoberfest.rescore.rescore::rescore_with_percolator Starting percolator with command percolator --weights XL_out/results/percolator/rescore.percolator.weights.csv                     --results-psms XL_out/results/percolator/rescore.percolator.psms.txt 

  df["label"] = df["label"].replace({"TT": True, "TD": False, "DD": False})


2025-06-21 10:08:05,595 - INFO - oktoberfest.rescore.rescore::rescore_with_percolator Finished rescoring using percolator.
2025-06-21 10:08:05,697 - INFO - oktoberfest.runner::run_rescoring Finished rescoring.
2025-06-21 10:08:05,699 - INFO - oktoberfest.runner::run_rescoring Generating xiFDR input.


  df["label"] = df["label"].replace({"TT": True, "TD": False, "DD": False})


2025-06-21 10:08:05,891 - INFO - oktoberfest.runner::run_rescoring Finished Generating xiFDR input.


### Check the results

The results are written to the output folder specified in your config file.

You should find the following key output files:

1. **percolator_xifdr_input.csv**

   Location: `.../XL_out/results/percolator/percolator_xifdr_input.csv`  

   This file contains Percolator scores for each CSM (cross-linked spectrum match).  

   It is intended for use with the **xiFDR** tool to estimate FDR.  

   👉 Note: Oktoberfest **does not perform FDR estimation** itself for XL-MS data — it only generates Percolator-based scores.  

   You can upload this file directly to **xiFDR** and apply FDR estimation as needed.  

   🔗 More info: [xiFDR GitHub Repository](https://github.com/Rappsilber-Laboratory/xiFDR)


   

2. **xisearch_xifdr_input.csv**  

   **Location:** `.../XL_out/results/percolator/xisearch_xifdr_input.csv`  

   This file contains XiSearch scores for each CSM (cross-linked spectrum match).  

   It can also be used as input for **xiFDR**, just like the Percolator version.  

   This allows you to **compare the performance of rescoring** (Percolator) against the original XiSearch scores.  

   📊 Useful for benchmarking rescoring effectiveness.

