# Running Oktoberfest

This notebook walks you through the main workflows in Oktoberfest similar to `Oktoberfest Tutorial.ipynb`, however, instead of performing rescoring using a pre-trained intensity predictor, automatic refinement/transfer learning is performed on the dataset locally and the resulting predictor used for intensity prediction in rescoring.

File download and pre-processing take about 25 minutes, while the automatic refinement/transfer learning in the rescoring step can take multiple hours depending on the machine.

## 1 - Import necessary python packages

In [1]:
import os
from oktoberfest.runner import run_job
import json
import urllib.request
import shutil
from tqdm import tqdm

## 2 - Download example files required to run different tasks from Zenodo 

The data used in this tutorial is provided in a public Zenodo record. 
This is a larger dataset with 2.55GB in total. Download time should be ~15mins (average 3 MB/s).

### A - Get the current directory and set the file name

In [23]:
download_dir = os.getcwd()
download_file = os.path.join(download_dir, 'Oktoberfest_input.zip')
url = 'https://zenodo.org/record/7613029/files/Oktoberfest_input.zip'

download = True  # set this to false if you already have the file and don't want to download again in the next step

### B - Download and extract files from Zenodo

In [24]:
if download:
    with tqdm(unit="B", total=2739196307, unit_scale=True, unit_divisor=1000, miniters=1, desc=url.split("/")[-1]) as t:
        urllib.request.urlretrieve(url=url, filename=download_file, reporthook=lambda blocks, block_size, _: t.update(blocks * block_size - t.n))
    shutil.unpack_archive(download_file, download_dir)

Oktoberfest_input.zip: 2.74GB [01:02, 43.5MB/s]                                                                                                                


### C - Check downloaded files

In [25]:
input_dir = download_file[:-4]
print(f'Downloaded data is stored in {input_dir}\nContents:')
os.listdir(input_dir)

Downloaded data is stored in /cmnfs/home/students/j.schlensok/oktoberfest/tutorials/Oktoberfest_input
Contents:


['config_files',
 'GN20170722_SK_HLA_G0103_R1_01.raw',
 'GN20170722_SK_HLA_G0103_R1_02.raw',
 'GN20170722_SK_HLA_G0103_R2_01.raw',
 'GN20170722_SK_HLA_G0103_R2_02.raw',
 'msms.txt',
 'peptides_spectral_library.csv']

## 3 - Running Oktoberfest

In [None]:
{
    "type": "Rescoring",
    "tag": "",
    "output": "./out",
    "inputs": {
        "search_results": "./msms.txt",
        "search_results_type": "Maxquant",
        "spectra": "./",
        "spectra_type": "raw",
        "instrument_type": "QE"
    },
    "models": {
        "intensity": "baseline",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "numThreads": 1,
    "dlomixInferenceBatchSize": 1024,
    "refinementLearningOptions": {
        "batchSize": 1024,
        "includeOriginalSequences": false,
        "improveFurther": false,
        "datasetFilteringOptions": {
            "searchEngineScoreThreshold": 0,
            "numDuplicates": 100
        }
    },
    "fdr_estimation_method": "mokapot",
    "allFeatures": false,
    "regressionMethod": "spline",
    "ssl": true,
    "thermoExe": "ThermoRawFileParser.exe",
    "massTolerance": 20,
    "unitMassTolerance": "ppm",
    "ce_alignment_options": {
        "ce_range": [19,50],
        "use_ransac_model": false
    }
}

In [None]:
with open("./rescoring_config.json", 'w') as f:
    json.dump(rescoring_config, f)

In [None]:
run_job("./rescoring_config.json")

### A - Preprocessing

This will read the raw files, convert them to mzML, and load the search results. While the job type is `"CollisionEnergyCalibration"`, actual CE calibration is skipped so the intensity predictor is not just trained using one single value for it.
This should take around 5 minutes.

#### Generate config file

Note the `"intensity": "baseline"` in `"models"` (line 12). This tells Oktoberfest to perform local refinement/transfer learning on a baseline intensity predictor and use it for rescoring instead of using a pre-trained one provided through Koina. Alternatively, a path to another pre-trained intensity predictor could be provided

In [20]:
task_config_preprocessing = {
    "type": "CollisionEnergyCalibration",
    "tag": "",
    "inputs":{
        "search_results": input_dir + "/msms.txt",
        "search_results_type": "Maxquant",
        "spectra": input_dir,
        "spectra_type": "raw"
    },
    "output": "./out",
    "models": {
        "intensity": "baseline",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": True,
    "thermoExe": "/opt/compomics/ThermoRawFileParser.exe",  # ensure you point to the right location of the executable here!
    "massTolerance": 20,
    "unitMassTolerance": "ppm",
    "numThreads": 1
}

#### Save config as json

In [21]:
with open('./preprocessing_config.json', 'w') as fp:
    json.dump(task_config_preprocessing, fp)

#### Run preprocessing job

In [26]:
run_job("./preprocessing_config.json")

2024-10-15 13:02:02,308 - INFO - oktoberfest.utils.config::read Reading configuration from ./ce_calibration_config.json
2024-10-15 13:02:02,313 - INFO - oktoberfest.runner::run_job Oktoberfest version 0.8.2
Copyright 2024, Wilhelmlab at Technical University of Munich
2024-10-15 13:02:02,315 - INFO - oktoberfest.runner::run_job Job executed with the following config:
2024-10-15 13:02:02,317 - INFO - oktoberfest.runner::run_job {
    "type": "CollisionEnergyCalibration",
    "tag": "",
    "inputs": {
        "search_results": "/cmnfs/home/students/j.schlensok/oktoberfest/tutorials/Oktoberfest_input/msms.txt",
        "search_results_type": "Maxquant",
        "spectra": "/cmnfs/home/students/j.schlensok/oktoberfest/tutorials/Oktoberfest_input",
        "spectra_type": "raw"
    },
    "output": "./out",
    "models": {
        "intensity": "baseline",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": true,
    "thermoExe": "/opt/comp

XMLSyntaxError: AttValue: ' expected, line 1365283, column 84 (GN20170722_SK_HLA_G0103_R1_02.mzML, line 1365283)

### B - Refinement/Transfer Learning & Rescoring

In this step, a training dataset in Parquet format is generated and used to run automtaic refinement/transfer learning of the intensity predictor specified by the user.
After successful completion of the learning process, a training report is generated and saved in the `results/dlomix/` subdirectory of your output folder in Jupyter notebook and HTML format.
Finally, the refined model is used for intensity prediction in Oktoberfest's rescoring step.
Depending on your machine, this might take a couple hours. The training results are cached so the refined intensity predictor can be re-used in additional rescoring runs.

#### Generate config file

Note the `"dlomixInferenceBatchSize`" key (line 24), as well as the `"refinementLearningOptions"` (line 25-33) provided. All of these are set to their default values, which have been found to provide the best balance between training time and performance in practice.
If `"improveFurther`" was set to `true`, an additional training phase with a reduced learning rate is performed to achieve the best possible performance at the expense of a longer training duration. This is skipped for time's sake in this tutorial.

In [None]:
task_config_rescoring = {
    "type": "Rescoring",
    "tag": "",
    "inputs":{
        "search_results": input_dir + "/msms.txt",
        "search_results_type": "Maxquant",
        "spectra": input_dir,
        "spectra_type": "raw"
    },
    "output": "./out",
    "models": {
        "intensity": "baseline",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": True,
    "thermoExe": "ThermoRawFileParser.exe",
    "numThreads": 4,
    "fdr_estimation_method": "percolator",  # ensure percolator is installed on your system
    "regressionMethod": "spline",
    "allFeatures": False,
    "massTolerance": 20,
    "unitMassTolerance": "ppm",
    "dlomixInferenceBatchSize": 1024,
    "refinementLearningOptions": {
        "batchSize": 1024,
        "includeOriginalSequences": false,
        "improveFurther": false,
        "datasetFilteringOptions": {
            "searchEngineScoreThreshold": 0,
            "numDuplicates": 100
        }
    },
}

#### Save config as json

In [None]:
with open('./rescoring_config.json', 'w') as fp:
    json.dump(task_config_rescoring, fp)

#### Run rescoring job

In [None]:
run_job("rescoring_config.json")