# Oktoberfest Workshop

This notebook is prepared to be run in Google [Colaboratory](https://colab.research.google.com/).

This notebook contains tasks that are designed to guide new users through the following topics:

1. How to install oktoberfest and load packages
2. How to get the required data
3. How to prepare a configuration file
4. How to run a job and interpret the output

# 1. Installation

Before using Oktoberfest, the package and dependencies need to be installed. This step is only required once on your notebook, but it may need to be repeated in Google Collab.

## Task 1.1

What are the requirements for Oktoberfest and where do you find this information? (Hint: Search the Oktoberfest documentation at readthedocs using your favourite search engine).

## Task 1.2

Execute the below code cell, which installs percolator and Oktoberfest and restart the session if asked.

In [None]:
!wget https://github.com/percolator/percolator/releases/download/rel-3-06-01/percolator-v3-06-linux-amd64.deb
!dpkg -i percolator-v3-06-linux-amd64.deb
!pip install oktoberfest

For this notebook to work, a few packages need to be imported that provide the functions used in the following. Shouly you get an error here, check that installation of the required packages was successful.

## Task 1.3

Import the below packages and functions by executing the code in the cell.

In [None]:
from oktoberfest.runner import run_job
from oktoberfest import __version__ as version
import os
import json
import urllib.request
import shutil
from tqdm.auto import tqdm

If this works, you have installed Oktoberfest correctly.

1.4 How can you check that you are using the current stable version? (check the output of __version__ using the below code cell and the Oktoberfest documentation)

In [None]:
  #  add code here to check the version of the imported oktoberfest version

# Task 2: Getting the data

The data used in this notebook is provided as a zip archive that can be downloaded from zenodo from this record https://zenodo.org/records/10793943

## Task 2.1

Find the download link in the public zenodo record. You can copy the link by hovering over the download button, click your right mouse button and choose the option to copy the download link.

## Task 2.2

Define variables for the download link, URL, and the local file name using the below code cell and execute the cell afterwards.


In [None]:
url =                    # here goes the download link of the file to download from the zenodo record, it should look like "https://zenodo.org/records/10782588/...", make sure to include the ""
download_dir =           # you can chose any directory, e.g. "Oktoberfest_input/", make sure to include the ""
file_name =              # you can chose any filename, e.g. "sample_data.zip"

## Task 2.3

Download and unpack the data using the below code cell. You should see a progress bar while it is downloading the file (86MB, approx. 1 minute).

In [None]:
if not os.path.isdir(download_dir):
  os.mkdir(download_dir)
download_file = os.path.join(download_dir, 'HLA_sample.zip')
with tqdm(unit="B", total=70958154, unit_scale=True, unit_divisor=1000, miniters=1, desc=url.split("/")[-1]) as t:
    urllib.request.urlretrieve(url=url, filename=download_file, reporthook=lambda blocks, block_size, _: t.update(blocks * block_size - t.n))
shutil.unpack_archive(download_file, download_dir)

## Task 2.4

Check that the download was successful. Hint: Use the file browser on the left side to search for the folder you defined using the __download_dir__ variable above and check the content.  What do you find here?

# Task 3: Rescoring with Oktoberfest

The main feature of oktoberfest is to perform rescoring. This requires two main inputs:
- unfiltered search results, for MaxQuant, this would mean a run with 100% PSM and peptide FDR
- acquired spectra, either in ThermoFisher .RAW, Bruker .d, or mzML format

In addition, Oktoberfest can acquire predictions from various data dependent models, that are provided by a Koina instance.

## Task 3.1

Where do you find information about the configuration options, example configurations, and the supported prediction models (Hint: Check the [Usage principles](https://oktoberfest.readthedocs.io/en/latest/usage.html) in the Oktoberfest documentation)? Define below variables accordingly.



In [None]:
spectra =                # this is the location of the mzML file containing the measured spectra, i.e. "<your download_dir>/<filename>.mzml"
spectra_type =           # this is the format the spectra are provided in ("mzml", "RAW", "d"), which one is correct here?

search_results =         # this is the location of the search engine output, i.e. "<your download_dir>/<search_engine output>"
search_results_type =    # this is the name of the search engine that produced the search results, which is the correct search engine here?

## Task 3.2

The data we are working with here was aquired using beam-type collision induced dissociation (HCD) without tandem mass tags (TMT).

Which are the models to use for fragment intensity prediction and retention time prediction and the server URL that provides access to these models (Hint: Check the [Usage principles](https://oktoberfest.readthedocs.io/en/latest/usage.html) in the Oktoberfest documentation)?

Also specify the directory you want to store all the outputs from Oktoberfest in.

In [None]:
intensity_model =        # this is the model used for fragment intensity prediction, e.g. "some model"
retention_time_model =   # this is the model used for retention time prediction, e.g. "some model"
prediction_server =      # the Koina server that provides access to the specified models, e.g. "<url>:<port number>"

output_directory =       # this is the output folder for everything Oktoberfest produces during rescoring, e.g. "rescore_out"



## Task 3.3

Save the variables you have defined above in the below configuration and store it to disk. For simplicity, this is providing a minimal configuration for this task, so you can simply execute the code cell.

A detailed explanation of all available configuration options can be found in the [Usage principles](https://oktoberfest.readthedocs.io/en/latest/usage.html) in the Oktoberfest documentation.

What are the mass tolerance and unit variables for?

In [None]:
task_config_rescoring = {
    "type": "Rescoring",
    "inputs":{
        "search_results": search_results,
        "search_results_type": search_results_type,
        "spectra": spectra,
        "spectra_type": spectra_type
    },
    "output": output_directory,
    "models": {
        "intensity": intensity_model,
        "irt": retention_time_model
    },
    "prediction_server": prediction_server,
    "ssl": True,
    "numThreads": 1,
    "fdr_estimation_method": "percolator",
    "massTolerance": 20,
    "unitMassTolerance": "ppm"
}

# this is for storing the file on disk
with open('./rescoring_config.json', 'w') as fp:
    json.dump(task_config_rescoring, fp)

(Optional) You can now check the configuration file on disk, to see if it looks correctly by finding it with the file browser on the left.

The oktoberfest documentation provides [example configurations](https://oktoberfest.readthedocs.io/en/latest/jobs.html#c-rescoring) that show you how a typical rescoring run for MaxQuant is set up with all the available options.

If you want to get detailed information about individual options and allowed values, you can check the documentation for the [full configuration](https://oktoberfest.readthedocs.io/en/latest/config.html).

## Task 3.4

Start the rescoring run.

After preparation of the configuration file, oktoberfest can be instructed to run a job with the provided configuration file. This step may take a while (approx. 3-5 minutes) and provide you with log output that tracks the progress of rescoring.
Oktoberfest will perform the following steps:

- read the search results from maxquant and translate them to the internal format used by Oktoberfest. The specification for this format can be found in the documentation.
- parse the mzml data to retreive MS2 spectra, then merge with the search results to generate PSMs, filtering out spectra without a search result
- annotation of spectra for all y- and b-fragments in charge states 1-3
- perform a NCE calibration using the top 1000 highest scoring target PSMs, to determine the NCE for which the highest spectral angle can be achieved with the acquired predictions
- fragment intensity and retention time prediction for all PSMs
- retention time alignment, spectral angle and further feature calculation for rescoring using percolator
- rescoring using features from intensity and retention time prediction and the original search engine score
- plotting summaries of the rescoring run

In [None]:
run_job("./rescoring_config.json")

## Task 3.5

Explore the output folder of Oktoberfest using the file brwoser on the left.

Where do you find information about the output folder structure and what you can find where (Hint: Check the [Usage principles](https://oktoberfest.readthedocs.io/en/latest/usage.html) in the Oktoberfest documentation)?

Did rescoring work and provide better results (Results discussion follows)?

# Task 4: Spectral library generation

A second feature of Oktoberfest is the generation of spectral libraries, which can be used for DIA analysis. Similarly to rescoring, a configuration file needs to be prepared. In this case, one main input is required:

- fasta file, to perform an in-silico digestion with given settings

## Task 4.1

What inputs are required for spectral library generation? You can check the documentation again, and fill out the below code cell.

In [None]:
library_input =          # this is the location of the fasta or peptide list, e.g. "/path/to/<filename>.fasta"
library_input_type =     # this is the format the you provide, e.g. "fasta" or "peptides", which one is correct here?

output_directory =       # this is the output folder for everything Oktoberfest produces during spectral library generation, e.g. "speclib_out"

## Task 4.2

Chose some settings for the library generation. You can check the documentation for detailed information and play around with the values below. Beware that more freedom in missed cleavages or more than one precursor charge will lead to longer prediction time and larger file sizes. It makes sense to try this out with minimal values first.

In [None]:
collisionEnergy =        # the collision energy for which the spectral library should be produces, e.g. 30
precursorCharge =        # the precursor charges that should be considered when creating the library, e.g. 0 or [2,3], or [1,2,3] (more than one increases prediction time / file size)
format =                 # the desired format for the library, e.g. "spectronaut" or "msp", "msp" is smaller, "spectronaut"

missedCleavages =        # this is the number of missed cleavages that should be allowed (higher values increase prediction time / file size)
minLength =              # minimal allowed peptide length, prosit accepts everythin >= 7
maxLength =              # maximal allowed peptide length, prosit accepts everything <= 30


## Task 4.3

Save the variables you have defined above in the below configuration and store it to disk. For simplicity, this is providing a minimal configuration for this task, so you can simply execute the code cell.

A detailed explanation of all available configuration options can be found in the Usage principles in the Oktoberfest documentation.

In [None]:
task_config_spectral_lib = {
    "type": "SpectralLibraryGeneration",
    "tag": "",
    "inputs": {
        "library_input": library_input,
        "library_input_type": library_input_type
    },
    "output": output_directory,
    "models": {
        "intensity": intensity_model,
        "irt": retention_time_model
    },
    "prediction_server": prediction_server,
    "ssl": True,
    "numThreads": 1,
    "spectralLibraryOptions": {
        "fragmentation": "HCD",
        "collisionEnergy": collisionEnergy,
        "precursorCharge": precursorCharge,
        "minIntensity": 5e-4,
        "batchsize": 10000,
        "format": format,
    },
    "fastaDigestOptions": {
        "digestion": "full",
        "missedCleavages": missedCleavages,
        "minLength": minLength,
        "maxLength": maxLength,
        "enzyme": "trypsin",
        "specialAas": "KR",
        "db": "target"
    },
}

with open('./spectral_library_config.json', 'w') as fp:
    json.dump(task_config_spectral_lib, fp)

## Task 4.4

Start the spectral library generation.

This step may take a while (TODO minutes). The log output tracks the progress of library generation. Oktoberfest will perform the following steps:

- read the fasta file and perform an in-silico digest according to the settings provided in the configuration file
- acquire fragment intensity and retention time predictions in batches and write them to disk on the fly

In [None]:
run_job("./spectral_library_config.json")

After Oktoberfest is done generating the library, the specified output folder contains a file called "myPrositLib.msp" (MSP) or "myPrositLib.csv" (spectronaut). Check to see if everything worked out correctly.