# Running Oktoberfest

This notebook provides an overview of the three main workflows in Oktoberfest. The total runtime including file download (15 minutes, only once) and rescoring (20 minutes) should take around 35 minutes in total.

## 1- Import necessary python packages

In [None]:
import os
from oktoberfest.runner import run_job
import json
import urllib.request
import shutil
from tqdm import tqdm

## 2- Download example files from zenodo required to run different tasks

The data used in this tutorial is provided in a public zenodo record. 
This is a larger dataset with 2.55GB in total. Download time should be ~15mins (averge 3MB/s).

### A- Get the current directory and set the file name

In [None]:
download_dir = os.getcwd()
download_file = os.path.join(download_dir, 'Oktoberfest_input.zip')
url = 'https://zenodo.org/record/7613029/files/Oktoberfest_input.zip'

download = True  # set this to false if you already have the file and don't want to download again in the next step

### B- Download and extract files from zenodo to the same directory

In [None]:
if download:
    with tqdm(unit="B", total=2739196307, unit_scale=True, unit_divisor=1000, miniters=1, desc=url.split("/")[-1]) as t:
        urllib.request.urlretrieve(url=url, filename=download_file, reporthook=lambda blocks, block_size, _: t.update(blocks * block_size - t.n))
    shutil.unpack_archive(download_file, download_dir)

### C- Check downloaded files

In [None]:
input_dir = download_file[:-4]
print(f'Downloaded data is stored in {input_dir}\nContents:')
os.listdir(input_dir)

## 3- Running Different Tasks

### A- Spectral Library Generation

This is a small test case and should roughly take around 1 minute.

#### Generate config file

In [None]:
task_config_spectral_lib = {
    "type": "SpectralLibraryGeneration",
    "tag": "",
    "inputs": {
        "library_input": input_dir + "/peptides_spectral_library.csv",
        "library_input_type": "peptides",
        "instrument_type": ""
    },
    "output": "./out",
    "models": {
        "intensity": "Prosit_2020_intensity_HCD",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": True,
    "numThreads": 5,
    "spectralLibraryOptions": {
        "fragmentation": "HCD",
        "collisionEnergy": 30,
        "precursorCharge": [2,3],
        "minIntensity": 5e-4,
        "batchsize": 10000,
        "format": "msp",
        "nrOx": 1,
    },
    "fastaDigestOptions": {
        "digestion": "full",
        "missedCleavages": 1,
        "minLength": 7,
        "maxLength": 30,
        "enzyme": "trypsin",
        "specialAas": "KR",
        "db": "target"
    },
}

#### Save config as json

In [None]:
with open('./spectral_library_config.json', 'w') as fp:
    json.dump(task_config_spectral_lib, fp)

#### Run spectral library generation job

In [None]:
run_job("./spectral_library_config.json")

### B- CE Calibration

This will read the raw files, convert them to mzML, load the search results and perform CE calibration on the top 1000 target PSMs (based on the andromeda score in the msms.txt).
This should take around 10 minutes, of which 5 minutes are file conversion that has to be performed only once.

#### Generate config file

In [None]:
task_config_ce_calibration = {
    "type": "CollisionEnergyCalibration",
    "tag": "",
    "inputs":{
        "search_results": input_dir + "/msms.txt",
        "search_results_type": "Maxquant",
        "spectra": input_dir,
        "spectra_type": "raw"
    },
    "output": "./out",
    "models": {
        "intensity": "Prosit_2020_intensity_HCD",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": True,
    "thermoExe": "/opt/compomics/ThermoRawFileParser.exe",  # ensure you point to the right location of the executable here!
    "massTolerance": 20,
    "unitMassTolerance": "ppm",
    "numThreads": 4
}

#### Save config as json

In [None]:
with open('./ce_calibration_config.json', 'w') as fp:
    json.dump(task_config_ce_calibration, fp)

#### Run ce calibration job

In [None]:
run_job("./ce_calibration_config.json")

### C- Rescoring

Rescoring involves CE calibration, after which predictions with the optimal CE are retrieved. This takes around 10 minutes if file conversion and CE calibration was performed in the last step already. If not, runtime increases to 20 minutes accordingly.

#### Generate config file

In [None]:
task_config_rescoring = {
    "type": "Rescoring",
    "tag": "",
    "inputs":{
        "search_results": input_dir + "/msms.txt",
        "search_results_type": "Maxquant",
        "spectra": input_dir,
        "spectra_type": "raw"
    },
    "output": "./out",
    "models": {
        "intensity": "Prosit_2020_intensity_HCD",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": True,
    "thermoExe": "ThermoRawFileParser.exe",
    "numThreads": 4,
    "fdr_estimation_method": "percolator",  # ensure percolator is installed on your system
    "regressionMethod": "spline",
    "allFeatures": False,
    "massTolerance": 20,
    "unitMassTolerance": "ppm"
}

#### Save config as json

In [None]:
with open('./rescoring_config.json', 'w') as fp:
    json.dump(task_config_rescoring, fp)

#### Run rescoring job

In [None]:
run_job("rescoring_config.json")