# Rescoring of timsTOF data using Oktoberfest

This notebook provides an overview of rescoring timsTOF data in Oktoberfest. The total runtime including file download (12 minutes, only once) and rescoring (20 minutes) should take around 35 minutes in total.

## 1. Import necessary python packages

In [None]:
import os
from oktoberfest.runner import run_job
import json
import urllib.request
import shutil
from tqdm import tqdm

## 2. Download files from zenodo required to run different tasks

The data used in this tutorial is provided in a public zenodo record. 
This is a larger dataset with 2.55GB in total. Download time should be ~15mins (averge 3MB/s).

### Get the current directory and set the file name

In [None]:
download_dir = os.getcwd()
download_file = os.path.join(download_dir, "Oktoberfest_timsTOF_input.zip")
url = "https://zenodo.org/record/10868376/files/Oktoberfest_timsTOF_input.zip"

# set download to False if you already have the file and don"t want to download again in the next step
download = True

### Download and extract files from zenodo to the same directory

In [None]:
if download:
    with tqdm(unit="B", total=2160299072, unit_scale=True, unit_divisor=1000, miniters=1, desc=url.split("/")[-1]) as t:
        urllib.request.urlretrieve(url=url, filename=download_file, reporthook=lambda blocks, block_size, _: t.update(blocks * block_size - t.n))
    shutil.unpack_archive(download_file, download_dir)

### Check downloaded files

In [None]:
input_dir = download_file[:-4]
print(f"Downloaded data is stored in {input_dir}\nContents:")
os.listdir(input_dir)

## 3. Rescoring

Rescoring involves CE calibration, after which predictions with the optimal CE are retrieved. This takes around 25 minutes, of which 15 minutes are file transformation / MS2 spectra aggregation (only once). In subsequent runs, runtime decreases to 10 minutes accordingly.

### Generate config file

In [None]:
task_config_rescoring = {
    "type": "Rescoring",
    "tag": "",
    "inputs":{
        "search_results": input_dir + "/txt",
        "search_results_type": "Maxquant",
        "spectra": input_dir,
        "spectra_type": "d"
    },
    "output": "./timstof_out",
    "models": {
        "intensity": "Prosit_2023_intensity_timsTOF",
        "irt": "Prosit_2019_irt"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": True,
    "thermoExe": "ThermoRawFileParser.exe",
    "numThreads": 1,
    "fdr_estimation_method": "percolator",
    "regressionMethod": "spline",
    "allFeatures": False,
    "massTolerance": 40,
    "unitMassTolerance": "ppm",
    "ce_alignment_options": {
        "ce_range": [
            5,
            45
        ],
        "use_ransac_model": True
    }
}

### Save the config file

In [None]:
with open("./rescoring_config.json", "w") as fp:
    json.dump(task_config_rescoring, fp)

### Start rescoring

In [None]:
run_job("./rescoring_config.json")

### Check the results

The results are written to the output folder specified in the config file. CE alignment, the distribution of percolator scores for targets and decoys when rescoring with and without Prosit-derived features, as well as PSMs and peptides below 1%FDR gained, shared and lost between when rescoring with Prosit-derived features versus without are also shown in the above cell to provide a general overview.

The expected result should show:

1. CE alignment: For precursor charge state 1-4 (scatter plots), there is a linear relation between the delta in CE (difference between reported and best CE as defined by providing the highest spectral angle for predictions). That means, with increasing precursor mass, the delta decreases.
2. Target-Decoy percolator score distribution: Predicting with Prosit-derived features (along the y-axis of the joint plot) shows a bimodal distribution for targets, that separates what is expected to be true positives from false positives. The latter follow the decoy distribution very well, which indiciates that percolator's rescoring worked well according to FDR estimation using the target-decoy approach. Rescoring without Prosit-derived features (along the x-axis) shows poorer separation, leading to less target PSMs / peptides retained below a 1%FDR cutoff (green dots not above a percolator score of 0, as indicated by the red line) compared to rescoring with Prosit-derived features.
3. Lost-common-gained PSMS and peptides: To quantify how many PSMs and peptides below 1%FDR are lost, common and gained when rescoring with Prosit-derived features versus rescoring without, the stacked barplots are used. You should see approximately 80% increase in PSMs peptides below 1% FDR. These are 3570 common and 2852 gained peptides, and 4379 common and 3581 gained PSMs. At the same time, the low number of lost PSMs (55) and peptides (47) can be used as another quality control to show that rescoring with Prosit-derived features improves the overall number without losing what would be there with features including the search engine derived score. The exact numbers may vary based on percolator's random initialization.