## Installation:

### Percolator:
- To install percolator on windows download this file: https://github.com/percolator/percolator/releases/download/rel-3-06-01/percolator-v3-06.exe
- Run the downloaded file while running the setup make sure to select "add percolator to the system PATH for all users" when asked.

### ThermoRawFileParser:
- You need this if you want to read thermo rawfiles.
- First download this zip folder localy: https://github.com/compomics/ThermoRawFileParser/releases/download/v1.4.3/ThermoRawFileParser1.4.3.zip
- Extract the contents of the zip folder and make sure to know where this is saved this will be used later by oktoberfest.

### Oktoberfest:
- Oktoberfest currenty support Python version 3.9 and 3.10 so please install one of these python versions.
- Install oktoberfest using pip install oktoberfest

### Site Annotation:
- Install site annotation package using pip install psite-annotation

## 1- Rescoring

In [1]:
from oktoberfest.runner import run_job
from oktoberfest import __version__ as version
import json
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
spectra = '/cmnfs/data/proteomics/Phospho_Histidines/'  # this is the directory of the mzML/rawfiles containing the measured spectra.
spectra_type ='raw'  # this is the format the spectra are provided in ("mzml", "raw", "d")

search_results = '/cmnfs/data/proteomics/Phospho_Histidines/Ecoli/'                            # this is the location of the search engine output
search_results_type = 'MsFragger'  # this is the search engine type ('maxquant', 'MsFragger')

In [3]:
intensity_model = "Prosit_2024_intensity_PTMs_gl"                              # this is the model used for fragment intensity prediction, e.g. "some model"
retention_time_model = "Prosit_2024_irt_PTMs_gl"                                   # this is the model used for retention time prediction, e.g. "some model"
prediction_server = "10.157.98.66:9500"                             # the Koina server that provides access to the specified models, e.g. "<url>:<port number>"

output_directory = '/cmnfs/proj/prosit/ptms/Phospho_hist/Ecoli'                           # here you can sepcify your output directory

In [4]:
 # this is the local folder where you ThermoRawFileParser.exe file is e.g 'extracted_ThermoRawFileParser/ThermoRawFileParser.exe'
thermo_exe_directory= '/cmnfs/home/w.gabriel/ThermoRawFileParser/ThermoRawFileParser.exe'

- Documentation for the different parameters in the config file can be found here:
https://oktoberfest.readthedocs.io/en/stable/config.html

In [5]:
task_config_rescoring = {
    "type": "Rescoring",
    "tag": "",
    "inputs":{
        "search_results": search_results,
        "search_results_type": search_results_type,
        "spectra": spectra,
        "spectra_type": spectra_type
    },
    "output": output_directory,
    "models": {
        "intensity": intensity_model
        , "irt": retention_time_model
    },
    "prediction_server": prediction_server,
    "ssl": False,
    "thermoExe": thermo_exe_directory,
    "numThreads": 5,
    "fdr_estimation_method": "percolator",
    "regressionMethod": "spline",
    "allFeatures": False,
    "pipeline": "cit",
    "ptm_localization": True,
    "ptmLocalizationOptions": {
        "unimod_id": 21,
        "possible_sites": ['H','S','T','Y'],
        "neutral_loss": True
    }
}

with open('./rescoring_config.json', 'w') as fp:
    json.dump(task_config_rescoring, fp)

In [10]:
run_job("./rescoring_config.json")

2024-10-10 12:01:35,122 - INFO - oktoberfest.utils.config::read Reading configuration from ./rescoring_config.json
2024-10-10 12:01:35,127 - INFO - oktoberfest.runner::run_job Oktoberfest version 0.8.1
Copyright 2024, Wilhelmlab at Technical University of Munich
2024-10-10 12:01:35,128 - INFO - oktoberfest.runner::run_job Job executed with the following config:
2024-10-10 12:01:35,130 - INFO - oktoberfest.runner::run_job {
    "type": "Rescoring",
    "tag": "",
    "inputs": {
        "search_results": "/cmnfs/proj/prosit_cit/Test_oktoberfest/",
        "search_results_type": "MSFragger",
        "spectra": "/cmnfs/proj/prosit_cit/Test_oktoberfest/",
        "spectra_type": "raw"
    },
    "output": "/cmnfs/proj/prosit_cit/Test_oktoberfest/output/",
    "models": {
        "intensity": "Prosit_2024_intensity_cit",
        "irt": "Prosit_2024_irt_cit"
    },
    "prediction_server": "koina.wilhelmlab.org:443",
    "ssl": true,
    "thermoExe": "ThermoRawFileParser/ThermoRawFileParser.

100%|██████████| 1/1 [00:25<00:00, 25.22s/it]


2024-10-10 12:02:01,827 - INFO - spectrum_io.search_result.msfragger::filter_valid_prosit_sequences #sequences before filtering for valid prosit sequences: 99769
2024-10-10 12:02:01,963 - INFO - spectrum_io.search_result.msfragger::filter_valid_prosit_sequences #sequences after filtering for valid prosit sequences: 96891
2024-10-10 12:02:02,578 - INFO - oktoberfest.runner::_preprocess Read 96891 PSMs from /cmnfs/proj/prosit_cit/Test_oktoberfest/output/msms/msms.prosit
2024-10-10 12:02:02,663 - INFO - oktoberfest.preprocessing.preprocessing::split_search Creating split search results file /cmnfs/proj/prosit_cit/Test_oktoberfest/output/msms/YIG_244_L009_04_01_U01_R2.rescore
2024-10-10 12:02:03,247 - INFO - spectrum_io.raw.thermo_raw::convert_raw_mzml Found converted file at /cmnfs/proj/prosit_cit/Test_oktoberfest/output/spectra/YIG_244_L009_04_01_U01_R2.mzML, skipping conversion
2024-10-10 12:02:03,380 - INFO - spectrum_io.raw.msraw::_read_mzml_pyteomics Reading mzML file: /cmnfs/proj/pr

Prosit_2024_intensity_cit:: 100%|██████████| 32/32 [00:08<00:00,  3.97it/s]


2024-10-10 12:07:08,093 - INFO - oktoberfest.utils.process_step::is_done Skipping ce_calib.YIG_244_L009_04_01_U01_R2 step because /cmnfs/proj/prosit_cit/Test_oktoberfest/output/proc/ce_calib.YIG_244_L009_04_01_U01_R2.done was found.
2024-10-10 12:07:08,921 - INFO - oktoberfest.predict.predictor::from_config Using model Prosit_2024_intensity_cit via Koina


Prosit_2024_intensity_cit:: 100%|██████████| 97/97 [00:22<00:00,  4.28it/s]


2024-10-10 12:07:56,155 - INFO - oktoberfest.predict.predictor::from_config Using model Prosit_2024_irt_cit via Koina


Prosit_2024_irt_cit:: 100%|██████████| 97/97 [00:05<00:00, 18.38it/s]


2024-10-10 12:09:46,440 - INFO - spectrum_fundamentals.metrics.percolator::get_indices_below_fdr Found 29024 (out of 69348) targets below 0.01             FDR using spectral_angle as feature
2024-10-10 12:09:46,444 - INFO - spectrum_fundamentals.metrics.percolator::apply_lda_and_get_indices_below_fdr Found 29024 targets and 27543 decoys as input for the LDA model
2024-10-10 12:09:46,985 - INFO - spectrum_fundamentals.metrics.percolator::get_indices_below_fdr Found 35861 (out of 69348) targets below 0.01             FDR using lda_scores as feature
2024-10-10 12:09:47,101 - INFO - spectrum_fundamentals.metrics.percolator::calc Median absolute error predicted vs observed retention time on targets < 1% FDR: 0.6327482475378723
2024-10-10 12:09:52,159 - INFO - oktoberfest.runner::run_rescoring Merging input tab files for rescoring without peptide property prediction
2024-10-10 12:09:54,077 - INFO - oktoberfest.runner::run_rescoring Merging input tab files for rescoring with peptide property 

## 2- Site annotation

In [2]:
import pandas as pd
import psite_annotation as pa
import re
from pathlib import Path

#### A. Extract SA scores from RESCORE TAB and combine with PROSIT-CIT PSMS Results

In [56]:
# Load spectral_angle and SpecId from the RESCORE TAB file
combined_df = pd.read_csv(   
fdr_dir / "rescore.tab",
    sep='\t', 
    usecols=["spectral_angle", "SpecId"]
)

# Load Prosit-Cit psms results
df_prosit_psms = pd.read_csv(
    fdr_dir / "rescore.percolator.peptides.txt",
    sep='\t'
)

#Remove _ appended to the peptide sequence
df_prosit_psms['peptide'] = df_prosit_psms['peptide'].str.replace('._','')
df_prosit_psms['peptide'] = df_prosit_psms['peptide'].str.replace('_.','')

# Rename the first column to "SpecId"
df_prosit_psms.rename(columns={"PSMId": "SpecId"}, inplace=True)

# Filter for cit peptides with 1%FDR
df_prosit_psms = df_prosit_psms[
    (df_prosit_psms['q-value'] <= 0.01) & 
    (df_prosit_psms['peptide'].str.contains(r"S\[UNIMOD:21\]"))
]

# Merge Prosit-Cit psms results with spectral_angle data
merged_df = pd.merge(df_prosit_psms, combined_df, on="SpecId", how="left")

  df_prosit_psms['peptide'] = df_prosit_psms['peptide'].str.replace('._','')
  df_prosit_psms['peptide'] = df_prosit_psms['peptide'].str.replace('_.','')


#### B. Map to the cit site/sequence window using the corresponding fasta

In [45]:
Fasta_file_path = "/cmnfs/home/w.gabriel/Prosit_PTM_paper/phospho_his/ecoli_uniprot.fasta"

In [57]:
# Remove contaminants proteins if existing
#merged_df['Organism'] = merged_df['proteinIds'].apply(lambda x: x.split(':')[0].split('_')[-1])
#main_organism = merged_df['Organism'].value_counts().index[0]
#merged_df = merged_df[merged_df['proteinIds'].str.contains(r'_'+ main_organism, regex=True)]

# Select relevant columns
merged_df = merged_df[['filename', 'peptide', 'proteinIds', 'spectral_angle']]

# Extract protein name from 'protein' column using regex
merged_df['proteinIds'] = merged_df['proteinIds'].apply(lambda x: re.sub(r'^.*\|(.*?)\|.*$', r'\1', x))

# Pivot the DataFrame
dt = merged_df.pivot_table(index=['peptide', 'proteinIds'], columns='filename', values='spectral_angle').reset_index()
dt.rename(columns={'proteinIds': 'Proteins', 'peptide': 'Modified sequence'}, inplace=True)

# Replace specific modification patterns in 'Modified Sequence'
mod_replacements = {
    r"C\[UNIMOD:4\]": "C",
    r"M\[UNIMOD:35\]": "M",
    r"H\[UNIMOD:21\]": "H",
    r"T\[UNIMOD:21\]": "T",
    r"Y\[UNIMOD:21\]": "Y",
}

# Apply all replacements
for pattern, replacement in mod_replacements.items():
    dt['Modified sequence'] = dt['Modified sequence'].str.replace(pattern, replacement, regex=True)

# Load Fasta used for search
CustomFasta = Fasta_file_path
dt = pa.addPeptideAndPsitePositions(dt, CustomFasta, mod_dict={'S[UNIMOD:21]': 's'})
dt = dt[dt['Site positions'] != ""]

#dt.to_csv(output_directory +"/Cit_rescore_site_mapping.txt", sep='\t', index=False)

OSError: Cannot save file into a non-existent directory: 'X:/internal_projects/L009_cit_temp_MS_Fragger'