# Search Results Files

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import plotly.express as px
!pip -q install itables
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
import itables.options as opt
opt.maxBytes = 0
opt.classes = ["display", "nowrap","compact","hover"]
opt.showIndex = False
opt.style = "max-width:6000px"
pd.set_option('display.max_colwidth', 400)

In [None]:
! wget http://genesis.ugent.be/uvpublicdata/ionbot.workshop/ionbot.workshop.data.zip
! unzip ionbot.workshop.data.zip
! git clone https://github.com/sdgroeve/ionbot.workshop.git
! mv ionbot.workshop/* .

The `ionbot.twbx` result file contains the matched peptides and proteins as a compressed file. 

This file can be renamed to better relfect the processed sample.

In the following field you can specify the path to the result file: 

In [None]:
twbx_file = "ionbot_20150929_QE5_UPLC10_RJC_SA_Plaque6_01_filtered.twbx"
#twbx_file = "c:\\Work\\ionbot.workshop.data\\ionbot_20150929_QE5_UPLC10_RJC_SA_Plaque6_01_filtered.twbx"

This result file can be extracted as a zip file, but here we will decompress the file using Python.

You can specify the folder to where to extract the ionbot result files to:

In [None]:
result_folder = "my_results"
#result_folder = "c:\\Work\\ionbot.workshop\\my_results"

In [None]:
import zipfile

archive = zipfile.ZipFile(twbx_file)

for file in archive.namelist():
    if file.startswith('Data/'):
        archive.extract(file, result_folder)

The result files are written to the subfolders `Data/ionbot_result`:

In [None]:
result_folder = result_folder + "/Data/ionbot_result"

The content of the result files is described [here](https://ionbot.cloud/help).

## The PSM results

First, we load the result file that contains the first ranked matches for each MS2 spectrum:

In [None]:
ionbot = pd.read_csv("%s/ionbot.first.csv"%result_folder)

These are the column names:

In [None]:
for col in ionbot.columns:
    print(col)

Let's print some columns and explain the content:

In [None]:
cols_to_use = ["ionbot_match_id","database_peptide","matched_peptide",
               "modifications","modifications_delta","unexpected_modification"]
ionbot[cols_to_use]

The column `database` is `T` if the PSM matched the target database, it is `D` otherwise.

In [None]:
cols_to_use = ["ionbot_match_id","database","q-value"]
ionbot[cols_to_use]

We can see that the result file contains all matches with FDR <= 1%:

In [None]:
print(ionbot["database"].value_counts())

The column `psm_score` contains the PSM score for the matched spectra:

In [None]:
px.histogram(ionbot,
             x="psm_score", 
             color="database", 
             nbins=50
            )

Next, we load the result file that contains the lower ranked (co-eluting) matches for each MS2 spectrum and add these to the search results:

In [None]:
ionbot["rank"] = ["first"]*len(ionbot)
tmp = pd.read_csv("%s/ionbot.lower.csv"%result_folder)
tmp["rank"] = ["lower"]*len(tmp)
ionbot = pd.concat([ionbot,tmp])

For the remainder, we remove the matches against the decoy database:

In [None]:
ionbot = ionbot[(ionbot["database"]=="T")]

While adding the lower ranked matches we created a column `rank` that has value 'first' if the match was ranked first based on the psm_score, and 'lower' otherwise:

In [None]:
print(ionbot["rank"].value_counts())

To reconstruct the LC-MS separation for matched MS2 spectra we can use the `observed_retention_time` and `precursor_mass` columns: 

In [None]:
fig = px.scatter(ionbot, 
                 x="observed_retention_time", 
                 y="precursor_mass", 
                 color="rank",
                 hover_data=["ionbot_match_id","matched_peptide"]
                )
fig.update_traces(marker=dict(size=2))
fig.show()

The `ionbot.features.csv` result files contains the matching information used in the PSM scoring function.

We load `ionbot.features.csv` and merge it with the search results: 

In [None]:
features = pd.read_csv("%s/ionbot.features.csv"%result_folder)
ionbot = ionbot.merge(features,on="ionbot_match_id",how="left")

for col in features.columns:
    print(col)

We can plot these feature values as boxplots:

In [None]:
px.box(ionbot, 
       y=["by-count","all-count"],
       color="rank",
       hover_data=["ionbot_match_id"]
      )

In [None]:
px.box(ionbot, 
       y=["by-explained","all-explained"],
       color="rank",
       hover_data=["ionbot_match_id"]       
      )

In [None]:
px.box(ionbot, 
       y=["by-intensity-pattern-correlation"],
       color="rank",
       hover_data=["ionbot_match_id"]      
       )

In [None]:
px.box(ionbot, 
       y=["rt-pred-error"],
       color="rank",
       hover_data=["ionbot_match_id"]
      )

In [None]:
fig = px.scatter(ionbot, 
                 x="observed_retention_time", 
                 y="predicted_retention_time",
                 color="rank",
                 hover_data=["ionbot_match_id"]
                )
fig.update_traces(marker=dict(size=2))
fig.show()

In [None]:
fig = px.scatter(ionbot, 
                 x="corrected_retention_time", 
                 y="predicted_retention_time",
                 color="rank",
                 hover_data=["ionbot_match_id"]
                )
fig.update_traces(marker=dict(size=2))
fig.show()

The `proteins` column contains detailed protein matching information:

In [None]:
ionbot[["ionbot_match_id","proteins"]]

## Adding Uiversal Spectrum Identifiers

If the spectrum files were uploaded to a public ProteomeXchange repository, then PSM annotations can be obtained by adding Universal Spectral Identifiers (USI).

The USI is a proposed standard in the process of being ratified by the Proteomics Standards Initiative (PSI) that enables the identification of a specific spectrum or PSM contained in public ProteomeXchange repositories.

For more information, including the draft specification, please see http://psidev.info/usi/

The resuired url can be constructed from the columns in the results files:

In [None]:
dataset = "PXD008601"

def get_universal_link(x):
    file = '.'.join(x["spectrum_file"].split('.')[:-1])
    s = x["matched_peptide"]
    if str(x["modifications"]) != "nan":
        tmp = x["modifications_delta"].split("|")
        seq = list(x["matched_peptide"])
        for i in range(0,len(tmp),2):
            pos = int(tmp[i])
            delta = tmp[i+1]
            if not delta.startswith('-'):
                delta = '%2B' + delta
            if pos == 0: #N-TERM
                seq.insert(pos,"[%s]"%delta)
            elif pos == len(seq)+1: #C-TERM
                seq.insert(pos-2,"[%s]"%delta)
            else:
                seq.insert(pos,"[%s]"%delta)
        s = ''.join(seq)
    link = "http://proteomecentral.proteomexchange.org/usi/?usi=mzspec:%s:%s:scan:%i:%s/%i"%(
        dataset,file,x["scan"],s,x["charge"])
    return f'<a target="_blank" href="%s">click</a>'%link

In [None]:
ionbot["USI"] = ionbot.apply(get_universal_link,axis=1)

Now we added a column `USI` that contains links to the spectrum annotations:

In [None]:
cols_to_use = ["ionbot_match_id","database_peptide","matched_peptide",
               "modifications","modifications_delta","unexpected_modification"]
ionbot[cols_to_use + ["USI"]]

## JQuery Lorikeet PSM Annotations

Alternatively, PSM annotations can be computed from local MGF files:

In [None]:
import annotations.lorikeet

You need to specify the folder that contains the spectrum MGF files and a folder to store the annotated spectra that are written as HTML files:

In [None]:
mgf_folder = "mgfs/"
annotations_folder = "my_annotations/"

#mgf_folder = "c:\\Work\\ionbot.workshop.data\\mgfs\\"
#annotations_folder = "c:\\Work\\ionbot.workshop\\my_annotations\\"

Next, you can specify the PSMs to annotate as follows (for each PSM the corresponding MGF file and the scan number needs to specified):

In [None]:
to_annotate = [
    ["20150929_QE5_UPLC10_RJC_SA_Plaque6_01.mgf",12057],
    ["20150929_QE5_UPLC10_RJC_SA_Plaque6_01.mgf",12058]
]

The following code will create the PSM annotations:

In [None]:
for mgf_file, scan in to_annotate:
    html_filename = annotations.lorikeet.generate_html(annotations_folder,mgf_folder,mgf_file,scan,ionbot,l_os="linux")
    print("Annotations written to %s"%html_filename)

## Modifications

The 'unexpected_modification' column only shows the matched unexpected modification, not the modifications set as varialbe (expected):

In [None]:
ionbot[["ionbot_match_id","modifications","unexpected_modification"]]

All matched modifications are in the 'modifications' column. We can parse this column as follows:

In [None]:
modifications = {}

def get_modifications(x):
    if str(x) == "nan":
        return
    tmp = x.split('|')
    for i in range(0,len(tmp),2):
        if not tmp[i+1] in modifications:
            modifications[tmp[i+1]] = 0
        modifications[tmp[i+1]] += 1
        
ionbot["modifications"].apply(get_modifications)
{k: v for k, v in sorted(modifications.items(), key=lambda item: item[1], reverse=True)}

## The protein results

There are two protein inference result files:

- ionbot.first.proteins.csv
- ionbot.coeluting.proteins.csv

The first file contains the protein statistics infered from the first ranked matched only. The second file containst the protein statistics infered from all co-eluting matches.

We will continue with the proteins infered from all co-eluting matches:

In [None]:
proteins = pd.read_csv("%s/ionbot.coeluting.proteins.csv"%result_folder)

In [None]:
for col in proteins.columns:
    print(col)

These are the columns (described [here](https://ionbot.cloud/help)):

The `protein_group` column is a concatenation of the proteins it contains (search for '__'):

In [None]:
cols_to_use = ["ionbot_match_id","protein_group","protein","position_in_protein","uniprot_id"]
proteins[cols_to_use]

Notice how protein groups that contain more than protein are also split over the rows. This allows for the 'position_in_protein', 'uniprot_id', 'protein_length' and 'protein_description' to make sense.

However, we want to look at protein groups only, so we remove these duplicated rows:

In [None]:
cols_to_use = ["ionbot_match_id","is_shared_peptide","protein_group","protein_group_q-value","protein_group_PEP"]
proteins = proteins[cols_to_use]
proteins.drop_duplicates(["ionbot_match_id","protein_group"],inplace=True)
proteins

PSMs matched with two or more protein groups are indicated in the `is_shared_peptide` column:

In [None]:
print(proteins["is_shared_peptide"].value_counts())

We wil continue with non-shared peptide matches only (you can of course skip this step):

In [None]:
proteins = proteins[proteins["is_shared_peptide"]==False]

Now we can count the number of (non-shared) PSMs in each protein group and add this as a column called `#PSMs`:

In [None]:
tmp = proteins["protein_group"].value_counts().reset_index(level=0)
tmp.columns = ["protein_group","#PSMs"]
proteins = proteins.merge(tmp,on="protein_group",how="left")
proteins.drop_duplicates(["protein_group"])[["protein_group","protein_group_q-value","#PSMs"]]

We can then count then number of protein groups with a specific number of PSMs:

In [None]:
tmp = proteins.drop_duplicates("protein_group")["#PSMs"].value_counts().reset_index(level=0)
fig = px.pie(tmp, values='#PSMs', names='index', title='#PSMs in protein group')
fig.update_traces(textposition='inside')
fig.show()

To compute counts at the peptide level we need to merge the `proteins` data with the `ionbot` data (we do this using the `ionbot_match_id` column:

In [None]:
proteins = proteins.merge(ionbot,on="ionbot_match_id",how="left")

In [None]:
proteins.columns

Now we can count the number of unique peptides in each protein group and add this as a column called `#peptides`:

In [None]:
tmp = proteins.drop_duplicates("matched_peptide")["protein_group"].value_counts().reset_index(level=0)
tmp.columns = ["protein_group","#peptides"]
proteins = proteins.merge(tmp,on="protein_group",how="left")

In [None]:
proteins[cols_to_use + ["#peptides"]]

In [None]:
tmp = proteins.drop_duplicates("protein_group")["#peptides"].value_counts().reset_index(level=0)
fig = px.pie(tmp, values='#peptides', names='index', title='#Peptides in protein group')
fig.update_traces(textposition='inside')
fig.show()

We can also compute protein group specific features:

In [None]:
cols = ["psm_score","all-count","by-intensity-pattern-correlation"]
metrics = ["min","max"]


feature_cols = []
for col in cols:
    for metric in metrics:
        feature_cols.append(col+"_"+metric)
        proteins[col+"_"+metric] = proteins.groupby('protein_group')[col].transform(metric)
        
feature_cols

In [None]:
proteins[["protein_group"] + feature_cols]