# Introduction
Quantitative mass spectrometry is a formidiable approach to identify differentially abundance proteins from shotgun proteomics experiments ([Liu et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3833812/), ([Mann et al. 2013](https://pubmed.ncbi.nlm.nih.gov/23438854/))). A common problem for shotgun proteomics experiment has been low reproducibility due to the complexity of samples analyzed ([Liu et al. 2004](https://pubmed.ncbi.nlm.nih.gov/15253663/),[Li et al. 2009](https://pubmed.ncbi.nlm.nih.gov/19294629/), [Amodei et al. 2019](https://link.springer.com/article/10.1007/s13361-018-2122-8), [Michalski et al. 2011](https://pubmed.ncbi.nlm.nih.gov/21309581/), ([Li et al. 2009](https://pubmed.ncbi.nlm.nih.gov/19294629/)) as well as the available method for analyzing data. For example [Michalski et al. 2011](https://pubmed.ncbi.nlm.nih.gov/21309581/) has found that only about 16 % of the detectable peptides are typically fragmented using data dependent LC-MS/MS methods, and there is typically low reproducibility (35-60% overlap of peptide identification) between experiment ([Tabb et al. 2010](https://pubmed.ncbi.nlm.nih.gov/19921851/)). This problem is especially prominent in data dependent acquisition (DDA) methods. Attempts at solving these issues have been made by the introduction of data independent acquisition (DIA) methods ([Venable et al. 2004](https://pubmed.ncbi.nlm.nih.gov/15782151/), [Plumb et al. 2006](https://pubmed.ncbi.nlm.nih.gov/16755610/), [Distler et al. 2014](https://pubmed.ncbi.nlm.nih.gov/24336358/), [Moran et al. 2014](https://pubmed.ncbi.nlm.nih.gov/24129072/), [Pak et al. 2013](https://pubmed.ncbi.nlm.nih.gov/24006250/), [Geiger et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2953918/), [Panchaud et al. 2011](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3217585/), [Weisbrod et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3319072/), [Carvalho et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2832823/), [Egertson et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881977/)), where spectras are acquired to a predefined fixed window. These methods have been shown to have superior reproducibility to DDA. 

[Triqler](https://pubmed.ncbi.nlm.nih.gov/30482846/) is a novel software for protein quantification. It uses probabilistic graphical models to generate posterior distributions for fold changes between treatment groups, highlighting uncertainty rather than hiding it. Conventional (frequentist) methods use filters, tresholds and imputations to control error rate and often ignore certain error sources. This project aims to benchmark Triqler against commonly used softwares for DIA protein quantification. 

We have been supplied by with data by biognosys for this benchmarking project. The data set consists of 10 samples containing mixtures of Arabidopsis Thaliana, Caenorhabditis Elegans and Homo Sapiens proteins. The concentration levels are known and quantified using Spectronaut. Theoretically, the results from Triqler should be more representative of the protein quantification, since no filters or imputations methods are used, but previous attempts at showing this fact ([here](https://patruong.github.io/bayesProtQuant/)) have shown that imputation methods could severely impact the results obtained by Spectronaut, making it look either much worse or much better by giving it an unfair advantage or disadvantage. One important aspect of this research is therefore how to make a fair comparison of Triqler and Spectronaut. Sub-tasks to answer relating to this aspect is "How do we make a fair imputation if we need to impute values?" and "How do we visualize the comparison in a meaningful and comprehensible way?".



## Problem
Triqler is a novel software that uses a Bayesian model for protein quantification. Previous efforts for DIA protein quantification has been dependent on spectral libraries for peptide and protein identification, as well as the construction of pseudo-MS/MS which are similar to DDA spectras and therefore allows for peptide and protein identification with conventional search-engine algorithms. 

[read up more on DIA and fill in from Navarro et al. 217, Kuharev et al. 2015 annd Gotti et al. 2020].

The use of bayesian modeling for protein quantification has not yet been shown better than existing methods, but the fact that Triqler is handling errors in multiple steps in a more theoretical sound way than the most commonly used protein quantification pipelines gives indication that it is better. A benchmark of said Triqler is therefore needed to show its performance. 

(Specifically, it is interesting to benchmarking against DIA proteomics pipeline (such as directDIA, Spectronaut, DIA-Umpire etc.) as DIA provides more false positives, false negatives and complex spectra.)

## Preliminary Research Question
The research aims to answer the following questions:
- Is Triqler a better alternative for protein quantifcation than existing methods?

This question will be answered by investigating following sub-questions:
- How do we benchmark Triqler, a bayesian model, against existing methods, such as Spectronaut?
- What performance metrics is relevant to benchmark against?
- How to we show the comparison in a fair manner?


## Limitations 
This project aims to benchmark Triqler against other DIA protein quantification workflows. 

## Previous Studies

### Data acquisition methods

A good description of the methods can be found in [Egertson et al. 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127711/)

#### SWATH

How to acquire high quality SWATH and link also contains reference sto DIA methods. [Schilling et al. 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5669615/#R9)
#### MS<sup>E</sup>
#### Selected Reaction Monitoring (SRM)
#### Data dependent acquisition (DDA)

### Protein quantification using mass spectrometry
Label-free mass spectrometry (MS) is an increasingly important tool for researchers in life sciences ([Zhu et al. 2009](https://www.hindawi.com/journals/bmri/2010/840518/)). In the last decade developments in MS methodologies has made it possible to use it for accurate protein quantification ([Ong et al. 2005](https://www.nature.com/articles/nchembio736)). Various softwares has been developed for processing raw MS data into quantified protein abundances ([Mueller et al. 2008](https://pubmed.ncbi.nlm.nih.gov/18173218/)). Each of which has a unique set of algorithms for different tasks of the MS data processing workflow. A systematic review of the various softwares has been conducted by [Välikangas et al. 2017](https://academic.oup.com/bib/article/19/6/1344/3859191). 

### Label-free quantification


### Benchmarking of protein quantification.
[Mueller et al. 2008](https://pubmed.ncbi.nlm.nih.gov/18173218/)

[Välikangas et al. 2017](https://academic.oup.com/bib/article/19/6/1344/3859191)  


### Protein inference and why it is problematic.

Protein inference is the process of obtaining a list of identified proteins from peptide measurements ([Huang et al. 2012](https://academic.oup.com/bib/article/13/5/586/415393)). There are several factors that make obtaining a list of identified proteins challenging: (1) There are usually only a small number of peptides for each protein because top typically only the top scoring peptide spectrum matches (PSMs) are included into the set of peptide identifications, which causes problems in protein identification confidence ([Resing et al. 2004](https://pubmed.ncbi.nlm.nih.gov/15228325/)). (2) Peptides from the same protein are not equally likely to be identified in a standard proteomics experiments. I.e. there is a low signal-to-noise ratio ([Bihan et al. 2004](https://pubmed.ncbi.nlm.nih.gov/15595722/),[Kuster et al. 2005](https://pubmed.ncbi.nlm.nih.gov/15957003/),[Tang et al. 2006](https://pubmed.ncbi.nlm.nih.gov/16873510/)). (3) Many peptide sequences can be mapped to more than one protein in a database (they are so called degenerate or shared peptides) ([Nesvizhskii et al. 2003](https://pubmed.ncbi.nlm.nih.gov/14632076/), [Nesvizhskii et al. 2005](https://pubmed.ncbi.nlm.nih.gov/16009968/)). (4) There is a variety of methods to estimate the false discovery rate (FDR) [Li et al. 2012](https://pubmed.ncbi.nlm.nih.gov/23176300/).     

When conducting a shotgun proteomics experiment we are often interested to see the set of proteins that has been present in a sample before proteolytic digestion. The most common method to control for errors is to use tresholds for the FDR. As the measured entities, spectra, are measurements of the peptides rather than proteins it makes protein-level FDRs complicated ([The et al. 2016](https://onlinelibrary.wiley.com/doi/full/10.1002/pmic.201500431)). In [The et al. 2016](https://onlinelibrary.wiley.com/doi/full/10.1002/pmic.201500431) it is pointed out that there are two competing null hypotheses in current protein inference methods that are meaningful: 1) "false discoveries are protein with incorrect best scoring peptide inference" and 2) "false discoveries are absent proteins". These two null hypothesis are meaningful in the right context and can be accurately estimated. Besides these there are a very diverse set of possibilities to contruct the null hypothesis under. Which makes it very complicated to define what protein-level FDR means.     


Cont to read [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3489551/]

Cont to read [https://pubmed.ncbi.nlm.nih.gov/23176300/].


### Protein quantification for DIA.
There are a lot of studies about data-dependent acquisition (DDA) protein quantification ([Välikangas et al. 2017](https://academic.oup.com/bib/article/19/6/1344/3859191), [Mueller et al. 2008](https://pubmed.ncbi.nlm.nih.gov/18173218/)). However, in recent year the interest seems to have shifted towards data-independent acquisition (DIA) protein quantification ([Purvine et al. 2003](https://pubmed.ncbi.nlm.nih.gov/12833507/),[Venable et al. 2004](https://pubmed.ncbi.nlm.nih.gov/15782151/), [Law et al. 2013](https://pubmed.ncbi.nlm.nih.gov/24206228/), [Egertson et al. 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127711/)). With in-depth evaluation studies such as [Navarro et al. 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5120688/), [Kuharev et al. 2015](https://pubmed.ncbi.nlm.nih.gov/25545627/), [Gotti et al. 2020](https://www.biorxiv.org/content/10.1101/2020.11.03.365585v1.full) and. [Zhang et al. 2020](https://onlinelibrary.wiley.com/doi/10.1002/pmic.201900276)

The DDA select precursor ions based on abundances and suffers from stochastic and irreproducible precursor ion selection ([Liu et al. 2004](https://pubmed.ncbi.nlm.nih.gov/15253663/),[Li et al. 2009](https://pubmed.ncbi.nlm.nih.gov/19294629/), [Amodei et al. 2019](https://link.springer.com/article/10.1007/s13361-018-2122-8)), under-sampling ([Michalski et al. 2011](https://pubmed.ncbi.nlm.nih.gov/21309581/)) and long instrument cycle times ([Li et al. 2009](https://pubmed.ncbi.nlm.nih.gov/19294629/)). While DIA fragment all precusor ions within a fixed window, regardless of intensity, and yields a comprehensive fragment of ion data. It therefore has the advantage of being reproducible, as well as offering the precision of selected reaction monitoring (SRM) ([Bruderer et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4424408/)). However, wide isolation windows produce highly chimeric spectra, which limits the sensitivity and accuracy of quantification and identification ([Armodei et al 2019](https://link.springer.com/article/10.1007/s13361-018-2122-8)). Several DIA mass spectrometric strategies has been implemented to handle these problems (SWATH-MS ([Gillet et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3433915/)), HDMSE ([Geromanos et al. 2012](https://pubmed.ncbi.nlm.nih.gov/22811061/)) and AIF ([Geiger et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2953918/))). In addition to the mass spectrometric problems, computatonal methods (raw dara processing, protein database searching and statistical analysis of the quantitative data) critically impact the results of analysis. [Navarro et al. 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5120688/) developed a computational benchmarking framework for label-free quantitative proteomics, LFQbench, to analyze performances on five DIA approach; four peptide-centric query tools (OpenSWATH ([Röst et al. 2014](https://pubmed.ncbi.nlm.nih.gov/24727770/)), SWATH2.0 , Skyline ([MacLean et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2844992/), [Egertson et al. 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127711/)), Spectronaut ([Bruderer et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4424408/)), which use MS/MS libraries to extracts groups of signals that reliably represent a specific peptide with statistical methods to discern true from false matches ([Pak et al. 2013](https://pubmed.ncbi.nlm.nih.gov/24006250/), [Röst et al. 2014](https://pubmed.ncbi.nlm.nih.gov/24727770/), [Reiter et al. 2011](https://pubmed.ncbi.nlm.nih.gov/21423193/)) and one data-centric approach (DIA-Umpire ([Tsou et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4399776/))) which constructs pseudo tandem MS spectra that can be identified and quantified using conventional database searching and protein-inference tools. It has been shown that the overlap of peptide and protein identification provided by the four-library based tools are higher than what is typically achieved for MS/MS idenfications between different DDA search engine ([Bell et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2785450/), [Shteynberg et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3769318/), [Yuan et al. 2014](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4184451/)).



- Thought:
Should we not be using the same data set [Kuharev et al. 2015](https://pubmed.ncbi.nlm.nih.gov/25545627/) and add triqler to the results in [Navarro et al. 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5120688/)?

What are assay libraries? (From context "Here, discovery workflows such as DIA-Umpire may prove an important orthogonal source for the generation of assay libraries.".

#### Peptide-centric query tool for DIA
These are tools based on assay library (empirical spectral libraries generated from DDA experiments) searching and scoring. [Gillet et al. 2014](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3433915/) showed that it is possible to analyze DIA data in a targeted fashion and [Weisbrod et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3319072/) showed that it is possible to identify peptides by searching peptide fragmentation patterns against DIA data. 
 

CHECK THIS!. (Analyze DIA in targeted fashion, peaking peaks as in DDA but on the SWATH window vs SRM,PRM targeted fashion)
CHECK THIS!.(Including data set of spectral assay))
On spectral assay for targeted analysis of DIA data - [Midha et al. 2020](https://www.nature.com/articles/s41597-020-00724-7) 

##### OpenSWATH
[Röst et al. 2014](https://www.nature.com/articles/nbt.2841) presented OpenSWATH DIA analysis pipeline. It is developed to analyze SWATH-MS data ([Gillet et al. 2012](https://www.mcponline.org/content/11/6/O111.016717)), a method have fixed m/z window and iterates through the full ms-range. It has the following steps: 

- Decoy generation. OpenSWATH takes an assay library (database of known peptides for a species) as input and generates decoy according to four methods: "shuffle", "pseudo-reverse", "reverse" and "shift". These decoys are appended to the target assay library for later classification and error rate estimation.
- Data conversion. OpenSWATH takes SWATH-MS data and the assay library, and convert it to suitable formats. 
- Normalization. Each run is aligned against normalized retention-time and outliers are removed using Chauvenet's criterion.
- Chromatogram extraction. 
- Peak-peaking. The chromatograms peaks are peak-picked using either gaussian or Savitzky-Golay smoothing.
- Peak-grouping. The peaks are grouped by identifying the most intense peaks over all chromatograms and using it as a seed to create a peak for all the other chromatograms. 
- Peak-group scoring. The peak-groups are scoring based on the elution profiles of the fragment ion, the fit to the expected retention time and fragment-ion intensity from the assay library and the properties of the full MS2 spectrum at the chromatographic peak apex.
- Statistical analysis. The decoy assays are scored the same way as the target assay, using this FDR-rates can be estimated (for example using mProphet ([Reiter et al. 2011](https://www.nature.com/articles/nmeth.1584))).

(I guess the unique component in this algorithm is how it groups peaks to make them search-able against the assay libraries.)

##### Skyline
[Egertson et al. 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127711/) describes how peptides can be extracted from DIA data using targeted chromatogram extraction, similarly to SRM. Unlike SRM, DIA is not contrained to a pre-determined stochastic set of peptides. In this method, a spectral library is generated using DDA to aid in the interpretation of DIA data, and chromatograms are extracted from the DIA data for all peptides in the library. But pre-existing spectral libraries can also be used.

The skyline platform ([MacLean et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2844992/pdf/btq054.pdf)) has commonly been used SRM data acquisition and interpretation, but can also be used for interpreting DIA data. The procedure for this is the following:

- Generate quality control instrument method. This step basically set the background proteome, enzymes used for digestion and the maximum number of allowed missed cleaves for peptides, as well as specifying whether or not structural or isotope modifications are allowed. (I GUESS THIS STEP CHECKS WHICH SPECTRA ARE ALLOWED OR CONSIDERED TRASH. BASICALLY THE DATA USED FOR NORMALIZATION AND DATA CLEANSING PROCEDURE? CHECK THIS)
- Generate DIA instrument method. This step focuses on window settings and filtering of the chromatograms. 
- Generate DDA intrument method. This step focuses on the settings for generating the spectral library. 
- Run quality control samples. To get the samples to normalize against.
- Run samples. At least one DDA set should be acquired to generate the spectral library. 
- Build spectral library. By searching the DDA data using automated database search algorithms (such as SEQUEST or Mascot).
- Analyze DIA data using DIA data and spectral library.

(ASK ABOUT THIS. SKYLINE WAS HARD TO UNDERSTAND. KEEP WRITING AND REVISIT AFTER MEETING FRIDAY)

##### SWATH 2.0 (PeakView)


[Lambert et al. 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3882083/)

A bit about SWATH 2.0 and Spectronaut DIA [Midha et al. 2020](https://www.nature.com/articles/s41467-020-18901-y)

##### Spectronaut
[Bruderer et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4424408/) developed Spectronaut which uses a novel method called hyper reaction montoring (HRM). The HRM uses retention-time normalized (iRT) spectral library for targeted data extraction. 


#### Data centric DIA 

#### DIA-Umpire
[Tsou et al. 2015](https://www.nature.com/articles/nmeth.3255#Fig2) developed a computational workflow for DIA data that detects precursor and fragment chromatographic features and assembles them into pseudo-tandem MS spectra, which can be identified with conventional database-searching algorithms and protein-inference tools. Therefore allowing untargeted analysis of DIA data without spectral library. The workflow consist of the following steps:

- Precursor and fragment-ion 2D peak detection. Peaks are required to exist in at least three m/z vectors and the m/z values are calculated as the weighted average of the detected m/z value in a retention-time span. A smoothing function is applied on the resulting weighted average. 
- Precursor-fragment grouping. The algorithm uses the relationship between precursor ion and its fragments (so called coelution). It calculates the correlation coefficients and retention-time difference in LC elution peak apexes between all detected precursor ions and all possible fragment ions to better connect precursor ions and thier most likely fragment ions. 
- Generation of pseudo-MS/MS spectra. Where fragment ions that are kept are calculated based on ranking functions and the intensities are taken LC apex intensity of the peak curve of the fragment ion, and weighted depending on the square of the corrlation with the precursor peak curve. 
- Peptide and protein identification using the pseudo-MS/MS spectra. The pseudo-MS/MS spectra has similar characteristics as DDA specra. Therefore DDA search-engines are used to identify the peptides and proteins. 
- Optional target identification to increase coverage across multiple samples. More details on [Tsou et al. 2015](https://www.nature.com/articles/nmeth.3255#ref-CR9). 
- Quantification in DIA data. Smoothed MS1 monoisotopic peak is used to determine the MS1 precursor-ion intensity. Unadjusted LC apex MS2 fragment-ion intensity of the fragment smoothed signal. Protein quantities are computed as the sum of all intensities of all matched fragments of all identified peptide ions from that protein. 


### Experimental design for DIA methods
Key considerations when designing an optimal DIA method are sample complexit, chromatography conditions, the mass spectrometric equipment, and th desired sensitivity ([Egertson et al. 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5127711/)). 


### Missing values and Bayesian missing data. 
Missing values are a frequent problem in life science studies. Incomplete data can cause substantial amount of bias and seriously compromise inferences from studies if they are not hendled appropriately ([Jakobsen et al. 2017](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0442-1)). Besides the common missing value imputation methods (such as min, max, mean and median imputation. So called single imputation which cause bias ([Dziura et al. 2013](https://pubmed.ncbi.nlm.nih.gov/24058309/))) various methods have been developed to tackle this problem, such as using imputation models ([Barnard et al. 1999](https://journals.sagepub.com/doi/10.1177/096228029900800103)) and multiple imputation methods ([Jørgensen et al. 2014](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0111964)). The mechanism causing the data to be missing is another aspect that should be considered when handling missing values if possible ([Jakobsen et al. 2017](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0442-1#ref-CR19)). Bayesian methods have been receiving much attention in the literature for handling missing values in various context, and with or without taking different mechnisms into account ([Ma et al. 2018](https://www.researchgate.net/publication/324510645_Bayesian_methods_for_dealing_with_missing_data_problems#pfe)). Using Bayesian models the missingness can be modelled using observed quantities by using known data to marginalize. Therefore, handling missing values with Bayesian models are very natural and comes without imputing, such as used in [The et al. 2018](https://www.biorxiv.org/content/10.1101/357285v2). However, incorrectly specied missing values also does cause problems for Bayesian models ([Mason J. 2010](https://spiral.imperial.ac.uk/handle/10044/1/5498)) and corretly specifying missing models greatly improves the overall model fit ([Mason et al. 2010](http://eprints.ncrm.ac.uk/1691/1/InsightsSubmitted.pdf)). In [The et al. 2018](https://www.biorxiv.org/content/10.1101/357285v2) a probability distribution is assined over the possible values of the missing value, and this distribution is marginalized over when inferrening protein's quantity. Resulting in a protein quantity which incorporates uncertainty into it, which is manifested by a larger variance than proteins without missing values. 


### Missing values in proteomics.

## Data
The data for this benchmark study is provided by Biognosys. The data set is generated using (?). It consists of ten samples of mixtures of various ratios of proteins from C. elegans, H. sapiens and A. thaliana. There are five replicates for each sample.


| Species     | S01 | S02  | S03   | S04    | S05   | S06    | S07   | S08   | S09   | S10 |
|-------------|-----|------|-------|--------|-------|--------|-------|-------|-------|-----|
| A. thaliana | 0.5 | 0.5  | 0.5   | 0.5    | 0.5   | 0.5    | 0.5   | 0.5   | 0.5   | 0.5 |
| C. elegans  | 0.5 | 0.25 | 0.125 | 0.0625 | 0.031 | 0.0155 | 0.008 | 0.004 | 0.002 | 0   |
| H. sapiens  | 0   | 0.25 | 0.375 | 0.4375 | 0.469 | 0.4845 | 0.492 | 0.496 | 0.498 | 0.5 |

<center><strong>Table 1</strong> . Protein ratios of the ten samples {S01, S02, ..., S10}.</center>

The peak areas of the MS2 intensities for each protein are given, as well as, protein quantities based on Spectronauts protein quantification method. These protein quantities are based on the top three most intense peptides and reproducibility of identification (search scores). Very low intensity MS2-peaks are set to value 1.0. These very low intensity peaks arise from small noise peaks and local normalization effects produced by Spectronaut. In both cases, they are to be considered noise and arises due to the fact that the dynamic range of MS1 and MS2 are not the same. 




## Method

MS2 peak intensities and protein quantities from Spectronauts is provided by Biognosys. The MS2 peaks, search scores and protein identifications are used by Triqler perform another protein quantification. The results from Triqler and Spectronaut will be compared. 

Currently, following analysis are could be considered interesting to perform.

- Number of proteins quantifies for each method including:
  - Number of protein quantified for each sample.
  - Number of protein quantified for each species.
  - Number of protein quantifies for each sample and species.
  - These protein quantifications should be performed without missing value imputation for Spectronaut.
- Performing exploratory data analysis including:
  - Explore the distribution of protein quantities.
    - Box plot.
    - Distribution plot.
    - Scatter plot (think about what we see here?).
- Analyzing missing value effect of protein quantities for spectronaut:
  - Trying different imputation methods.
  - Exploring distributions using different imputations methods.
  - Check coefficient of variation to see the stability of imputed methods.
- Analyzing protein quantity variation to see the stability of each method. This is done by:
  - Visualizing protein quantities for a random subsample of proteins for each method.
  - Performing differential expression analysis between samples for each species.
  - Check coefficient of variation.
- Data tresholding and cleaning.
  - Removing protein for spectronaut which has more than x replicates missing.
  - Removing protein for spectronaut and triqler which as high coefficient of variation. 

Notes for future:
(Also, since missing value imputation can have a 
   
Missing not at random and missing at random...)

Something about missing value type in description in our data?.


#### ADD

======

Lägg till:

Olika metoder för protein inferens. Du använder hel-cell lysat, det kommer göra att du inte får en realistisk bild av vad som är rätt metod för protein intererens. Se tex, https://onlinelibrary.wiley.com/doi/full/10.1002/pmic.201500431; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6474350/

De finns en del benchmarks mellan DIA metoder, tex https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5120688/

Den studierna blev kritiserade för att de hade svårt att dra gränsen mellan metoder som identifierade olika många proteiner. Det är lättare att vara kvatativt korrekt om man bara identierar några få säkra proteiner, men det framgår inte av artikeln. Stefan Tenzer berättade för mig att han utsats för en help del påtryckningar medans de skrev, så artikeln ska ses som en politisk kompromiss mellan olika lab.


