adding overhangs for synthetic peptides and QconCATs #5

pavel-shliaha · 2014-04-29T19:03:39Z

it is quite clear now, that 100% digestion efficiency with trypsin should not be assumed in proteomics workflows. Inefficient trypsin digestion also posses a very serious problems in absolute quantitation workflows using labelled isotopic standards.

The way isotopic standards are currently used is peptides to be quantified are synthesised labelled. Then a known amount of the labelled peptide is spiked in the sample prior to its analysis by LC-MS. After the acquisition the amount of unlabelled peptide (and hence its protein of origin), is computed as foolows

quantity_unlabelled = signal_unlabelled/signal_labelled * quantity_labelled

Consider quantitation of the following peptide: VTTYFPSVNLR. Below is a piece of protein sequence it originates from:

GNIR.VTTYFPSVNLR.KSSQK

note to get the peptide out of the protein digestion should occur after R, however R is followed by K, which is expected to result in two dead-end products:

VTTYFPSVNLR and VTTYFPSVNLRK

as a result the amount of VTTYFPSVNLR peptide is no longer proportional to protein amount and if absolute quantitation is performed using this peptide only, the amount of protein will be underestimated (a specific example of this happening is given in ref1).

The most obvious approach to counteract the problem is to ignore peptides like this. However this is not usually possible, given that only a limited amount of peptides suitable for quantitation is available per every protein. Thus the best solution is to mimic cleavage site by adding 3 amino acids before and after.

However consider the following peptide:

QNGRLR.HFTIPSHR.ARAGR

if we add RLR on N-teminus of peptide sequence again the cleavage site does not mimic what happens in the protein since if cleavage occurs after the first R in the protein it yeilds a dead end product:

LR.HFTIPSHR

hence the overhang needs to be extended 3 aa before the RLR. However this extension of overhangs is not always possible, since there is a limit to peptide's length (usually a synthetic peptide of no longer than 20aa) can be synthesised, hence additional parameters need to be passed to the model to determine the optimal compromise.

I will write out a detailed outline of the workflow if this functionality is to be added to cleaver.

references:

Kito, Keiji, et al. "A synthetic protein approach toward accurate mass spectrometric quantification of component stoichiometry of multiprotein complexes." Journal of proteome research 6.2 (2007): 792-800.

pavel-shliaha · 2014-05-01T17:33:10Z

my current code

addOverhangs <- function (pep_seq, proteins, maxLength,
                          preferN = FALSE, preferC = FALSE){

  proteinSeq <- grep (pep_seq, proteins, value = TRUE)
  proteinSeqAA <- strsplit (proteinSeq, split = "")[[1]]

  if (length (proteinSeq) > 1){
    resultList <- list ("AA_before_20" = NA, "AA_after_20"  = NA, 
                        "spikeTide" = pep_seq, "result" =  "non_proteotypic")
    return (resultList)
    stop ("non-proteotypic")
  } 


  pepPosition <- regexpr (pep_seq, proteinSeq)[1]
  pepLength   <- nchar (pep_seq)

  ###############################################################################
  #  add 20 aa before

  aaStart    <- pepPosition - 20

  if (aaStart > 0) {
    AA_before_20 <- paste0 (proteinSeqAA[aaStart : (pepPosition - 1)],  collapse = "")
  } else {
    AA_before_20 <- paste0 (proteinSeqAA[1 : (pepPosition - 1)],  collapse = "") 
  }


  ###############################################################################
  # add 20 aa after

  aaEnd    <- pepPosition + pepLength + 20

  if (aaEnd < nchar (proteinSeq)) {
    AA_after_20 <- paste0 (proteinSeqAA[(pepPosition + pepLength) : aaEnd],  collapse = "")
  } else {
    AA_after_20 <- paste0 (proteinSeqAA[(pepPosition + pepLength) : nchar (proteinSeq)],  collapse = "") 
  }

  # apply the following rules:

  ##############################################################
  # 1) for the preceeding  AA

  aaBefore <- strsplit (AA_before_20, split = "")[[1]]
  aaBasic  <- which (aaBefore == "K" | aaBefore == "R")

  if (length (aaBasic) > 1){
    aaBasic2 <- c (0, aaBasic[1:(length (aaBasic) - 1)]) 
    firstGoodAA   <- which ((aaBasic - aaBasic2 > 3))

    if (length (firstGoodAA) > 0){
      firstGoodAA <- aaBasic[max (firstGoodAA)]
      aaToAddBefore <- paste (aaBefore[(firstGoodAA - 3) : length (aaBefore)] , collapse = "")
    } else {
      aaToAddBefore <- tail (aaBefore, 4)
    }

  } else {
    aaToAddBefore <- tail (aaBefore, 4)
  }

  overhang_before <- paste (aaToAddBefore, collapse = "")

  ############################################################################
  # 2) for the following AA

  aaAfter <- strsplit (AA_after_20, split = "")[[1]]
  aaBasic  <- c (0,  which (aaAfter == "K" | aaAfter == "R"))

  if (length (aaBasic) > 1) {
    aaBasic2 <- c (aaBasic[2:length (aaBasic)], length (aaAfter))
    firstGoodAA   <- which (aaBasic2 - aaBasic > 2)

    if (length ( firstGoodAA) > 0){
      firstGoodAA   <- aaBasic[min (firstGoodAA)]
      aaToAddAfter  <- aaAfter[1: (firstGoodAA + 3)]
    } else{
      aaToAddAfter <- head (aaAfter, 3)
    }

  } else {
    aaToAddAfter <- head (aaAfter, 3)
  }

  overhang_after <- paste (aaToAddAfter, collapse = "")

  ############################################################################
  # add overhangs

  length_with_overhangs <- sum (nchar (overhang_before), nchar (pep_seq),nchar (overhang_after))

  # option 1: adding full overnags
  if (length_with_overhangs <= maxLength ){
    spikeTide <-  paste (overhang_before, pep_seq, overhang_after , sep = ".")
    result <- "complete_overhangs"
  } 

  # option 2: shrotening preceeding overhang (succeding overnhamg is 3 aminoacids long)
  if (length_with_overhangs > maxLength &
      nchar (pep_seq) + 7   <= maxLength &  
      nchar (overhang_before) > 4 &  nchar (overhang_after) < 4 ){

    aaAllowedBefore <- maxLength - nchar (pep_seq) - nchar (overhang_after)
    aaBefore <- strsplit (overhang_before, split = "")[[1]]
    aaBefore <- aaBefore[(length (aaBefore) - aaAllowedBefore + 1) :  length (aaBefore)]
    new_overhang_before <- paste (aaBefore, collapse = "")
    spikeTide <-  paste (new_overhang_before, pep_seq, overhang_after , sep = ".")
    result    <- "N_overhang_shortened"
  } 

  # option 3: shrotening succeding overhang (preceding overhang is 4 aminoacids long)
  if (length_with_overhangs > maxLength &
      nchar (pep_seq) + 7   <= maxLength &
      nchar (overhang_before) < 5 &  nchar (overhang_after) > 3 ){

    aaAllowedAfter <- maxLength - nchar (pep_seq) - nchar (overhang_before)
    aaAfter <- strsplit (overhang_after, split = "")[[1]]
    aaAfter <- aaAfter[1 :aaAllowedAfter]
    new_overhang_after <- paste (aaAfter, collapse = "")
    spikeTide <-  paste (overhang_before, pep_seq, new_overhang_after , sep = ".")
    result    <- "C_overhang_shortened"
  } 

  # option 4: shrotening both overhangs, if both need to be shortened
  if (length_with_overhangs >  maxLength &
      nchar (pep_seq) + 7   <= maxLength &
      nchar (overhang_before) > 4 &  nchar (overhang_after) > 3 ){

    new_overhang_before <- paste0 (tail (strsplit (overhang_before, split = "")[[1]] , 4), collapse = "")
    new_overhang_after  <- paste0 (head (strsplit (overhang_after, split = "")[[1]] , 3), collapse = "")

    spikeTide <-  paste (new_overhang_before, pep_seq, new_overhang_after, sep = ".")
    result    <- "both_overhangs_shortened"
  } 


  # option 5: add a single overhang
  # important do not add less than 4 amino acids N-terminus and less than 3 amino acids on C-terminus
  if ( nchar (pep_seq) + 7  > maxLength){

    numAAToAdd <- maxLength - nchar (pep_seq)

    # if user wants overhang on N-terminus
    if (preferN & numAAToAdd >= 4) { # add amino acids
      if (nchar (overhang_before) == 4 ){
        spikeTide <-  paste (overhang_before, pep_seq, sep = ".")
        result    <- "N_overhang_only"
      } else {
        aaAllowedBefore <- maxLength - nchar (pep_seq)
        aaBefore <- strsplit (overhang_before, split = "")[[1]]
        aaBefore <- aaBefore[(length (aaBefore) - aaAllowedBefore + 1) :  length (aaBefore)]
        new_overhang_before <- paste (aaBefore, collapse = "")
        spikeTide <-  paste (new_overhang_before, pep_seq, sep = ".")
        result    <- "N_overhang_only_shortened"
      }  
    }

    # if user wants overhang on C-terminus
    if ((preferC & numAAToAdd >= 3) | numAAToAdd == 3) { # add amino acids
      if (nchar (overhang_after) == 3 ){
        spikeTide <-  paste (pep_seq, overhang_after,  sep = ".")
        result    <- "C_overhang_only"
      } else {
        aaAllowedAfter <- maxLength - nchar (pep_seq)
        aaAfter <- strsplit (overhang_after, split = "")[[1]]
        aaAfter <- aaAfter[1 :aaAllowedAfter]
        new_overhang_after <- paste (aaAfter, collapse = "")
        spikeTide <-  paste (pep_seq, new_overhang_after , sep = ".")
        result    <- "C_overhang_only_shortened"
      }  
    }


  } 


  # return the results

  resultList <- list ("AA_before_20" = AA_before_20,
                      "AA_after_20"  = AA_after_20, 
                      "spikeTide" = spikeTide, 
                      "result" = result)

  return (resultList)
}

pavel-shliaha · 2014-05-02T15:56:06Z

A couple of more comments:

the output: I believe the user might want the following output:

the new sequence: YDSKVNQADNLIEVGKGPEK
the new sequence where the cleavage sites are shown as dots YDSK.VNQADNLIEVGK.GPEK
the complete overhang
the suggested overhang (might not be the same as complete if shortened)
result spelled out: e.g. "complete_overhangs" or "C_overhang_only"
20 amino acids before and 20 amino acids after for user to be able to examine how overhangs were created

an example table (with peptide sequences and output are in):

"data:\RAW\pvs22_QTOF_DATA_data3\data_for_synapter_2.0\cleaver_overhangs"

pavel-shliaha · 2014-05-02T16:00:10Z

sometimes a company will enforce peptide synthesis to end with a certain amino acid (JPT enforces K|R on the C-terminus). There should be an argument to this end, e.g. end = "K". Note this enforced AA is a part of peptide being ordered hence it should be considered when allowing maximum peptide sequence length.

see sgibb/cleaver#5 for details

sgibb · 2015-01-03T20:55:57Z

Closed via lgatto/Pbase#6.

see sgibb/cleaver#5 for details git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Pbase@98013 bc3139a8-67e5-0310-9ffc-ced21a209358

see sgibb/cleaver#5 for details git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Pbase@98013 bc3139a8-67e5-0310-9ffc-ced21a209358

sgibb added a commit to lgatto/Pbase that referenced this issue Jul 27, 2014

first implementation of an algorithm to suggest heavy labeled peptides;

f889b85

see sgibb/cleaver#5 for details

sgibb mentioned this issue Jul 27, 2014

Calculate heavy labeled peptides lgatto/Pbase#6

Merged

sgibb closed this as completed Jan 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding overhangs for synthetic peptides and QconCATs #5

adding overhangs for synthetic peptides and QconCATs #5

pavel-shliaha commented Apr 29, 2014

pavel-shliaha commented May 1, 2014

pavel-shliaha commented May 2, 2014

pavel-shliaha commented May 2, 2014

sgibb commented Jan 3, 2015

adding overhangs for synthetic peptides and QconCATs #5

adding overhangs for synthetic peptides and QconCATs #5

Comments

pavel-shliaha commented Apr 29, 2014

pavel-shliaha commented May 1, 2014

pavel-shliaha commented May 2, 2014

pavel-shliaha commented May 2, 2014

sgibb commented Jan 3, 2015