Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding overhangs for synthetic peptides and QconCATs #5

Closed
pavel-shliaha opened this issue Apr 29, 2014 · 4 comments
Closed

adding overhangs for synthetic peptides and QconCATs #5

pavel-shliaha opened this issue Apr 29, 2014 · 4 comments

Comments

@pavel-shliaha
Copy link

it is quite clear now, that 100% digestion efficiency with trypsin should not be assumed in proteomics workflows. Inefficient trypsin digestion also posses a very serious problems in absolute quantitation workflows using labelled isotopic standards.

The way isotopic standards are currently used is peptides to be quantified are synthesised labelled. Then a known amount of the labelled peptide is spiked in the sample prior to its analysis by LC-MS. After the acquisition the amount of unlabelled peptide (and hence its protein of origin), is computed as foolows

quantity_unlabelled = signal_unlabelled/signal_labelled * quantity_labelled

Consider quantitation of the following peptide: VTTYFPSVNLR. Below is a piece of protein sequence it originates from:

GNIR.VTTYFPSVNLR.KSSQK

note to get the peptide out of the protein digestion should occur after R, however R is followed by K, which is expected to result in two dead-end products:

VTTYFPSVNLR and VTTYFPSVNLRK

as a result the amount of VTTYFPSVNLR peptide is no longer proportional to protein amount and if absolute quantitation is performed using this peptide only, the amount of protein will be underestimated (a specific example of this happening is given in ref1).

The most obvious approach to counteract the problem is to ignore peptides like this. However this is not usually possible, given that only a limited amount of peptides suitable for quantitation is available per every protein. Thus the best solution is to mimic cleavage site by adding 3 amino acids before and after.

However consider the following peptide:

QNGRLR.HFTIPSHR.ARAGR

if we add RLR on N-teminus of peptide sequence again the cleavage site does not mimic what happens in the protein since if cleavage occurs after the first R in the protein it yeilds a dead end product:

LR.HFTIPSHR

hence the overhang needs to be extended 3 aa before the RLR. However this extension of overhangs is not always possible, since there is a limit to peptide's length (usually a synthetic peptide of no longer than 20aa) can be synthesised, hence additional parameters need to be passed to the model to determine the optimal compromise.

I will write out a detailed outline of the workflow if this functionality is to be added to cleaver.

references:

  1. Kito, Keiji, et al. "A synthetic protein approach toward accurate mass spectrometric quantification of component stoichiometry of multiprotein complexes." Journal of proteome research 6.2 (2007): 792-800.
@pavel-shliaha
Copy link
Author

my current code

addOverhangs <- function (pep_seq, proteins, maxLength,
                          preferN = FALSE, preferC = FALSE){

  proteinSeq <- grep (pep_seq, proteins, value = TRUE)
  proteinSeqAA <- strsplit (proteinSeq, split = "")[[1]]

  if (length (proteinSeq) > 1){
    resultList <- list ("AA_before_20" = NA, "AA_after_20"  = NA, 
                        "spikeTide" = pep_seq, "result" =  "non_proteotypic")
    return (resultList)
    stop ("non-proteotypic")
  } 


  pepPosition <- regexpr (pep_seq, proteinSeq)[1]
  pepLength   <- nchar (pep_seq)

  ###############################################################################
  #  add 20 aa before

  aaStart    <- pepPosition - 20

  if (aaStart > 0) {
    AA_before_20 <- paste0 (proteinSeqAA[aaStart : (pepPosition - 1)],  collapse = "")
  } else {
    AA_before_20 <- paste0 (proteinSeqAA[1 : (pepPosition - 1)],  collapse = "") 
  }


  ###############################################################################
  # add 20 aa after

  aaEnd    <- pepPosition + pepLength + 20

  if (aaEnd < nchar (proteinSeq)) {
    AA_after_20 <- paste0 (proteinSeqAA[(pepPosition + pepLength) : aaEnd],  collapse = "")
  } else {
    AA_after_20 <- paste0 (proteinSeqAA[(pepPosition + pepLength) : nchar (proteinSeq)],  collapse = "") 
  }

  # apply the following rules:

  ##############################################################
  # 1) for the preceeding  AA

  aaBefore <- strsplit (AA_before_20, split = "")[[1]]
  aaBasic  <- which (aaBefore == "K" | aaBefore == "R")

  if (length (aaBasic) > 1){
    aaBasic2 <- c (0, aaBasic[1:(length (aaBasic) - 1)]) 
    firstGoodAA   <- which ((aaBasic - aaBasic2 > 3))

    if (length (firstGoodAA) > 0){
      firstGoodAA <- aaBasic[max (firstGoodAA)]
      aaToAddBefore <- paste (aaBefore[(firstGoodAA - 3) : length (aaBefore)] , collapse = "")
    } else {
      aaToAddBefore <- tail (aaBefore, 4)
    }

  } else {
    aaToAddBefore <- tail (aaBefore, 4)
  }

  overhang_before <- paste (aaToAddBefore, collapse = "")

  ############################################################################
  # 2) for the following AA

  aaAfter <- strsplit (AA_after_20, split = "")[[1]]
  aaBasic  <- c (0,  which (aaAfter == "K" | aaAfter == "R"))

  if (length (aaBasic) > 1) {
    aaBasic2 <- c (aaBasic[2:length (aaBasic)], length (aaAfter))
    firstGoodAA   <- which (aaBasic2 - aaBasic > 2)

    if (length ( firstGoodAA) > 0){
      firstGoodAA   <- aaBasic[min (firstGoodAA)]
      aaToAddAfter  <- aaAfter[1: (firstGoodAA + 3)]
    } else{
      aaToAddAfter <- head (aaAfter, 3)
    }

  } else {
    aaToAddAfter <- head (aaAfter, 3)
  }

  overhang_after <- paste (aaToAddAfter, collapse = "")

  ############################################################################
  # add overhangs

  length_with_overhangs <- sum (nchar (overhang_before), nchar (pep_seq),nchar (overhang_after))

  # option 1: adding full overnags
  if (length_with_overhangs <= maxLength ){
    spikeTide <-  paste (overhang_before, pep_seq, overhang_after , sep = ".")
    result <- "complete_overhangs"
  } 

  # option 2: shrotening preceeding overhang (succeding overnhamg is 3 aminoacids long)
  if (length_with_overhangs > maxLength &
      nchar (pep_seq) + 7   <= maxLength &  
      nchar (overhang_before) > 4 &  nchar (overhang_after) < 4 ){

    aaAllowedBefore <- maxLength - nchar (pep_seq) - nchar (overhang_after)
    aaBefore <- strsplit (overhang_before, split = "")[[1]]
    aaBefore <- aaBefore[(length (aaBefore) - aaAllowedBefore + 1) :  length (aaBefore)]
    new_overhang_before <- paste (aaBefore, collapse = "")
    spikeTide <-  paste (new_overhang_before, pep_seq, overhang_after , sep = ".")
    result    <- "N_overhang_shortened"
  } 

  # option 3: shrotening succeding overhang (preceding overhang is 4 aminoacids long)
  if (length_with_overhangs > maxLength &
      nchar (pep_seq) + 7   <= maxLength &
      nchar (overhang_before) < 5 &  nchar (overhang_after) > 3 ){

    aaAllowedAfter <- maxLength - nchar (pep_seq) - nchar (overhang_before)
    aaAfter <- strsplit (overhang_after, split = "")[[1]]
    aaAfter <- aaAfter[1 :aaAllowedAfter]
    new_overhang_after <- paste (aaAfter, collapse = "")
    spikeTide <-  paste (overhang_before, pep_seq, new_overhang_after , sep = ".")
    result    <- "C_overhang_shortened"
  } 

  # option 4: shrotening both overhangs, if both need to be shortened
  if (length_with_overhangs >  maxLength &
      nchar (pep_seq) + 7   <= maxLength &
      nchar (overhang_before) > 4 &  nchar (overhang_after) > 3 ){

    new_overhang_before <- paste0 (tail (strsplit (overhang_before, split = "")[[1]] , 4), collapse = "")
    new_overhang_after  <- paste0 (head (strsplit (overhang_after, split = "")[[1]] , 3), collapse = "")

    spikeTide <-  paste (new_overhang_before, pep_seq, new_overhang_after, sep = ".")
    result    <- "both_overhangs_shortened"
  } 


  # option 5: add a single overhang
  # important do not add less than 4 amino acids N-terminus and less than 3 amino acids on C-terminus
  if ( nchar (pep_seq) + 7  > maxLength){

    numAAToAdd <- maxLength - nchar (pep_seq)

    # if user wants overhang on N-terminus
    if (preferN & numAAToAdd >= 4) { # add amino acids
      if (nchar (overhang_before) == 4 ){
        spikeTide <-  paste (overhang_before, pep_seq, sep = ".")
        result    <- "N_overhang_only"
      } else {
        aaAllowedBefore <- maxLength - nchar (pep_seq)
        aaBefore <- strsplit (overhang_before, split = "")[[1]]
        aaBefore <- aaBefore[(length (aaBefore) - aaAllowedBefore + 1) :  length (aaBefore)]
        new_overhang_before <- paste (aaBefore, collapse = "")
        spikeTide <-  paste (new_overhang_before, pep_seq, sep = ".")
        result    <- "N_overhang_only_shortened"
      }  
    }

    # if user wants overhang on C-terminus
    if ((preferC & numAAToAdd >= 3) | numAAToAdd == 3) { # add amino acids
      if (nchar (overhang_after) == 3 ){
        spikeTide <-  paste (pep_seq, overhang_after,  sep = ".")
        result    <- "C_overhang_only"
      } else {
        aaAllowedAfter <- maxLength - nchar (pep_seq)
        aaAfter <- strsplit (overhang_after, split = "")[[1]]
        aaAfter <- aaAfter[1 :aaAllowedAfter]
        new_overhang_after <- paste (aaAfter, collapse = "")
        spikeTide <-  paste (pep_seq, new_overhang_after , sep = ".")
        result    <- "C_overhang_only_shortened"
      }  
    }


  } 


  # return the results

  resultList <- list ("AA_before_20" = AA_before_20,
                      "AA_after_20"  = AA_after_20, 
                      "spikeTide" = spikeTide, 
                      "result" = result)

  return (resultList)
}

@pavel-shliaha
Copy link
Author

A couple of more comments:

  1. the output: I believe the user might want the following output:
  • the new sequence: YDSKVNQADNLIEVGKGPEK
  • the new sequence where the cleavage sites are shown as dots YDSK.VNQADNLIEVGK.GPEK
  • the complete overhang
  • the suggested overhang (might not be the same as complete if shortened)
  • result spelled out: e.g. "complete_overhangs" or "C_overhang_only"
  • 20 amino acids before and 20 amino acids after for user to be able to examine how overhangs were created
  1. an example table (with peptide sequences and output are in):

"data:\RAW\pvs22_QTOF_DATA_data3\data_for_synapter_2.0\cleaver_overhangs"

@pavel-shliaha
Copy link
Author

  1. sometimes a company will enforce peptide synthesis to end with a certain amino acid (JPT enforces K|R on the C-terminus). There should be an argument to this end, e.g. end = "K". Note this enforced AA is a part of peptide being ordered hence it should be considered when allowing maximum peptide sequence length.

@sgibb
Copy link
Owner

sgibb commented Jan 3, 2015

Closed via lgatto/Pbase#6.

@sgibb sgibb closed this as completed Jan 3, 2015
lgatto pushed a commit to lgatto/Pbase that referenced this issue Feb 14, 2015
jorainer pushed a commit to lgatto/Pbase that referenced this issue Jul 4, 2018
see sgibb/cleaver#5 for details

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/Pbase@98013 bc3139a8-67e5-0310-9ffc-ced21a209358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants