Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computation of conditions bug #52

Closed
pavel-shliaha opened this issue Sep 12, 2017 · 15 comments
Closed

computation of conditions bug #52

pavel-shliaha opened this issue Sep 12, 2017 · 15 comments
Assignees
Labels
Milestone

Comments

@pavel-shliaha
Copy link
Collaborator

pavel-shliaha commented Sep 12, 2017

If I execute the following code I get

myoTD <- readTopDownFiles(path = getwd(),
                          type = c("b", "c", "y", "z"),
                          adducts = data.frame(mass=c(-H, H), name=c("cmH", "zpH"), to=c("c", "z")),
                          neutralLoss = NULL,
                          modifications = NULL,
                          sampleColumns = sampleColumns) 
myoTD <- filterInjectionTime(myoTD, maxDeviation=log2(3), keepTopN=2)
myoTD <- filterIntensity(myoTD, threshold=0.1)
myoTD <- filterCv(myoTD, threshold=30)
myoTD <- filterNonReplicatedFragments(myoTD)
ncb <- as(myoTD, "NCBSet")
ncbRed <- ncb[,ncb$AgcTarget == 5e5 & ncb$EtdReagentTarget %in% c(1e7, 0) & ncb$Mz == 707.3,]
ncbRedCond <- as.data.frame (bestConditions(ncbRed), stringsAsFactors = FALSE)
ncbRedCond["conditions"] <- row.names (ncbRed@colData)[ncbRedCond$index]
conditions

#   index bonds                              conditions
#1     95    90  C0707.30_5.0e+05_1.0e+07_07_05.00_00_1

so the best condition is index 95 with 90 bonds. However this is not the case: the best condition is 131 with 93 fragments

xx <- ncbRed@assay[, c(131, 95)]
sum(xx[, 1] == 1, xx[, 1] == 2, 2*sum(xx[, 1] == 3))
sum(xx[, 2] == 1, xx[, 2] == 2, 2*sum(xx[, 2] == 3))

C131 gives 93 and C95 gives 92 fragments. I imagine its a simple bug of not awarding 2 points for "Both" fragment (3 in your case)?

@sgibb
Copy link
Owner

sgibb commented Sep 12, 2017

You are right. Currently n-term, c-term, both are counted once. As far as I understood that's the way you suggested it in #22.

IMHO there was a misunderstanding. I asked:

Would you prefer a C-terminal fragment with higher intensity about a "both" fragment with lower intensity?

And you said

it depends on application. However in our case I can filter to select only reproducible fragments of high intensity before starting aggregation into N, C and Both

That's why I thought you don't care whether a bond is covered by one or two fragments and we look for the highest bond coverage (regardless its covered by one or two fragments).

But I understand that a bond is covered better by "both" than by a single "N" or "C".

I will change that.

@sgibb sgibb added this to the bioc release milestone Sep 12, 2017
@sgibb sgibb self-assigned this Sep 12, 2017
@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Sep 12, 2017

yes, sorry I understood your question in a very convoluted way ... I thought you were asking "why is "both" with low intensity fragments better (gets a score of 2) than N-terminal with high intensity (gets a score of one)". Facepalm.

Also now you probably understand why I suggested splitting the table into N- and C-terminal part for computing combinations... Given we have to test both the tables its much simpler to do this separately

@sgibb
Copy link
Owner

sgibb commented Sep 12, 2017

Ok, I changed the function. Now you get a three column matrix (with conditions as row.names); columns are: index, number of fragments, number of covered bonds:

bestConditions(n)
#                                       index fragments bonds
#C937.30_5.0e+05_0.0e+00_00.00_28_00_01  4830        72    69
#C703.20_1.0e+06_1.0e+07_02.50_00_21_1   3903        12    11
#C703.20_1.0e+06_1.0e+07_05.00_28_00_2   3945         8     8
#C703.20_5.0e+05_1.0e+06_15.00_00_00_2   2929         4     4
#C703.20_1.0e+06_5.0e+06_15.00_14_00_3   3830         3     3
#C937.30_5.0e+05_0.0e+00_00.00_35_00_07  4847         3     2
#C562.70_5.0e+05_5.0e+06_02.50_28_00_2   1052         2     2
#C562.70_1.0e+05_1.0e+06_02.50_00_14_3    129         1     1
#C562.70_5.0e+05_0.0e+00_00.00_00_14_01   702         1     1

Could you please test again? (Hopefully I don't mess everything up. It is dangerous to code at night ...)

@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Sep 13, 2017

this is still a problem, and I think I know why. But just to reiterate, when we calculate coverage:

both - gets 2 points
N/C - 1 point
none - 0 points

when you are combining conditions for a particular bond

both + both = both - 0 points
both + N = both - 0 points
both + C = both - 0 points
none + none = none - 0 points
N + N = N - 0 points
C + C = C - 0 points
N + C = both - 1 point
both + N = both 1 point
both + C = both 1 point

the part that is wrong in your code is that

both + N gets 0 point
both + C gets 0 point

at least your code seems to perform this way numerically, i.e. when you add a new condition you do not reward having both in the position you only previously had N or C. Might it be smth to do with this

x[.row(x[, hc[1L], drop=FALSE]), ] <- 0L

here you seem to assign 0 to all rows even if you only had N/C in the bond?

@sgibb
Copy link
Owner

sgibb commented Sep 13, 2017

To be honest I don't completely understand your example. But should it work in the following way?

Example data:

Matrix:

bond C1 C2 C3
1 N B
2 N C

First round: Select condition C1 (because 2 bonds covered) and return the following:

index fragments bonds
1 2 2

Now my code does the following (because all bonds are already covered by at least one fragment):

bond C1 C2 C3
1
2

Am I right, that you suggest to degrade B in C2 to C (because N was already present in C1) and keep C in C3 (because bond 2 was not covered by B or C):

bond C1 C2 C3
1 C
2 C

So you will get 2 more rounds:

index fragments bonds
1 2 2
2 1 1
3 1 1

@pavel-shliaha
Copy link
Collaborator Author

sorry I made a mistake in my description, since summing conditions is asymmetrical, as I just realise. From now on lets put already selected condition as the first one, so:

both + N = both - 0 points
N + both = both - 1 point

the latter is broken in your code.

@pavel-shliaha
Copy link
Collaborator Author

as to your comment: yes Both becomes C when N is already covered. But your code makes Both into nothing.

I understand this is confusing so:

THE BOTTOM LINE IS WE NEED TO ACCUMULATE AS MANY FRAGMENTS GOING BOTH DIRECTIONS. HAVING A FRAGMENT GOING IN THE DIRECTION THAT IS ALREADY PRESENT GIVES 0. HAVING A FRAGMENT GO IN A DIRECTION THAT IS ABSENT GIVES 1

hence my suggestion to split the table into 2 parts.

NFragTable <- as.matrix (ncbRed@assay)
NFragTable <- t(apply (NFragTable, 1, function(x) ifelse (x %in% c(1,3), TRUE, FALSE))) 

CFragTable <- as.matrix (ncbRed@assay)
CFragTable <- t(apply (CFragTable, 1, function(x) ifelse (x %in% c(2,3), TRUE, FALSE))) 


numCond <- 10
  
combTable <- as.data.frame (matrix (ncol = 2, nrow = numCond, 
                                      dimnames = list(NULL, c("index","fragments"))), stringsAsFactors = FALSE)
  
  for (i in 1:numCond){
    numBonds <- colSums (NFragTable) + colSums (CFragTable) 
    sel <- order (numBonds, decreasing = TRUE)[1]
    combTable[i, "index"] <- sel
    combTable[i, "fragments"] <- numBonds[sel]
    
    NFragTable <- NFragTable[!NFragTable[, sel], ] 
    CFragTable <- CFragTable[!CFragTable[, sel], ]
    
    sel <- NFragTable[, sel] | CFragTable[, sel]
    
    NFragTable <- NFragTable[!sel, ]
    CFragTable <- CFragTable[!sel, ]
}

sgibb added a commit that referenced this issue Sep 21, 2017
@sgibb
Copy link
Owner

sgibb commented Sep 21, 2017

@pavel-shliaha could please (re)test bestConditions in the most recent version.

@pavel-shliaha
Copy link
Collaborator Author

yep, it now works!

@pavel-shliaha
Copy link
Collaborator Author

right now the bestConditions() returns a table with the number of fragments covered + index. I know we discussed this and I know I asked for fragments not bonds, but it seems the field requires to maximise bonds not fragments. Could you please now compute both: bonds covered and fragments number and introduce an argument, maximise = "bonds" or "fragments", with "fragments" as default. If "fragments" is selected as in now,the conditions are listed in order of increasing fragments if "bonds" are selected the conditions are returned in order of maximising bonds covered. However metrics are always returned

@pavel-shliaha pavel-shliaha reopened this Jan 13, 2018
@sgibb sgibb modified the milestones: bioc release, bioc 3.7 Jan 14, 2018
@sgibb
Copy link
Owner

sgibb commented Jan 15, 2018

I have a question regarding the final reporting table. The current approach is:

  1. Find the highest fragment coverage (easy to change to highest bond coverage; I already did that.)
  2. Remove the fragments from the whole dataset (easy to change for bonds)
  3. Start at 1. again until now additional coverage could be generated.
  4. Report results.

The problem is in the implementation of step 2. Image the following conditions:

cond1 cond2
bond1 C -
bond2 N B

maximise fragments:

  1. Select cond1 or 2 based on intensity (e.g. we select cond1)
  2. Remove C from bond1 and N from bond2 (turn B in cond2 into a C).
  3. Select cond2 which adds a C:
  4. Report
index fragments bonds
1 2 2
2 1 ?

The question is what should be reported for the ? in the table above? 1 because the C covers 1 bond (then the colSum would be larger than the real number of bonds). Or zero because the bond is already covered by condition 1 (then the colSum would be equal to the real bond number, but some fragments won't cover a bond)?

same problem for maximise bonds is IMHO easy:

  1. Select cond1 because it covers 2 bonds.
  2. Remove both bonds from the coverage set
  3. Nothing left
  4. Report
index fragments bonds
1 2 2

@pavel-shliaha
Copy link
Collaborator Author

pavel-shliaha commented Jan 15, 2018

In your example, the answer we are looking for is

index fragments bonds
1 2 2
2 1 0

however, what is important attempting to maximise bonds will generate a different optimal combination than attempting to maximise fragments. Hence we need an argument: maximise = "bonds" or "fragments" However in both cases (if possible, I would prefer to have both columns: fragments and bonds) calculated.

I know this is not what I initially wanted, but it seems the field requires to report coverage in terms of bonds, however, I would still like to have the fragment functionality calculated.

@sgibb
Copy link
Owner

sgibb commented Jan 15, 2018

As we discussed via skype we optimise for number of fragments (default) or bonds. If there are multiple conditions with the same number of fragments/bonds we choose the one with the highest overall intensity. The bonds (fragments) are ignored for optimisation if maximise="fragments" (maximise="bonds") but reported.

library("topdownr")
data(tds)
tds <- aggregate(tds)
ncb <- as(tds, "NCBSet")

bestConditions(ncb, maximise="fragments")
#                                index fragments bonds
# C1.0e+06_0.0e+00_00.00_00_14_1   219        99    98
# C1.0e+06_1.0e+06_05.00_21_00_1   160        36    31
# C1.0e+06_1.0e+06_02.50_00_21_1   154         9     6
# C1.0e+06_1.0e+06_15.00_14_00_1   181         5     2
# C1.0e+06_1.0e+06_02.50_00_07_1   152         5     4
# C5.0e+05_0.0e+00_00.00_00_21     144         4     1
# C1.0e+06_1.0e+06_02.50_00_14_1   153         3     2
# C1.0e+06_1.0e+06_10.00_00_21     176         3     1
# C1.0e+06_1.0e+06_10.00_21_00_1   171         2     0
# C5.0e+05_1.0e+06_10.00_00_00_1   104         2     0
# C1.0e+06_1.0e+06_15.00_00_14     186         2     2
# C5.0e+05_0.0e+00_00.00_35_00_1   141         1     0
# C1.0e+05_1.0e+06_02.50_00_00_1    11         1     0
# C1.0e+06_1.0e+06_10.00_28_00     172         1     0
# C1.0e+06_1.0e+06_05.00_28_00_1   161         1     1
# C5.0e+05_1.0e+06_05.00_00_07_1    88         1     1
# C1.0e+06_1.0e+06_30.00_00_00     201         1     0
# C1.0e+06_1.0e+06_10.00_00_07     174         1     0
# C5.0e+05_1.0e+06_15.00_07_00     105         1     0
# C5.0e+05_1.0e+06_02.50_00_21_1    79         1     0
# C1.0e+06_1.0e+06_10.00_00_14     175         1     0
# C5.0e+05_1.0e+06_10.00_00_14_1   100         1     0
# C5.0e+05_1.0e+06_50.00_28_00     130         1     0
# C5.0e+05_0.0e+00_00.00_07_00_1   137         1     0
# C1.0e+06_1.0e+06_10.00_00_35_1   178         1     0
# C1.0e+06_0.0e+00_00.00_21_00_1   215         1     0
bestConditions(ncb, maximise="bonds")
#                                index fragments bonds
# C1.0e+06_0.0e+00_00.00_00_14_1   219        99    98
# C1.0e+06_1.0e+06_10.00_00_07     174        35    32
# C1.0e+06_1.0e+06_05.00_00_07     163         5     5
# C5.0e+05_0.0e+00_00.00_35_00_1   141         6     3
# C1.0e+06_1.0e+06_10.00_28_00     172         2     2
# C1.0e+06_1.0e+06_05.00_28_00_1   161         4     2
# C5.0e+05_1.0e+06_05.00_00_14_1    89         4     2
# C5.0e+05_1.0e+06_02.50_00_21_1    79         3     2
# C5.0e+05_1.0e+06_05.00_00_07_1    88         2     1
# C1.0e+06_1.0e+06_02.50_00_07_1   152         2     1
# C1.0e+06_1.0e+06_15.00_00_14     186         1     1

@sgibb sgibb closed this as completed Jan 17, 2018
@pavel-shliaha
Copy link
Collaborator Author

If possible could you please introduce 2 additional columns containing the total number of fragments and bonds covered in every condition. So now there will be 5 columns:

  1. "index"
  2. "fragments_added_to_combination"
  3. "bonds_added_to_combination"
  4. "fragments_in_codnition"
  5. "bonds_in_condition"

this will make the table more infromative

@sgibb
Copy link
Owner

sgibb commented Jan 30, 2018

Now there are 5 columns:

library("topdownr")
data(tds)
tds <- aggregate(tds)
ncb <- as(tds, "NCBSet")

bestConditions(ncb, maximise="fragments", n=10)
#                                Index FragmentsAddedToCombination
# C1.0e+06_0.0e+00_00.00_00_14_1   219                          97
# C1.0e+06_1.0e+06_05.00_21_00_1   160                          37
# C1.0e+06_1.0e+06_02.50_00_07_1   152                           9
# C1.0e+06_1.0e+06_02.50_00_21_1   154                           6
# C1.0e+06_1.0e+06_15.00_14_00_1   181                           5
# C5.0e+05_0.0e+00_00.00_00_21     144                           4
# C1.0e+06_1.0e+06_02.50_00_14_1   153                           3
# C1.0e+06_1.0e+06_10.00_00_21     176                           3
# C1.0e+06_1.0e+06_10.00_21_00_1   171                           2
# C5.0e+05_1.0e+06_10.00_00_00_1   104                           2
#                                BondsAddedToCombination FragmentsInCondition
# C1.0e+06_0.0e+00_00.00_00_14_1                      96                   97
# C1.0e+06_1.0e+06_05.00_21_00_1                      32                   89
# C1.0e+06_1.0e+06_02.50_00_07_1                       8                   90
# C1.0e+06_1.0e+06_02.50_00_21_1                       3                   90
# C1.0e+06_1.0e+06_15.00_14_00_1                       2                   75
# C5.0e+05_0.0e+00_00.00_00_21                         1                   29
# C1.0e+06_1.0e+06_02.50_00_14_1                       2                   67
# C1.0e+06_1.0e+06_10.00_00_21                         1                   62
# C1.0e+06_1.0e+06_10.00_21_00_1                       0                   86
# C5.0e+05_1.0e+06_10.00_00_00_1                       0                   63
#                                BondsInCondition
# C1.0e+06_0.0e+00_00.00_00_14_1               96
# C1.0e+06_1.0e+06_05.00_21_00_1               88
# C1.0e+06_1.0e+06_02.50_00_07_1               90
# C1.0e+06_1.0e+06_02.50_00_21_1               87
# C1.0e+06_1.0e+06_15.00_14_00_1               71
# C5.0e+05_0.0e+00_00.00_00_21                 29
# C1.0e+06_1.0e+06_02.50_00_14_1               66
# C1.0e+06_1.0e+06_10.00_00_21                 62
# C1.0e+06_1.0e+06_10.00_21_00_1               83
# C5.0e+05_1.0e+06_10.00_00_00_1               62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants