Skip to content

Quantile Normalization

mjarek66 edited this page Feb 18, 2016 · 8 revisions
Input matrix is:

 A    5    4    3
 B    NA   1    4
 C    NA   4    6
 D    4    2    8


 After NA replacement with column average:

 A    5    4    3
 B    4.5  1    4
 C    4.5  4    6
 D    4    2    8

Ranks are:

 A    iv   iii   i
 B    ii   i     ii
 C    ii   iii   iii
 D    i    ii    iv

 After ordering we have a temporary matrix,

 A    4     1   3
 B    4.5   2   4
 C    4.5   4   6
 D    5     4   8

 from which we can calculate row-wide averages, see how duplicates are dealt with.

 A (4 1 3)/3 = 2.66 = rank i
 B (4.5 2 4)/3 = 3.5 = rank ii
 C (4.5 4 6)/3 = 4.83 = rank iii
 D (5 4 8)/3 = 5.66 = rank iv

 Normalized matrix is:

 A    5.66      4.83    2.66
 B    3.5       2.66    3.5
 C    3.5       4.83    4.83
 D    2.66      3.5     5.66

In the normalized matrix we should have similar moments for each column.


Alternative approach to consider:

As an alternative to the above (currently implemented) approach, 
for missing data points one could/should consider the average rank 
across the row (peptide/probe/analyte). 

Thus, if a given peptide is on average low rank (low abundance) across all 
samples for which it is measured, it should not be imputed as mid-rank 
within a column, as implied by current approach. 

Rather, it should be imputed as having a low rank/value that follows its 
average behavior across samples. We can not impute an average value before 
re-normalizing since columns/samples could have significant batch effects, 
for instance (although in practice we have not observed this for P100 or GCP
so far - kudos to Broad!).

However, we can impute the average rank and then subsequently the corresponding 
average value for that rank. In other words, once we have the expected rank, 
we assign for the missing point the value for that rank in the column of interest 
(if this rank is present), or the average value of closest ranks (if the average 
rank falls between), or the value for the lowest (or highest) rank if the average 
rank falls below the lowest (above the highest) rank. 

Thus, for the example used above, this would look as follows:

Input matrix is:

  A    5    4    3
  B    NA   1    4
  C    NA   4    6
  D    4    2    8


Ranks are:

  A    ii   iii   i
  B    1.5  i     ii
  C    3.0  iii   iii
  D    i    ii    iv


After ordering we have a temporary matrix,

  A    4     1   3
  B    5     2   4
  C    NA    4   6
  D    NA    4   8

  from which we can calculate row-wide averages, see how duplicates are dealt with.

  A (4 1 3)/3 = 2.66 = rank i
  B (5 2 4)/3 = 3.66 = rank ii
  C (4 6)/2 = 5.0 = rank iii
  D (4 8)/2 = 6.0 = rank iv

  Normalized matrix is:

  A    3.66                5.0     2.66
  B    (3.66+2.66)/2       2.66    3.66
  C    5.0                 5.0     5.0
  D    2.66                3.66    6.0



For row B, we have rank 1.5 in the first column, so we take the average 
of values for ranks I and II in that column. For row C, we simply take 
the value for rank III.

So, in the end, we are imputing mean value for that row (over 
values that are not missing), as desired. This is unlikely to make a 
significant overall difference in our case,  
given how well data seem to be normalized already. However, such 
average (per row) imputed values would be more realistic. 

It should be also noted that none of the two variants gives perfect solution, 
and they both break down as the number of missing values becomes large. 
In the case of the second (alternative) approach with mean per row discussed here, 
ranks for non-missing data points get inflated as the fraction of missing data 
points increases, which could lead to systematic bias (probably not so bad, 
though, given that points that are missing are largely expected to correspond 
to low abundance (low rank) peptides). Needs testing!