-
Notifications
You must be signed in to change notification settings - Fork 4
Quantile Normalization
mjarek66 edited this page Feb 18, 2016
·
8 revisions
Input matrix is: A 5 4 3 B NA 1 4 C NA 4 6 D 4 2 8 After NA replacement with column average: A 5 4 3 B 4.5 1 4 C 4.5 4 6 D 4 2 8 Ranks are: A iv iii i B ii i ii C ii iii iii D i ii iv After ordering we have a temporary matrix, A 4 1 3 B 4.5 2 4 C 4.5 4 6 D 5 4 8 from which we can calculate row-wide averages, see how duplicates are dealt with. A (4 1 3)/3 = 2.66 = rank i B (4.5 2 4)/3 = 3.5 = rank ii C (4.5 4 6)/3 = 4.83 = rank iii D (5 4 8)/3 = 5.66 = rank iv Normalized matrix is: A 5.66 4.83 2.66 B 3.5 2.66 3.5 C 3.5 4.83 4.83 D 2.66 3.5 5.66
In the normalized matrix we should have similar moments for each column.
Alternative approach to consider:
As an alternative to the above (currently implemented) approach, for missing data points one could/should consider the average rank across the row (peptide/probe/analyte). Thus, if a given peptide is on average low rank (low abundance) across all samples for which it is measured, it should not be imputed as mid-rank within a column, as implied by current approach. Rather, it should be imputed as having a low rank/value that follows its average behavior across samples. We can not impute an average value before re-normalizing since columns/samples could have significant batch effects, for instance (although in practice we have not observed this for P100 or GCP so far - kudos to Broad!). However, we can impute the average rank and then subsequently the corresponding average value for that rank. In other words, once we have the expected rank, we assign for the missing point the value for that rank in the column of interest (if this rank is present), or the average value of closest ranks (if the average rank falls between), or the value for the lowest (or highest) rank if the average rank falls below the lowest (above the highest) rank. Thus, for the example used above, this would look as follows: Input matrix is: A 5 4 3 B NA 1 4 C NA 4 6 D 4 2 8 Ranks are: A ii iii i B 1.5 i ii C 3.0 iii iii D i ii iv After ordering we have a temporary matrix, A 4 1 3 B 5 2 4 C NA 4 6 D NA 4 8 from which we can calculate row-wide averages, see how duplicates are dealt with. A (4 1 3)/3 = 2.66 = rank i B (5 2 4)/3 = 3.66 = rank ii C (4 6)/2 = 5.0 = rank iii D (4 8)/2 = 6.0 = rank iv Normalized matrix is: A 3.66 5.0 2.66 B (3.66+2.66)/2 2.66 3.66 C 5.0 5.0 5.0 D 2.66 3.66 6.0 For row B, we have rank 1.5 in the first column, so we take the average of values for ranks I and II in that column. For row C, we simply take the value for rank III. So, in the end, we are imputing mean value for that row (over values that are not missing), as desired. This is unlikely to make a significant overall difference in our case, given how well data seem to be normalized already. However, such average (per row) imputed values would be more realistic. It should be also noted that none of the two variants gives perfect solution, and they both break down as the number of missing values becomes large. In the case of the second (alternative) approach with mean per row discussed here, ranks for non-missing data points get inflated as the fraction of missing data points increases, which could lead to systematic bias (probably not so bad, though, given that points that are missing are largely expected to correspond to low abundance (low rank) peptides). Needs testing!
Development sponsored by NIH funded BD2K grant http://lincs-dcic.org/