Sorlie et al. (2001) examined 85 experimental samples gathered from cDNA microarrays to identify breast carcinoma based on variations in gene expression levels. The data consist of 456 cDNA clones from 427 unique genes for 78 carcinomas, 3 benign tumors, and 4 normal tissues.

Sample Size Number of Features Number of Classes Disease
85 456 5 Breast Cancer

Data Source and Preprocessing

We acquired the data set from the hybridHclust package on CRAN, where the missing values were imputed with a 10-nearest neighbors method following. Because Sorlie et al. (2001) divided the data set into five subtypes based on gene expression, we considered the data set to have 5 populations.


BibTeX Record

author = {S{\o}rlie, Therese and Perou, Charles M and Tibshirani, Robert and Aas, Turid and Geisler, Stephanie and Johnsen, Hilde and Hastie, Trevor and Eisen, Michael B and van de Rijn, Matt and Jeffrey, Stefanie S and Thorsen, Thor and Quist, Hanne and Matese, John C and Brown, Patrick O and Botstein, David and L{\o}nning, Per Eystein and B{\o}rresen-Dale, Anne-Lise},
title = {{Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications}},
journal = {Proceedings of the National Academy of Sciences},
year = {2001},
volume = {98},
pages = {10869--10874},
month = sep