Khan (2001)

John Ramey edited this page Dec 29, 2012 · 2 revisions


From the Khan et al. (2001) paper, "the small, round blue cell tumors (SRBCTs) of childhood, which include neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS)." In the version of the data set that we have provided, the class labels correspond to the four SRBCT categories given here.

The gene expression data consists of 63 training samples that were obtained from cDNA microarrays containing 6567 genes. The authors filtered the data to obtain 2308 remaining genes. Our version of the data consists of the 63 observations with the filtered 2308 genes.

In the Khan et al. (2001) paper, they used PCA to reduce the data to 10 dimensions, which contained 63% of the total variability in terms of the principal components. According to the authors, the remaining PCA components contained variance unrelated to separating the four cancers.

Sample Size Number of Features Number of Classes Disease
63 2,308 4 SRBCT

Data Source and Preprocessing

We have collected the Khan data from the pamr package on CRAN. We use the 2308 filtered genes, but we do not provide the PCA version of the data as in the Khan paper.


Link to Original Paper

Project Website with Supplemental Data and Information

BibTeX Record

author = {Khan, J and Wei, J S and Ringn{\'e}r, M and Saal, L H and Ladanyi, M and Westermann, F and Berthold, F and Schwab, M and Antonescu, C R and Peterson, C and Meltzer, P S},
title = {{Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.}},
journal = {Nature Medicine},
year = {2001},
volume = {7},
number = {6},
pages = {673--679},
month = jun