# Clustering of Edgar Anderson's Iris Data

To demonstrate K-means clustering, we're going to use Fisher’s and Anderson’s [`iris`](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html) dataset from the R [`datasets`](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) package. It is  a dataset that gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. For convenience, a copy of this dataset is provided at http://cobweb.cs.uga.edu/~mec/iris.csv. First, let's load in the data using a [`Relation`](http://cobweb.cs.uga.edu/~jam/scalation_1.3/scalation_mathstat/target/scala-2.12/api/scalation/relalgebra/Relation$.html) to see what's available:

In [6]:
import scalation.columnar_db._
val url = "http://cobweb.cs.uga.edu/~mec/iris.csv"
val rel = Relation(url, "longley", "SDDDDS", 0, ",")
rel.show()

import scalation.columnar_db._
url: String = http://cobweb.cs.uga.edu/~mec/iris.csv
0
rel: scalation.columnar_db.Relation =
Relation(longley, 0,
WrappedArray(id, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species),
VectorS(1,	2,	3,	4,	5,	6,	7,	8,	9,	10,	11,	12,	13,	14,	15,	16,	17,	18,	19,	20,	21,	22,	23,	24,	25,	26,	27,	28,	29,	30,	31,	32,	33,	34,	35,	36,	37,	38,	39,	40,	41,	42,	43,	44,	45,	46,	47,	48,	49,	50,	51,	52,	53,	54,	55,	56,	57,	58,	59,	60,	61,	62,	63,	64,	65,	66,	67,	68,	69,	70,	71,	72,	73,	74,	75,	76,	77,	78,	79,	80,	81,	82,	83,	84,	85,	86,	87,	88,	89,	90,	91,	92,	93,	94,	95,	96,	97,	98,	99,	100,	101,	102,	103,	104,	105,	106,	107,	108,	109,	110,	111,	112,	113,	114,	115,	116,	117,	118,	119,	120,	121,	122,	123,	124,	125,	126,	127,	128,	129,	130,	131,	132,	133,	134,	135,	136,	137,	138,	139,	140,	141,	142,	143,	144,	145,	146,	147,	148,	149,	150)
VectorD...
|--------------------------------------------------------------------------------------------------------------|


Let's see if a cluster analysis reveals any relationship between petal size and species. To accomplish this, we will make a matrix of points corresponding to `Petal.Length` and `Petal.Width`, then attempt to cluster them into three clusters (i.e., the number of species) using the *k-means++* clustering algorithm provided in ScalaTion's `KMeansPPClusterer`.

In [7]:
import scalation.analytics.clusterer.KMeansPPClusterer
import scalation.linalgebra.MatrixD
val x = rel.toMatriD((3 to 4).toSeq).asInstanceOf[MatrixD]
val (cl, c) = KMeansPPClusterer(x, 3)

import scalation.analytics.clusterer.KMeansPPClusterer
import scalation.linalgebra.MatrixD
x: scalation.linalgebra.MatrixD =

MatrixD(1.40000,	0.200000,
	1.40000,	0.200000,
	1.30000,	0.200000,
	1.50000,	0.200000,
	1.40000,	0.200000,
	1.70000,	0.400000,
	1.40000,	0.300000,
	1.50000,	0.200000,
	1.40000,	0.200000,
	1.50000,	0.100000,
	1.50000,	0.200000,
	1.60000,	0.200000,
	1.40000,	0.100000,
	1.10000,	0.100000,
	1.20000,	0.200000,
	1.50000,	0.400000,
	1.30000,	0.400000,
	1.40000,	0.300000,
	1.70000,	0.300000,
	1.50000,	0.300000,
	1.70000,	0.200000,
	1.50000,	0.400000,
	1.00000,	0.200000,
	1.70000,	0.500000,
	1.90000,	0.200000,
	1.60000,	0.200000,
	1.60000,	0.400000,
	1.50000,	0.200000,
	1.40000,	0.200000,
	1.60000,	0.200000,
	1.60000,	0.200000,
	1.50000,	0.400000,
	1.50000,	0.100000,
	1.40000,	0.200000,
	1.50000,	0.200000,
	1.20000,	0.200000,
	1.30000,	0.200000,
	1.40000,	0.100...
cl: scalation.analytics.clusterer.KMeansPPClusterer = scalation.analytics.clusterer.KMeansPPClusterer@5a9ccb

Using the cluster assignments, let's select rows from the dataset to see if each cluster corresponds to a single species. 

In [8]:
for (k <- 0 until 3) rel.selectAt(c.zipWithIndex.filter(_._1 == k).map(_._2).toSeq).pi("Species").show()

|--------------------|
| Relation name = longley_s_9_p_10, key-column = -1    |
|--------------------|
|            Species |
|--------------------|
|         versicolor |
|         versicolor |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |
|          virginica |


Based on the output, we see that each of the three clusters is clearly dominated by a single species. In fact, the second cluster only contained a single species. This suggests that there is some relationship between petal size and species.

## References
* Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) *The New S Language.* Wadsworth & Brooks/Cole.
* Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. *Annals of Eugenics*, 7, Part II, 179–188.
* Anderson, Edgar (1935). The irises of the Gaspe Peninsula, *Bulletin of the American Iris Society*, 59, 2–5.