Cluster Analysis

Clustering wines
k-means

In [None]:
data(wine, package='rattle')
head(wine)

In [None]:
wine.stand <- scale(wine[-1])  # To standarize the variables

# K-Means
k.means.fit <- kmeans(wine.stand, 3) # k = 3

In [None]:
#In k.means.fit are contained all the elements of the cluster output:
attributes(k.means.fit)

In [None]:
# Centroids:
k.means.fit$centers

In [None]:
# Clusters:
k.means.fit$cluster

In [None]:
# Cluster size:
k.means.fit$size

In [None]:
wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}

wssplot(wine.stand, nc=6) 

In [None]:
#Library clusters allow us to represent (with the aid of PCA) the cluster solution into 2 dimensions:
library(cluster)
clusplot(wine.stand, k.means.fit$cluster, main='2D representation of the Cluster solution',
         color=TRUE, shade=TRUE,
         labels=2, lines=0)

In [None]:
#In order to evaluate the clustering performance we build a confusion matrix:
table(wine[,1],k.means.fit$cluster)

Hierarchical clustering:
Hierarchical methods use a distance matrix as an input for the clustering algorithm. The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another.

In [None]:
d <- dist(wine.stand, method = "euclidean") # Euclidean distance matrix.

In [None]:
#We use the Euclidean distance as an input for the clustering algorithm (Ward’s minimum variance criterion minimizes the total within-cluster variance):
H.fit <- hclust(d, method="ward")

In [None]:
#The clustering output can be displayed in a dendrogram
plot(H.fit) # display dendogram
groups <- cutree(H.fit, k=3) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(H.fit, k=3, border="red") 

In [None]:
table(wine[,1],groups)

#Study case I: EUROPEAN PROTEIN CONSUMPTION
#We consider 25 European countries (n = 25 units) and their protein intakes (in percent) from nine major food sources (p = 9). The data are listed below.


In [None]:
url = 'http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv'
food <- read.csv(url)
head(food)

In [None]:
#We start first, clustering on just Red and White meat (p=2) and k=3 clusters.
set.seed(123456789) ## to fix the random starting clusters
grpMeat <- kmeans(food[,c("WhiteMeat","RedMeat")], centers=3, nstart=10)
grpMeat

In [None]:
## list of cluster assignments
o=order(grpMeat$cluster)
data.frame(food$Country[o],grpMeat$cluster[o])

In [None]:
#To see a graphical representation of the clustering solution we plot cluster assignments on Red and White meat on a scatter plot:
plot(food$Red, food$White, type="n", xlim=c(3,19), xlab="Red Meat", ylab="White Meat")
text(x=food$Red, y=food$White, labels=food$Country,col=grpMeat$cluster+1)

Next, we cluster on all nine protein groups and prepare the program to create seven clusters. The resulting clusters, shown in color on a scatter plot of white meat against red meat (any other pair of features could be selected), actually makes lot of sense. Countries in close geographic proximity tend to be clustered into the same group.

In [None]:
## same analysis, but now with clustering on all
## protein groups change the number of clusters to 7
set.seed(123456789)
grpProtein <- kmeans(food[,-1], centers=7, nstart=10)
o=order(grpProtein$cluster)
data.frame(food$Country[o],grpProtein$cluster[o])

In [None]:
library(cluster)
clusplot(food[,-1], grpProtein$cluster, main='2D representation of the Cluster solution', color=TRUE, shade=TRUE, labels=2, lines=0)

Alternatively we can implement a Hierarchical approach. We use the agnes function in the package cluster. Argument diss=FALSE indicates that we use the dissimilarity matrix that is being calculated from raw data. Argument metric=“euclidian” indicates that we use Euclidean distance. No standardization is used and the link function is the “average” linkage.

In [None]:
foodagg=agnes(food,diss=FALSE,metric="euclidian")
plot(foodagg, main='Dendrogram') ## dendrogram

Study case II: Social Network Clustering Analysis

From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of interests, namely extracurricular activities, fashion, religion, romance, and antisocial behavior. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The final dataset indicates, for each person, how many times each word appeared in the person’s SNS profile.

In [None]:
teens <- read.csv("d:/student/snsdata.csv")
head(teens,3)

In [None]:
dim(teens)

In [None]:
str(teens)

In [None]:
summary(teens$age)

In [None]:
teens = na.omit(teens)
dim(teens)

We’ll start our cluster analysis by considering only the 36 features that represent the number of times various interests appeared on the SNS profiles of teens. For convenience, let’s make a data frame containing only these features:

In [None]:
interests <- teens[5:40]

In [None]:
#To apply z-score standardization to the interests data frame, we can use the scale() function with lapply(), as follows:
interests_z <- as.data.frame(lapply(interests, scale))

In [None]:
teen_clusters <- kmeans(interests_z, 5)

In [None]:
teen_clusters$size

In [None]:
teen_clusters$centers

In [None]:
par(mfrow=c(2,2))
pie(colSums(interests[teen_clusters$cluster==1,]),cex=0.5)

pie(colSums(interests[teen_clusters$cluster==2,]),cex=0.5)

pie(colSums(interests[teen_clusters$cluster==3,]),cex=0.5)

pie(colSums(interests[teen_clusters$cluster==4,]),cex=0.5)