chapter4.Rmd

# Chapter 4: Clustering and classification

```{r setup4, include=FALSE}
library(MASS)
library(tidyverse)
library(corrplot)
library(ggplot2)
library(GGally)
library(plotly)
data("Boston")

```

## Introduction

In this chapter I will look into closely to Boston dataframe that has 506 rows and 14 columns. The dataframe contains information about factors effecting housing values in suburban areas of Boston. These factors (further addessed as variables) include e.g crimerate (crim), full value property tax rate per 10000$ (tax) and pupil- teacher ratio (pratio). Full description of the data frame can be found here: [Housing Values in Suburbs of Boston](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html )

```{r columns, echo=FALSE}
colnames(Boston)
str(Boston)
dim(Boston)

```

Below is a graphical overview done with pairs- function and the full summary of all variables. 


```{r overview4, echo=FALSE}
pairs(Boston)
summary(Boston)


```
From the summary, we can see that all the variables in this data frame are numeric. All the data in this frame is percents, proprotions, mean or median values or calculated with their own function. For example Medv is the median value of homes in 1000$s and black is the amount of blacks calculated with the following function: 1000(Bk - 0.63)^2, Bk is the proportion of blacks per town. As an exception to others, Charles River dummy variable (Chas) is a binary variable with values 1 or 0 indicating if the river crosses the area or not.


```{r pressure, echo=FALSE}

cor_matrix<-cor(Boston) %>% round(digits= 2)

corrplot(cor_matrix, type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6, method="circle")
```

Above is the correlation chart of the values. Size of the circle varies according to correlation coefficents. The color of the circle indicates whether it is negatively or positively correlating.

In here it's visible that rad (index of accessibility to radial highways) correlates positively to dis (weighted mean of distances to five Boston employment centres) and lstat(lower status of the population (percent)) correlates positively with medv (median value of owner-occupied homes in \$1000s). 

##Standardizing 

```{r scale, echo=FALSE}
boston_scaled <- scale(Boston)
summary(boston_scaled)
class(boston_scaled)
boston_scaled <- as.data.frame(boston_scaled)
```
Above here is the summary of Boston data frame after standardizing it with scale-function. Compared to previous summary, it is clear that the values have changed. All the values have been minimized in size, which can bee seen best by looking at the min- values. All of them are negative, where as previously they had positive values. Also a remarkable change is that median values have all turned to zeros. This has to do with the standardization function. In standadrization the value, standard score, is counted from function (x- mean)/sd, which leads to no mean values at all. Here is a helpful post about understanding scale function: [Understanding scale in r](http://stackoverflow.com/questions/20256028/understanding-scale-in-r)

```{r categorial, echo=FALSE}
scaled_crim <- boston_scaled$crim

summary(scaled_crim)

bins <- quantile(scaled_crim)

crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

table(crime)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
```

```{r divide, echo=FALSE}
n <- nrow(boston_scaled)

ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]


print("Training set")
nrow(train)
print("Testing set")
nrow(test)
``` 

I divided the scaled data frame into two sets: test and training set. The training set has 80% of the data from Boston scaled and the total number of rows for both sets is printed here. 

##Linear Discriminant Analysis (LDA)

In this section, I fitted a Linear Discriminant Analysis (LDA, in short) by choosing categorial variable crime as target variable and rest as explanatory in R:s lda- function. 

```{r LDA, echo=FALSE}
lda.fit <- lda(crime ~., data = train)

lda.fit

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col= classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

```


The graph of LDA model shows that clearly high crime rate stands out from the rest of the data. Also, index of accessibility to radial highways (rad) is separated from the rest of the arrows and pointing towards high value of crime rate. 

 
```{r LDAwo, echo=FALSE}
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class)

```

From test data set the variable crime has been dropped and saved separately. Above is the prediction run on our LDA model. From predicted values, 67/102 were predicted the same as the correct values. The best prediction was done on high- values: 22/24. As seen from the graph before, this was expected, since high values differ from the rest of the values. Worst prediction was done on medium low- values 13/29, which is approximately 44% predicted correctly. Looking back at the graph above, medium low- values are most mixed with low and medium high- values, making predicting difficult.

## K-means

Unlike LDA, K-means is a clustering method that divides observations into clusters. The number of clusters is determined beforehand by adding number of centers as attribute to K-means function.

First, I reloaded the Boston data frame and scaled it again. Inorder execute K-means, the distances of observation need to be calculated. I used here eucledian, since it's the most popular distance measurement type. 
```{r reload, echo=FALSE}
data("Boston")
boston_scaled <- scale(Boston)
dist_eu <- dist(Boston)
km <-kmeans(dist_eu, centers = 15)

pairs(Boston, col = km$cluster)

```


Above is the plot of Boston data frame coloured with K-means clusters. K-means is ran here with 15 clusters. The graph is difficult to interpret, which could be because of wrong number of clusters. Let's determine the correct amount of clusters using the total of within cluster sum of squares (WCSS).


```{r wcss, echo=FALSE}
set.seed(123)

dist_eu <- dist(Boston)

k_max <- 15

twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

``` 

From the visualization of the result, we can see that the biggest drop of total WCSS is when the amount of clusters is 2. This means that 2 clusters is the best amount for our data frame. 

```{r kmeans, echo=FALSE}

km <-kmeans(dist_eu, centers = 2)

pairs(Boston, col = km$cluster)

```

The new visualization with only two clusters looks different to the one drawn before. Crime rate clearly (with a few exceptions) belongs to one cluster seen in red and proportion of residential owned lots (zn) to other cluster coloured black. To the same cluster with crime rate belong most of the observations from lower status of population (lstat), proportion of blacks (black) and nitrogen oxide concentration (nox). This means that part of town with high crime rate has also most likely lower status population, black people and high amounts of nitrogen oxide in the air. Areas with high amount of residential owned plots are likely to have accessibility to radial highways (rad), high full value tax rate (tax) and more pupils towards one teacher.


##Bonus

To execute the bonus task I reloaded the Boston dataset and standardized it again. The graph below shows the K-means clustering of this dataset with 4 centers.


```{r bonus, echo=FALSE}
data("Boston")
boston_scaled <- scale(Boston)
dist_eu <- dist(Boston)
km <-kmeans(dist_eu, centers = 4)

pairs(Boston, col = km$cluster)

lda.fit2 <- lda(km$cluster ~., data = Boston)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda.fit2, dimen = 2, col= classes, pch=classes)
lda.arrows(lda.fit2, myscale = 2)


``` 

The graph above is LDA model with the K-means clusters as target variable and Boston data frame as data. The variable with the longest arrow is nitrogen oxides concentration (nox) and it divides the data frames observations into two separate areas. From the other variables it is harder to say which one could be influential linear separator, so let's look at the graphs vectors more closely.

```{r bonusclose, echo=FALSE}
plot(lda.fit2, dimen = 2, col= classes, pch=classes)
lda.arrows(lda.fit2, myscale = 10)

```

The two variables showing to be linear separators are variables index of accessibility to radial highways (rad) and Charles River dummy variable (chas) that separate the observations from cluster one from the rest of the group. 


##Super bonus


```{r superbonus, echo=FALSE}

model_predictors <- dplyr::select(train, -crime)

dim(model_predictors)
dim(lda.fit$scaling)

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

```

As a super bonus task I ran the given code that makes the three matrix products from previous LDA (not the one in Bonus task) and inserts the into plot_ly function. The result is this beautiful 3D graph. If you move your mouse over the observations, you can see their exact coordinates.