# Homework 08
This homework is based on the clustering lectures. Check the lecture notes and TA notes - they should help!

## Question 1
This question will walk you through creating your own `kmeans` function.

#### a) What are the steps of `kmeans`?
**Hint**: There are 4 steps/builder functions that you'll need.

The four steps to k-means involve 1. Assigning each point to a cluster N at random.
2. Calculating the mean position of each cluster using the previous assignments.
3. Looping through the points - assign each point to the cluster to whose center it is closest.
4. Repeating this process until the centers stop moving around.

#### b) Create the builder function for step 1.

In [9]:
library(dplyr)

label_randomly <- function(n_points, n_clusters){
  sample(((1:n_points) %% n_clusters)+1, n_points, replace=F)
}



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




#### c) Create the builder function for step 2.

In [10]:
get_cluster_means <- function(data, labels){
  data %>%
    mutate(label__ = labels) %>%
    group_by(label__) %>%
    summarize(across(everything(), mean), .groups = "drop") %>%
    arrange(label__)
}

#### d) Create the builder function for step 3.
*Hint*: There are two ways to do this part - one is significantly more efficient than the other. You can do either.  

In [11]:
assign_cluster <- function(data, means){
  data_matrix <- as.matrix(data)
  means_matrix <- as.matrix(means %>% dplyr::select(-label__))
  dii <- sort(rep(1:nrow(data), nrow(means)))
  mii <- rep(1:nrow(means), nrow(data))
  data_repped <- data_matrix[dii, ]
  means_repped <- means_matrix[mii, ]
  diff_squared <- (data_repped - means_repped)^2
  all_distances <- rowSums(diff_squared)
  tibble(dii=dii, mii=mii, distance=all_distances) %>%
    group_by(dii) %>%
    arrange(distance) %>%
    filter(row_number()==1) %>%
    ungroup() %>%
    arrange(dii) %>%
    pull(mii)
}

#### e) Create the builder function for step 4.

In [12]:
kmeans_done <- function(old_means, new_means, eps=1e-6){
  om <- as.matrix(old_means)
  nm <- as.matrix(new_means)
  m <- mean(sqrt(rowSums((om - nm)^2)))
  if(m < eps) TRUE else FALSE
}

#### f) Combine them all into your own `kmeans` function.

In [16]:
mykmeans <- function(data, n_clusters, eps=1e-6, max_it = 1000, verbose = FALSE){
  labels <- label_randomly(nrow(data), n_clusters)
  old_means <- get_cluster_means(data, labels)
  done <- FALSE
  it <- 0
  while(!done & it < max_it){
    labels <- assign_cluster(data, old_means)
    new_means <- get_cluster_means(data, labels)
    if(kmeans_done(old_means, new_means)){
      done <- TRUE
    } else {
      old_means <- new_means
      it <- it + 1
      if(verbose){
        cat(sprintf("%d\n", it))
      }
    }
  }
  list(labels=labels, means=new_means)
}

## Question 2
This is when we'll test your `kmeans` function.
#### a) Read in the `voltages_df.csv` data set.

In [19]:
voltages_df <- read.csv('voltages_df.csv')

#### b) Call your `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$means`.

In [20]:
results <- mykmeans(voltages_df, 3)
print(results$labels)
print(results$means)

  [1] 1 3 3 3 3 3 2 2 3 1 1 1 2 1 3 1 1 1 1 1 3 3 3 3 3 1 1 1 2 2 2 2 3 2 2 3 3
 [38] 1 2 3 1 1 1 2 3 2 2 3 2 2 3 2 1 1 2 1 3 1 2 3 1 1 3 1 2 2 3 2 1 2 1 3 2 2
 [75] 3 3 1 2 2 2 2 2 3 2 2 3 3 3 1 2 2 2 3 2 3 3 2 3 3 2 2 2 3 2 1 1 2 1 1 3 1
[112] 1 3 2 1 2 1 1 3 3 3 3 1 1 2 3 3 2 1 2 3 1 3 2 1 3 3 3 1 2 1 3 3 2 1 1 3 1
[149] 1 2 1 2 3 3 2 1 2 2 3 1 2 2 1 3 1 2 3 1 3 1 1 2 3 1 2 2 3 2 1 3 2 3 2 1 2
[186] 3 3 1 3 1 1 3 1 1 2 2 1 3 1 1 2 1 2 2 3 1 3 2 3 1 3 2 1 2 2 3 2 2 3 3 3 3
[223] 3 3 3 2 2 2 2 1 3 1 2 3 3 2 2 1 1 2 2 1 2 2 3 2 1 2 2 3 2 3 1 2 3 2 2 1 1
[260] 1 3 3 2 1 3 3 3 2 2 2 1 3 3 3 2 1 2 2 3 3 2 1 2 1 1 3 2 3 3 2 2 1 2 2 1 3
[297] 1 2 2 3 3 2 1 1 1 2 1 3 3 3 1 1 1 1 1 1 3 1 3 1 1 2 3 2 1 1 1 1 3 1 3 1 2
[334] 1 3 3 2 2 1 3 2 3 2 2 1 2 3 2 1 1 2 3 3 3 1 2 3 1 2 2 3 3 3 3 1 2 1 3 1 1
[371] 1 1 1 3 3 2 2 1 1 1 1 2 2 1 1 2 2 3 3 2 1 3 3 3 3 2 3 3 2 1 1 1 1 3 3 1 1
[408] 2 2 2 3 2 2 1 3 3 3 2 3 3 3 1 1 3 1 2 3 3 3 3 2 2 3 2 1 3 1 1 1 2 2 2 2 1
[445] 3 1 1 1 2 3 1 1 1 3 3 1 3 3 1 2 3 

#### c) Call R's `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$cluster`.
*Hint*: Use the `as.matrix()` function to make the `voltages_df` data frame a matrix before calling `kmeans()`.

In [32]:
resultsR <- kmeans(as.matrix(voltages_df),3)
print(resultsR$cluster)
print(resultsR$centers) # used this instead of results$labels as that was not working

  [1] 2 1 1 1 1 1 3 3 1 2 2 2 3 2 1 2 2 2 2 2 1 1 1 1 1 2 2 2 3 3 3 3 1 3 3 1 1
 [38] 2 3 1 2 2 2 3 1 3 3 1 3 3 1 3 2 2 3 2 1 2 3 1 2 2 1 2 3 3 1 3 2 3 2 1 3 3
 [75] 1 1 2 3 3 3 3 3 1 3 3 1 1 1 2 3 3 3 1 3 1 1 3 1 1 3 3 3 1 3 2 2 3 2 2 1 2
[112] 2 1 3 2 3 2 2 1 1 1 1 2 2 3 1 1 3 2 3 1 2 1 3 2 1 1 1 2 3 2 1 1 3 2 2 1 2
[149] 2 3 2 3 1 1 3 2 3 3 1 2 3 3 2 1 2 3 1 2 1 2 2 3 1 2 3 3 1 3 2 1 3 1 3 2 3
[186] 1 1 2 1 2 2 1 2 2 3 3 2 1 2 2 3 2 3 3 1 2 1 3 1 2 1 3 2 3 3 1 3 3 1 1 1 1
[223] 1 1 1 3 3 3 3 2 1 2 3 1 1 3 3 2 2 3 3 2 3 3 1 3 2 3 3 1 3 1 2 3 1 3 3 2 2
[260] 2 1 1 3 2 1 1 1 3 3 3 2 1 1 1 3 2 3 3 1 1 3 2 3 2 2 1 3 1 1 3 3 2 3 3 2 1
[297] 2 3 3 1 1 3 2 2 2 3 2 1 1 1 2 2 2 2 2 2 1 2 1 2 2 3 1 3 2 2 2 2 1 2 1 2 3
[334] 2 1 1 3 3 2 1 3 1 3 3 2 3 1 3 2 2 3 1 1 1 2 3 1 2 3 3 1 1 1 1 2 3 2 1 2 2
[371] 2 2 2 1 1 3 3 2 2 2 2 3 3 2 2 3 3 1 1 3 2 1 1 1 1 3 1 1 3 2 2 2 2 1 1 2 2
[408] 3 3 3 1 3 3 2 1 1 1 3 1 1 1 2 2 1 2 3 1 1 1 1 3 3 1 3 2 1 2 2 2 3 3 3 3 2
[445] 1 2 2 2 3 1 2 2 2 1 1 2 1 1 2 3 1 

#### d) Are your labels/clusters the same? If not, why? Are your means the same?

The labels are not the same due to the fact that kmeans randomly initializes cluster centers. This then means that the means are also not the same

## Question 3
#### a) Explain the process of using a for loop to assign clusters for kmeans.

a for loop is used to assign clusters because you need to iterate through each datapoint in a dataset and check to see how far away they are from each centroid. This loop runs until all data points have reached equilibrium.

#### b) Explain the process of vectorizing the code to assign clusters for kmeans.

Vectorizing the code to assign clusters for k-means allows for the use of matrix math by converting each datapoint into a single row that is correspondent to a cluster column, with the relationship between the two being the distance between the centroid and the datapoint.

#### c) State which (for loops or vectorizing) is more efficient and why.

Vectorizing the code is faster as it allows for faster, more optimized matrix operations to be done compared to the slow for loop that has to iterate through each data point one at a time.

## Question 4
#### When does `kmeans` fail? What assumption does `kmeans` use that causes it to fail in this situation?

kmeans fails in scenarios where the dataset is non-linear, such as in radial datasets. Kmeans assumes that clusters are spherical, so when datasets are irregular, centroid calculation struggles to be accurate or give any useful insight.

## Question 5
#### What assumption do Guassian mixture models make?

Gaussian mixture models assume that data is generated from a N gaussian distributions whose parameters are estimated from the data. They also assume that clusters can overlap, making it superior than kmeans in some cases.


## Question 6
#### What assumption does spectral clustering make? Why does this help us?

The assumption that spectral clustering makes is that two points are more likely to be in the same cluster if they are closer to one another.

## Question 7
#### Define the gap statistic method. What do we use it for?

the gap statistic method involves comparing the clustering for each value of K to a cluster of data that is randomlized in with the original data and then computing the difference between the two clusters. We use it to find a base idea of how many clusters we want for a dataset.