# Homework 08
This homework is based on the clustering lectures. Check the lecture notes and TA notes - they should help!

## Question 1
This question will walk you through creating your own `kmeans` function.

#### a) What are the steps of `kmeans`?
**Hint**: There are 4 steps/builder functions that you'll need.
1. Load data
2. Assign points to clusters at random
3. Compute cluster means (centroids)
4. Iterate, reassign labels, calculate new means

#### b) Create the builder function for step 1.
```
load_data <- function(file_path) {
  data <- read.csv(file_path)
  return(data)
}
```

#### c) Create the builder function for step 2.
```
assign_random_clusters <- function(data, k) {
  cluster_labels <- paste0("C", 0:(k - 1))
  data$cluster <- sample(cluster_labels, nrow(data), replace = TRUE)
  return(data)
}
```

#### d) Create the builder function for step 3.
*Hint*: There are two ways to do this part - one is significantly more efficient than the other. You can do either.  
```
compute_centroids <- function(data) {
  centers_data <- data %>%
    group_by(cluster) %>%
    summarise(
      x = mean(x),
      y = mean(y),
      .groups = "drop"
    ) %>%
    arrange(cluster) %>%
    mutate(label = paste0("μ", row_number() - 1))
  
  return(centers_data)
}


ggplot(data, aes(x, y, color = cluster)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_point(data = centers_data, aes(x, y, fill = cluster),
             shape = 21, size = 5, color = "black", stroke = 1.2) +
  geom_text(data = centers_data, aes(x, y, label = label),
            vjust = -1, fontface = "bold", color = "black") +
  coord_equal() +
  theme_bw() +
  ggtitle("Compute cluster means (centroids)")
```

#### e) Create the builder function for step 4.
```
assign_to_nearest <- function(data, centers) {
  data %>%
    rowwise() %>%
    mutate(
      cluster = centers$cluster[which.min(sqrt((x - centers$x)^2 + (y - centers$y)^2))]
    ) %>%
    ungroup()
}

data_new <- assign_to_nearest(data, centers_data)

ggplot(data_new, aes(x, y, color = factor(cluster))) +
  geom_point(size = 2, alpha = 0.7) +
  geom_point(data = centers_data, aes(x, y, fill = factor(cluster)),
             shape = 21, size = 5, color = "black", stroke = 1.2) +
  geom_text(data = centers_data, aes(x, y, label = label),
            vjust = -1, fontface = "bold", color = "black") +
  coord_equal() +
  theme_bw() +
  ggtitle("Iterate and reassign labels")

new_centers_data <- data_new %>%
  group_by(cluster) %>%
  summarise(
    x = mean(x),
    y = mean(y),
    .groups = "drop"
  ) %>%
  arrange(cluster) %>%
  mutate(label = paste0("μ", row_number() - 1))

ggplot(data_new, aes(x, y, color = factor(cluster))) +
  geom_point(size = 2, alpha = 0.7) +
  geom_point(data = new_centers_data, aes(x, y, fill = factor(cluster)),
             shape = 21, size = 5, color = "black", stroke = 1.2) +
  geom_text(data = new_centers_data, aes(x, y, label = label),
            vjust = -1, fontface = "bold", color = "black") +
  coord_equal() +
  theme_bw() +
  ggtitle("Recalculate means")
```

#### f) Combine them all into your own `kmeans` function.

In [4]:
kmeans_custom <- function(file_path, k, max_iter = 10) {
  library(dplyr)

  data <- read.csv(file_path)

  cluster_labels <- paste0("C", 0:(k - 1))
  data$cluster <- sample(cluster_labels, nrow(data), replace = TRUE)

  compute_centroids <- function(data) {
  data %>%
    group_by(cluster) %>%
    summarise(across(.cols = where(is.numeric) & !any_of("cluster"), mean), .groups = "drop") %>%
    arrange(cluster) %>%
    mutate(label = paste0("μ", row_number() - 1))
}

  assign_to_nearest <- function(data, centers) {
    feature_cols <- setdiff(names(data), "cluster")

    centers_no_label <- centers[, feature_cols, drop = FALSE]
    centers_clusters <- centers$cluster

    distances <- as.matrix(dist(rbind(data[, feature_cols], centers_no_label)))
    n_data <- nrow(data)
    n_centers <- nrow(centers)

    dist_matrix <- distances[1:n_data, (n_data + 1):(n_data + n_centers)]
    data$cluster <- centers_clusters[apply(dist_matrix, 1, which.min)]
    data
  }

  for (i in 1:max_iter) {
    centers_data <- compute_centroids(data)
    new_data <- assign_to_nearest(data, centers_data)

    if (all(new_data$cluster == data$cluster)) {
      break
    }

    data <- new_data
  }

  final_centers <- compute_centroids(data)

  return(list(
    labels = data$cluster,
    means = final_centers
  ))
}

## Question 2
This is when we'll test your `kmeans` function.
#### a) Read in the `voltages_df.csv` data set. 

In [2]:
voltages_df <- read.csv("voltages_df.csv")
head(voltages_df)

Unnamed: 0_level_0,X0,X1.00401606425703,X2.00803212851406,X3.01204819277108,X4.01606425702811,X5.02008032128514,X6.02409638554217,X7.0281124497992,X8.03212851405623,X9.03614457831325,⋯,X240.963855421687,X241.967871485944,X242.971887550201,X243.975903614458,X244.979919678715,X245.983935742972,X246.987951807229,X247.991967871486,X248.995983935743,X250
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-1.031463,1.104665,0.8982475,0.4142208,-1.1490888,-1.07851,-1.002401,-0.9182083,-0.8215574,-0.7023741,⋯,-0.7392703,-0.7633694,-0.7792297,-0.784434,-0.777982,-0.7608812,-0.736983,-0.7138199,-0.7014771,-0.7056029
2,-1.031463,1.246157,1.0948587,0.9039343,0.465441,-1.160496,-1.112005,-1.0721319,-1.0385633,-1.0075872,⋯,-0.8859964,-0.8511675,-0.8064307,-0.7534558,-0.6954785,-0.6404759,-0.6105817,-0.6348313,-0.6767121,-0.7140939
3,-1.031463,1.216111,1.0557873,0.8417629,-0.5636836,-1.147653,-1.101783,-1.0645681,-1.0336197,-1.0051885,⋯,-0.9503509,-0.9122991,-0.8625269,-0.8016142,-0.7306757,-0.6527186,-0.5812047,-0.587556,-0.6768023,-0.7206992
4,-1.031463,1.166244,0.9899628,0.7230858,-1.1806746,-1.125106,-1.077167,-1.0370309,-1.0027385,-0.9709488,⋯,-0.9498509,-0.9236047,-0.8896604,-0.850212,-0.8086367,-0.7700917,-0.7418958,-0.7315532,-0.7409824,-0.7644406
5,-1.031463,1.230222,1.07467,0.873388,0.2116394,-1.153728,-1.106832,-1.0691075,-1.038335,-1.0108187,⋯,-0.8710166,-0.8237315,-0.7590005,-0.6698582,-0.5061566,1.0975578,0.9348933,0.6673692,-1.1669718,-1.1047735
6,-1.031463,1.25765,1.1112886,0.9322788,0.604562,-1.166325,-1.113488,-1.0696859,-1.0332517,-1.0009558,⋯,-0.9092342,-0.8715416,-0.8213803,-0.7589597,-0.6842115,-0.5958772,-0.4766488,1.1008087,0.9169321,0.5137345


#### b) Call your `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$means`. 

In [5]:
voltages_df <- read.csv("voltages_df.csv", header = TRUE)

results <- kmeans_custom("voltages_df.csv",3)
print(results$labels)
print(results$means)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




  [1] "C2" "C1" "C1" "C1" "C1" "C1" "C0" "C0" "C1" "C2" "C2" "C2" "C0" "C2" "C1"
 [16] "C2" "C2" "C2" "C2" "C2" "C1" "C1" "C1" "C1" "C1" "C2" "C2" "C2" "C0" "C0"
 [31] "C0" "C0" "C1" "C0" "C0" "C1" "C1" "C2" "C0" "C1" "C2" "C2" "C2" "C0" "C1"
 [46] "C0" "C0" "C1" "C0" "C0" "C1" "C0" "C2" "C2" "C0" "C2" "C1" "C2" "C0" "C1"
 [61] "C2" "C2" "C1" "C2" "C0" "C0" "C1" "C0" "C2" "C0" "C2" "C1" "C0" "C0" "C1"
 [76] "C1" "C2" "C0" "C0" "C0" "C0" "C0" "C1" "C0" "C0" "C1" "C1" "C1" "C2" "C0"
 [91] "C0" "C0" "C1" "C0" "C1" "C1" "C0" "C1" "C1" "C0" "C0" "C0" "C1" "C0" "C2"
[106] "C2" "C0" "C2" "C2" "C1" "C2" "C2" "C1" "C0" "C2" "C0" "C2" "C2" "C1" "C1"
[121] "C1" "C1" "C2" "C2" "C0" "C1" "C1" "C0" "C2" "C0" "C1" "C2" "C1" "C0" "C2"
[136] "C1" "C1" "C1" "C2" "C0" "C2" "C1" "C1" "C0" "C2" "C2" "C1" "C2" "C2" "C0"
[151] "C2" "C0" "C1" "C1" "C0" "C2" "C0" "C0" "C1" "C2" "C0" "C0" "C2" "C1" "C2"
[166] "C0" "C1" "C2" "C1" "C2" "C2" "C0" "C1" "C2" "C0" "C0" "C1" "C0" "C2" "C1"
[181] "C0" "C1" "C0" "C2" "C

#### c) Call R's `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$cluster`. 
*Hint*: Use the `as.matrix()` function to make the `voltages_df` data frame a matrix before calling `kmeans()`.

In [5]:
voltages_df <- read.csv("voltages_df.csv", header = TRUE)

voltages_matrix <- as.matrix(voltages_df)

results <- kmeans(voltages_matrix, centers = 3)

print(results$cluster)
results$centers

  [1] 1 3 3 3 3 3 2 2 3 1 1 1 2 1 3 1 1 1 1 1 3 3 3 3 3 1 1 1 2 2 2 2 3 2 2 3 3
 [38] 1 2 3 1 1 1 2 3 2 2 3 2 2 3 2 1 1 2 1 3 1 2 3 1 1 3 1 2 2 3 2 1 2 1 3 2 2
 [75] 3 3 1 2 2 2 2 2 3 2 2 3 3 3 1 2 2 2 3 2 3 3 2 3 3 2 2 2 3 2 1 1 2 1 1 3 1
[112] 1 3 2 1 2 1 1 3 3 3 3 1 1 2 3 3 2 1 2 3 1 3 2 1 3 3 3 1 2 1 3 3 2 1 1 3 1
[149] 1 2 1 2 3 3 2 1 2 2 3 1 2 2 1 3 1 2 3 1 3 1 1 2 3 1 2 2 3 2 1 3 2 3 2 1 2
[186] 3 3 1 3 1 1 3 1 1 2 2 1 3 1 1 2 1 2 2 3 1 3 2 3 1 3 2 1 2 2 3 2 2 3 3 3 3
[223] 3 3 3 2 2 2 2 1 3 1 2 3 3 2 2 1 1 2 2 1 2 2 3 2 1 2 2 3 2 3 1 2 3 2 2 1 1
[260] 1 3 3 2 1 3 3 3 2 2 2 1 3 3 3 2 1 2 2 3 3 2 1 2 1 1 3 2 3 3 2 2 1 2 2 1 3
[297] 1 2 2 3 3 2 1 1 1 2 1 3 3 3 1 1 1 1 1 1 3 1 3 1 1 2 3 2 1 1 1 1 3 1 3 1 2
[334] 1 3 3 2 2 1 3 2 3 2 2 1 2 3 2 1 1 2 3 3 3 1 2 3 1 2 2 3 3 3 3 1 2 1 3 1 1
[371] 1 1 1 3 3 2 2 1 1 1 1 2 2 1 1 2 2 3 3 2 1 3 3 3 3 2 3 3 2 1 1 1 1 3 3 1 1
[408] 2 2 2 3 2 2 1 3 3 3 2 3 3 3 1 1 3 1 2 3 3 3 3 2 2 3 2 1 3 1 1 1 2 2 2 2 1
[445] 3 1 1 1 2 3 1 1 1 3 3 1 3 3 1 2 3 

Unnamed: 0,X0,X1.00401606425703,X2.00803212851406,X3.01204819277108,X4.01606425702811,X5.02008032128514,X6.02409638554217,X7.0281124497992,X8.03212851405623,X9.03614457831325,⋯,X240.963855421687,X241.967871485944,X242.971887550201,X243.975903614458,X244.979919678715,X245.983935742972,X246.987951807229,X247.991967871486,X248.995983935743,X250
1,-1.031463,0.9381238,0.7619864,0.3631543,-1.1179412,-1.051145,-0.9766807,-0.8694758,-0.6892375,-0.5661321,⋯,-0.7900387,-0.8070676,-0.8182598,-0.8207339,-0.8132928,-0.7969549,-0.77567272,-0.75689256,-0.7496483,-0.7570393
2,-1.031463,1.3093239,1.1616772,0.9787498,0.6481497,-1.16861,-1.1196122,-1.0590962,-0.9943176,-0.9237437,⋯,0.3364266,0.8337474,0.7125412,-0.2659209,-1.0409179,-1.0587745,-1.01359887,-0.96467777,-0.9151047,-0.8610245
3,-1.031463,1.2439759,1.0924697,0.900444,0.3011754,-1.159714,-1.1098127,-1.0685484,-1.0338649,-1.0022396,⋯,-0.9107472,-0.8732292,-0.8234477,-0.7607812,-0.6682618,-0.3380864,-0.04693168,0.02820486,-0.41135,-0.8115784


#### d) Are your labels/clusters the same? If not, why? Are your means the same?
The labels between my kmeans and R's kmeans are different. Mine are CO, C1, C2 and R's are 1,2,and 3. The means are the same between both functions; however, the labeling is off. CO corresponds to 2, C1 corresponds to 3, and C2 corresponds to 1. The reason for the labeling difference is that they are arbitrarily assigned and depend on the order in which the clusters are assigned by the algorithm. Thus while the labeling is not the same, the centroids are showing consistencies between my k-means function and R'S.

## Question 3
#### a) Explain the process of using a for loop to assign clusters for kmeans.
To use a loop to assign clusters for k means you first loop over each data point. Then with each data point you loop over all cluster centroids.
Next, you calculate the distance between the data point and each centroid. After you determine which centroid is the closest, finally this brings you to the end goal of assigning the data point to that cluster. This process requires a double loop where it loops over all points and all clusters, a slow approach for large datasets as each distance is calculated one pair at a time.

#### b) Explain the process of vectorizing the code to assign clusters for kmeans.
Vectorizing the code to assign a cluster for k-means involves computing the distance for all data point cluster pairs simultaneously. This is accomplished in the following steps: 1) Replicate the data points matrix so each point appears one time per cluster (Data') 2) Replicate the cluster means matrix so each centroid appears once per data point (Means') Steps 1 and 2 create replicated matrices with the same shape 3)
Compute distances between all point-cluster pairs at the same time with one vectorized operation 4)
Assign each data point ot the closest centroid This approach is much quicker than explicitly looping.

#### c) State which (for loops or vectorizing) is more efficient and why.
Vectorizing is much more efficent than loops because by constructing index vectors with repeated entries, replicated matrices can be created. This allows for a simultaneous computation of all distances at once. With looping the distances are computed individually.

## Question 4
#### When does `kmeans` fail? What assumption does `kmeans` use that causes it to fail in this situation

Kmeans fails when the structure of the data does not match the assumptions about the cluster shape and distribution. 

The assumptions are: Vectorial data, uniformly shaped Gaussian of about the same size and shape, and each element is either in a cluster or not. 

The last assumption can be relaxed and each entity is only assigned a probability of being in a cluster based on distance from the center. This is called fuzzy-k-means which works when data is distributed in concentric clusters and may have outlines which are ignored.



## Question 5
#### What assumption do Guassian mixture models make?
The Guassian mixture models make the following assumptions: Data is drawn from a mixture of mulitple N Gaussian distributions whose individual parameters are estimated from the data. It can handle clusters of different sizes and shapes more easily. It uses Expectation-Maximization to estimate the parameters of each Gaussian, and then modifying the parameters of the model to maximize the likelihood of the data we have. 

## Question 6
#### What assumption does spectral clustering make? Why does this help us?

Spectral clustering requires only one assumption: two points are more likely to be in the same cluster if they are close to one another. This is helpful because the assumption for the existence of a metric is a much weaker condition than the existence of a vector space, allowing for the use of more types of data. This data also allows for the calculation of the Graph Laplacian which can be truncated to form a low dimensional respresentation of the data which k-means can then be used on.

## Question 7
#### Define the gap statistic method. What do we use it for?

The gap statistic method determines the number of clusters best represent the data. We use it when the clustering isn't clear. It works by comparing the cluster results on the data to cluster results on randomly generated reference data. For each cluster the cluster variation is measured for both data sets. The gap statistic is the difference between the in cluster variation between the data set and reference data. 