# Clustering with K-means
Learning Goals:
- Understand how K-means clusters data in 2-dimensional feature space.
- Apply K-means clustering to classify satellite pixels.
- Visualise results using geemap and compare them with RGB and NDVI layers.



_______________________________________________________________________________
# K-means on synthetic data
First, let us generate some fake data that does not have labels but that does have a clear clustering structure within it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate 3 clusters of 2D points
np.random.seed(42)
cluster_1 = np.random.normal(loc=[2, 2], scale=0.5, size=(100, 2))
cluster_2 = np.random.normal(loc=[6, 6], scale=0.5, size=(100, 2))
cluster_3 = np.random.normal(loc=[2, 6], scale=0.5, size=(100, 2))

data = np.vstack((cluster_1, cluster_2, cluster_3))

# Plot the unlabelled data
plt.scatter(data[:, 0], data[:, 1], c='gray')
plt.title("Unlabelled Fake Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Next, apply the K-means alg.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(data)
labels = kmeans.labels_

# Visualize the clustering
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='Set1')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='black', marker='x', s=100, label='Centroids')
plt.title("K-Means Clustering on Fake Data")
plt.legend()
plt.show()

Let's move some of these clusters closer to each other...

In [None]:
# Generate 3 clusters of 2D points that overlap
np.random.seed(42)
cluster_1 = np.random.normal(loc=[2, 2], scale=0.5, size=(100, 2))
cluster_2 = np.random.normal(loc=[6, 6], scale=0.5, size=(100, 2))
cluster_3 = np.random.normal(loc=[4, 4], scale=0.8, size=(100, 2)) #<- Here is where I moved (loc) the third cluster to sit between the other two and dispersed it further (scale)

data = np.vstack((cluster_1, cluster_2, cluster_3))

# Plot the unlabelled data
plt.scatter(data[:, 0], data[:, 1], c='gray')
plt.title("Unlabelled Fake Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Let us apply the K-means once more....

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(data)
labels = kmeans.labels_

# Visualize the clustering
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='Set1')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='black', marker='x', s=100, label='Centroids')
plt.title("K-Means Clustering on Fake Data That Overlaps")
plt.legend()
plt.show()

Exercise: alter the dispersion and position of the first cluster so that it interacts more with the third.
*   Set the position to [3,2] and the scale (dispersion) to 0.6

As a pair, discuss if the resulting clusters that are found are truly meaninful?Should this data now be two clusters rather than three?

Then, add a fourth cluster to our synthetic data:
*   At position [2,7] with scale 0.5, but leave k-means set to n_clusters=3.

Does the resulting clustering change your answer to the above question?

In [None]:
# Use this code cell to copy in the example code above and modify it to answer the k-means exercise so that you can compare results more easily


________________________________________________________________________________
# K-means on Sentinel 2
Now we will have a first run at applying un-supervised classification on satellite data.

In [None]:
# Our usual set up for ee and geemap
import ee
import geemap

ee.Authenticate()
ee.Initialize(project='earthengine-ml-testing') #<- Remember to change this to your own project's name!

Load Sentinel 2 and grab the bands we want from the area we want:

In [None]:
# Auckland area
point = ee.Geometry.Point([174.7633, -36.8485])
image = ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED') \
    .filterBounds(point) \
    .filterDate('2023-02-01', '2023-02-28') \
    .sort('CLOUD_COVER') \
    .first()

# Select relevant bands and reduce size
bands = image.select(['B2', 'B3', 'B4', 'B8'])  # Blue, Green, Red, NIR
region = point.buffer(1000).bounds()  # small area for demo

# Sample image pixels to NumPy
sample = bands.sample(region=region, scale=10, numPixels=1000, seed=42).getInfo()
pixels = np.array([list(f['properties'].values()) for f in sample['features']])


Apply the K-means classifier, noting that we have pulled the pixels out of the earth engine server side to a numpy array on our client side. This is an important and likely common step for your future work in this module!

In [None]:
# Sklearn Kmeans alg
kmeans_sat = KMeans(n_clusters=4, random_state=0).fit(pixels)
labels_sat = kmeans_sat.labels_

Now let's visualize this in 2D Feature Space:

In [None]:
# Let's plot Red vs NIR as an example
plt.scatter(pixels[:, 2], pixels[:, 3], c=labels_sat, cmap='Set3')
plt.xlabel('Red (B4)')
plt.ylabel('NIR (B8)')
plt.title('K-Means Clustering in Spectral Feature Space')
plt.show()

In you pair, answer the following discussion prompts:
- What patterns did the K-means pick up in the image data?
- How might the choice of bands or number of clusters affect results?
- What are the risks of using K-means for landcover classification?