# Clustering
## GGR473 Lab 5 [part 2]

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt
import os
import numpy as np

### Data
Let's work with the same data we used in part 1: census tracts with aggregated features for AirBnB properties. 

For the lab we will aim to identify clusters in the data based on AirBnB price. However, you can think about how data may be clustered based on multiple features and/or spatial location. For example, you could try repeating the process using both price and number of bedrooms as features for clustering. 

I have also added a point file of the AirBnB locations to the data folder in case you'd like to test working with non-aggregated point data and different variables.

In [None]:
# You may have seen an error when previously importing a shapefile whereby the projection library couldn't be found
# The following code points to the PROJ library and should address the error
os.environ['PROJ_LIB'] = '/path/to/env/share/proj'

# Import shapefile and store data as geopandas geodataframe
gdf = gpd.read_file("data/airbnbct23.shp")
print(gdf.info())

### K-means clustering

K-means clustering is an unsupervised machine learning algorithm used for partitioning a dataset into K distinct clusters. It is commonly used to segment data and identify patterns and anomalies. 

In our example, we can identify which areas have similar pricing characteristics. To do this we will again use the [sci-kit learn](https://scikit-learn.org/stable/index.html) library, specifically the [KMeans](https://scikit-learn.org/stable/modules/clustering.html#k-means) module.

**Aim:** Group census tracts based on Airbnb price into distinct clusters

- pricemean = mean AirBnB price for census tract

For the following example, I choose k=4 clusters to identify clear groups of high and low pricing. It's possible to choose the number of clusters based on your own subject knowledge, however, there are also technical approaches to selecting k. You may like to explore options such as [elbow plots](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/) for finding the optimal number of clusters. 


In [None]:
from sklearn.cluster import KMeans

# Select relevant feature(s) for clustering
X = gdf[['pricemean']] #To cluster based on multiple features (e.g., price and no. bedroom) use [['pricemean', 'bedrmsmed']] 

# Initialize model by setting number of clusters
km4 = KMeans(n_clusters=4)

# Fit the model to the relevant features
km4clusters = km4.fit_predict(X)  # fit_predict() fits the model and assigns cluster labels (e.g., 0, 1, 2, 3) which are helpful for grouping later

# Add cluster labels to the original DataFrame
gdf['km4clusters'] = km4clusters


The above code places a centroid seed (initial cluster centres) randomly and assigns nearest data points to the centroid. For each cluster, the algorithm recalculates the centroids based on the mean of all of its assigned data points and repositions the centroid accordingly. This process happens iteratively until a point of convergence or the maximum number of iterations is reached. 

You may see a warning for `n_init` above. `n_init` is the maximum number of iterations and is currently set to 10. In a future version of the KMeans module, this will change from 10 to 'auto'.

Now let's use [matplot.pyplot](https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html) to take a look spatially at which neighbourhoods were assigned to each cluster based on AirBnB price.

In [None]:
# Create a figure and axes
f, ax = plt.subplots(1, figsize=(9, 9)) # f is the top level container for our plot and ax is the area where data are plotted

# Extract clusters from new column 'km4clusters' and show on map
gdf.plot(column='km4clusters', categorical=True, legend=True, ax=ax)

# Turn off axes lines and values
ax.set_axis_off()

# Add title
plt.title('AirBnb price classification for Toronto')

# Display the map
plt.show()

The map above shows a pattern: there is a class at the core and north of the city (cluster number 3), two sporadic clusters (1 and 2), and a more suburban cluster (0).

This gives us an insight into the geographical structure, but does not tell us much about what are the defining elements of these groups. To do that, let's look at the characteristics of the clusters and visualize the mean of the different AirBnB prices for each.

If working with multiple variables, you could stack the different variables on the bar plot to view their proportions across groups. Some standardisation may be helpful here. For example, including a mean price of $200 and a median of 2 bedrooms in the same bar would make it difficult to view how clusters have been divided up based on high price, low number of bedrooms etc. as price would dominate the display. Instead displaying standardised values (e.g., [z-scores](https://wikitekkee.com/how-to-calculate-z-score-in-python/) where z = (X-mean)/std) would make it easier to compare variables.

In [None]:
# Group data points by cluster label and calculate mean for each cluster
cluster_means = gdf.groupby('km4clusters')[['pricemean']].mean()


# Create a horizontal bar plot for cluster means
f, ax = plt.subplots(1, figsize=(18, 9))
cluster_means.plot(kind='barh', stacked=True, ax=ax)

# Add labels, title and legend
plt.xlabel('Mean Price')
plt.ylabel('Cluster')
plt.title('Cluster means for AirBnB price')
plt.legend(['Mean price for AirBnB'])

plt.show()


We can see that cluster 1 is based on high prices and cluster 0 on low prices. Based on the map, is this what you would expect to see? It seems there are some interesting outliers for high prices.

So far our clustering has been based purely on the price variable. Let's incorporate spatial relationships by including a spatial weight in the model.

### Spatial autocorrelation
First check to see if spatial autocorrelation exists using Moran's I statistic (a global measure of spatial autocorrelation). This provides good grounds for including spatial weights in your model or not.

In [None]:
from esda.moran import Moran

# Generate spatial weights (using Queen contiguity in this case)
w = Queen.from_dataframe(gdf)

# Calculate Moran's I to see if spatial autocorrelation in price exists
moran = Moran(gdf['pricemean'], w)

# Print Moran's I statistic and p-value
print("Moran's I:", moran.I)
print("P-value:", moran.p_sim)

It seems there is one census tract that doesn't have any neighbours based on Queen contiguity! I'm sure you can figure out which census tract this is. As a result, it was excluded from the calcuation. 

Otherwise, a significant positive Moran's I statistic (albeit quite a small one) suggests there is postive spatial autocorrelation in price i.e., similar prive values tend to cluster spatially.

We can also also use a Local Moran's i statistic (similar to Lab 4) to see where clustering of prices occurs.

In [None]:
from esda.moran import Moran_Local
from splot.esda import moran_scatterplot, plot_local_autocorrelation

# Calculate local Moran's I
moran_loc = Moran_Local(gdf['pricemean'], w)

# Plot Moran scatterplot
fig, ax = moran_scatterplot(moran_loc, p=0.05)
plt.show()

# Plot Local Moran's I Map
plot_local_autocorrelation(moran_loc, gdf, 'pricemean')
plt.show()

It's clear there's some clustering of high values and low values in certain locations. This looks a little different to where we see census tracts that have been grouped due to high prices. 

### Spatial K-means clustering
Let's see if spatial autocorrelation affects our model by adding spatial weights. 

In [None]:
from libpysal.weights import Queen

# Convert the spatial weights matrix to a numpy array so it can be appended to your feature set of price values
w_matrix = w.full()[0]

# Append spatial weights
X_with_weights = np.hstack((X.values, w_matrix))

# Perform K-Means clustering on the extended dataset
skm4clusters = km4.fit_predict(X_with_weights)


# Add cluster labels to geodaraframe
gdf['skm4clusters'] = skm4clusters


In [None]:
# Create two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot the clusters from K-Means on the first subplot (axes[0])
gdf.plot(column='km4clusters', categorical=True, legend=True, ax=axes[0])

# Plot the clusters from Spatial K-Means on the second subplot (axes[1])
gdf.plot(column='skm4clusters', categorical=True, legend=True, ax=axes[1])

# Set titles for plots
axes[0].set_title('K-Means clustering')
axes[1].set_title('Spatial K-Means clustering')

# Display the plots
plt.tight_layout()
plt.show()

The outputs are similar but by including spatial weights, the census tracts groupings has changed slightly (e.g., in East Scarborough). You may like to take time exploring the cluster means with bar plots. Note that the labels may not align completely but can be [updated.](https://www.digitalocean.com/community/tutorials/update-rows-and-columns-python-pandas) 

Once you have worked through the notebook and feel happy with the concepts, export your notebook by selection File > Print Preview > Save as PDF.

Upload your PDF to Part 2 of the lab.