# Problem Set 5 - K-Means Clustering  

In this problem set, you will explore seasonal ice velocity patterns on the Greenland Ice Sheet using K-Means clustering. The analysis is based on Solgaard, A. M., Rapp, D., Noël, B. P. Y., & Hvidberg, C. S. (2022). Seasonal patterns of Greenland ice velocity from Sentinel-1 SAR data  linked to runoff. *Geophysical Research Letters*, 49, e2022GL100343. https://doi.org/10.1029/2022GL100343. 

**[1]** In the code block below, import the libraries you will need for data preprocessing and K-Means clustering.

**[2] (2 pts)** Load the dataset into a dataframe and display the first few lines of the dataframe. Due to the data volume, this data set is hosted locally on our server at /scratch/ml_in_eas/data/VelocityTimeSeries.csv. 

You should see that the data has 29 features, labeled t1 through t29. Each row in the dataframe represents a single location on the ice sheet. Each column in the dataframe (e.g. each of the 29 features) represents the velocity at that location at a particular time. For example, t1 corresponds to 23 Jan 2019 and t29 is 18 Dec 2019. Therefore each row of the dataframe is a time series of ice velocity from Jan-Dec 2019 at a particular location on the ice sheet.  

Your goal in this problem set is to find clusters of common seasonal velocity patterns, e.g. velocity time series from different locations on the ice sheet that behave similarly. For example, all locations that have slow spring velocities, speed up during the summer, and return to slow velocities in the winter might constitute a cluster. Glaciologists are interested in these patterns because they can shed light on interactions between ice velocity and surface melting. For example, previous work has identified a common pattern where some glaciers speed up when surface meltwater runoff is high, and slow down when it is low due to additional basal lubrication that allows the ice to slide faster.

**[3] (5 pts)** Scale each time series to its own range (e.g. apply min-max scaling to each row of the dataframe). Note that this is a different approach to data standardization! Typically we standardize each feature within itself. But in this case, all of our features have meaningful relationships with one another. Since we want to cluster time series with common patterns, it makes more sense to rescale the each time series by its own min and max, rather than scaling all velocities at a given point in time based on the min and max velocities across the whole ice sheet. 

**[4] (7 pts)** Instantiate, train, and predict clusters using a K-Means model. Solgaard et al (2022) ultimately chose 4 clusters as the optimal representation of the data, so run this initial model with four clusters. 

**[5] (15 pts)** Make a figure with four subplots, one for each predicted cluster. For each cluster, plot the mean velocity as a function of the day of the year, as well as the 10% percentile and 90% of the velocity. You should produce four different velocity time series with uncertainty bounds that represent the average behavior of each cluster.   

A 2014, when ice sheet velocity data was much more sparse, a paper manually sorted glacier velocity patterns from a subset of Greenland glaciers into three modes that were related to the surface meltwater runoff patterns. Compare the patterns from your clusters to the patterns from Figure 2 in Moon, T., I. Joughin, B. Smith, M. R. van den Broeke, W. J. van de Berg, B. Noël, and M. Usher (2014), Distinct patterns of seasonal Greenland glacier velocity, *Geophys. Res. Lett.*, 41, 7209–7216, doi:10.1002/2014GL061836. 

Answer the questions below:    
(1) How do the velocity patterns that you inferred using K-Means compare to the velocity patterns in Moon et al. (2014)?            
(2) Does your machine learning analysis provide further support for any of the glacier velocity modes originally identified in Moon et al. (2014)? If so, which ones?

In [None]:
# Day of year axis for plotting normalized velocity as a function of the day of the year
doy = np.arange(24,365,12)
doy = x[0:29]

**[6] (6 pts)** We typically use the silhouette score to choose the optimal number of clusters for K-Means. Unfortunately, calculating the silhouette score is very computationally expensive and for a dataset of this size would take a very long time, especially if we wanted to try many different numbers of clusters to get an accurate picture of the best option. As discussed in class, an alternate way to choose the optimal number of clusters that is less informative but easier to compute is the "elbow plot".   

Fit a `KMeans()` model for numbers of clusters ranging from 2-25 and create an elbow plot. Remember that the `intertia_` property of the trained `KMeans()` object is the sum of squared distances of samples to their closest cluster center.   

Answer the following question:     
(1) How many clusters do you think would be optimal based on your elbow plot?

**[7] (15 pts)** Train a new K-Means model that uses eight clusters. As you did for the four cluster model, plot the mean velocity time series and its uncertainty (10-90 percentile of velocity) for each of the eight clusters.   

Answer the questions below:     
(1) How much additional information about glacier velocity patterns do the extra four clusters seem to provide?    
(2) If we wanted to further investigate links between velocity patterns and their physical drivers, do you think the four cluster or eight cluster results would be more useful? Why?