### Building clustering models

In this module we use [Gaussian mixture models](http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html) in Spark MLlib to look for structure in the rider data.

This notebook is based on material supplied by Cloudera under their Cloudera Academic Partner program and *Spark: The Definitive Guide* book by Bill Chambers and Matei Zaharia. 

_Gaussian mixture models_ (GMM) makes different assumptions than k-means. K-means tries to group data by reducing the sum of squared distance from the center of the cluster. Gaussian mixture models assume that each cluster produces data based upon random draws from a Gaussian distribution. This means that clusters of data should be less likely to have data at the edge of the cluster (reflected in the Gaussian distribution) and much higer probability of having data in the center. Each Gausian cluster can be of arbitrary size with its own mean and standard deviation. (and hence a possibly different ellipsoid shape). There are still k user-specified clusters that will be created during training. A simplified way of thinking about GMMs is that they are like a soft version of k-means. K-means creates very rigid structures - each point is only within one cluster. GMMs allow for a more nuanced cluster associated with probabilities, instead of rigid boundaries. See [Elements of Stat Learning Section 14.3](https://web.stanford.edu/~hastie/ElemStatLearn/). Source: Spark: The Definitive Guide.



Topics
- Extract, transform, and select the features
- Build and evaluate a Gaussian mixture model
- Plot cluster locations (geo data)
- Save and apply cluster model
- Explore cluster profiles

In [0]:
# Load clean rider data
riders = spark.read.parquet("/mnt/cis442f-data/duocar/clean/riders/")

#### Preprocess the data

We are going to focus on student riders in this analysis.

In [0]:
# Filter on student riders
from pyspark.sql.functions import col
students = riders.filter(col("student") == True)

In [0]:
students.printSchema()

In [0]:
# Inspect the data of the students dataframe
students.drop("home_block","home_lat", "home_lon", "work_lat", "work_lon", "student").show(5)
students.select("student","home_block","home_lat", "home_lon", "work_lat", "work_lon").show(5)

In [0]:
students.count()

#### Extract, transform, and select the features

We are selecting latitude and longitude. So we can think of the clusters as 2-dimensional Gaussian bells

In [0]:
# Select features to cluster on home latitude and longitude
selected = ["home_lat", "home_lon"]

# Assemble the feature vector
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=selected, outputCol="features")
assembled = assembler.transform(students)
assembled.select("features").show(5,False)

In [0]:
assembled.count()

#### Build and evaluate a Gaussian mixture model

In [0]:
# Specify a Gaussian mixture model with two clusters:
from pyspark.ml.clustering import GaussianMixture
gm = GaussianMixture(featuresCol="features", k=2)
type(gm)

In [0]:
# Examine all arguments:
print(gm.explainParams())

In [0]:
# Fit the Gaussian mixture model: 
gm_model = gm.fit(assembled)
type(gm_model)

In [0]:
# Examine (multivariate) Gaussian distribution function
# Note there are only two rows because k = 2
gm_model.gaussiansDF.head(5)

- The elements in the `mean` vector represent the centers of the clusters. 
- The elements in the `cov`ariance matrix describe the shapes of the Gaussians associated with each cluster. 
- Below is a graphical representation of three clusters (so not our example) with different shapes in 2D space. 

Have a look at pages 23 to 28 of [this document](http://www.ee.columbia.edu/~stanchen/spring16/e6870/slides/lecture3.pdf) to get a sense of how 2D Gaussian shapes are encoded in the covariance matrix (matrix Σ on page 28):

![GMM Plot](https://cis442f-open-data.s3.amazonaws.com/pictures/gmm_plot1.png "GMM Plot")

#### Plot student home locations

The following paragraphs contain code to plot the centroids of the two clusters on a map of the Fargo ND area.

In [0]:
# Install folium mapping library if necessary
dbutils.library.installPyPI("folium")

In [0]:
import folium
# Plot Gaussian means (cluster centers):
center_map = folium.Map(location=[46.8772222, -96.7894444], zoom_start=13)
for cluster in gm_model.gaussiansDF.collect():
  folium.Marker(cluster["mean"]).add_to(center_map)

html_string = center_map._repr_html_()

# Display the map 
displayHTML(html_string)


So, first centroid from `gm_model.gaussiansDF` [46.893,-96.804] is located at North Dakota State University while the second [46.866,-96.758] is located at Minnesota State University



Like other clustering algorithms, GMMs include a summary class to help with model evaluation. This includes information about the clusters created, like the
- Weights
- Means
- Covariance of the Gausian mixtures 

Which can help learn more about the underlying structure in our data.

In [0]:
# Examine mixing weights
gm_model.weights
# about 76% of student riders belong to the cluster around North Dakota State

In [0]:
# Examine model summary
gm_model.hasSummary

In [0]:
# Examine cluster sizes
gm_model.summary.clusterSizes
# 162 students in the cluster around North Dakota State University

In [0]:
# Examine predictions DataFrame
gm_model.summary.predictions.printSchema()

We can look at some rows of the `gm_model.summary.predictions` dataframe to examine predictions. Note that values like `probability=DenseVector([1.0, 0.0])` indicate that the algorithm has a high level of confidence that this particular location belongs to the first of the two clusters identified i.e. around North Dakota State University

In [0]:
for row in gm_model.summary.predictions.select("features","prediction", "probability").head(5):
    print(row)

In [0]:
# Examine predictions DataFrame (alternatively looking at in tabular form)
gm_model.summary.predictions.select("features","prediction","probability").show(5, False)

#### Save and apply clustering model

In [0]:
# Save the Gaussian mixture model
gm_model.write().overwrite().save("/mnt/my-data/myduocar/gm_model")

In [0]:
# Load the Gaussian mixture model
# Useful if you would like to use it later to make predictions
from pyspark.ml.clustering import GaussianMixtureModel
gm_model_loaded = GaussianMixtureModel.load("/mnt/my-data/myduocar/gm_model")

In [0]:
# Apply the Gaussian mixture model
clustered = gm_model_loaded.transform(assembled)

In [0]:
# Examine schema and view data:
clustered.printSchema()

In [0]:
for item in clustered.head(5):
    print(str(item) + "\n")

In [0]:
# Compute cluster sizes:
clustered.groupBy("prediction").count().orderBy("prediction").show()

#### Explore cluster profiles

It looks like 
- There are a larger proportion of riders at North Dakota state who are male
- There are slightly more female riders at Moorehead

In [0]:
# Explore clusters
clustered \
  .groupBy("prediction", "sex") \
  .count() \
  .orderBy("prediction", "sex") \
  .show()

In [0]:
display(clustered.where(clustered.sex.isNotNull()))
# display(clustered) # version to include nulls

id,birth_date,start_date,first_name,last_name,sex,ethnicity,student,home_block,home_lat,home_lon,work_lat,work_lon,features,probability,prediction
220200000013,1998-04-29,2017-01-01,Conor,Curro,male,White,True,380170005022009,46.889479,-96.811096,,,"Map(vectorType -> dense, length -> 2, values -> List(46.889479, -96.811096))","Map(vectorType -> dense, length -> 2, values -> List(1.0, 8.997262071721094E-20))",0
220200000014,1998-07-08,2017-01-01,Robert,Dunnan,male,White,True,380170003002002,46.897359,-96.801023,,,"Map(vectorType -> dense, length -> 2, values -> List(46.897359, -96.801023))","Map(vectorType -> dense, length -> 2, values -> List(1.0, 7.754974693892166E-20))",0
220200000017,1998-12-25,2017-01-01,Zachary,Brown,male,White,True,380170006004000,46.887537,-96.810385,,,"Map(vectorType -> dense, length -> 2, values -> List(46.887537, -96.810385))","Map(vectorType -> dense, length -> 2, values -> List(1.0, 1.1278032217941238E-19))",0
220200000020,1998-08-11,2017-01-01,Scott,Griffith,male,White,True,270270204001005,46.866674,-96.755357,,,"Map(vectorType -> dense, length -> 2, values -> List(46.866674, -96.755357))","Map(vectorType -> dense, length -> 2, values -> List(6.866266523170403E-20, 1.0))",1
220200000023,1994-11-26,2017-01-01,Rebecca,Kendall,female,White,True,380170004003004,46.893968,-96.796288,,,"Map(vectorType -> dense, length -> 2, values -> List(46.893968, -96.796288))","Map(vectorType -> dense, length -> 2, values -> List(1.0, 1.2374437405336763E-19))",0
220200000029,1994-12-09,2017-01-01,Jiri,Webster,male,White,True,380170006004021,46.887771,-96.823359,,,"Map(vectorType -> dense, length -> 2, values -> List(46.887771, -96.823359))","Map(vectorType -> dense, length -> 2, values -> List(1.0, 1.3753455782506818E-18))",0
220200000036,1996-12-15,2017-01-02,Ben,Sparks,male,White,True,380170003002004,46.895864,-96.805807,,,"Map(vectorType -> dense, length -> 2, values -> List(46.895864, -96.805807))","Map(vectorType -> dense, length -> 2, values -> List(1.0, 7.898802369107408E-20))",0
220200000039,1993-01-06,2017-01-02,Kayla,Walls,female,White,True,270270203002010,46.867739,-96.768726,,,"Map(vectorType -> dense, length -> 2, values -> List(46.867739, -96.768726))","Map(vectorType -> dense, length -> 2, values -> List(1.7052976104358498E-19, 1.0))",1
220200000046,1993-12-07,2017-01-02,Amber,Barnhill,female,White,True,380170005022008,46.889493,-96.810336,,,"Map(vectorType -> dense, length -> 2, values -> List(46.889493, -96.810336))","Map(vectorType -> dense, length -> 2, values -> List(1.0, 8.491394218189515E-20))",0
220200000055,1998-07-26,2017-01-03,Dean,King,male,White,True,270270204001005,46.866674,-96.755357,,,"Map(vectorType -> dense, length -> 2, values -> List(46.866674, -96.755357))","Map(vectorType -> dense, length -> 2, values -> List(6.866266523170403E-20, 1.0))",1


###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

(1) Experiment with different values of k (number of clusters).

(2) Experiment with other hyperparameters.

(3) Look for clusters for all riders (not just student riders).



#### References

[Wikipedia - Cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis)

[Spark Documentation - Clustering](http://spark.apache.org/docs/latest/ml-clustering.html)

[Spark Python API - GaussianMixture, GaussianMixtureModel, and GaussianMixtureSummary classes](http://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#clustering)

[Post on how to display maps on Databricks](https://forums.databricks.com/questions/444/how-to-create-maps-in-databricks.html)

#### Following cells helped in figuring out how to plot the map

In [0]:
# You can iterate over the rows
for row in gm_model.gaussiansDF.head(5):
    print (row)

# Find the cluster centriods for plotting on a map
for cluster in gm_model.gaussiansDF.collect():
    print (cluster[0]) 

# And show that the lat long are returned in a Dense Vector
print(type(cluster["mean"]))   

In [0]:
# This allows you to save the html version of the map
# Useful since the interactive version does not get saved
# with the HTML version of the notebook

# Save map to local storage
center_map.save("cluster-map.html")
# Copy file to where I can download it
dbutils.fs.cp('file:/databricks/driver/cluster-map.html', '/mnt/my-data/cluster-map.html')

# Tried to use the following code to print the map -but it didn't work
# Read all lines of html map file at once
#file = open('cluster-map.html',mode='r')
#all_of_it = file.read()
#file.close()
#display(all_of_it)

**This is a static version of the locations of the two clusters**

I downloaded the html file and opened it with a regular browser. This is a screenshot.
![Clusters](https://cis442f-open-data.s3.amazonaws.com/pictures/cluster-map.png "Clusters")

If you take the following code and run it in a regular Jupyter notebook you should get an interactive map

```
import folium
# Plot Gaussian means (cluster centers):
m = folium.Map(location=[46.8772222, -96.7894444], zoom_start=13)
tooltip = 'Click me!'
folium.Marker([46.8934749136,-96.8042130494], popup='<i>North Dakota State</i>', tooltip=tooltip).add_to(m)
folium.Marker([46.866446549,-96.7584140588], popup='<i>Minnesotta State Moorhead</i>', tooltip=tooltip).add_to(m)
m #display the map
```