# ￼7. Building a Clustering Model with Spark

* 싸이그래머 / 스칼라ML : PART 2 - SparkML 파트
* 김무성

# Contents

* Types of clustering models
* Extracting the right features from your data
* Training a clustering model
* Making predictions using a clustering model
* Evaluating the performance of clustering models
* Tuning parameters for clustering models
* Summary

# Types of clustering models

* K-means clustering
* Mixture models
* Hierarchical clustering

## K-means clustering

* Initialization methods
* Variants

<img src="figures/cap1.png" />

* http://spark.apache.org/docs/latest/mllib-clustering.html

<img src="figures/cap2.png" width=600 />

<img src="figures/cap3.png" width=600 />

<img src="figures/cap4.png" width=600 />

<img src="figures/cap5.png" width=600 />

### Initialization methods

* https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods

<img src="figures/cap6.png" width=600 />

### Variants

* fuzzy K-means

## Mixture models

* https://en.wikipedia.org/wiki/Mixture_model

## Hierarchical clustering

* https://en.wikipedia.org/wiki/Hierarchical_clustering

# Extracting the right features from your data

In [None]:
# 파이썬 커널로 바꿔서 실행하자.
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

In [None]:
# 파이썬 커널로 바꿔서 실행하자.
!unzip ml-100k.zip

## Extracting features from the MovieLens dataset

* Extracting movie genre labels
* Training the recommendation model
* Normalization

In [6]:
val movies = sc.textFile("ml-100k/u.item")

In [7]:
println(movies.first)

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0


### Extracting movie genre labels

In [8]:
val genres = sc.textFile("ml-100k/u.genre")

In [9]:
genres.take(5).foreach(println)

unknown|0
Action|1
Adventure|2
Animation|3
Children's|4


In [10]:
val genreMap = genres.filter(!_.isEmpty).map(line => line. split("\\|")).map(array => (array(1), array(0))).collectAsMap

In [11]:
println(genreMap)

Map(2 -> Adventure, 5 -> Comedy, 12 -> Musical, 15 -> Sci-Fi, 8 -> Drama, 18 -> Western, 7 -> Documentary, 17 -> War, 1 -> Action, 4 -> Children's, 11 -> Horror, 14 -> Romance, 6 -> Crime, 0 -> unknown, 9 -> Fantasy, 16 -> Thriller, 3 -> Animation, 10 -> Film-Noir, 13 -> Mystery)


In [12]:
val titlesAndGenres = movies.map(_.split("\\|")).map { array =>
     val genres = array.toSeq.slice(5, array.size)
     val genresAssigned = genres.zipWithIndex.filter { case (g, idx)
     =>
       g == "1"
     }.map { case (g, idx) =>
       genreMap(idx.toString)
     }
     (array(0).toInt, (array(1), genresAssigned))
   }

In [13]:
println(titlesAndGenres.first)

(1,(Toy Story (1995),ArrayBuffer(Animation, Children's, Comedy)))


### Training the recommendation model

In [14]:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val rawData = sc.textFile("ml-100k/u.data")
val rawRatings = rawData.map(_.split("\t").take(3))
val ratings = rawRatings.map{ case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) }
ratings.cache
val alsModel = ALS.train(ratings, 50, 10, 0.1)

In [15]:
import org.apache.spark.mllib.linalg.Vectors
val movieFactors = alsModel.productFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
val movieVectors = movieFactors.map(_._2)
val userFactors = alsModel.userFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
val userVectors = userFactors.map(_._2)

### Normalization

In [16]:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val movieMatrix = new RowMatrix(movieVectors)
val movieMatrixSummary =
   movieMatrix.computeColumnSummaryStatistics()
val userMatrix = new RowMatrix(userVectors)
val userMatrixSummary =
   userMatrix.computeColumnSummaryStatistics()

println("Movie factors mean: " + movieMatrixSummary.mean)
println("Movie factors variance: " + movieMatrixSummary.variance)
println("User factors mean: " + userMatrixSummary.mean)
println("User factors variance: " + userMatrixSummary.variance)

Movie factors mean: [0.20789576878673827,-0.1218642314985826,-0.08050077987566967,0.07875341160286595,0.03130059086048734,-0.17394223160805616,0.27811477528867007,-0.1147974188467822,0.48425019720189455,-0.010162416832331846,0.02780273325699798,-0.029333127912586705,0.002269696770613964,-0.126404266095983,0.23410194056932715,-0.27416059304200485,0.3412157869899644,0.13145729997521985,-0.20346040224647674,-0.417533289718383,0.07048275645402653,-0.16785782941356098,0.022321968114607554,-0.024652617382200193,-0.12933346180310487,0.11004002792496512,-0.013093729829460665,-0.09579578037552726,0.32819452849369735,0.12809119179121456,-0.30179630431435733,-0.2102948194374519,-0.3346491788915079,0.48234844468680715,0.17666643892854142,-0.1846671476193655,0.018568612051423895,-0.10691549152841391,-0.2899559200445816,0.21883869892817648,0.23024444699992483,-0.08566829466179592,-0.1909448688106259,0.011998112661985923,0.1595211027318184,-0.2589499515216911,0.16105921011794508,0.10893434958360669,0

# Training a clustering model

* Training a clustering model on the MovieLens dataset

## Training a clustering model on the MovieLens dataset

In [17]:
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 5
val numIterations = 10
val numRuns = 3

In [18]:
val movieClusterModel = KMeans.train(movieVectors, numClusters, 
    numIterations, numRuns)

In [19]:
val movieClusterModelConverged = KMeans.train(movieVectors, 
    numClusters, 100)

In [20]:
val userClusterModel = KMeans.train(userVectors, numClusters,
   numIterations, numRuns)

# Making predictions using a clustering model

* Interpreting cluster predictions on the MovieLens dataset

In [21]:
val movie1 = movieVectors.first
val movieCluster = movieClusterModel.predict(movie1)
println(movieCluster)

4


In [22]:
val predictions = movieClusterModel.predict(movieVectors)
println(predictions.take(10).mkString(","))

4,4,4,2,3,2,0,3,1,0


## Interpreting cluster predictions on the MovieLens dataset

* Interpreting the movie clusters

### Interpreting the movie clusters

In [23]:
import breeze.linalg._
import breeze.numerics.pow
def computeDistance(v1: DenseVector[Double], v2: DenseVector[Double])
   = pow(v1 - v2, 2).sum

In [24]:
val titlesWithFactors = titlesAndGenres.join(movieFactors)
val moviesAssigned = titlesWithFactors.map { case (id, ((title,
   genres), vector)) =>
     val pred = movieClusterModel.predict(vector)
     val clusterCentre = movieClusterModel.clusterCenters(pred)
     val dist = computeDistance(DenseVector(clusterCentre.toArray),
   DenseVector(vector.toArray))
     (id, title, genres.mkString(" "), pred, dist)
   }
val clusterAssignments = moviesAssigned.groupBy { case (id, title,
   genres, cluster, dist) => cluster }.collectAsMap

In [25]:
for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) {
     println(s"Cluster $k:")
     val m = v.toSeq.sortBy(_._5)
     println(m.take(20).map { case (_, title, genres, _, d) =>
     (title, genres, d) }.mkString("\n"))
     println("=====\n")
}

Cluster 0:
(Angela (1995),Drama,0.2722493162569773)
(Moonlight and Valentino (1995),Drama Romance,0.32607016919158377)
(Blue Chips (1994),Drama,0.3318913159955939)
(Outlaw, The (1943),Western,0.36948033644835687)
(Johns (1996),Drama,0.3948319364647146)
(Outbreak (1995),Action Drama Thriller,0.4200712768357253)
(River Wild, The (1994),Action Thriller,0.42240423529013393)
(Mr. Jones (1993),Drama Romance,0.4349995632989728)
(Intimate Relations (1996),Comedy,0.44102632964306193)
(Mr. Wonderful (1993),Comedy Romance,0.4541827549187795)
(Air Up There, The (1994),Comedy,0.5395869522340357)
(Wedding Bell Blues (1996),Comedy,0.5418234889672385)
(Tainted (1998),Comedy Thriller,0.5418234889672385)
(Next Step, The (1995),Drama,0.5418234889672385)
(Private Benjamin (1980),Comedy,0.5546262184382943)
(Maverick (1994),Action Comedy Western,0.5610467996096883)
(Target (1995),Action Drama,0.5632667662491723)
(New Jersey Drive (1995),Crime Drama,0.5673731701567497)
(Nightwatch (1997),Horror Thriller,0.57

# Evaluating the performance of clustering models

* Internal evaluation metrics
* External evaluation metrics
* Computing performance metrics on the MovieLens dataset

## Internal evaluation metrics

## External evaluation metrics

## Computing performance metrics on the MovieLens dataset

In [26]:
val movieCost = movieClusterModel.computeCost(movieVectors)
val userCost = userClusterModel.computeCost(userVectors)
println("WCSS for movies: " + movieCost)
println("WCSS for users: " + userCost)

WCSS for movies: 2291.295963921779
WCSS for users: 1489.5822206450414


# Tuning parameters for clustering models

* Selecting K through cross-validation

## Selecting K through cross-validation

In [28]:
val trainTestSplitMovies = movieVectors.randomSplit(Array(0.6, 0.4),
   123)
val trainMovies = trainTestSplitMovies(0)
val testMovies = trainTestSplitMovies(1)
val costsMovies = Seq(2, 3, 4, 5, 10, 20).map { k => (k, KMeans.
   train(trainMovies, numIterations, k, numRuns).computeCost(testMovies))
   }
println("Movie clustering cross-validation:")
costsMovies.foreach { case (k, cost) => 
    println(f"WCSS for K=$k id $cost%2.2f") }

Movie clustering cross-validation:
WCSS for K=2 id 867.49
WCSS for K=3 id 871.64
WCSS for K=4 id 856.87
WCSS for K=5 id 852.78
WCSS for K=10 id 848.25
WCSS for K=20 id 857.27


In [29]:
val trainTestSplitUsers = userVectors.randomSplit(Array(0.6, 0.4),
   123)
val trainUsers = trainTestSplitUsers(0)
val testUsers = trainTestSplitUsers(1)
val costsUsers = Seq(2, 3, 4, 5, 10, 20).map { k => (k,
   KMeans.train(trainUsers, numIterations, k,
   numRuns).computeCost(testUsers)) }
println("User clustering cross-validation:")
costsUsers.foreach { case (k, cost) => 
    println(f"WCSS for K=$k id $cost%2.2f") }

User clustering cross-validation:
WCSS for K=2 id 577.85
WCSS for K=3 id 569.65
WCSS for K=4 id 567.82
WCSS for K=5 id 570.76
WCSS for K=10 id 569.52
WCSS for K=20 id 572.66


# Summary

# 참고자료

* [1] book - https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-spark
* [2] jypyter/all-spark-notebook docker - https://hub.docker.com/r/jupyter/all-spark-notebook/