### grp

# Spark: The Definitive Guide

## PART 6: Advanced Analytics and Machine Learning 

## dataPaths

In [1]:
clust = '/Users/grp/sparkTheDefinitiveGuide/data/retail-data/by-day/*.csv'

## _Chapter #29 - Unsupervised Learning_

-  UL tries to discover patterns and underlying structure of dataset typically via clustering:
    -  clustering can create odd clusters because of high-dimensional spaces aka "the curse of dimensionality":
        -  as feature space expands in dimensionality it becomes increasingly sparse and contains more noise in the data:
            - thus as the dimensions increase the data needed to compute statistical results increases very fast:
                -  thus making it difficult to predict hence reading in on noise instead of the factors causing the cluster groupings
-  K-Means / Bisecting K-Means group data by reducing sum of squared distances from the center of the cluster
-  Gaussian Mixture Model produces clusters based on Gaussian distribution

### Use Cases:
-  identifying anomalies in data
-  topic modeling to identify topics within unstructured text
-  identifying groups in data

### MLlib Unsupervised Models:
-  K-Means
-  Bisecting K-Means
-  GMM (Gaussian Mixture Model)
-  LDA (Latent Dirichlet Allocation)

### Model Configuration:
-  Model Hyperparameters (structure of how model can be initialized)
-  Training Parameters (structure of how model can be trained)
-  Prediction Parameters (structured of how model determines making predictions)
-  Model Summary (provides information about final trained model)

### _Chapter #29 Exercises (UL)_

In [2]:
from pyspark.ml.feature import VectorAssembler

In [3]:
va = VectorAssembler().setInputCols(["Quantity", "UnitPrice"]).setOutputCol("features")

sales = va.transform(spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(clust)
.limit(50)
.coalesce(1)
.where("Description IS NOT NULL"))

sales.cache()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string, features: vector]

In [4]:
sales.printSchema()
sales.show(3, True)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- features: vector (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|   features|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|[48.0,1.79]|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|[20.0,1.25]|
|   580538|    22906

## K-Means:
-  number of clusters are randomly assigned to different points in the dataset
-  unassigned points are assigned to a cluster based on Euclidean distance measure from previously assigned cluster:
    -  hence cluster centers (centroids) are computed and all points are assigned to a centroid
    -  process continues until # of iterations is reached or when centroid locations stop changing
-  must choose K value, which can be a hard task and requires experimentation and understanding dataset
-  trade-off with adjusting parameters can lead to increased processing time, but may lead to better clustering results
    -  Model Hyperparameters:
        -  K:
            -  hardcoded number of clusters in model
    -  Training Parameters:
        -  initMode:
            -  determines starting locations of the centroids:
                -  random (random position of centroids)
                -  k-means (default, well spread out centroids)
        -  initSteps:
            -  number of steps for k-means
        -  maxIter:
            -  total # of iterations over data before stopping
        -  tol:
            -  threshold for changes in centroids to stop model
    -  Metrics Summary:
        -  summary class to evaluate model
        -  information about clusters created and their relative sizes
        -  computes "within set sum of squared errors" for how close values are from each cluster centroid (**computeCost**)
        -  goal is to minimize "within set sum of squared error"

In [5]:
from pyspark.ml.clustering import KMeans

In [6]:
km = KMeans().setK(5)
print(km.explainParams())
kmModel = km.fit(sales)

featuresCol: features column name. (default: features)
initMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)
initSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)
k: The number of clusters to create. Must be > 1. (default: 2, current: 5)
maxIter: max number of iterations (>= 0). (default: 20)
predictionCol: prediction column name. (default: prediction)
seed: random seed. (default: 7969353092125344463)
tol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)


### _Metrics Summary Example_

In [7]:
summary = kmModel.summary
print(summary.clusterSizes) # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

[10, 8, 29, 2, 1]
Cluster Centers: 
[23.2    0.956]
[ 2.5     11.24375]
[7.55172414 2.77172414]
[48.    1.32]
[36.    0.85]


## Bisecting K-Means:
-  "top-down" clustering method whereas K-Means performs "bottom-up" aka initially assigning different groups
-  initially creates a single group and then splits group into smaller groups to finalize into K # of clusters specified
-  Model Hyperparameters:
    -  K:
        -  hardcoded number of clusters in model
-  Training Parameters:
    -  minDivisibleClusterSize:
        -  minimum # of points / minimum proportion of points
        -  sets how many minimum points must be in each cluster
    -  maxIter:
        -  total # of iterations over data before stopping
-  Metrics Summary:
    -  same as K-Means

In [8]:
from pyspark.ml.clustering import BisectingKMeans

In [9]:
bkm = BisectingKMeans().setK(5).setMaxIter(5)
print(bkm.explainParams())
bkmModel = bkm.fit(sales)

featuresCol: features column name. (default: features)
k: The desired number of leaf clusters. Must be > 1. (default: 4, current: 5)
maxIter: max number of iterations (>= 0). (default: 20, current: 5)
minDivisibleClusterSize: The minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster. (default: 1.0)
predictionCol: prediction column name. (default: prediction)
seed: random seed. (default: -6311319853468918464)


### _Metrics Summary Example_

In [10]:
summary = bkmModel.summary
print(summary.clusterSizes) # number of points
kmModel.computeCost(sales)
centers = kmModel.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

[16, 8, 13, 10, 3]
Cluster Centers: 
[23.2    0.956]
[ 2.5     11.24375]
[7.55172414 2.77172414]
[48.    1.32]
[36.    0.85]


## GMM:
-  assumes each cluster produces data randomly from Gaussian distribution and probabalities
    -  Model Hyperparameters:
        -  K:
            -  hardcoded number of clusters in model
    -  Training Parameters:
        -  maxIter:
            -  total # of iterations over data before stopping
        -  tol:
            -  threshold for changes in weights to stop model
    -  Metrics Summary:
        - produces cluster metrics like:
            -  weights
            -  means
            -  covariance of Gaussian mixture

In [11]:
from pyspark.ml.clustering import GaussianMixture

In [12]:
gmm = GaussianMixture().setK(5)
print(gmm.explainParams())
model = gmm.fit(sales)

featuresCol: features column name. (default: features)
k: Number of independent Gaussians in the mixture model. Must be > 1. (default: 2, current: 5)
maxIter: max number of iterations (>= 0). (default: 100)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
seed: random seed. (default: -7090211980209472397)
tol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.01)


### _Metrics Summary Example_

In [13]:
summary = model.summary
print(model.weights)
model.gaussiansDF.show(3)
summary.cluster.show(3)
summary.clusterSizes
summary.probability.show(3)

[0.16503937777770641, 0.35496420094056985, 0.06003637101912308, 0.1999636297743671, 0.21999642048823354]
+--------------------+--------------------+
|                mean|                 cov|
+--------------------+--------------------+
|[2.54180583818530...|0.785769315153778...|
|[5.07243095740621...|2.059950971034034...|
|[43.9877864408847...|32.22707068867282...|
+--------------------+--------------------+
only showing top 3 rows

+----------+
|prediction|
+----------+
|         2|
|         3|
|         3|
+----------+
only showing top 3 rows

+--------------------+
|         probability|
+--------------------+
|[1.37632400885157...|
|[4.89041912245635...|
|[1.67299627008735...|
+--------------------+
only showing top 3 rows



## LDA:
-  hierarchical clustering model
-  performs well on text documents
-  treats each document as having a variable number of contributing factors from multiple input topics
-  implementations:
    -  online LDA:
        -  works better with larger document examples
    -  expectation maximization (EM):
        -  works better with larger input vocabulary and more topics associated with corpus
-  Model Hyperparameters:
    -  K:
        -  total number of topics for data
    -  docConcentration:
        -  computes Dirichlet distribution (document-topic distribution)
    -  topicConcentration:
        -  symmetric Dirichlet distribution (document-topic distribution)
-  Training Parameters:
    -  maxIter:
        -  total # of iterations over data before stopping
    -  optimizer:
        -  determines training mechanism [EM or online (default)]
    -  learningDecay:
        -  learning rate (should be between 0.5 and 1.0)
    -  learningOffset:
        -  only relevant for online optimizer
        -  downweights early iterations
    -  optimizerDocConcentration:
        -  only relevant for online optimizer
        -  dependent on docConcentration value and if optimization will occur during trainnig
    -  subsamplingRate:
        -  sample fraction of corpus for mini-batch gradient descent iterations
    -  seed:
        -  re-produces same results
    -  checkpointInterval:
        -  saves model's work over course of training for recovery purposes
-  Prediction Parameters:
    -  topicDistributionCol:
        -  holds output of topic mixture distribution for each document

In [14]:
from pyspark.ml.feature import Tokenizer, CountVectorizer

In [15]:
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")

tokenized = tkn.transform(sales.drop("features"))
cv = CountVectorizer()\
.setInputCol("DescOut")\
.setOutputCol("features")\
.setVocabSize(500)\
.setMinTF(0)\
.setMinDF(0)\
.setBinary(True)

cvFitted = cv.fit(tokenized)
prepped = cvFitted.transform(tokenized)

In [16]:
from pyspark.ml.clustering import LDA

In [17]:
lda = LDA().setK(10).setMaxIter(5)
print(lda.explainParams())
model = lda.fit(prepped)

checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
docConcentration: Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta"). (undefined)
featuresCol: features column name. (default: features)
k: The number of topics (clusters) to infer. Must be > 1. (default: 10, current: 10)
keepLastCheckpoint: (For EM optimizer) If using checkpointing, this indicates whether to keep the last checkpoint. If false, then the checkpoint will be deleted. Deleting the checkpoint can cause failures if a data partition is lost, so set this bit with care. (default: True)
learningDecay: Learning rate, set as anexponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. (default: 0.51)
learningOffset: A (pos

In [18]:
# limiting to 3 terms per topic
df = model.describeTopics(3)
df.show(7)
# searching vocabulary
print(cvFitted.vocabulary[:10])
# number of terms in vocabulary
print(len(cvFitted.vocabulary))

+-----+-------------+--------------------+
|topic|  termIndices|         termWeights|
+-----+-------------+--------------------+
|    0|[137, 90, 49]|[0.00891024805336...|
|    1| [56, 29, 98]|[0.00916648406836...|
|    2|[15, 131, 45]|[0.00897001752758...|
|    3|   [2, 7, 16]|[0.01734152140219...|
|    4|  [40, 10, 8]|[0.01155559318753...|
|    5| [11, 23, 13]|[0.01462723671834...|
|    6|    [3, 1, 0]|[0.01443826360785...|
+-----+-------------+--------------------+
only showing top 7 rows

['water', 'hot', 'vintage', 'bottle', 'paperweight', '6', 'home', 'doormat', 'landmark', 'bicycle']
141


### _Map Index-Terms Example_

In [19]:
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

In [20]:
def indicesToTermsMap(vocabulary):
    def indicesToTermsMap(x):
        return [vocabulary[int(i)] for i in x]
    return udf(indicesToTermsMap, ArrayType(StringType()))

In [21]:
maps = df.withColumn("termWords", indicesToTermsMap(cvFitted.vocabulary)("termIndices"))
maps.select("topic", "termIndices", "termWords").show(7, False)

+-----+-------------+-----------------------------+
|topic|termIndices  |termWords                    |
+-----+-------------+-----------------------------+
|0    |[137, 90, 49]|[slate, woodland, design]    |
|1    |[56, 29, 98] |[message, night, rack]       |
|2    |[15, 131, 45]|[kit, or, amelie]            |
|3    |[2, 7, 16]   |[vintage, doormat, leaf]     |
|4    |[40, 10, 8]  |[notting, frame, landmark]   |
|5    |[11, 23, 13] |[ribbons, christmas, classic]|
|6    |[3, 1, 0]    |[bottle, hot, water]         |
+-----+-------------+-----------------------------+
only showing top 7 rows



### grp