# LDA Clustering with PySpark

This notebook shows how to implement and measure a Latent Dirichlet Allocation (LDA) topic model for clustering documents.

* Method: [LDA](https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA)
* Dataset: MLlib sample LDA data

Terminology:
* term = "word": an el
* token: instance of a term appearing in a document
* topic: multinomial distribution over terms representing some concept
* document: one piece of text, corresponding to one row in the input data

**NOTE**: this feature is experimental and under active development

## Imports

In [None]:
from os import environ
# Set SPARK_HOME
# environ["SPARK_HOME"] = "/home/students/spark-2.2.0"

import findspark
findspark.init()

from itertools import product

from pyspark import SparkContext
from pyspark.sql import SQLContext

from pyspark.ml.clustering import LDA
from sklearn import metrics

import seaborn as sb
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 10, 8
sb.set_style('whitegrid')

## Get Some Context

In [None]:
# Create a SparkContext and a SQLContext context to use
sc = SparkContext(appName="LDA Clustering with Spark")
sqlContext = SQLContext(sc)

## Load and Prepare the Data

In [None]:
DATA_FILE = "/Users/robert.dempsey/Dev/daamlobd/data/mllib/sample_lda_libsvm_data.txt"

In [None]:
data = sqlContext.read.format("libsvm").load(DATA_FILE)

In [None]:
# View one of the records
data.take(3)

## Identify the Number of Clusters and Optimizer to Use

Arguments:
* k: number of topics (clusters)
* seed: random seed
* optimizer: Optimizer or inference algorithm used to estimate the LDA model
  * online
  * em

In [None]:
# Create a list of tuples to test cluster ranges with different linkages
cluster_range = range(2, 11)
optimizer = ['online', 'em']

cluster_range_optimizer = list(product(cluster_range, optimizer))
print(cluster_range_optimizer)

In [None]:
# Create a list of LDA models
lda_models = [LDA(k=i[0], optimizer=i[1], maxIter=50) for i in cluster_range_optimizer]
print(len(lda_models))

Metrics
* logLikelihood: a lower bound on the log likelihood of the entire corpus
* logPerplexity: calculate an upper bound on perplexity (lower is better)

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample.

In [None]:
# For each model, fit it to the data and get the logPerplexity score
cluster_ll_scores = list()
cluster_lp_scores = list()

# Fit each of the models on the data
for lda_model in lda_models:
    model = lda_model.fit(data)
    ll = model.logLikelihood(data)
    lp = model.logPerplexity(data)
    cluster_ll_scores.append(lp)
    cluster_lp_scores.append(lp)

# Show one of the LP scores
cluster_lp_scores[0]

In [None]:
# Create a scatterplot of the LL and LP scores
plt.scatter(cluster_ll_scores, cluster_lp_scores)
plt.title("logPerplexity and logLikelihood Scores")
plt.xlabel("logLikelihood")
plt.ylabel("logPerplexity")
plt.show()

In [None]:
# Plot an barchart of the LP scores
chart_labels = ["{}_{}".format(i[0], i[1]) for i in cluster_range_optimizer]

sb.barplot(y=chart_labels, x=cluster_lp_scores)

**Observation**: based on the graph above it appears that 2 clusters using online optimization has the best logPerplexity score.

In [None]:
# Get the index value of the min cluster lp score
min_score_index = cluster_lp_scores.index(min(cluster_lp_scores))

# Get the number of clusters used for the model with the min score
params_to_use = cluster_range_optimizer[min_score_index]

print("Number of topics: {}".format(params_to_use[0]))
print("Optimizer: {}".format(params_to_use[1]))## Fit a Hierarchical Clustering Model

## Fit an LDA Model

In [None]:
# Fit the model
lda_model = LDA(k=params_to_use[0], optimizer=params_to_use[1], maxIter=50)
model = lda_model.fit(data)

## View Model Information

### logLikelihood and logPerplexity

In [None]:
ll = model.logLikelihood(data)
lp = model.logPerplexity(data)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

### Topics

In [None]:
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

## Shut it Down

In [None]:
sc.stop()