### Topic Modeling with Latent Dirichlet Allocation (LDA)

In this notebook we use Latent Dirichlet Allocation (LDA) to look for topics in the ride reviews. It is based on material supplied by Cloudera under their Cloudera Academic Partner program and *Spark: The Definitive Guide* book by Bill Chambers and Matei Zaharia. 

Topics
- Extracting and transforming features
- Counting frequency of words
- Specifying and fitting a topic model using LDA
- Exploring topics identified by LDA
- Specifying and fitting a topic model using LDA

In [0]:
# Read the ride review data
reviews = spark.read.parquet("/mnt/cis442f-data/duocar/clean/ride_reviews/")
reviews.head(5) 

#### Extracting and transforming features

The ride reviews are not in a form amenable to machine learning algorithms. Spark MLlib provides a number of feature extractors and feature transformers to preprocess the ride reviews into a form appropriate for modeling.


**Parse the ride reviews**
- Use the [RegexTokenizer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.RegexTokenizer.html) class to improve the tokenization
- Use the [StopWordsRemover](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StopWordsRemover.html) class to remove common words

In [0]:
# Use RegexTokenizer class to tokenize reviews (configured to remove punctuation)
from pyspark.ml.feature import RegexTokenizer
tokenizer = RegexTokenizer(inputCol="review", outputCol="words", gaps=False, pattern="[a-zA-z-']+")
tokenized = tokenizer.transform(reviews)

In [0]:
# Use StopWordsRemover to remove common words
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="words", outputCol="words_removed")
remover.getStopWords()[:10]
removed = remover.transform(tokenized)
for row in removed.head(5):
    print(row)

#### Count the frequency of words in each review

In [0]:
# Count the words
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words_removed", outputCol="words_vectorized", vocabSize=100)
vectorizer_model = vectorizer.fit(removed)
vectorized = vectorizer_model.transform(removed)
# vectorized.head(5)

In [0]:
# Our vocabulary seems reasonable now
print(list(enumerate(vectorizer_model.vocabulary[:100])))

In [0]:
# We can see how the processed reviews are vectorized
vectorized.select("words_removed", "words_vectorized").head(5)

#### Specify and fit a topic model using LDA

In [0]:
# Use the `LDA` class to specify an LDA model
from pyspark.ml.clustering import LDA
lda = LDA(featuresCol="words_vectorized", k=9)

# Use the `fit` method to fit the LDA model
lda_model = lda.fit(vectorized)

#### Explore the topics identified by LDA

In [0]:
# lda_model.describeTopics().show(truncate=False)
lda_model.describeTopics().show()

In [0]:
topics = lda_model.topicsMatrix()
vocab = vectorizer_model.vocabulary
wordNumbers = 10  # number of words per topic

# Capture the data from the model's describeTopics() method
topics_data = (lda_model.describeTopics(maxTermsPerTopic = wordNumbers)).collect()

# Extract the vocabulary associated with each term associated with each topic
topics = []
terms = []
terms_in_words = []
weights = []

for row in topics_data:
    words =[]
    topics.append(row[0])
    terms.append(row[1])
    for term in (row[1]):
        word = vocab[term]
        words.append(word)
    terms_in_words.append(words)    
    weights.append(row[2])

# Print vocuabulary for each topic discoverd by LDA
for topic in topics:
    print (("Topic: " + str(topics[topic])), end="  ")
    print (terms_in_words[topic])
    # print (weights[topic])

# Note: If we were working with this sort of data a lot it would make sense to turn this sort of analysis into a function that we could call at will

#### Apply the topic model

In [0]:
reviews_with_topics = lda_model.transform(vectorized)
for row in reviews_with_topics.select("review", "topicDistribution").head(15):
    print (row)

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

(1) Determine if increasing the vocabulary size improves the solution.

(2) Experiment with different hyperparameters for the LDA algorithm.

(3) Use the `HashingTF` estimator rather than the `CountVectorizer` estimator to generate the term-frequency vectors.

(4) Use the `NGram` transformer to consider pairs of words rather than single words.



**References**

[Wikipedia - Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

[Spark Documentation - Latent Dirichlet allocation](http://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda)

[Spark Python API - LDA, LDAModel, localLDAModel, and DistributedLDAModel classes](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#clustering)