In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import *

from collections import Counter
from pyspark.ml.clustering import LDA, LDAModel, DistributedLDAModel
from pyspark.ml.linalg import Vectors

import pickle as pkl

sc = pyspark.SparkContext()
sc.setCheckpointDir('checkpoint/')
spark = SparkSession.builder.getOrCreate()

DATA_FOLDER = "data/"
RESULTS_FOLDER = "results/"
RDD_FILENAME = "stemstem.parquet"

## 1. Loading stemmed corpus

First, we load the corpus of stemmed reviews. Stemming will prevent us from getting noise in our analysis from words having the same stem but being considered as different words (eg. wait, waited, waiting, ...).

Unfortunately, for various performance reasons and to be able to perform all computations in a reasonable amount of time, we first limit ourselves to a corpus of 50k reviews.

In [2]:
def get_rdd(filename, max_size = None):
    df = spark.read.parquet(DATA_FOLDER + filename)
    rdd = df.rdd.zipWithIndex().map(lambda r: (r[1], r[0]["Words"]))
    return rdd if max_size == None else sc.parallelize(rdd.take(max_size))

In [3]:
rdd = get_rdd(RDD_FILENAME, max_size = 50000)

## 2. Preparing the data

In order to use the `LDAModels` provided by `pyspark`, we need to shape the data. More precisely, we need to compute the number of occurrences of each term in each review. This is the purpose of the function `count_words` below:

In [4]:
def count_words(rdd):
    def word_occurrences(l):
        counter = dict(Counter(l))
        return list(zip(counter.keys(), counter.values()))
    
    return rdd.mapValues(word_occurrences)

Next, we establish the set of all words present in our reviews. This is done in the `wordify` function below. `words` is constructed from iteratively taking the union of the set of words of each review. Finally, we assign to each word an unique integer index. This is going to be useful later...

In [5]:
def wordify(rdd):
    words = list(rdd.map(lambda p: set(p[1])).reduce(lambda x, y: x | y))
    return dict([(words[i], i) for i in range(len(words))])

The last step of data preparation is to map the review of each entry of `rdd` to a vector of size `len(words)` containing at index $i$ the number of occurrences of word $i$. Since each review contains only a small subset of all possible words, we naturally choose a `SparseVector` instance for obvious performance reasons.

In [6]:
def LDAify(rdd, words):
    return rdd.mapValues(lambda l: [(words[w], occurrences) for (w, occurrences) in l]).mapValues(lambda l: Vectors.sparse(len(words), dict(l))).map(lambda p: list(p))

Words can be computed in the following manner: `words = wordify(rdd)`. Nevertheless, since this computation is quite time consuming, we decided to do it once for all and to store the result in Pickle file. Hence, we load the dictionary as follows:

In [7]:
words = {}
with open(RESULTS_FOLDER + "words_lda.pickle", 'rb') as handle:
    words = pkl.load(handle)

Moreover, we create the data matrix X:

In [8]:
X = LDAify(count_words(rdd), words)

## 3. Core LDA analysis

LDA's `describeTopics()` method returns a list of `k` topics assigning to each one a word distribution. However, here words indices are used and not words themselves. Therefore, we implement the following `format_terms` function to help us retrieving a word from its index in `words`.

In [9]:
def format_terms(indices, weights):
    words_list = list(words)
    return [(words_list[index], weights[i]) for (i, index) in enumerate(indices)]

In addition, it's worth noticing that LDA depends on two hyperparameters $\alpha$ and $\beta$, describing respectively the document and term distributions. We need to tune those parameters according to our needs and we do this using a simple grid search.

This is what does the `test_coefs` function below. Also, it prints the topics created by the LDA so that we can manually evaluate their relevance to find the best values for $\alpha$ and $\beta$.

In [10]:
def test_coefs():
    res = []
    for alpha in [1.01, 1.5, 2, 3]:
        for beta in [1.01, 1.5, 2, 3]:
            print("alpha = " + str(alpha) + " / beta = " + str(beta))
            lda = LDA(k = 10).setFeaturesCol("feature").setMaxIter(100).setOptimizer("em").setDocConcentration([float(alpha)]).setTopicConcentration(float(beta))
            X_ = X.toDF(["id", "feature"])
            model = lda.fit(X_)
            res += [model.describeTopics().rdd.map(lambda r: (r[0], format_terms(r[1], r[2]))).take(10)]
    
    interesting_indices = []
    for i in range(4):
        for j in range(4):
            for k, (_, words) in enumerate(res[4*i + j]):
                if "kindl" in list(map(lambda p: p[0], words)):
                    interesting_indices.append((4*i+j, k))

In [11]:
# Chosen from above
alpha = 2
beta = 1.5

Again, training a LDA model is quite time consuming. Therefore, we saved the model for the values $\alpha = 1$ and $\beta = 1.5$ in a Parquet file. Retrieving the model can be done as such:

In [12]:
best_model = DistributedLDAModel.load(RESULTS_FOLDER + "kindle_lda_model.parquet")

## 4. Topics extraction

In [13]:
def print_topics(model, dictionary):
    words_list = list(dictionary)
    for i, words in model.describeTopics(10).rdd.map(lambda r: (r[0], (zip(r[1], r[2])))).groupByKey().sortByKey().collect():
        print("Topic " + str(i+1))
        for word_indices in words:
            for word, per in (map(lambda w: (words_list[w[0]], w[1]), word_indices)):
                print("    " + word + ": " + str(100*per) + "%")
        print()

print_topics(best_model, words)

Topic 1
    review: 1.669342772819849%
    relationship: 1.227819401281716%
    togeth: 1.1226048807265143%
    into: 0.9942960889184277%
    onli: 0.9100196149317308%
    sexi: 0.8871170059162949%
    she': 0.8252354260346444%
    doesn't: 0.7530020812918945%
    romanc: 0.7046308878346287%
    passion: 0.5523344025810424%

Topic 2
    novel: 3.416132220056701%
    mysteri: 3.056578227312123%
    twist: 2.247111886674283%
    suspens: 1.6679165260551003%
    excel: 1.6058436532190203%
    fiction: 1.342177967207062%
    adventur: 1.2226281607025942%
    murder: 1.1559497799029421%
    humor: 1.138142625814316%
    action: 0.9842488594565195%

Topic 3
    seri: 7.396401356429165%
    becaus: 5.167379252584321%
    author: 4.1938369866492335%
    didn't: 3.359057931263465%
    definit: 2.6682958127516434%
    anoth: 1.8695779456988404%
    onli: 1.6990315579187822%
    couldn't: 1.6985199654105725%
    wasn't: 1.6225247585990297%
    disappoint: 1.4891925098047047%

Topic 4
    stori: 2

From the topics displayed above, we note that a large number of them seem related to specific genres that we might find in books. For example, the topic containing "relationship", "together", "sexi", "romanc", "passion" may be a genre that we could call "Love & Romance".

However, a specific topic has retained our whole attention. It is topic number 6, essentially containing words that are really specific to Kindle features.

In [14]:
kindle_topic = best_model.describeTopics(best_model.vocabSize()).rdd.filter(lambda r: r[0] == 5).map(lambda r: r[2]).collect()[0]
kindle_topic_vect = Vectors.dense(kindle_topic)

Now, what we are naturally interested in is finding a measure of similarity between reviews and this topic. This will give us a certain confidence on how much a review matches our Kindle topic. For this task, we chose a simply cosine similarity, widely used in topic extraction.

Then we tune the desired similarity in order to select a decent number (~50) of best reviews.

In [15]:
def sim(u, v):
    return 1 - u.dot(v)/(u.norm(2)*v.norm(2))

In [16]:
def normalize(v):
    return Vectors.sparse(v.size, dict(zip(v.indices, map(lambda x: x/len(words), v.values))))

In [17]:
matching_reviews = X.map(lambda r: (r[0], sim(normalize(r[1]), kindle_topic_vect))).filter(lambda r: r[1] > 0.999998).collectAsMap()

Finally, we retrieve the `ASIN` of the product those reviews concern, for further analysis:

In [18]:
matching_asins = spark.read.parquet("data/" + RDD_FILENAME).rdd.zipWithIndex().filter(lambda r: r[1] in matching_reviews.keys()).map(lambda r: r[0]["Asin"]).collect()

In [19]:
matching_asins

['B00HMQVE4G',
 'B00GR4XFPU',
 'B00HUS1CCU',
 'B005FCVU02',
 'B0079UAT0A',
 'B00K9XXA32',
 'B005H4V4M2',
 'B009IQNMTS',
 'B00L19SVQ4',
 'B00C8EUHQS',
 'B00ASDAWUM',
 'B00FS32GZG',
 'B00BI4J0S0',
 'B00JVF40C4',
 'B008C9INNC',
 'B00B392XIE',
 'B004QGY35W',
 'B00AXBK8YY',
 'B00BPSJ0JI',
 'B005DA0LN8',
 'B005COO1X6',
 'B00AFEOBZ6',
 'B00B2K1AWO',
 'B00I214D70',
 'B00K31DCD8',
 'B00AYYOSS2',
 'B00B4ISWTS',
 'B00DMNDTR8',
 'B0073VIZB0',
 'B00BC6SVY8',
 'B00DR0B31U',
 'B00ILKTBSI',
 'B00BEMD8FW',
 'B005IGBHRG',
 'B00AV2FS36',
 'B00BNVGUPO',
 'B002RI9TIW',
 'B00A94QQCI']