# LDA topic modeling using MLLib
Using some lyrics that I had downloaded from DarkLyrics.com, try to see what major themes are expressed in song lyrics.

Note that this is by song and not by album.

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import Tokenizer, CountVectorizer, StopWordsRemover

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName('Metal Lyrics LDA') \
        .getOrCreate()

Load the data into a `pyspark.sql` DataFrame.

1500 text files containing lyrics are in `data_test/`, randomly sampled from ~200,000.

In [3]:
from pyspark.sql.functions import monotonically_increasing_id as mid

documents = sc.wholeTextFiles('data_test/')
df = spark.createDataFrame(documents,['doc_name','doc_text'])

# This will return a new DF with all the columns + id
df = df.withColumn("id", mid())
tokenizer = Tokenizer(inputCol="doc_text", outputCol="words")
df = tokenizer.transform(df)

Since the lyrics are in many different languages, remove stop words from several languages. Unfortunately, this removes the word "die" from English as well as German lyrics. Otherwise though, the topics would roughly be arranged into languages.

Use `CountVectorizer` to do the word counting instead of the usual method of applying the standard parallelizable combinations of `.map` and `.reduceByKey()`.

In [4]:
stop_words = StopWordsRemover.loadDefaultStopWords('english')
stop_words += StopWordsRemover.loadDefaultStopWords('german')
stop_words += StopWordsRemover.loadDefaultStopWords('spanish')
stop_words += StopWordsRemover.loadDefaultStopWords('french')
stop_words += ["i'm",' ','','-',"don't","you're","i'll","can't","it'",
              "we'll","it's","ne ","i've","you'll","let","there's","oh"]

remover = StopWordsRemover(inputCol="words", outputCol="filtered", stopWords=stop_words)
removed = remover.transform(df)

cv = CountVectorizer(inputCol="filtered", outputCol="vectors")
model = cv.fit(removed)
df_vec = model.transform(removed)
# df_vec.show()

Something that isn't clear in the Python documentation: I was getting an error when passing `(document_key, SparseVector())` tuples to the LDA constructor. In the docs, they refer you to CountVectorizer for performing word counts, but it was throwing an error saying that it expected a Vector type. So I instead convert to `(id, DenseVector())` pairs.

I imagine that it's much clearer in Scala!

In [5]:
from pyspark.mllib.linalg import DenseVector
corpus = df_vec.select("id","vectors").rdd.map(lambda (x, y): [x,DenseVector(y.toArray())]).cache()

In [6]:
# Cluster the documents into five topics using LDA
NUM_TOPICS = 5
ldaModel = LDA.train(corpus, k=NUM_TOPICS)

Print out the results with the top 20 words for each topic.

In [7]:
NUM_WORDS = 20
topics = ldaModel.describeTopics(NUM_WORDS)
print "{} words in vocabulary".format(len(model.vocabulary))
print ""

for i, t in enumerate(topics):
    print "Top words for topic {}".format(i)
    word_indices, weights = t
    result = []
    for idx in range(len(word_indices)):
        #print "{} : {}".format(model.vocabulary[word_indices[idx]].encode('utf-8'), 
        #                       weights[idx])
        result.append(model.vocabulary[word_indices[idx]].encode('utf-8'))
    print ', '.join(result)
    print ""

36415 words in vocabulary

Top words for topic 0
life, time, one, blood, night, eyes, flesh, like, day, cry, wake, death, soul, heart, world, see, end, us, dead, never

Top words for topic 1
time, life, never, see, one, us, black, night, away, world, take, fire, know, new, eyes, heart, pain, come, back, mind

Top words for topic 2
death, itâs, see, time, life, like, fall, na, one, silence, keep, sense, words, away, dead, never, death,, must, frozen, behind

Top words for topic 3
see, come, one, go, like, eyes, light, time, never, life, could, know, dead, make, dark, mind, forever, god, take, us

Top words for topic 4
like, know, got, feel, get, see, one, never, take, time, way, come, end, love, world, life, want, we're, every, right

