# Topic Modeling Song Lyrics

We will perform topic modeling using two techniques: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) using tools from scikit-learn and gensim. All topic modeling code is contained in the `topic_modeling.py` script.

In [24]:
import pandas as pd
import numpy as np

In [25]:
data = pd.read_csv('data/lyrics_bob-dylan.csv')
data = data.dropna()

## Perform TFIDF Vectorization

Before we can start topic modelling, we must apply term frequency-inverse document frequency (TFIDF) vectorization to our tokenized dataset. TFIDF is used to determine how important a word is to a document in a collection or corpus ([ref](https://www.wikiwand.com/en/Tf%E2%80%93idf)). For example, let's say the word "like" is very popular across all songs. Using TFIDF, we downweight the importance of "like" because it is a word that occurs frequently within our corpus. Let's say "democracy" is another word within that song but it is very rare across all songs. Its importance would be upweighted using TFDIF because it doesn't occur very often in our corpus.

Note: scikit-learn's `TfidfVectorizer` expects an array of strings. So, we will need to concatenate our tokenized words together as a string for TFIDF to work properly. That being said, our concatenated tokenized words are very different from our original lyrics because we filtered out stopwords and performed lemmatization.

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', min_df=3, max_df=0.9)
X = tfidf.fit_transform(data['processed_lyrics'])
print("TFIDF matrix dimensions:",X.shape)

TFIDF matrix dimensions: (596, 2621)


In [62]:
X

<596x2621 sparse matrix of type '<class 'numpy.float64'>'
	with 34582 stored elements in Compressed Sparse Row format>

Now that we have our TFIDF matrix, we can start topic modeling with NMF and LDA.

## Non-negative Matrix Factorization (NMF)

NMF starts with a Document-Word Matrix, $DWM_{ij}$, which represents the number of occurences of word $w_j$ in document $d_i$. We create our DWM using tf-idf or count vectorization. This matrix gets factorized into two smaller matrices: Word-Topic Matrix $WTM$ and Topic-Document Matrix $TDM$.

In [63]:
from sklearn.decomposition import NMF

In [66]:
k_topics = 8
nmf = NMF(n_components=k_topics)
nmf.fit(X)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=8, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [67]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 20

for i in range(0,k_topics):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':nmf.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    print("Topic %s:" % (i+1), ' '.join(sorted_topic['word']))

Topic 1: time say like know day come got long gone night heart eye tell man way hard good think away old
Topic 2: instrumental babe lookin gather stone fall rain shine hey sky mornin awake light twice wait use walkin floor everybody ledge
Topic 3: baby want got right come like mind honey night know wrong babe man unto stay really hurt lookin blue lover
Topic 4: love let true know heart want make pure blue like world need seen tonight girl fool little anybody moon hold
Topic 5: lord die home plow believe ground fixin day child hand worry hold yes jackson highway whistle lucky gone engine george
Topic 6: gon mama let high trouble levee make know goin water best easy road wake yes tell walk friend way come
Topic 7: knock heaven knockin door close trying let anymore walking like leave miss lucky orleans sister hurry baltimore wading missouri new
Topic 8: said went man got asked took right little clothes dream war shelter started highway heard come god kinda seen came


## Latent Dirichlet Allocation (LDA)

In [70]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

k_topics = 8
lda = LDA(n_components=k_topics, max_iter=10, learning_method='online')
lda.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=8, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [71]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 20
for i in range(0,k_topics):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':lda.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    print("Topic %s:" % i, ' '.join(sorted_topic['word']))

Topic 0: guillotine unlike mouthpiece disillusioned limited scuff plier stuffed obscenity baptized handcuff remark honesty meantime despise unclear lodged outsider reappear sex
Topic 1: knock knockin heaven stoned stone door congressman prophesize critic rapidly writer stalled rattle drenched senator breakfast admit libido ragin mortician
Topic 2: ramble wan dawn break happens baby gon told day know say got quit old killed night babe played like line
Topic 3: know got love like come baby time gon said say want man night let day tell heart gone way away
Topic 4: jane blowing land fair babe come whistle maid queen beauty gypsy head gon summer like bright young true nothin little
Topic 5: father seven billy broken glory good ship night drank mama danced dream sun path day wind tree best hour goin
Topic 6: instrumental mood said water come let high lord risin gon low got child turn need trouble mind day hear little
Topic 7: success failure speaks madam dangles ideal matchstick horseman quo

In [72]:
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda, X, tfidf, mds='tsne')
panel

## Comparing NMF vs. LDA

Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) are both topic modelling tools. The main difference is that LDA takes a Bayesian approach and adds a Dirichlet prior on top of the generative model. NMF’s topic-word probability distributions are fixed, while LDA’s topic-word distributions vary based on how the prior was tuned (hyperparameter $k$ - number of components). NMF would be a better choice if the topic probabilities are fixed for each document ref. Also, if our dataset is small, LDA may have inferior performance since it could introduce too much variability to the model ref.

Unlike NMF, reconstructing X with LDA is not a closed-form solution. We need to use Monte Carlo simulations to sample from the distribution of Z (the distribution of topics for each sample), followed by the distribution of W (the distribution of words for topic $Z_i$).