# Topic Modeling Song Lyrics

We will perform topic modeling using two techniques: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) using tools from scikit-learn and gensim. All topic modeling code is contained in the `modeling.py` script.

In [9]:
import pandas as pd
import numpy as np

In [49]:
data = pd.read_csv('data/lyrics_drake.csv')
data = data.dropna()

## Perform TFIDF Vectorization

Before we can start topic modelling, we must apply term frequency-inverse document frequency (TFIDF) vectorization to our tokenized dataset. TFIDF is used to determine how important a word is to a document in a collection or corpus ([ref](https://www.wikiwand.com/en/Tf%E2%80%93idf)). For example, let's say the word "like" is very popular across all songs. Using TFIDF, we downweight the importance of "like" because it is a word that occurs frequently within our corpus. Let's say "democracy" is another word within that song but it is very rare across all songs. Its importance would be upweighted using TFDIF because it doesn't occur very often in our corpus.

Note: scikit-learn's `TfidfVectorizer` expects an array of strings. So, we will need to concatenate our tokenized words together as a string for TFIDF to work properly. That being said, our concatenated tokenized words are very different from our original lyrics because we filtered out stopwords and performed lemmatization.

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_hiphop = tfidf.fit_transform(data['processed_lyrics'])
print("TFIDF matrix dimensions:",X_hiphop.shape)

TFIDF matrix dimensions: (373, 7161)


In [51]:
X_hiphop

<373x7161 sparse matrix of type '<class 'numpy.float64'>'
	with 50454 stored elements in Compressed Sparse Row format>

Now that we have our TFIDF matrix, we can start topic modeling with NMF and LDA.

## Non-negative Matrix Factorization (NMF)

In [52]:
from sklearn.decomposition import NMF

In [53]:
n_topics = 6
nmf = NMF(n_components=n_topics)
nmf.fit(X_hiphop)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=6, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [54]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 20

for i in range(0,n_topics):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':nmf.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    print("Topic %s:" % i, ' '.join(sorted_topic['word']))

Topic 0: know get like got say girl time one love wan let feel take want could baby never need thing tell
Topic 1: cake million made rule cash bill like dog know le everything nigga sotto worldwide around hov doin got play share
Topic 2: camera mean team knew thought ooh lie calling know care good girl wait knight taking one shining long scene stay
Topic 3: home goin hold going endlessly emotion alone thing hard know want love good girl exactly something act different mark baby
Topic 4: hell yeah right fuckin say fucked learned girl wit text like flew interview confession told confusing damn wish texting getting
Topic 5: nigga shit like got fuck boy man bitch back real yeah new made money quick whole friend admit tell make


## Latent Dirichlet Allocation (LDA)

In [55]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [60]:
n_components = 20
lda = LDA(n_components=n_components)
lda.fit(X_hiphop)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=20, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [65]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 20
k_topics = 4
for i in range(0,k_topics):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':lda.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    print("Topic %s:" % i, ' '.join(sorted_topic['word']))

Topic 0: versace medusa mazi truey handcuff metropolis ferragamo egyptian pissing remixing pharaoh dyckman overload exquisite optimist strictly gated binoculars lingo nicely
Topic 1: get pleasure really count chainz killing ballplayer team whiskey nigga balled man carnival bitch agreed weh ashore unforgettable collect wildfire
Topic 2: nobody like ready nigga much belong low shit prettiest versace doubt grammy bottom know beating chillin yeah make rather liable
Topic 3: jumpman little tattoo frank home happening bit heartbreak smokin calabasas baddest nobu lightin babylon reside endlessly tiptoeing ooohhhh girrrrrrrrl shopping


## Comparing NMF vs. LDA

Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) are both topic modelling tools. The main difference is that LDA takes a Bayesian approach and adds a Dirichlet prior on top of the generative model. NMF’s topic-word probability distributions are fixed, while LDA’s topic-word distributions vary based on how the prior was tuned (hyperparameter $k$ - number of components). NMF would be a better choice if the topic probabilities are fixed for each document ref. Also, if our dataset is small, LDA may have inferior performance since it could introduce too much variability to the model ref.

Unlike NMF, reconstructing X with LDA is not a closed-form solution. We need to use Monte Carlo simulations to sample from the distribution of Z (the distribution of topics for each sample), followed by the distribution of W (the distribution of words for topic $Z_i$).