# Topic Modeling Song Lyrics

We will perform topic modeling using two techniques: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) using tools from scikit-learn and gensim. All topic modeling code is contained in the `modeling.py` script.

In [9]:
import pandas as pd
import numpy as np

In [11]:
data = pd.read_csv('data/lyrics_hiphop.csv')
data = data.dropna()

## Perform TFIDF Vectorization

Before we can start topic modelling, we must apply term frequency-inverse document frequency (TFIDF) vectorization to our tokenized dataset. TFIDF is used to determine how important a word is to a document in a collection or corpus ([ref](https://www.wikiwand.com/en/Tf%E2%80%93idf)). For example, let's say the word "like" is very popular across all songs. Using TFIDF, we downweight the importance of "like" because it is a word that occurs frequently within our corpus. Let's say "democracy" is another word within that song but it is very rare across all songs. Its importance would be upweighted using TFDIF because it doesn't occur very often in our corpus.

Note: scikit-learn's `TfidfVectorizer` expects an array of strings. So, we will need to concatenate our tokenized words together as a string for TFIDF to work properly. That being said, our concatenated tokenized words are very different from our original lyrics because we filtered out stopwords and performed lemmatization.

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_hiphop = tfidf.fit_transform(data['processed_lyrics'])
print("TFIDF matrix dimensions:",X_hiphop.shape)

TFIDF matrix dimensions: (24846, 138554)


In [34]:
X_hiphop

<24846x138554 sparse matrix of type '<class 'numpy.float64'>'
	with 3648296 stored elements in Compressed Sparse Row format>

Now that we have our TFIDF matrix, we can start topic modeling with NMF and LDA.

## Non-negative Matrix Factorization (NMF)

In [6]:
from sklearn.decomposition import NMF

In [20]:
n_topics = 10
nmf = NMF(n_components=n_topics)
nmf.fit(X_hiphop)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=10, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [31]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 20

for i in range(0,n_components):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':nmf.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    print("Topic %s:" % i, ' '.join(sorted_topic['word']))

Topic 0: know life never time one day could would see say way feel world thing cause think take mind tell still
Topic 1: nigga shit money fuck real got know thug wit get die gon hood motherfucker gun street ride yeah dog young
Topic 2: ich und der die ist nicht da wir wie auf mich sie ein mir mit doch den wenn denn dich
Topic 3: que con como por los soy la una pero esta todo para ma quiero cuando porque tengo sin vida hay
Topic 4: girl baby yeah know wan want let got right tonight gon need tell get night make take body come like
Topic 5: love baby heart need away like never ooh feel way make loving babe give hate hurt one chorus fall show
Topic 6: instrumental harmony cell talk one dflo lyric pillow cooly redman verse shuttle groove slide version sex singing chorus waterfall peanut
Topic 7: bitch fuck shit hoe as dick pussy fuckin fucking niggaz give money got like fucked motherfucker bad wan damn hook
Topic 8: like get got back niggaz wit man money chorus shit come hit cause rock rap 

## Latent Dirichlet Allocation (LDA)

In [35]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [36]:
n_topics = 10
lda = LDA(n_components=n_topics)
lda.fit(X_hiphop)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [37]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 20

for i in range(0,n_components):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':lda.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    print("Topic %s:" % i, ' '.join(sorted_topic['word']))

Topic 0: like nigga got get know love shit bitch yeah let girl baby see one fuck want back make time man
Topic 1: meg jireh egy juiciest hogy nem engedj chaaaange swaggggg swagk askjfas loveeeeeeeee juiccy pussssy onez vagy gyere nzz kpekkel csak
Topic 2: surender oohoh kokoro rnrn lilijo najsliczniejsza matko boska maryjo flyover datz nani qurl tatoe mou dhat lng datte itsu rabbana
Topic 3: ich und der ist nicht die da wir wie auf mich sie mit mir ein doch den wenn dich denn
Topic 4: instermentual starbound anata koishikute somedayz jak dirtv slavic dake luuden kare scrubb hitburn kisi zobacz audrey zutto hitotsu eldo chikyuu
Topic 5: shoobeedoo heythere cutle gotio shoobee snoobedeebeebop delacratic naya zindagi blessingis aroundyoull promiseto comesyou failyour struggleor comesyoull jabberwock borogoves raths outgrabe
Topic 6: liberian naku hito watashi hsubakcits chiva breakadawn betsu anata dirtyface piya penda zoosk frenchie ayoooooo atashi ooooooooooooh eopsin koto naka
Topic 7:

## Comparing NMF vs. LDA

Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) are both topic modelling tools. The main difference is that LDA takes a Bayesian approach and adds a Dirichlet prior on top of the generative model. NMF’s topic-word probability distributions are fixed, while LDA’s topic-word distributions vary based on how the prior was tuned (hyperparameter $k$ - number of components). NMF would be a better choice if the topic probabilities are fixed for each document ref. Also, if our dataset is small, LDA may have inferior performance since it could introduce too much variability to the model ref.

Unlike NMF, reconstructing X with LDA is not a closed-form solution. We need to use Monte Carlo simulations to sample from the distribution of Z (the distribution of topics for each sample), followed by the distribution of W (the distribution of words for topic $Z_i$).