# Topic Modeling Song Lyrics

We will perform topic modeling using two techniques: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) using tools from scikit-learn and gensim. All topic modeling code is contained in the `topic_modeling.py` script.

In [4]:
import pandas as pd
import numpy as np

In [5]:
data = pd.read_csv('data/lyrics_indie.csv')
data = data.dropna()

## Perform TFIDF Vectorization

Before we can start topic modelling, we must apply term frequency-inverse document frequency (TFIDF) vectorization to our tokenized dataset. TFIDF is used to determine how important a word is to a document in a collection or corpus ([ref](https://www.wikiwand.com/en/Tf%E2%80%93idf)). For example, let's say the word "like" is very popular across all songs. Using TFIDF, we downweight the importance of "like" because it is a word that occurs frequently within our corpus. Let's say "democracy" is another word within that song but it is very rare across all songs. Its importance would be upweighted using TFDIF because it doesn't occur very often in our corpus.

Note: scikit-learn's `TfidfVectorizer` expects an array of strings. So, we will need to concatenate our tokenized words together as a string for TFIDF to work properly. That being said, our concatenated tokenized words are very different from our original lyrics because we filtered out stopwords and performed lemmatization.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', min_df=3, max_df=0.9)
X = tfidf.fit_transform(data['processed_lyrics'])
print("TFIDF matrix dimensions:",X.shape)

TFIDF matrix dimensions: (3148, 17681)


With scikit-learn's '`TfidfVectorizer`, you can specify a minimum and maximum document frequency (`min_df`, `max_df`). I set `min_df` to be 3, which means that a word must be mentioned in at least 3 documents in order for the vectorizer to include it. I set `max_df` to be 0.9 which will ignore words that appear in more than 90% of documents. You can think of it as a filter for corpus-specific stopwords. 

In [19]:
X

<3148x3570 sparse matrix of type '<class 'numpy.float64'>'
	with 119558 stored elements in Compressed Sparse Row format>

Now that we have our TFIDF matrix, we can start topic modeling with NMF and LDA.

## Non-negative Matrix Factorization (NMF)

NMF was first published in the context of machine learning of facial images by Lee and Seung in 1999. It starts with a document-word matrix, $X_{ij}$, which represents the number of occurences of word $w_i$ in document $d_j$. We create our document-word matrix $X$ using tf-idf or count vectorization. This matrix gets factorized into two smaller matrices: a word-topic matrix $W_{ik}$ and topic-document matrix $H_{kj}$. $W_{ik}$ represents the $k$ topics discovered from the documents, while $H_{kj}$ represents the coefficient weights for the topics in each document. By reducing the dimensionality of our original document-word matrix, we are able to extract information about $k$ topics. 



<img src="images/matrix_factorization.png" width="50%"/>

The process of factorizing $W$ and $H$ involves optimizing over an objective function, which in this case is the reconstruction error between $X$ and the product of its factors $W$ and $H$. $W$ and $H$ are updated iteratively until convergence (i.e., reconstruction error can no longer be minimized). In our example, a song represents one "document" in our $X$ matrix. Our goal is to reduce the dimensionality of our song-word matrix, $X$, so that we can extract meaningful $k$ topics.

In [27]:
from sklearn.decomposition import NMF

k_topics = 8
nmf = NMF(n_components=k_topics)
nmf.fit(X)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=8, random_state=None, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

In [35]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 10
word_dict = dict()

for i in range(0,k_topics):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':nmf.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    word_dict[i+1] = list(sorted_topic['word'])

pd.DataFrame(word_dict)

Unnamed: 0,1,2,3,4,5,6,7,8
0,time,instrumental,love,know,want,let,gon,que
1,like,purely,heart,heart,say,come,wan,pa
2,way,team,need,tell,need,home,tonight,le
3,day,lyric,said,feel,tell,leave,make,por
4,got,song,like,night,feel,shine,got,qui
5,say,devil,anymore,said,girl,light,run,los
6,away,ooh,darling,make,really,long,baby,comme
7,thing,motion,baby,alright,think,rain,try,tout
8,eye,frozen,life,think,hear,heart,lose,pero
9,life,captivating,hold,baby,ooh,sun,stop,pour


We looked at the top 10 most "relevant" words across 8 topics in our indie lyric corpus. Some of the topics are hard to summarize, but others are quite obvious. For example, Topic 3 is clearly about `love` and Topic 8 captures lyrics from non-English songs. 

Note that results can change if you try out different $k$ topics. Choosing a small $k$ can result in extremely broad topics, while choosing a large $k$ can end up in over-clustering, which produces many highly-similar topics ([ref](https://arxiv.org/pdf/1404.4606.pdf)). There are strategies to identify optimal $k$ (e.g., term-centric stability analysis, k-clustering, etc.), but this is outside the scope of this project.

## Latent Dirichlet Allocation (LDA)

In [16]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

k_topics = 6
lda = LDA(n_components=k_topics, max_iter=15, learning_method='online')
lda.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=15, mean_change_tol=0.001,
             n_components=6, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [17]:
tfidf_features = tfidf.get_feature_names()
top_n_words = 20
for i in range(0,k_topics):
    topic = pd.DataFrame(data={'word':tfidf_features, 'weight':lda.components_[i]})
    sorted_topic = topic.sort_values('weight', ascending=False).head(top_n_words)
    print("Topic %s:" % i, ' '.join(sorted_topic['word']))

Topic 0: que pa le non qui por amor los tout nous dans sem pour che comme quand sol suis plus est
Topic 1: know love like time come say let got want way day feel make away heart thing life night need gon
Topic 2: handa kita bein marilyn henry nãº monroe ano kong lagi held marry ohh chorus man paris smoked eaten dinner married
Topic 3: kau aku yang tak pergi hanya ore dan revoir kita kini lagi total lain mental outta repeat french door fake
Topic 4: mmmm dash rejoice mmmmm grandmother believer translation dialect dawning original pit mmm english wade person dawn hill silence shadow living
Topic 5: uptown congregation rotation station wilt working fork afloat dreaming beak childhood calming maid leaving spoken hospital wise course mobile giro


In [111]:
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda, X, tfidf, mds='tsne')
pyLDAvis.save_html(panel, 'lda.html')
panel

## Comparing NMF vs. LDA

Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) are both topic modelling tools. The main difference is that LDA takes a Bayesian approach and adds a Dirichlet prior on top of the generative model. NMF’s topic-word probability distributions are fixed, while LDA’s topic-word distributions vary based on how the prior was tuned (hyperparameter $k$ - number of components). NMF would be a better choice if the topic probabilities are fixed for each document ref. Also, if our dataset is small, LDA may have inferior performance since it could introduce too much variability to the model ref.

Unlike NMF, reconstructing X with LDA is not a closed-form solution. We need to use Monte Carlo simulations to sample from the distribution of Z (the distribution of topics for each sample), followed by the distribution of W (the distribution of words for topic $Z_i$).