# LDA Topic Modeling on Song Lyrics
## Creating a Social Network Graph of the Marvel Universe


This notebooks performs LDA Topic Modeling on Song Lyrics [(see data source)](https://www.kaggle.com/albertsuarez/azlyrics).

**Author:** Tim Denzler

In [51]:
%load_ext autoreload
%autoreload 2
import warnings
import csv
import os
import pickle
import nltk
from tqdm import tqdm
warnings.filterwarnings('ignore')
#execute the following of not downloaded already:
#nltk.download('wordnet')
#nltk.download('stopwords')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
from langdetect import detect_langs
langs = detect_langs('Otec matka syn.')

## Step 1: Read CSV file and identify language

In [4]:
from langdetect import detect_langs
from langdetect.lang_detect_exception import LangDetectException

lyric_corpus = []
lang_filtered = 0
for filename in tqdm(os.listdir('azlyrics-scraper')):
    path = './azlyrics-scraper/'+filename
    with open(path, 'r') as f:
        data = csv.reader(f)
        headers = next(data) #skip headers
        for row in data:
            if len(row)< 5: #lyrics column not included
                continue
            if not row[4]: #lyrics are empty
                continue
            try:
                langlist = detect_langs(row[4])
                for language in langlist:
                    if language.prob < 0.95 or language.lang != 'en': #lyrics not clearly identified as English
                        lang_filtered += 1
                        continue
                    else:
                        lyric_corpus.append(row[4])
            except LangDetectException:
                continue

100%|██████████| 27/27 [18:30<00:00, 41.12s/it]


In [5]:
print("Filtered", lang_filtered, "songs that had lyrics identified in a language other than English.", len(lyric_corpus), "songs remain for topic modeling")

Filtered 0 songs that had lyrics identified in a language other than English. 123794 songs remain for topic modeling


In [6]:
import pickle
fileObject = open("./data/lyric_corpus",'wb') 
pickle.dump(lyric_corpus,fileObject)   
fileObject.close()

## Step 2: Tokenization of Song Lyrics

All lyrics are currently stored as strings. First of all, each lyrics's text in the corpus is converted to lowercase letters.
The text strings are tokenized in order continue the text processing using the *RegexpTokenizer* and subsequently stored in *lyric_corpus_tokenized* for each paper object. Each word ('\w' refers to word characters, so alphanumerics) is now a string and each paper a list of strings.

In [52]:
from nltk.tokenize import RegexpTokenizer

lyric_corpus_tokenized = []
tokenizer = RegexpTokenizer(r'\w+')
for lyric in tqdm(lyric_corpus):
    tokenized_lyric = tokenizer.tokenize(lyric.lower())  #tokenize and lower each lyric
    lyric_corpus_tokenized.append(tokenized_lyric)

100%|██████████| 123794/123794 [00:11<00:00, 10342.80it/s]


In [53]:
def count_token_in_corpus (corpus): #returns number of tokens for any given corpus
    return_count = 0
    for song in corpus:
        return_count += len(song)
    print("Total of", return_count, "tokens in the current corpus.")

In [54]:
count_token_in_corpus(lyric_corpus_tokenized)

Total of 31455368 tokens in the current corpus.


## Step 3: Removing Numeric and Single Character Tokens

Tokens that only contain numbers or consist of only one letter are removed to reduce the dimensionality.

In [55]:
for s,song in tqdm(enumerate(lyric_corpus_tokenized)):
    filtered_song = []    
    for token in song:
        if len(token) > 2 and not token.isnumeric():
            filtered_song.append(token)
    lyric_corpus_tokenized[s] = filtered_song

123794it [00:06, 19741.66it/s]


In [56]:
count_token_in_corpus(lyric_corpus_tokenized)

Total of 22178517 tokens in the current corpus.


## Step 4: Token Lemmatization
*NLTK's WordNetLemmatizer* is imported, lemmatizing each token. This means each word is reduced to its stem or base form in order to enable better comparability. 

In [57]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

for s,song in tqdm(enumerate(lyric_corpus_tokenized)):
    lemmatized_tokens = []
    for token in song:
        lemmatized_tokens.append(lemmatizer.lemmatize(token))
    lyric_corpus_tokenized[s] = lemmatized_tokens

123794it [01:16, 1619.44it/s]


In [58]:
count_token_in_corpus(lyric_corpus_tokenized)

Total of 22178517 tokens in the current corpus.


## Step 5: Remove Stop Words and Profanities
In order to further reduce dimensionality, all words holding little to no value for topic modeling are removed. For this, *NLTK's stopwords*, a list of common stop words in the English language is imported and a few words that occured in previous modeling attempts are added.

In [59]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
new_stop_words = ['ooh','yeah','hey','whoa','woah', 'ohh', 'was', 'mmm', 'oooh','yah','yeh','mmm', 'hmm','deh','doh','jah','wa']
stop_words.extend(new_stop_words)

for s,song in tqdm(enumerate(lyric_corpus_tokenized)):
    filtered_text = []    
    for token in song:
        if token not in stop_words:
            filtered_text.append(token)
    lyric_corpus_tokenized[s] = filtered_text

123794it [00:59, 2081.85it/s]


In [60]:
count_token_in_corpus(lyric_corpus_tokenized)

Total of 13787937 tokens in the current corpus.


Filter out profanities based on a predefined list

In [61]:
profanities = []
with open('profanity.txt', 'r') as file:
    prof_string = file.read().replace('\n', '')
    prof_tokens = prof_string.split(", ")
    for token in prof_tokens:
        profanities.append(token)

In [62]:
for s,song in tqdm(enumerate(lyric_corpus_tokenized)):
    filtered_text = []    
    for token in song:
        if token not in profanities:
            filtered_text.append(token)
    lyric_corpus_tokenized[s] = filtered_text

123794it [03:20, 617.45it/s]


In [63]:
count_token_in_corpus(lyric_corpus_tokenized)

Total of 13527321 tokens in the current corpus.


In [64]:
# optionally store tokenized lyrics 
fileObject = open("./data/lyric_corpus_tokenized",'wb') 
pickle.dump(lyric_corpus_tokenized,fileObject)   
fileObject.close()

## Step 6: Dictionary Creation and Filtering

A dictionary representation of the lyrics is created (mapping all tokens to a unique ID).

In [65]:
from gensim.corpora import Dictionary

dictionary = Dictionary(lyric_corpus_tokenized)
print('Number of unique tokens: ', len(dictionary))

Number of unique tokens:  132588


In order to further reduce dimensionality, tokens that occur less than 100 songs, as well as tokens that occur in more than 80% of songs are removed

In [66]:
dictionary.filter_extremes(no_below = 100, no_above = 0.8)
print('Number of unique tokens: ', len(dictionary))

Number of unique tokens:  5638


## Step 7: Bag-of-Words and Index to Dictionary Conversion

Each song (as of now a list of tokens) is converted into the bag-of-words format, which only stores the unique token ID and its count for each song.
<br>
<font color='red'> All preprocessing should be done before this step! </font>

In [67]:
from gensim.corpora import MmCorpus

gensim_corpus = [dictionary.doc2bow(song) for song in lyric_corpus_tokenized]

#create index to dictionary
temp = dictionary[0]  # "loads" the dictionary
id2word = dictionary.id2token

## Step 8: Setting the Model Parameters

Before commencing training, the models' parameters have to be set. 
From *gensim* documentation:
- *chunksize* = the number of documents considered in each training cycle
- *passes* = number of passes through the corpus during training
- *iterations* = maximum number of iterations
- *start, limit, step* = for which number of topics (which range) should the models be trained

In [72]:
# training parameters
chunksize = 2000
passes = 20
iterations = 400
num_topics = 6

## Step 9: Execute Training and calculate coherence

We train a LDA topic model for number of topics k (num_topics). Due to the large song corpus this may take some time.

In [73]:
from gensim.models import LdaModel
lda_model = LdaModel(
corpus=gensim_corpus,
id2word=id2word,
chunksize=chunksize,
alpha='auto',
eta='auto',
iterations=iterations,
num_topics=num_topics,
passes=passes
)

Calculate Cv coherence score (optional):

In [74]:
#from gensim.models.coherencemodel import CoherenceModel

#coherencemodel = CoherenceModel(model=lda_model, texts=lyric_corpus_tokenized, dictionary=dictionary, coherence='c_v')
#print(coherencemodel.get_coherence())

## Step 10: Visualize the LDA model using pyLDAvis
Visualize and store the models

In [75]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

vis_data = gensimvis.prepare(lda_model, gensim_corpus, dictionary)
#pyLDAvis.display(vis_data)
pyLDAvis.save_html(vis_data, './Lyrics_LDA_k_'+ str(num_topics) +'.html')