# 4. Topic Modeling with LDA
Use this notebook to practice topic modeling (see lecture slides for reference). First, the usual stuff.

In [1]:
import pandas as pd
transcript_df = pd.read_excel('data/excel/edi_2024_daniel_george.xlsx')
transcript_df.sample(3)

Unnamed: 0.1,Unnamed: 0,timestamp,speaker,utterance
477,477,00:42:41,Daniel,"OK, OK."
251,251,00:19:10,Daniel,This.
169,169,00:13:16,Yuxuan,"You know, like how they fold those, like reall..."


In [3]:
utterances = transcript_df['utterance'].tolist()
utterances[:3]

['Just record the voice and then next we can do something...',
 'Yes.',
 'But we need to make a video as well.']

In [4]:
def remove_puncts(utterance_text, alphanumeric_only='True'):
    utterance_text = utterance_text.replace('-', ' ')
    clean_utterance_text = ''.join(e for e in utterance_text if e.isalnum() or e == ' ').lower()
    clean_utterance_text = ' '.join(clean_utterance_text.split())
    return clean_utterance_text

In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def get_words_tokenized_nopunct_nostop(utterances, stop_w=stop_words):
    utterance_words_list = []
    for utterance in utterances:
        clean_utterance = remove_puncts(utterance)
        words = word_tokenize(clean_utterance.lower())
        words_nostop = [word for word in words if not word in stop_w]
        utterance_words_list.append(words_nostop)
    return utterance_words_list

In [6]:
tokenized_utterances_list = get_words_tokenized_nopunct_nostop(utterances)

## Train an LDA Model
LDA models require an estimated number of topics. We can choose this to be 5, and fine-tune if we see overlaps.

In [16]:
import gensim

dictionary = gensim.corpora.Dictionary(tokenized_utterances_list)
corpus = [dictionary.doc2bow(utterance) for utterance in tokenized_utterances_list]
topic_count = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=topic_count, id2word=dictionary, passes=50)

## Visualize the LDA Model
We use the famouse [pyLDAvis](https://pypi.org/project/pyLDAvis/) library to make sense of the topics.

First, we set up the notebook to suppress warnings (you'll get a lot due to package deprecations).

In [17]:
import warnings
warnings.filterwarnings('ignore')

## IMPORTANT: Set the slider below to a $\lambda$ value of 0.4
A $\lambda$ of 1 shows all frequently-occurring words, which gives you a global picture of the text in general, but does not help you understand the topics. A $\lambda$ of 0.1 shows only the words unique to each topic, which helps you understand the difference between the topics, but not what the topics are about. A $\lambda$ of 0.4 achieves a good balance. See [this paper](https://aclanthology.org/W14-3110.pdf) for more details.

In [18]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
gensimvis.prepare(ldamodel, corpus, dictionary)

## View topic distribution for a document (review)

In [14]:
ldamodel.get_document_topics(corpus[0])

[(0, 0.06777185), (1, 0.07171783), (2, 0.8605103)]

# Exercise: What are some topics relevant to you?
Explore the topics and their keywords. 