# Topic Modeling in Python and Interactive Visualization with pyLDAvis

We will be using the 20 newsgroups dataset as provided by scikit-learn.

pip install pyLDAvis == 3.4.0

pip install sklearn == 1.2.2

查看版本号：pyLDAvis.\_\_version\_\_

In [1]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import numpy as np
import sklearn

In [2]:
print(pyLDAvis.__version__)
print(sklearn.__version__)

3.4.0
1.2.2


## Load 20 newsgroups dataset

First, the 20 newsgroups dataset available in sklearn is loaded. As always, the headers, footers and quotes are removed.

In [3]:
import pickle as pkl
docs_raw = pkl.load(open('20news.pkl', 'rb'))

In [5]:
len(docs_raw)

3273

In [7]:
docs_raw[0]

"Hello World,\n\t     just bought a new Stealth two weeks ago. Got a grad student \n rebate. Someone told me that there's another $400 reabet for 1st time\n Chrysler buyer. True ? If yes can I still get it or am I too late ?\n"

## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as TF-IDF form.

In [8]:
tfidf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
                                   stop_words = 'english',
                                   lowercase = True,
                                   token_pattern = r'\b[a-zA-Z]{3,}\b',
                                   max_df = 0.5, 
                                   min_df = 10)
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

(3273, 3466)


## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted.

USE: sklearn.decomposition.LatentDirichletAllocation

Refer to: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [19]:
# TODO
lda = LatentDirichletAllocation(n_components=10)
lda.fit(dtm_tfidf)

**Proportion of each topic in each document:**

In [20]:
# TODO
doc_topic = lda.transform(dtm_tfidf)

In [21]:
doc_topic.shape

(3273, 10)

In [22]:
doc_topic[0]

array([0.01999288, 0.0199928 , 0.01999293, 0.01999365, 0.01999625,
       0.01999295, 0.01999286, 0.82004233, 0.01999491, 0.02000844])

**k word distributions (k topics):**

In [23]:
# TODO
topics = lda.components_
topics.shape
topics[1]

array([0.1       , 0.10000003, 0.10000001, ..., 0.10000001, 0.10000001,
       0.10000003])

## Visualizing the models with pyLDAvis

In [24]:
pyLDAvis.lda_model.prepare(lda, dtm_tfidf, tfidf_vectorizer)