<a href="https://colab.research.google.com/github/simon-clematide/colab-notebooks-for-teaching/blob/main/notebooks/topic_modeling_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling Demo on English Newsgroup Texts
  - Illustrates the use of term frequency (TF) and TF*IDF document representation
  - Shows interactive topic modeling exploration with pyLDAvis

# Setup




In [1]:
%pip install gensim==4.3.3 numpy==1.26.4 pyldavis

# colab has newer versions installed, we need to restart the runtime
from IPython.display import HTML, display
display(HTML("""Please restart the runtime from the Menu Runtime if new packages were installed.<br><br>
         <code>Runtime → Restart runtime</code><br><br>
    This is necessary to apply the newly installed packages.
    """))



In [7]:
# try to avoid warnings but not really working for now
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning, module="ipykernel.ipkernel")
warnings.filterwarnings("ignore", category=DeprecationWarning, module="jupyter_client.*") # Added specific filter

# `pyLDAvis`

pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using the 20 newsgroups dataset as provided by scikit-learn.

In [8]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

## Load 20 newsgroups dataset

First, the 20 newsgroups dataset available in sklearn is loaded. As always, the headers, footers and quotes are removed.

Newsgroup categories:
`['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']`

In [9]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [10]:
cats = ['sci.med', 'alt.atheism', 'rec.autos', 'sci.space','rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),categories=cats)
docs_raw = newsgroups.data
print(len(docs_raw))

2858


In [11]:
print(docs_raw[72])

No, he's not nuts, WIP is second to none THE sports station.  They
don't have Tony Bruno working ESPN radio and Al Morganti doing Friday
Night Hockey because they suck.  I live in Richmond Va, but I visit
Phila often, and on the way I get WTEM Washington) and WIP.  I hear
the FAN at night wherever I go (the signal used to be WNBC, when they
played golden oldies) because you can't avoid it.  Of those three,
WIP has the best hosts hands down.  Chuck Cooperstein isn't a homer,
and neither is Jody Mac.  WTEM is too generic to be placed in the
catergory.  In fact if you have heard WTEM and the FAN you notice the
theme music is identical...same ownership?? I think so!  WIP is
totally original.  Their hosts actually have a personality (this is a
knock at TEM (the TEAM) not the FAN because Mike and the Mad Dog and
Sommers are good) I mean comparing the morning guys in Philadelphia
to the ones in Washington is a total joke.  Anyway, I like the FAN
and WIP, but I think the edge goes to 'IP.  

W

## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as raw counts or in TF-IDF form.

In [12]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5,  # exclude words with a relative document frequency greater than 50%
                                min_df = 10    # exclude tokens that occur less than 10 times
                                )
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

(2858, 3234)


In [13]:
# How does a certain document look like in this representation?
# Get the mapping of column indices to vocabulary items
index2vocabulary_item = tf_vectorizer.get_feature_names_out()

# Get the dense matrix representation of the document-term matrix
doc_index = 72  # Index of the document to show
doc_matrix = dtm_tf.getrow(doc_index).toarray()

# Print the words and their counts in the document
for i, count in enumerate(doc_matrix[0]):
    if count > 0:
        word = index2vocabulary_item[i]
        print(f"{word}: {count}")

actually: 1
admit: 1
avoid: 1
away: 1
basically: 1
believe: 1
best: 2
bet: 1
blown: 1
book: 1
cause: 1
chuck: 1
cobb: 2
comparing: 1
control: 1
days: 1
die: 1
doing: 2
don: 1
edge: 1
fact: 1
fan: 6
fans: 1
finally: 1
friday: 1
games: 2
glad: 1
goes: 1
golden: 1
good: 1
got: 1
guy: 1
guys: 2
hands: 1
hard: 1
hear: 2
heard: 3
homer: 1
identical: 1
inches: 1
isn: 1
joke: 1
knock: 2
knows: 1
like: 3
line: 1
listening: 1
live: 1
mac: 1
mad: 2
mean: 1
mike: 1
morning: 1
music: 1
national: 1
night: 2
notice: 1
nuts: 1
ones: 1
original: 1
philadelphia: 1
phillies: 3
phone: 1
placed: 1
play: 2
played: 1
radio: 1
really: 1
remember: 1
richmond: 2
right: 1
said: 1
san: 1
second: 1
signal: 1
sports: 3
started: 1
station: 4
steve: 1
strong: 1
suck: 1
summer: 1
team: 1
think: 2
thought: 1
tony: 1
total: 1
totally: 2
used: 1
usually: 1
visit: 1
washington: 2
wasn: 1
way: 1
weekend: 1
went: 3
win: 1
working: 1
year: 1


Alternative, we can build a tf-idf document-term matrix

In [14]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

(2858, 3234)


In [19]:
# How does a certain document look like in this representation?
# Get the mapping of column indices to vocabulary items
index2vocabulary_item = tfidf_vectorizer.get_feature_names_out()

# Get the dense matrix representation of the document-term matrix
doc_index = 72  # Index of the document to show
doc_matrix = dtm_tfidf.getrow(doc_index).toarray()

# Print the words and their counts in the document
for i, count in enumerate(doc_matrix[0]):
    if count > 0:
        word = index2vocabulary_item[i]
        print(f"{word:<12}: {count:.4f}")

actually    : 0.0490
admit       : 0.0730
avoid       : 0.0655
away        : 0.0546
basically   : 0.0648
believe     : 0.0468
best        : 0.0995
bet         : 0.0767
blown       : 0.0827
book        : 0.0584
cause       : 0.0549
chuck       : 0.0865
cobb        : 0.1689
comparing   : 0.0804
control     : 0.0593
days        : 0.0552
die         : 0.0743
doing       : 0.1079
don         : 0.0335
edge        : 0.0819
fact        : 0.0507
fan         : 0.4015
fans        : 0.0701
finally     : 0.0664
friday      : 0.0790
games       : 0.1180
glad        : 0.0767
goes        : 0.0602
golden      : 0.0854
good        : 0.0385
got         : 0.0481
guy         : 0.0622
guys        : 0.1383
hands       : 0.0747
hard        : 0.0561
hear        : 0.1259
heard       : 0.1672
homer       : 0.0854
identical   : 0.0811
inches      : 0.0827
isn         : 0.0514
joke        : 0.0835
knock       : 0.1730
knows       : 0.0677
like        : 0.1004
line        : 0.0589
listening   : 0.0854
live        :

## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted. n_components is number of topics.

In [None]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0,verbose=1, max_iter=10)
lda_tf.fit(dtm_tf)

## Visualizing the models with pyLDAvis
Multidimensional scaling = Dimension reduction

Can you reidentify the newsgroups? `['sci.med', 'alt.atheism', 'rec.autos', 'sci.space','rec.sport.baseball']`

Hover over topics circles and terms to explore the connection between words and topics...

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)

## Topix Modeling with TFIDF values

In [None]:
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=10, random_state=0, verbose=1,max_iter=10)
lda_tfidf.fit(dtm_tfidf)

In [None]:
pyLDAvis.lda_model.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

### Using different MDS functions

With `sklearn` installed, other MDS functions, such as MMDS and TSNE can be used for plotting if the default PCoA is not satisfactory.

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')