<a href="https://colab.research.google.com/github/simon-clematide/colab-notebooks-for-teaching/blob/main/notebooks/topic_modeling_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup




In [None]:
#! pip install --upgrade pandas-profiling jupyter
! pip install pyldavis # works with current pyldavis now!

In [2]:
# try to avoid warnings but not really working for now
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning, module="ipykernel.ipkernel")


# `pyLDAvis`

pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using the 20 newsgroups dataset as provided by scikit-learn.

In [3]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

## Load 20 newsgroups dataset

First, the 20 newsgroups dataset available in sklearn is loaded. As always, the headers, footers and quotes are removed.

Newsgroup categories:
`['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']`

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

  and should_run_async(code)


In [5]:
cats = ['sci.med', 'alt.atheism', 'rec.autos', 'sci.space','rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),categories=cats)
docs_raw = newsgroups.data
print(len(docs_raw))

  and should_run_async(code)


2858


In [6]:
print(docs_raw[72])

No, he's not nuts, WIP is second to none THE sports station.  They
don't have Tony Bruno working ESPN radio and Al Morganti doing Friday
Night Hockey because they suck.  I live in Richmond Va, but I visit
Phila often, and on the way I get WTEM Washington) and WIP.  I hear
the FAN at night wherever I go (the signal used to be WNBC, when they
played golden oldies) because you can't avoid it.  Of those three,
WIP has the best hosts hands down.  Chuck Cooperstein isn't a homer,
and neither is Jody Mac.  WTEM is too generic to be placed in the
catergory.  In fact if you have heard WTEM and the FAN you notice the
theme music is identical...same ownership?? I think so!  WIP is
totally original.  Their hosts actually have a personality (this is a
knock at TEM (the TEAM) not the FAN because Mike and the Mad Dog and
Sommers are good) I mean comparing the morning guys in Philadelphia
to the ones in Washington is a total joke.  Anyway, I like the FAN
and WIP, but I think the edge goes to 'IP.  

W

  and should_run_async(code)


## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as raw counts or in TF-IDF form.

In [7]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5,  # exclude words with a relative document frequency greater than 50%
                                min_df = 10    # exclude tokens that occur less than 10 times
                                )
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

  and should_run_async(code)


(2858, 3234)


In [8]:
# How does a certain document look like in this representation?
# Get the mapping of column indices to vocabulary items
index2vocabulary_item = tf_vectorizer.get_feature_names_out()

# Get the dense matrix representation of the document-term matrix
doc_index = 72  # Index of the document to show
doc_matrix = dtm_tf.getrow(doc_index).toarray()

# Print the words and their counts in the document
for i, count in enumerate(doc_matrix[0]):
    if count > 0:
        word = index2vocabulary_item[i]
        print(f"{word}: {count}")

actually: 1
admit: 1
avoid: 1
away: 1
basically: 1
believe: 1
best: 2
bet: 1
blown: 1
book: 1
cause: 1
chuck: 1
cobb: 2
comparing: 1
control: 1
days: 1
die: 1
doing: 2
don: 1
edge: 1
fact: 1
fan: 6
fans: 1
finally: 1
friday: 1
games: 2
glad: 1
goes: 1
golden: 1
good: 1
got: 1
guy: 1
guys: 2
hands: 1
hard: 1
hear: 2
heard: 3
homer: 1
identical: 1
inches: 1
isn: 1
joke: 1
knock: 2
knows: 1
like: 3
line: 1
listening: 1
live: 1
mac: 1
mad: 2
mean: 1
mike: 1
morning: 1
music: 1
national: 1
night: 2
notice: 1
nuts: 1
ones: 1
original: 1
philadelphia: 1
phillies: 3
phone: 1
placed: 1
play: 2
played: 1
radio: 1
really: 1
remember: 1
richmond: 2
right: 1
said: 1
san: 1
second: 1
signal: 1
sports: 3
started: 1
station: 4
steve: 1
strong: 1
suck: 1
summer: 1
team: 1
think: 2
thought: 1
tony: 1
total: 1
totally: 2
used: 1
usually: 1
visit: 1
washington: 2
wasn: 1
way: 1
weekend: 1
went: 3
win: 1
working: 1
year: 1


  and should_run_async(code)


Alternative, we can build a tf-idf document-term matrix

In [9]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

  and should_run_async(code)


(2858, 3234)


In [10]:
# How does a certain document look like in this representation?
# Get the mapping of column indices to vocabulary items
index2vocabulary_item = tfidf_vectorizer.get_feature_names_out()

# Get the dense matrix representation of the document-term matrix
doc_index = 72  # Index of the document to show
doc_matrix = dtm_tfidf.getrow(doc_index).toarray()

# Print the words and their counts in the document
for i, count in enumerate(doc_matrix[0]):
    if count > 0:
        word = index2vocabulary_item[i]
        print(f"{word}: {count}")

actually: 0.04904840759498951
admit: 0.07299497510211472
avoid: 0.06545807155396616
away: 0.05456901558145122
basically: 0.06478407157709502
believe: 0.04675876453764755
best: 0.0994820820268445
bet: 0.07670547711801053
blown: 0.08267003006955992
book: 0.05844788921599561
cause: 0.054878639478271374
chuck: 0.08651485315372326
cobb: 0.16890930987684408
comparing: 0.08037327994808072
control: 0.05929226473552773
days: 0.05519560691471015
die: 0.07426877981418571
doing: 0.10794098035118706
don: 0.03345256161085763
edge: 0.08185979207508924
fact: 0.05069871139558411
fan: 0.4015050116497602
fans: 0.07009394406678396
finally: 0.06641310873076137
friday: 0.07903568115882252
games: 0.11800967575437019
glad: 0.07670547711801053
goes: 0.0601936062442923
golden: 0.0854450956691524
good: 0.03852922757106624
got: 0.048135332254219995
guy: 0.062202102910416196
guys: 0.1383003040359028
hands: 0.074721868417729
hard: 0.056079554382427656
hear: 0.12588956947024332
heard: 0.16722355825998003
homer: 0.08

  and should_run_async(code)


## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted. n_components is number of topics.

In [11]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0,verbose=1, max_iter=10)
lda_tf.fit(dtm_tf)

  and should_run_async(code)


iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


## Visualizing the models with pyLDAvis
Multidimensional scaling = Dimension reduction

Can you reidentify the newsgroups? `['sci.med', 'alt.atheism', 'rec.autos', 'sci.space','rec.sport.baseball']`

Hover over topics circles and terms to explore the connection between words and topics...

In [12]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)

  and should_run_async(code)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])  # type: ignore[arg-type]


## Topix Modeling with TFIDF values

In [None]:
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=10, random_state=0, verbose=1,max_iter=10)
lda_tfidf.fit(dtm_tfidf)

In [None]:
pyLDAvis.lda_model.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

### Using different MDS functions

With `sklearn` installed, other MDS functions, such as MMDS and TSNE can be used for plotting if the default PCoA is not satisfactory.

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')