<a href="https://colab.research.google.com/github/simon-clematide/colab-notebooks-for-teaching/blob/main/notebooks/topic_modeling_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup




In [None]:
%pip install gensim==4.3.3 numpy==1.26.4 pyldavis

# colab has newer versions installed, we need to restart the runtime
from IPython.display import HTML, display
display(HTML("""Please restart the runtime from the Menu Runtime if new packages were installed.<br><br>
         <code>Runtime → Restart runtime</code><br><br>
    This is necessary to apply the newly installed packages.
    """))

In [None]:
# try to avoid warnings but not really working for now
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning, module="ipykernel.ipkernel")


# `pyLDAvis`

pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using the 20 newsgroups dataset as provided by scikit-learn.

In [None]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

## Load 20 newsgroups dataset

First, the 20 newsgroups dataset available in sklearn is loaded. As always, the headers, footers and quotes are removed.

Newsgroup categories:
`['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']`

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
cats = ['sci.med', 'alt.atheism', 'rec.autos', 'sci.space','rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),categories=cats)
docs_raw = newsgroups.data
print(len(docs_raw))

In [None]:
print(docs_raw[72])

## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as raw counts or in TF-IDF form.

In [None]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5,  # exclude words with a relative document frequency greater than 50%
                                min_df = 10    # exclude tokens that occur less than 10 times
                                )
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

In [None]:
# How does a certain document look like in this representation?
# Get the mapping of column indices to vocabulary items
index2vocabulary_item = tf_vectorizer.get_feature_names_out()

# Get the dense matrix representation of the document-term matrix
doc_index = 72  # Index of the document to show
doc_matrix = dtm_tf.getrow(doc_index).toarray()

# Print the words and their counts in the document
for i, count in enumerate(doc_matrix[0]):
    if count > 0:
        word = index2vocabulary_item[i]
        print(f"{word}: {count}")

Alternative, we can build a tf-idf document-term matrix

In [None]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

In [None]:
# How does a certain document look like in this representation?
# Get the mapping of column indices to vocabulary items
index2vocabulary_item = tfidf_vectorizer.get_feature_names_out()

# Get the dense matrix representation of the document-term matrix
doc_index = 72  # Index of the document to show
doc_matrix = dtm_tfidf.getrow(doc_index).toarray()

# Print the words and their counts in the document
for i, count in enumerate(doc_matrix[0]):
    if count > 0:
        word = index2vocabulary_item[i]
        print(f"{word}: {count}")

## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted. n_components is number of topics.

In [None]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0,verbose=1, max_iter=10)
lda_tf.fit(dtm_tf)

## Visualizing the models with pyLDAvis
Multidimensional scaling = Dimension reduction

Can you reidentify the newsgroups? `['sci.med', 'alt.atheism', 'rec.autos', 'sci.space','rec.sport.baseball']`

Hover over topics circles and terms to explore the connection between words and topics...

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)

## Topix Modeling with TFIDF values

In [None]:
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=10, random_state=0, verbose=1,max_iter=10)
lda_tfidf.fit(dtm_tfidf)

In [None]:
pyLDAvis.lda_model.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

### Using different MDS functions

With `sklearn` installed, other MDS functions, such as MMDS and TSNE can be used for plotting if the default PCoA is not satisfactory.

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')

In [None]:
pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')