<img align="left" width="90" height="90" src="https://github.com/x-tabdeveloping/topicwizard/raw/main/assets/logo.svg">

# topicwizard

Pretty and opinionated topic model visualiation in Python.

### Installation
Let us first install topicwizard from PyPI.

In [None]:
%pip install topic-wizard

### Loading a corpus

In this example we are going to investigate the topical content of an openly available dataset, 20newsgroups.
Let's fetch the dataset from scikit-learn's repositories.

In [4]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

newsgroups = fetch_20newsgroups(subset="all", remove=('headers', 'footers', 'quotes'))
corpus = newsgroups.data
# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = np.array(newsgroups.target_names)[newsgroups.target]

## Fitting a Model
<img align="right" width="300" src="https://x-tabdeveloping.github.io/topicwizard/_images/pipeline.png">
Let us now fit a topic model to our data, that we can later investigate.
Classical topic models consist of the following components:
 * A vectorizer, which turns texts into bag-of-words numerical representations
 * Optional term weighting, like tf-idf. We will omit this in the current example.
 * A topic model, which can either be a generative probabilistic model (like LDA) or a matrix decomposition model (like LSA)
 
We are going to represent this structure as a scikit-learn pipeline.
This allows us to interacti with the this structure as an atomic unit.
 
In this example we are going to use Nonnegative Matrix Factorization (NMF) for discovering topics.
Note that topicwizard is also compatible with topic models from Gensim and BERTopic.

In [5]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

# Setting up topic modelling pipeline
vectorizer = CountVectorizer(max_df=0.8, min_df=10, stop_words="english")
# NMF topic model with 20 topics
nmf = NMF(n_components=20)
# Build a pipeline from the two components
pipeline = make_pipeline(vectorizer, nmf)

# Fit the pipeline to the data
pipeline.fit(corpus)

## Model Interpretation
We can then use topicwizard for interpreting the parameters of the fitted topic model.

### Web application

By far the easiest way to interpret are results is to launch the topicwizard web app.

Note that this may take a long time, as producing lower-dimensional projections of data can take a lot of computation resources.

In [None]:
import topicwizard

topicwizard.visualize(corpus, pipeline=pipeline)

You can also try disabling the documents page, which usually takes a long time to prepare.

In [None]:
topicwizard.visualize(corpus, pipeline=pipeline, exclude_pages=["documents"])

If you are interested in the relation of topics to predefined labels you can also pass those labels to topicwizard.

In [None]:
topicwizard.visualize(corpus, pipeline=pipeline, exclude_pages=["documents"], group_labels=group_labels)

### Individual Plots
If you are a bit more topic-model-savy and want to produce individual, customizable interactive plots, you can also use the figures API.

These plots will be faster to produce and can also be saved as html if need be.

I am only going to demonstrate a couple of examples here, but there are many more figures you can explore, please consult the documentation.

Let's have a look at what kinds of words are important for the discovered topics:

In [None]:
from topicwizard.figures import topic_barcharts

topic_barcharts(corpus, pipeline=pipeline, top_n=5)

It would also be useful to see how the different words relate to each other.

In [6]:
from topicwizard.figures import word_map

word_map(corpus, pipeline=pipeline)

Since 20newsgroups contains precomputed labels we can also have a look at the labels' relations to topics.
I would like to see which topics are most relevant for each group, so let's plot that:

In [9]:
from topicwizard.figures import group_topic_barcharts

group_topic_barcharts(corpus, group_labels, pipeline=pipeline, top_n=5)