In this example we will see how to perform topic extraction using MiniSom. The goal is to extract the main topics (represented as a set of words) that occur in a collection of documents.

In [1]:
import numpy as np
import torch
import matplotlib.pyplot as plt
%matplotlib inline
from minisom import MiniSom
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

The colloction of documents that we will work with is the famous `20newsgroups` dataset. It contains more than 10000 newsgroups posts. We will download the dataset using sklearn and will transform the textual documents into a matrix `D` where each row represents a post using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TF-IDF representation</a>:

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=no_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
D = torch.tensor(np.array(tfidf.todense().tolist()), dtype=torch.float32)

Now we have to train a SOM that clusters the documents, the total number of neurons in the SOM will be also the number of topics to extract:

In [3]:
n_neurons = 2
m_neurons = 4
som = MiniSom(n_neurons, m_neurons, no_features)
som.pca_weights_init(D)
som.train(D, 40000, random_order=False, verbose=False)

We will consider as topic the list of first `top_keywords` associated with the biggest weights of each neuron. With the following for loop we will inspect all the weights and recover the words associated with the weights using the feature names saved by the TfidfVectorizer:

In [4]:
top_keywords = 10

weights = som.get_weights().numpy()
cnt = 1
for i in range(n_neurons):
    for j in range(m_neurons):
        keywords_idx = np.argsort(weights[i,j,:])[-top_keywords:]
        keywords = ' '.join([tfidf_feature_names[k] for k in keywords_idx])
        print('Topic', cnt, ':', keywords)
        cnt += 1

Topic 1 : jewish like weapons ask armenians land turkish israel space armenia
Topic 2 : buy deleted check info help stuff mail thanks appreciated advance
Topic 3 : using board files sound include point com os x11 pc
Topic 4 : computer wire controller ide know 17 19 drive cable scsi
Topic 5 : know support video sin law define words does cards jesus
Topic 6 : low truth think knowledge want shall right religion people don
Topic 7 : expect man final home encryption administration games like money clinton
Topic 8 : study writing didn learn ed people good ideas interested alt
