In this example we will see how to perform topic extraction using MiniSom. The goal is to extract the main topics (represented as a set of words) that occur in a collection of documents.

In [1]:
import sys
sys.path.insert(0, '/home/zephyr/Paper2Agent-main/Minisom_Agent/repo/minisom/')

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.dpi"] = 300
plt.rcParams["savefig.dpi"] = 300

from minisom import MiniSom
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

The colloction of documents that we will work with is the famous `20newsgroups` dataset. It contains more than 10000 newsgroups posts. We will download the dataset using sklearn and will transform the textual documents into a matrix `D` where each row represents a post using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TF-IDF representation</a>:

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=no_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
D = tfidf.todense().tolist()

Now we have to train a SOM that clusters the documents, the total number of neurons in the SOM will be also the number of topics to extract:

In [3]:
n_neurons = 2
m_neurons = 4
som = MiniSom(n_neurons, m_neurons, no_features)
som.random_weights_init(D)
som.train(D, 5000, random_order=False, verbose=True)

 [    0 / 5000 ]   0% - ? it/s [    0 / 5000 ]   0% - ? it/s [    1 / 5000 ]   0% - 0:00:01 left  [    2 / 5000 ]   0% - 0:00:01 left  [    3 / 5000 ]   0% - 0:00:01 left  [    4 / 5000 ]   0% - 0:00:00 left  [    5 / 5000 ]   0% - 0:00:00 left  [    6 / 5000 ]   0% - 0:00:00 left  [    7 / 5000 ]   0% - 0:00:00 left  [    8 / 5000 ]   0% - 0:00:00 left  [    9 / 5000 ]   0% - 0:00:00 left  [   10 / 5000 ]   0% - 0:00:00 left  [   11 / 5000 ]   0% - 0:00:00 left  [   12 / 5000 ]   0% - 0:00:00 left  [   13 / 5000 ]   0% - 0:00:00 left  [   14 / 5000 ]   0% - 0:00:00 left  [   15 / 5000 ]   0% - 0:00:00 left  [   16 / 5000 ]   0% - 0:00:00 left  [   17 / 5000 ]   0% - 0:00:00 left  [   18 / 5000 ]   0% - 0:00:00 left  [   19 / 5000 ]   0% - 0:00:00 left  [   20 / 5000 ]   0% - 0:00:00 left  [   21 / 5000 ]   0% - 0:00:00 left  [   22 / 5000 ]   0% - 0:00:00 left  [   23 / 5000 ]   0% - 0:00:00 left  [   24 / 5000 ]   0% - 0:00:00 left  [   25 / 5000 ]   0% - 0

 [ 2028 / 5000 ]  41% - 0:00:00 left  [ 2029 / 5000 ]  41% - 0:00:00 left  [ 2030 / 5000 ]  41% - 0:00:00 left  [ 2031 / 5000 ]  41% - 0:00:00 left  [ 2032 / 5000 ]  41% - 0:00:00 left  [ 2033 / 5000 ]  41% - 0:00:00 left  [ 2034 / 5000 ]  41% - 0:00:00 left  [ 2035 / 5000 ]  41% - 0:00:00 left  [ 2036 / 5000 ]  41% - 0:00:00 left  [ 2037 / 5000 ]  41% - 0:00:00 left  [ 2038 / 5000 ]  41% - 0:00:00 left  [ 2039 / 5000 ]  41% - 0:00:00 left  [ 2040 / 5000 ]  41% - 0:00:00 left  [ 2041 / 5000 ]  41% - 0:00:00 left  [ 2042 / 5000 ]  41% - 0:00:00 left  [ 2043 / 5000 ]  41% - 0:00:00 left  [ 2044 / 5000 ]  41% - 0:00:00 left  [ 2045 / 5000 ]  41% - 0:00:00 left  [ 2046 / 5000 ]  41% - 0:00:00 left  [ 2047 / 5000 ]  41% - 0:00:00 left  [ 2048 / 5000 ]  41% - 0:00:00 left  [ 2049 / 5000 ]  41% - 0:00:00 left  [ 2050 / 5000 ]  41% - 0:00:00 left  [ 2051 / 5000 ]  41% - 0:00:00 left  [ 2052 / 5000 ]  41% - 0:00:00 left  [ 2053 / 5000 ]  41% - 0:00:00 left  [ 2054 / 5

 [ 3670 / 5000 ]  73% - 0:00:00 left  [ 3671 / 5000 ]  73% - 0:00:00 left  [ 3672 / 5000 ]  73% - 0:00:00 left  [ 3673 / 5000 ]  73% - 0:00:00 left  [ 3674 / 5000 ]  73% - 0:00:00 left  [ 3675 / 5000 ]  74% - 0:00:00 left  [ 3676 / 5000 ]  74% - 0:00:00 left  [ 3677 / 5000 ]  74% - 0:00:00 left  [ 3678 / 5000 ]  74% - 0:00:00 left  [ 3679 / 5000 ]  74% - 0:00:00 left  [ 3680 / 5000 ]  74% - 0:00:00 left  [ 3681 / 5000 ]  74% - 0:00:00 left  [ 3682 / 5000 ]  74% - 0:00:00 left  [ 3683 / 5000 ]  74% - 0:00:00 left  [ 3684 / 5000 ]  74% - 0:00:00 left  [ 3685 / 5000 ]  74% - 0:00:00 left  [ 3686 / 5000 ]  74% - 0:00:00 left  [ 3687 / 5000 ]  74% - 0:00:00 left  [ 3688 / 5000 ]  74% - 0:00:00 left  [ 3689 / 5000 ]  74% - 0:00:00 left  [ 3690 / 5000 ]  74% - 0:00:00 left  [ 3691 / 5000 ]  74% - 0:00:00 left  [ 3692 / 5000 ]  74% - 0:00:00 left  [ 3693 / 5000 ]  74% - 0:00:00 left  [ 3694 / 5000 ]  74% - 0:00:00 left  [ 3695 / 5000 ]  74% - 0:00:00 left  [ 3696 / 5


 quantization error: 0.9884376329597593


We will consider as topic the list of first `top_keywords` associated with the biggest weights of each neuron. With the following for loop we will inspect all the weights and recover the words associated with the weights using the feature names saved by the TfidfVectorizer:

In [4]:
top_keywords = 10

weights = som.get_weights()
cnt = 1
for i in range(n_neurons):
    for j in range(m_neurons):
        keywords_idx = np.argsort(weights[i,j,:])[-top_keywords:]
        keywords = ' '.join([tfidf_feature_names[k] for k in keywords_idx])
        print('Topic', cnt, ':', keywords)
        cnt += 1

Topic 1 : pc times point remember year way time better max old
Topic 2 : said idea month goal guns group police gun government hell
Topic 3 : mail israel position interested exists question gov course included new
Topic 4 : know chicago file widget window files manager running dos windows
Topic 5 : players christians thought need really christ hockey action don god
Topic 6 : used cause turkish gas just israeli basic apparently results people
Topic 7 : couple like modem day mention got drive steve questions ve
Topic 8 : says heard knows sure wanted talking team kill knew men
