In [0]:
import gensim

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

Download the data using `fetch_20newsgroups()` function.

In [0]:
data = fetch_20newsgroups(remove=("headers", "footer", "quotes")).data

# Part 1. LDA

## Task 1

1. Create an object of class `CountVectorizer`
2. Apply this class to the dataset. Store only top-1000 most common words

In [0]:
# I additionally remove stopwords, because otherwise the results would be noninterpreted
cv = CountVectorizer(max_features=1000, stop_words="english")
data = cv.fit_transform(data)

## Task 2

Get the top-1000 most common words

In [4]:
feature_names = cv.get_feature_names()
assert len(feature_names) == 1000

for feature in feature_names[-10:]:
    print(feature)

writing
written
wrong
x11
xt
year
years
yes
york
young


Train `LatentDirichletAllocation` with the following parameters:
* `n_topics=20`
* `max_iter=50`
* `learning_method="batch"`

In [5]:
lda = LatentDirichletAllocation(n_components=20, max_iter=50, learning_method="batch")
lda.fit(data)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=50,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=0)

Use the following function to print key words from each topic

In [6]:
def print_top_words(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

print_top_words(lda, feature_names)

Topic 0
said car just didn time know did day went like
Topic 1
edu available ftp image software version pub graphics faq mit
Topic 2
00 25 10 15 20 16 14 17 12 30
Topic 3
space nasa gov earth data launch high water engine cost
Topic 4
mr president people government think going stephanopoulos know american clinton
Topic 5
armenian turkish armenians people war jews israeli armenia world turks
Topic 6
university list research information internet cx 1993 mail group center
Topic 7
game team year games season play hockey league players win
Topic 8
thanks mail edu know new does price like email looking
Topic 9
ax max g9v b8f a86 145 pl 0d 1d9 34u
Topic 10
key encryption chip __ ___ keys clipper use security public
Topic 11
file output entry program section files jpeg line rules 02
Topic 12
drive card scsi disk mac bit hard video speed pc
Topic 13
windows window use using problem program screen display dos application
Topic 14
don just like think know people good ve make really
Topic 15
god j

# Part 2. Gensim

Convert the corpus into Gensim-compatible format

In [0]:
corpus = gensim.matutils.Sparse2Corpus(data, documents_columns=False)

## Task 3

Create `id2word` dictionary

In [0]:
id2word = {}
for feature in feature_names:
    id2word[len(id2word)] = feature

## Task 4

Train `LDA` from `gensim`

In [0]:
# Specify `num_topics` to provide fair comparison between models
lda = gensim.models.LdaModel(corpus=corpus, id2word=id2word, num_topics=20)

Print most common words for each topic

In [10]:
lda.print_topics(lda.num_topics)

[(0,
  '0.042*"00" + 0.023*"hockey" + 0.019*"50" + 0.018*"25" + 0.017*"games" + 0.016*"sale" + 0.016*"team" + 0.016*"edu" + 0.015*"new" + 0.013*"10"'),
 (1,
  '0.029*"people" + 0.023*"don" + 0.022*"think" + 0.015*"just" + 0.010*"right" + 0.010*"israel" + 0.010*"know" + 0.009*"like" + 0.009*"good" + 0.008*"say"'),
 (2,
  '0.017*"power" + 0.015*"space" + 0.011*"team" + 0.010*"nasa" + 0.010*"division" + 0.009*"station" + 0.009*"ground" + 0.009*"year" + 0.009*"think" + 0.009*"toronto"'),
 (3,
  '0.115*"cx" + 0.057*"_o" + 0.031*"lk" + 0.030*"17" + 0.027*"w7" + 0.026*"chz" + 0.025*"86" + 0.023*"ah" + 0.022*"d9" + 0.021*"air"'),
 (4,
  '0.016*"time" + 0.016*"know" + 0.014*"book" + 0.013*"does" + 0.012*"people" + 0.010*"books" + 0.010*"church" + 0.009*"just" + 0.009*"like" + 0.009*"world"'),
 (5,
  '0.058*"edu" + 0.035*"com" + 0.019*"mail" + 0.015*"internet" + 0.012*"uk" + 0.011*"ac" + 0.011*"file" + 0.010*"cs" + 0.010*"send" + 0.010*"pgp"'),
 (6,
  '0.030*"25" + 0.029*"10" + 0.021*"16" + 0.01

## Task 5

Train `LSI` from `gensim`

In [0]:
# Specify `num_topics` to provide fair comparison between models
lsi = gensim.models.LsiModel(corpus=corpus, id2word=id2word, num_topics=20)

Print most common words for each topic

In [12]:
lsi.print_topics(lsi.num_topics)

[(0,
  '0.997*"ax" + 0.072*"max" + 0.016*"g9v" + 0.012*"b8f" + 0.010*"a86" + 0.009*"pl" + 0.007*"1d9" + 0.006*"1t" + 0.006*"145" + 0.006*"bhj"'),
 (1,
  '0.387*"145" + 0.378*"a86" + 0.374*"b8f" + 0.359*"g9v" + 0.324*"0d" + 0.226*"1d9" + 0.211*"0t" + 0.187*"2di" + 0.172*"_o" + 0.171*"34u"'),
 (2,
  '-0.309*"file" + -0.248*"edu" + -0.171*"use" + -0.144*"available" + -0.129*"com" + -0.122*"program" + -0.121*"information" + -0.117*"people" + -0.117*"pub" + -0.114*"ftp"'),
 (3,
  '0.972*"db" + 0.149*"cs" + 0.089*"al" + 0.082*"cx" + 0.064*"bits" + -0.041*"file" + -0.021*"edu" + 0.021*"gas" + 0.020*"higher" + 0.020*"ah"'),
 (4,
  '-0.675*"g9v" + 0.507*"0d" + 0.331*"_o" + -0.182*"b8f" + 0.129*"145" + 0.128*"6um" + -0.113*"1d9" + 0.110*"6ei" + 0.096*"3t" + 0.088*"75u"'),
 (5,
  '-0.244*"10" + -0.230*"14" + -0.225*"16" + -0.189*"12" + -0.186*"25" + -0.180*"15" + -0.178*"20" + -0.175*"13" + -0.174*"11" + -0.170*"00"'),
 (6,
  '0.298*"file" + -0.233*"stephanopoulos" + -0.233*"mr" + -0.210*"know" +

Which model highlighted the topics most successfully?

Мне кажется, лучше всего получилось выделить темы у `LDA` из `sklearn`. Если посмотреть на самые популярные слова в темах, то можно будет даже как-то их интерпретировать: спорт, политический конфликт США и Израиля, космос, цифры.

В `LDA` из `gensim` в целом тоже можно найти какие-то темы, но все как-то спутано и добавляются лишние слова.

В `LSI` все вообще плохо -- просто набор каких-то несвязанных слов.