<a href="https://colab.research.google.com/github/tashkinovnet/Home-Task/blob/Hometask_7.ipynb/Hometask_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Hometask 7**

###Задача: запустить модель LDA и Gibbs Sampling с числов тегов 20. Вывести топ-10 слов по каждому тегу. Соотнести полученные теги с тегами из датасета. Добейтесь того, чтобы хотя бы несколько тем были явно интерпретируемы, например, как в примерах ниже.

Примеры топ-10 слов из некотрых тегов, которые получаются после применения LDA:
* ['god', 'jesus', 'believe', 'life', 'bible', 'christian', 'world', 'church', 'word', 'people'] - эта группа явно соотносится с soc.religion.christian
* ['drive', 'card', 'hard', 'bit', 'disk', 'scsi', 'memory', 'speed', 'mac', 'video'] - эту группу можно соотнести с темами 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware'
* ['game',	'games',	'hockey',	'league',	'play',	'players',	'season',	'team',	'teams',	'win'] - тема rec.sport.hockey

Советы:
* модель будет сходится лучше и быстрее, если уменьшить размер словаря за счет отсеивания общеупотребительных слов и редких слов. Управлять размером словаря можно с помощью параметров min_df (отсеивает слова по минимальной частоте встречаемости) и max_df (отсеивает слова по максимальной частоте встречаемости) в CountVectorizer.
* параметры $\alpha$, $\beta$ можно, для начала, положить единицами
* после 100 итераций можно ожидать хорошего распределения по темам. Если этого не происходит и в темах мешинина - проверяйте код и оптимизируйте словарь
* на примере третьей темы видно, что у нас встречаются разные формы одного и того же слова. С помощью процедур stemming и lemmatization можно привести слова к общей форме и объединить близкие по значению

In [2]:
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
documents = newsgroups_train.data
labels = newsgroups_train.target
label_names = newsgroups_train.target_names

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(doc):
    words = [lemmatizer.lemmatize(word.lower()) for word in doc.split() if word.isalpha() and word.lower() not in stop_words]
    return ' '.join(words)

preprocessed_documents = [preprocess_text(doc) for doc in documents]
vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000)
X = vectorizer.fit_transform(preprocessed_documents)
vocab = vectorizer.get_feature_names_out()

n_topics = 20
alpha = 1.0
beta = 1.0
n_iterations = 100

n_docs, n_words = X.shape
word_topic_counts = np.zeros((n_words, n_topics))
doc_topic_counts = np.zeros((n_docs, n_topics))
topic_totals = np.zeros(n_topics)
doc_topic_assignments = []

for d in range(n_docs):
    doc = X[d].indices
    topics = []
    for w in doc:
        topic = random.randint(0, n_topics - 1)
        word_topic_counts[w, topic] += 1
        doc_topic_counts[d, topic] += 1
        topic_totals[topic] += 1
        topics.append(topic)
    doc_topic_assignments.append(topics)

# Gibbs Sampling
for it in range(n_iterations):
    for d in range(n_docs):
        doc = X[d].indices
        for i, w in enumerate(doc):
            current_topic = doc_topic_assignments[d][i]
            word_topic_counts[w, current_topic] -= 1
            doc_topic_counts[d, current_topic] -= 1
            topic_totals[current_topic] -= 1

            topic_probs = (word_topic_counts[w] + beta) * (doc_topic_counts[d] + alpha) / (topic_totals + beta * n_words)
            topic_probs /= topic_probs.sum()

            new_topic = np.random.choice(np.arange(n_topics), p=topic_probs)
            word_topic_counts[w, new_topic] += 1
            doc_topic_counts[d, new_topic] += 1
            topic_totals[new_topic] += 1
            doc_topic_assignments[d][i] = new_topic

# Получение топ-слов для каждой темы
def get_top_words(word_topic_counts, vocab, n_top_words=10):
    topics = []
    for topic_idx in range(n_topics):
        top_words_idx = word_topic_counts[:, topic_idx].argsort()[::-1][:n_top_words]
        topics.append([vocab[i] for i in top_words_idx])
    return topics

topics = get_top_words(word_topic_counts, vocab)

for i, topic in enumerate(topics):
    print(f"Topic #{i + 1}: {', '.join(topic)}")

topic_assignments = np.argmax(doc_topic_counts, axis=1)
topic_to_labels = {i: [] for i in range(n_topics)}

for doc_idx, topic in enumerate(topic_assignments):
    topic_to_labels[topic].append(labels[doc_idx])

for topic_idx, label_list in topic_to_labels.items():
    label_counts = np.bincount(label_list, minlength=len(label_names))
    if label_counts.sum() > 0:
        top_label_idx = label_counts.argmax()
        print(f"Topic #{topic_idx + 1}: {label_names[top_label_idx]} ({label_counts[top_label_idx]} documents)")
        print(f"Top words: {', '.join(topics[topic_idx])}\n")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Topic #1: one, would, like, get, also, see, could, thing, time, say
Topic #2: make, one, would, good, like, use, need, want, also, look
Topic #3: one, like, never, know, people, much, something, thing, going, would
Topic #4: god, believe, christian, say, people, word, jesus, life, must, bible
Topic #5: government, key, use, public, law, system, chip, used, encryption, phone
Topic #6: less, since, often, case, quite, cause, problem, level, certain, may
Topic #7: think, would, know, want, even, people, say, much, get, one
Topic #8: use, using, set, file, function, line, following, read, number, several
Topic #9: window, system, use, using, card, drive, work, running, run, program
Topic #10: like, one, get, would, know, come, still, also, use, look
Topic #11: first, year, day, went, took, last, thought, said, told, month
Topic #12: anyone, please, know, thanks, looking, would, post, email, send, reply
Topic #13: state, people, government, right, american, force, member, world, war, child
