# HO02: Text Clustering
*Vencimento* 16 mai por 7:00 *Pontos* 5

## Problema
Clusterizar o conjunto de dados 20 News Group Dataset, vetorizando o dataset utilizando TF-IDF e Word2Vec, utilizando cada uma das abordagens abaixo:

- K-Means (K=4)
- Spectral Clustering (K=6)
- Gaussian Mixture
- Agglomerative Clustering
- DBSCAN
- HDBSCAN

Para cada cluster, mostrar os top-10 documentos e os top-20 tokens que representam o cluster.

**usar biblioteca python para baixar esse dataset**
<br>
https://builtin.com/data-science/tsne-python
https://github.com/VeereshElango/tsne-visualizations/blob/master/notebooks/t-SNE%20visualization%20of%2020%20News%20group%20dataset.ipynb

Clustering são algoritmos não supervisionados, explotarory data mining, analise prévia
Clustering não tem rótulos.

### 1° passo: Carregar as informações

In [1]:
# Baixar o dataset do 20 News Group 
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.cluster import KMeans, SpectralClustering, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
import hdbscan
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re

import nltk
from gensim.models import Word2Vec

news = fetch_20newsgroups(subset='train')
stop_words = stopwords.words('english')

In [2]:
list(news.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [3]:
categorias = ['comp.windows.x',  'sci.electronics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categorias)
print(f'Exibir primeiras 10 linhas de {len(newsgroups_train.data)} do dataset:')
newsgroups_train.data[:10]

Exibir primeiras 10 linhas de 1184 do dataset:


['From: cfb@fc.hp.com (Charlie Brett)\nSubject: Re: Los Angeles Freeway traffic reports\nNntp-Posting-Host: hpfcmgw.fc.hp.com\nOrganization: Hewlett-Packard Fort Collins Site\nX-Newsreader: TIN [version 1.1 PL8.5]\nLines: 21\n\n: While driving through the middle of nowhere, I picked up KNBR, AM 1070,\n: a clear-channel station based in Los Angeles. They had an ad \n: claiming that they were able to get traffic flow information from \n: all of the thousands of traffic sensors that CalTrans has placed\n: under the pavement. Does CalTrans sell this info? Does KNBR have\n: an exclusive? What\'s the deal?\n\n: ==Doug "Former L.A. commuter" Claar\n\nYou were right the second time, it is KNX. Believe it or not, I also\nlisten to KNX in the evenings here in Colorado! It\'s kind of fun driving\nthrough the country listening to traffic jams on the 405. Back to your\noriginal question. Yes, there are sensors just past every on-ramp and\noff-ramp on the freeways. They\'re the same sensors used at 

### Preparação dos dados: Normalização e Vetorização
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [4]:
def normalizar(texto):
    texto = texto.replace('\n', ' ') # Remove quebra de linha
    texto = texto.lower() # Converte para minúsculo
    texto = re.sub(r'[^a-zA-Z0-9@\'\s]', '', texto) # Remove caracteres especiais
    texto = ' '.join([palavra for palavra in texto.split() if palavra not in stop_words]) #remove stopwords
    return texto

documentos = [normalizar(texto) for texto in newsgroups_train.data]

# Extrai as features do dataset usando TF-IDF
tfidf_vectorizer = TfidfVectorizer(min_df=0.01) #ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature.
tfidf_vectors = tfidf_vectorizer.fit_transform(documentos)
df_tfidf = pd.DataFrame(tfidf_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
df_tfidf

Unnamed: 0,10,100,11,12,120,120vac,121,122,13,14,...,xterminal,xview,xwindows,year,years,yes,yet,york,young,zero
0,0.000000,0.000000,0.066876,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.087431,0.0,0.0,0.0,0.0
1,0.031507,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.039397,0.0,0.0,0.0,0.0
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.379575,0.168979,0.055138,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.000000,0.058468,0.000000,0.0,0.0,0.0,0.0,0.063011,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1179,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0
1180,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0
1181,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0
1182,0.000000,0.023417,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.000000,0.044382,0.0,0.0,0.0,0.0


In [5]:
# Extrai as features do dataset usando Word2Vec

documentos_lista = [noticia.split() for noticia in documentos]
# treinamento do modelo Word2Vec
w2v_model = Word2Vec(documentos_lista, min_count=1)
# representação das notícias em vetores
w2v_vectors = [w2v_model.wv[noticia.lower().split()] for noticia in documentos]

print(list(w2v_model.wv.key_to_index))      # Obtem a lista de palavras processadas )
print(w2v_vectors)

[array([[-1.86447948e-02,  3.35907452e-02,  9.70787776e-04, ...,
        -2.87701637e-02, -7.88334385e-03,  5.84800402e-03],
       [-5.50100058e-02,  6.36893436e-02,  3.35966274e-02, ...,
        -8.01872760e-02, -6.47769868e-03,  1.70338210e-02],
       [-3.11476123e-02,  4.67386916e-02,  1.01382602e-02, ...,
        -4.91098240e-02, -1.81242707e-04,  6.73025427e-03],
       ...,
       [-7.87572488e-02,  7.63576329e-02,  3.99471000e-02, ...,
        -8.72473791e-02,  3.19810584e-04,  4.15861160e-02],
       [-7.25341067e-02,  7.60523453e-02,  2.74386816e-02, ...,
        -1.18667915e-01, -1.73777305e-02,  2.74749715e-02],
       [-2.27608025e-01,  2.41265103e-01,  8.75391066e-02, ...,
        -2.90333807e-01, -1.44123705e-02,  7.41316006e-02]], dtype=float32), array([[-0.03528229,  0.04032332,  0.0107287 , ..., -0.07219984,
        -0.0013865 ,  0.02363114],
       [-0.07576464,  0.06039803,  0.01889229, ..., -0.09857567,
        -0.01648353,  0.02039888],
       [-0.03700048,  0.04

In [None]:
#Como reduzir a tridimensionalidade de Word2Vec para 2D??

from sklearn.decomposition import PCA

pca = PCA(n_components=1)
w2v_vectors_reduzida = pca.fit_transform(w2v_vectors[0])

w2v_vectors_reduzida = []
for doc in w2v_vectors:
    w2v_vectors_reduzida.append(np.mean(doc, axis=0))
w2v_vectors_reduzida

# Desenvolvento os algoritmos de Clusterização

In [6]:
def kmeans_cluster(vetor, termos, k=4):
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=10).fit(vetor)
    
    top_docs = []        
    top_tokens = []
    order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
    for i in range(kmeans.n_clusters):
        # Obtém os índices dos top 10 documentos de cada cluster
        cluster_docs = np.where(kmeans.labels_ == i)[0]
        cluster_scores = vetor[cluster_docs].sum(axis=1)
        top_docs_idx = np.argsort(cluster_scores, axis=0)[::-1][:10]
        top_docs.append(cluster_docs[top_docs_idx])

        # Obtém os termos mais importantes de cada cluster# Obtém os 20 tokens mais representativos de cada cluster
        cluster_tokens = [termos[ind] for ind in order_centroids[i, :20]]
        top_tokens.append(cluster_tokens)   
    
    # Imprime os top  documentos de cada cluster
    for i, docs in enumerate(top_docs):
        print(f"Cluster {i}:\nTop 20 tokens representativos: \n\t{', '.join(top_tokens[i])}\nTop 10 docs:")
        
        for doc_idx in docs:
            print(f"\t{documentos[doc_idx[0]]}")
        print("="*80 + "\n")

kmeans_cluster(tfidf_vectors, tfidf_vectorizer.get_feature_names_out())


Cluster 0:
Top 20 tokens representativos: 
	one, would, use, writes, good, ground, power, article, radar, audio, lines, like, work, subject, current, radio, circuit, chip, organization, amp
Top 10 docs:
	jeh@cmkrnlcom subject electrical wiring faq question 120vac outlet wiring replyto wirefaq@ferretocunixonca keywords 120 240 ac outlets wiring power shock gfci expires 15 may 93 213516 pdt distribution world organization kernel mode systems san diego ca lines 1547 since electrical wiring questions turn time time scielectronics answers always apparent even skilled electronics hijacking following faq posting copy i've asked writers crosspost scielectronics future jeh@cmkrnlcom xnews cmkrnl newsanswers 6685 newsgroups miscconsumershouserecwoodworkingnewsanswersmiscanswersrecanswers subject electrical wiring faq messageid wirefaq733900891@ecicrl clewis@ferretocunixonca chris lewis date 4 apr 93 052149 gmt replyto wirefaq@ferretocunixonca wiring faq commentary reception followupto poster exp

In [7]:
def spectral_cluster(vetor, termos, k=6):
    spectral = SpectralClustering(n_clusters=k, random_state=0, affinity='nearest_neighbors', n_init=10).fit(vetor)
    
    top_docs = []
    top_tokens = []
    for i in range(spectral.n_clusters):
        # Obtém os índices dos top 10 documentos de cada cluster
        cluster_docs = np.where(spectral.labels_ == i)[0]
        cluster_scores = vetor[cluster_docs].sum(axis=1)
        top_docs_idx = np.argsort(cluster_scores, axis=0)[::-1][:10].astype(int)
        top_docs.append(cluster_docs[top_docs_idx])

        # Obtém os índices dos top 20 tokens de cada cluster
        cluster_indices = np.where(spectral.labels_ == i)[0]
        cluster_vectors = vetor[cluster_indices]
        cluster_mean = np.asarray(cluster_vectors.mean(axis=0)).flatten()
        top_tokens_idx = np.argsort(cluster_mean)[::-1][:20]
        cluster_tokens = [termos[j] for j in top_tokens_idx]
        top_tokens.append(cluster_tokens)
    
    # Imprime os top  documentos de cada cluster
    for i, docs in enumerate(top_docs):
        print(f"Cluster {i}:\nTop 20 tokens representativos: \n\t{', '.join(top_tokens[i])}\nTop 10 docs:")
        
        for doc_idx in docs:
            print(f"\t{documentos[doc_idx[0]]}")
        print("="*80 + "\n")

spectral_cluster(tfidf_vectors, tfidf_vectorizer.get_feature_names_out())


Cluster 0:
Top 20 tokens representativos: 
	ground, wire, outlet, wiring, neutral, outlets, breaker, connected, 120vac, question, box, electrical, panel, circuits, circuit, screw, one, current, writes, may
Top 10 docs:
	jeh@cmkrnlcom subject electrical wiring faq question 120vac outlet wiring replyto wirefaq@ferretocunixonca keywords 120 240 ac outlets wiring power shock gfci expires 15 may 93 213516 pdt distribution world organization kernel mode systems san diego ca lines 1547 since electrical wiring questions turn time time scielectronics answers always apparent even skilled electronics hijacking following faq posting copy i've asked writers crosspost scielectronics future jeh@cmkrnlcom xnews cmkrnl newsanswers 6685 newsgroups miscconsumershouserecwoodworkingnewsanswersmiscanswersrecanswers subject electrical wiring faq messageid wirefaq733900891@ecicrl clewis@ferretocunixonca chris lewis date 4 apr 93 052149 gmt replyto wirefaq@ferretocunixonca wiring faq commentary reception follo

Os documentos são classificados com base na probabilidade logarítmica de pertencerem ao cluster, enquanto as palavras são classificadas com base nos valores médios de TF-IDF.

In [8]:
def gaussian_mixture_cluster(vetor, termos, k=4):
    # Classifica por Gaussian Mixture Model (k=4)
    gmm = GaussianMixture(n_components=k, random_state=0).fit(vetor.toarray())

    # Atribui os documentos aos clusters
    labels = gmm.predict(vetor.toarray())

    # Obtém os top 10 documentos e top 20 tokens de cada cluster
    top_documents = []
    top_tokens = []
    for i in range(gmm.n_components):
        cluster_indices = np.where(labels == i)[0]
        cluster_vectors = vetor[cluster_indices]
        
        # Top 10 documentos
        top_docs_idx = np.argsort(gmm.score_samples(vetor.toarray())[cluster_indices])[::-1][:10]
        top_docs = [newsgroups_train.data[j] for j in cluster_indices[top_docs_idx]]
        top_documents.append(top_docs)
        
        # Top 20 tokens
        cluster_mean = np.asarray(cluster_vectors.mean(axis=0)).flatten()
        top_tokens_idx = np.argsort(cluster_mean)[::-1][:20]
        cluster_tokens = [termos[j] for j in top_tokens_idx]
        top_tokens.append(cluster_tokens)

    # Imprime os top 10 documentos e top 20 tokens de cada cluster
    for i, docs in enumerate(top_documents):
        print(f"Cluster {i}:\nTop 20 tokens representativos: \n\t{', '.join(top_tokens[i])}\nTop 10 docs:")
        for doc in docs:
            print("\t"+doc.replace('\n', ''))
        print("="*80 + "\n")

gaussian_mixture_cluster(tfidf_vectors, tfidf_vectorizer.get_feature_names_out())

Cluster 0:
Top 20 tokens representativos: 
	subject, university, thanks, lines, organization, nntppostinghost, anyone, please, know, email, information, advance, would, help, internet, looking, distribution, 11, version, hi
Top 10 docs:
	From: rao@cse.uta.edu (Rao Venkatesh Simha)Subject: xrn, xarchie for HP 9000/730 - ASAPNntp-Posting-Host: cse.uta.eduOrganization: Computer Science Engineering at the University of Texas at ArlingtonLines: 10	Hi,	I need xrn and xarchie for the HP's (9000/730, version 8 OS), either inthe source form or, (preferably) in executable form. Please suggestwhere I can find this, 	Send e-mail to: rao@cse.uta.eduThanks in advance,Rao.-- SSC
	From: rao@cse.uta.edu (Rao Venkatesh Simha)Subject: xrn , xarchie for HP'sNntp-Posting-Host: cse.uta.eduOrganization: Computer Science Engineering at the University of Texas at ArlingtonLines: 10	Hi,	I need xrn and xarchie for the HP's (9000/730, version 8 OS), either inthe source form or, (preferably) in executable form. Pl

Como o Agglomerative Clustering não fornece uma função de pontuação, usamos a soma dos valores TF-IDF para classificar os documentos dentro de cada cluster. Em seguida, selecionamos os top 10 documentos e as top 20 palavras de cada cluster, classificando as palavras com base nos valores médios de TF-IDF.

In [11]:
def agg_cluster(vetor, termos):
    # Classifica por Agglomerative Clustering (k=4)
    agg_clustering = AgglomerativeClustering(n_clusters=4).fit(vetor.toarray())

    # Atribui os documentos aos clusters
    labels = agg_clustering.labels_

    # Obtém os top 10 documentos e top 20 tokens de cada cluster
    top_documents = []
    top_tokens = []
    for i in range(agg_clustering.n_clusters_):
        cluster_indices = np.where(labels == i)[0]
        cluster_vectors = vetor[cluster_indices]
        
        # Top 10 documentos
        # Como o Agglomerative Clustering não fornece uma função de pontuação, usaremos a soma dos valores TF-IDF
        top_docs_idx = np.argsort(np.asarray(cluster_vectors.sum(axis=1)).flatten())[::-1][:10]
        top_docs = [newsgroups_train.data[j] for j in cluster_indices[top_docs_idx]]
        top_documents.append(top_docs)
        
        # Top 20 tokens
        cluster_mean = np.asarray(cluster_vectors.mean(axis=0)).flatten()
        top_tokens_idx = np.argsort(cluster_mean)[::-1][:20]
        cluster_tokens = [termos[j] for j in top_tokens_idx]
        top_tokens.append(cluster_tokens)

    # Imprime os top 10 documentos e top 20 tokens de cada cluster

    for i, docs in enumerate(top_documents):
        print(f"Cluster {i}:\nTop 20 tokens representativos: \n\t{', '.join(top_tokens[i])}\nTop 10 docs:")
        for doc in docs:
            print("\t"+doc.replace('\n', ''))
        print("="*80 + "\n")

agg_cluster(tfidf_vectors, tfidf_vectorizer.get_feature_names_out())

Cluster 0:
Top 20 tokens representativos: 
	lines, subject, organization, would, university, nntppostinghost, one, use, like, thanks, know, anyone, get, window, writes, article, need, server, help, could
Top 10 docs:
	From: dbl@visual.com (David B. Lewis)Subject: comp.windows.x Frequently Asked Questions (FAQ) 4/5Summary: useful information about the X Window SystemArticle-I.D.: visual.C52Ep6.97pExpires: Sun, 2 May 1993 00:00:00 GMTReply-To: faq%craft@uunet.uu.net (X FAQ maintenance address)Organization: VISUAL, Inc.Lines: 968Archive-name: x-faq/part4Last-modified: 1993/04/04----------------------------------------------------------------------Subject:  80)! Where can I get an X-based plotting program?These usually are available from uucp sites such as uunet or other sites asmarked; please consult the archie server to find more recent versions. gnuplot	X (xplot), PostScript and a bunch of other drivers.	export.lcs.mit.edu [and elsewhere]:contrib/gnuplot3.1.tar.Z gl_plot	X output only [

O parâmetro "**eps**" controla a distância máxima entre pontos para que eles sejam considerados parte do mesmo cluster. Se o valor de "eps" for muito alto, muitos pontos podem ser agrupados em um único cluster, enquanto se for muito baixo, muitos clusters pequenos podem ser criados. 

O parâmetro "**min_samples**" controla o número mínimo de pontos que devem estar dentro da distância "eps" para que um cluster seja formado. Se o valor de "min_samples" for muito alto, pode ser difícil formar clusters, enquanto se for muito baixo, muitos clusters pequenos podem ser criados. 

In [12]:
from sklearn.preprocessing import Normalizer

# Extrai as features do dataset usando TF-IDF
vectorizer = tfidf_vectorizer
X = tfidf_vectors

# Normaliza os dados para melhorar o desempenho do DBSCAN
normalizer = Normalizer()
X_normalized = normalizer.fit_transform(X)

# Classifica por DBSCAN
dbscan = DBSCAN(eps=0.95, min_samples=10).fit(X_normalized)

# Atribui os documentos aos clusters
labels = dbscan.labels_

# Obtém os top 10 documentos e top 20 tokens de cada cluster
unique_labels = np.unique(labels)
unique_labels = unique_labels[unique_labels != -1]  # Remove o rótulo de ruído (-1)

if len(unique_labels) == 0:
    print("Nenhum cluster significativo encontrado. Tente ajustar os hiperparâmetros do DBSCAN.")
else:
    print(f'Agrupado em {len(unique_labels)} clusters.')
    top_documents = []
    top_tokens = []
    for i in unique_labels:
        cluster_indices = np.where(labels == i)[0]
        cluster_vectors = X[cluster_indices]

        # Top 10 documentos
        top_docs_idx = np.argsort(np.asarray(cluster_vectors.sum(axis=1)).flatten())[::-1][:10]
        top_docs = [newsgroups_train.data[j] for j in cluster_indices[top_docs_idx]]
        top_documents.append(top_docs)

        # Top 20 tokens
        cluster_mean = np.asarray(cluster_vectors.mean(axis=0)).flatten()
        top_tokens_idx = np.argsort(cluster_mean)[::-1][:20]
        cluster_tokens = [vectorizer.get_feature_names_out()[j] for j in top_tokens_idx]
        top_tokens.append(cluster_tokens)

    # Imprime os top 10 documentos e top 20 tokens de cada cluster
    for i, docs in enumerate(top_documents):
        print(f"Cluster {i}:\nTop 20 tokens representativos: \n\t{', '.join(top_tokens[i])}\nTop 10 docs:")
        for doc in docs:
            print("\t"+doc.replace('\n', ''))
        print("="*80 + "\n")



Agrupado em 4 clusters.
Cluster 0:
Top 20 tokens representativos: 
	radar, detector, detectors, virginia, law, radio, state, receiver, one, operating, rf, cars, local, claim, car, others, detect, yes, antenna, illegal
Top 10 docs:
	From: whit@carson.u.washington.edu (John Whitmore)Subject: Re: Radar detector DETECTORS?Article-I.D.: shelley.1r4cucINNhamDistribution: naOrganization: University of Washington, SeattleLines: 18NNTP-Posting-Host: carson.u.washington.eduIn article <1993Apr19.231050.2196@Rapnet.Sanders.Lockheed.Com> babb@rapnet.sanders.lockheed.com (Scott Babb) writes:>Brian Day (bday@lambda.msfc.nasa.gov) wrote:>: On December 29, 1992, it was illegal to operate a radar detector>: in the state of Virginia.  If one got caught, one got fined $65.00.>The Federal Communications Act of 1934 made it *legal* for you to>operate a radio receiver of any kind, on any frequency (including>X, K, and Ka bands) in the United States. 	And the Commonwealth of Virginia has not exactly buttedaga

In [15]:

import numpy as np

# Extrai as features do dataset usando TF-IDF
vectorizer = tfidf_vectorizer
X = tfidf_vectors

# Cria o objeto HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=12)# Ajusta o modelo aos dados
clusterer.fit(X)

# Obtenha os rótulos de cluster para cada ponto
labels = clusterer.labels_

# Obtenha o número de clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

# Obtém os top 10 documentos e top 20 tokens de cada cluster
unique_labels = np.unique(labels)
unique_labels = unique_labels[unique_labels != -1]  # Remove o rótulo de ruído (-1)

if len(unique_labels) == 0:
    print("Nenhum cluster significativo encontrado. Tente ajustar os hiperparâmetros do HDBSCAN.")
else:
    print(f'Agrupado em {len(unique_labels)} clusters.')
    top_documents = []
    top_tokens = []
    for i in unique_labels:
        cluster_indices = np.where(labels == i)[0]
        cluster_vectors = X[cluster_indices]

        # Top 10 documentos
        top_docs_idx = np.argsort(np.asarray(cluster_vectors.sum(axis=1)).flatten())[::-1][:10]
        top_docs = [newsgroups_train.data[j] for j in cluster_indices[top_docs_idx]]
        top_documents.append(top_docs)

        # Top 20 tokens
        cluster_mean = np.asarray(cluster_vectors.mean(axis=0)).flatten()
        top_tokens_idx = np.argsort(cluster_mean)[::-1][:20]
        cluster_tokens = [vectorizer.get_feature_names_out()[j] for j in top_tokens_idx]
        top_tokens.append(cluster_tokens)

    # Imprime os top 10 documentos e top 20 tokens de cada cluster
    for i, docs in enumerate(top_documents):
        print(f"Cluster {i}:\nTop 20 tokens representativos: \n\t{', '.join(top_tokens[i])}\nTop 10 docs:")
        for doc in docs:
            print("\t"+doc.replace('\n', ''))
        print("="*80 + "\n")


Agrupado em 6 clusters.
Cluster 0:
Top 20 tokens representativos: 
	cooling, nuclear, towers, water, steam, tower, cool, site, uhuraneoucomedu, air, mayhew, wtm, hot, john, power, bill, heat, cold, figured, ever
Top 10 docs:
	From: exuptr@exu.ericsson.se (Patrick Taylor, The Sounding Board)Subject: Re: How to the disks copy protected.Nntp-Posting-Host: 138.85.253.85Organization: Ericsson Network Systems, Inc.X-Disclaimer: This article was posted by a user at Ericsson.              Any opinions expressed are strictly those of the              user and not necessarily those of Ericsson.Lines: 36In article <1993Apr21.131908.29582@uhura.neoucom.edu> wtm@uhura.neoucom.edu (Bill Mayhew) writes:>From: wtm@uhura.neoucom.edu (Bill Mayhew)>Subject: Re: How to the disks copy protected.>Date: Wed, 21 Apr 1993 13:19:08 GMT>Write a good manual to go with the software.  The hassle of>photocopying the manual is offset by simplicity of purchasing>the package for only $15.  Also, consider offering an in