<a href="https://colab.research.google.com/github/tada20001/NLP_2023/blob/main/19_08_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install bertopic[visualization]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic[visualization]
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic[visualization])
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m64.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic[visualization])
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone


### 1. BERTopic
------------
BERT embeddings와 클래스 기반(class-based) TF-IDF를 활용한 알고리즘으로 클러스터를 만드는 토픽모델링 기술임.
다음의 3가지 과정을 거침.

1) SBERT로 임베딩
* “paraphrase‑MiniLM‑L6‑v2” : 영어 데이터로 학습된 SBERT
* “paraphrase‑multilingual‑MiniLM‑L12‑v2” : 50 개 이상의 언어로 학습된 다국어 SBERT

2) 문서 클러스터링 : UMAP을 사용하여 임베딩 차원을 줄이고 HDBSCAN 알고리즘을 이용하여 차원 축소된 임베딩을 클러스터링하고 의미적으로 유사한 문서 클러스터를 생성함

3) 토픽표현 생성 : 클래스 기반 TF-IDF 토픽 추출


### 2. 데이터 로드
-------------------

In [3]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
docs[:5]

["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-per

In [4]:
len(docs)

18846

### 3. 토픽모델링
---------------
BERTopic 모델 객체를 만들고, fit_transform 수행

In [5]:
model = BERTopic()
topics, probabilities = model.fit_transform(docs)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [6]:
print('각 문서의 토픽번호 리스트:', len(topics))
print('첫번째 문서의 토픽번호:',topics[0])

각 문서의 토픽번호 리스트: 18846
첫번째 문서의 토픽번호: 0


In [7]:
# 토픽의 개수, 크기, 각 토픽에 할당된 단어 확인 가능
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6649,-1_to_is_of_the,"[to, is, of, the, and, you, for, in, it, this]",[\n\n\tWhy do we follow God so blindly? Have ...
1,0,1824,0_game_team_games_he,"[game, team, games, he, players, season, hocke...",[NHL RESULTS FOR GAMES PLAYED 4/15/93.\n\n----...
2,1,563,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[Here is a revised version of my summary which...
3,2,525,2_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, why, each, very, ...","[\nHuh?, \nYep.\n, ites:]"
4,3,438,3_monitor_card_video_drivers,"[monitor, card, video, drivers, vga, monitors,...",[A couple of months ago I tried out a Hercules...
...,...,...,...,...,...
209,208,10,208_media_nw_washington_dc,"[media, nw, washington, dc, street, news, ny, ...",[From article <1qvampINNmhf@darkstar.UCSC.EDU>...
210,209,10,209_toshiba_nec_drive_cdtechnology,"[toshiba, nec, drive, cdtechnology, 3401, mult...","[The Toshiba has a 200ms access time, the NEC ..."
211,210,10,210_law_jesus_paul_gentiles,"[law, jesus, paul, gentiles, faith, heaven, go...","[\nOK, here's at least one Christian's answer:..."
212,211,10,211_uninstall_windows_norton_group,"[uninstall, windows, norton, group, 31, deskto...",[(NDW)\n\nIf an Uninstall icon doesn't exist i...


In [8]:
# count열의 값을 합치면 총문서의 수가 됨
model.get_topic_info()['Count'].sum()

18846

Topic -1이 가장 크나, 토픽이 할당되지 않은 이상치에 해당하는 문서들임. 따라서 토픽은 0-210까지 있는 것임. 

In [9]:
# 3번 토픽 정보
model.get_topic(3)

[('monitor', 0.022295341872923997),
 ('card', 0.02167927164641604),
 ('video', 0.015727636934186322),
 ('drivers', 0.012620073296151302),
 ('vga', 0.012602969688873656),
 ('monitors', 0.009620979290856008),
 ('diamond', 0.009068046739958546),
 ('screen', 0.008915507120835244),
 ('cards', 0.008491358410794408),
 ('mode', 0.008242045760807526)]

### 4. 토픽 시각화
----------
BERTopic을 사용하면 LDAvis와 매우 유사한 방식으로 생성된 토픽을 시각화할 수 있음. 시각화를 통해 생성된 토픽에 대해 더 많은 통찰력을 얻을 수 있음

In [10]:
model.visualize_topics()

### 5. 단어 시각화
----------------

In [11]:
model.visualize_barchart()

### 6. 토픽유사도 시각화

In [12]:
model.visualize_heatmap()

### 7. 토픽수 정하기
-----
토픽수를 직접 정하는 방법.. 
1) 모델 객체생성시 nr_topics 값으로 원하는 토픽수를 입력하여 원하는 토픽수를 설정할 수 있음. BERTopic은 유사한 토픽을 찾아 하나의 토픽으로 병합함. 

In [13]:
model = BERTopic(nr_topics=20)

In [14]:
topics, probabilities = model.fit_transform(docs)

In [15]:
model.visualize_topics()

2) 또 다른 방법은 자동으로 토픽수를 줄이도록 설정하는 것임. 

In [16]:
model = BERTopic(nr_topics="auto")
topics, probabilities = model.fit_transform(docs)

model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6557,-1_the_to_of_and,"[the, to, of, and, is, in, for, that, it, you]",[\n There are a couple of ways to look a...
1,0,2636,0_for_with_is_the,"[for, with, is, the, and, to, it, drive, or, you]",[Archive-name: x-faq/part4\nLast-modified: 199...
2,1,1825,1_game_team_he_games,"[game, team, he, games, the, was, in, players,...","[News:\n=====\nFor the first time all season, ..."
3,2,1369,2_that_of_is_god,"[that, of, is, god, not, to, the, you, in, and]",[\n\nThere was an article in USA today a few m...
4,3,1130,3_the_key_to_that,"[the, key, to, that, of, and, be, they, is, this]",[THE WHITE HOUSE\n\n Office...
...,...,...,...,...,...
91,90,12,90_needles_acupuncture_needle_syringe,"[needles, acupuncture, needle, syringe, hypode...",[\n\n\tAsk the practitioner whether he uses th...
92,91,11,91_xtermmap_numlock_definekey_xmodmap,"[xtermmap, numlock, definekey, xmodmap, capslo...",[These are two common subjects so I hope someo...
93,92,11,92_moscow_aviation_russian_kaliningrad,"[moscow, aviation, russian, kaliningrad, poljo...","[\nCorrection, and some more info: The Kalinin..."
94,93,11,93_boards_solder_mask_green,"[boards, solder, mask, green, board, fiberglas...",[The color of the board shows the composition ...


### 8. 임의의 문서에 대한 예측
-----------------
학습된 토픽 모델에 어떤 임의의 문서를 입력하여 해당 문서의 주요 토픽이 무엇인지를 예측하고 싶다면 transform()이라는 메소드를 이용함. 학습에 사용했던 첫번째 문서를 입력으로 하여 해당 문서의 주요 토픽번호를 출력해 보자.

In [18]:
new_doc = docs[0]

print(new_doc)



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [19]:
topics, probs = model.transform([new_doc])
print('예측한 토픽 번호:', topics)


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



예측한 토픽 번호: [1]


### 9. 모델 저장과 로드
-------------------------------


In [20]:
model.save("my_topics_model")
BerTopic_model = BERTopic.load("my_topics_model")