# AI 전문가 교육과정 실습 1 - part 2

***
### NLP응용: 토픽 추출
Applied Natrual Language Processing: Topic Modeling

강사: 차미영 교수 (카이스트 전산학부)    
조교: 신민기, 정현규 (카이스트 전산학부)

실습 담당: 신민기 (mingi.shin@kaist.ac.kr)

# BERTopic

* GPU를 사용하기 위해 런타임 > 런타임 유형 변경 > GPU 또는 TPU 선택

In [None]:
!pip install bertopic

In [None]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

from bertopic import BERTopic

# Scikit-learn: Machine learning library
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

# spacy for lemmatization
import spacy

import matplotlib.pyplot as plt
%matplotlib inline

# NLTK: NLP library
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Fidning Topical Clusters with BERTopic

More detailed information: https://maartengr.github.io/BERTopic/index.html

### Load data

20newsgroups data: This dataset is a collection newsgroup documents. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

20newsgroups 데이터셋은 각각 중심 주제를 가진 메일링 리스트에서 교환된 문서의 집합입니다.

More detailed information: https://www.kaggle.com/crawford/20-newsgroups

In [None]:
dataset = fetch_20newsgroups(shuffle=True,
                            random_state=32,
                            remove=('headers', 'footers', 'qutes'))

In [None]:
news_df = pd.DataFrame({'News': dataset.data,
                       'Target': dataset.target})
news_df['Target_name'] = news_df['Target'].apply(lambda x: dataset.target_names[x])
news_df

### Preprocessing

In [None]:
data = news_df.News.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

In [None]:
print(data[:1])

### Build BERTopic

In [None]:
bertopic_model = BERTopic(language='english', top_n_words=10, n_gram_range=(1, 1), min_topic_size=10)

In [None]:
topics, probs = bertopic_model.fit_transform(data)

### Result

In [None]:
bertopic_model.get_topic_info()

### Visualization

In [None]:
# Visualize the topics
bertopic_model.visualize_topics()

In [None]:
bertopic_model.visualize_hierarchy(top_n_topics=50)

+ 토픽의 개수는 어떻게 정해지는 것일까? 어떻게 원하는 개수의 토픽을 얻을 수 있을까?
+ BERTopic는 BERT와 같은 언어 모델을 활용한다. LDA와 다르게 stopwords 제거, 문장부호 제거 등 전처리를 하지 않은 이유를 그와 연관지어 생각해 보자.