# 5. Topic Modeling

- **의미적 관계성(Semantic Relations):** 자연어 처리 연구 영역 중 중요한 임무 중 하나. 개체명 인식, 질의어 확장, 클러스터링 등 많은 영역에서 근본적인 과정으로 사용 가능. 

- **통계적 유사성:** 주어진 문헌 집단 안에서 두 개의 단어 사이에 어떤 관계가 있는지를 통계적인 수치를 이용해서 유사성을 구함 = 통계적인 모델링을 통해 유사도를 유추

- **토픽 모델링** 

- 비지도 학습 방법론 중 하나 <br>

- 구조화되지 않은 방대한 문헌 집단 (= 비정형 데이터)에서 주제를 (비지도 학습 방법으로) 찾아내기 위한 알고리즘. (여기서 주제란, 같은 맥락에서 나타날 가능성이 있거나 비슷한 토픽이나 주제를 나타내는 단어들을 그룹화한 것을 뜻한다). 

- **Generative Model** : 어떤 확률 분포와 파라미터가 있다고 가정할 대 그로부터 랜덤한 프로세스에 따라서 데이터를 생성하는 것.     
- 맥락과 관련된 단서들을 이용하여 유사한 의미를 가진 단어들을 클러스터링하는 방식으로 주제를 추론하는 모델<br>
- 데이터 양이 많을수록 성능이 좋아지고, 적을수록 낮아짐 = 그 문헌 집단 안에 양이 많을수록 집단 내에서 표현하고있는 주제들이 뚜렷하게 나타나기 때문<br>

토픽 모델링에서 가장 많이 쓰이는 것은 LDA(Latent Dirichlet Allocation)인데, 확률 분포는 Dirichlet distribution을 따르게 된다. 


## 5.1. LDA

- 문헌 내의 용어 분포는 알 수 있지만, 주제들의 용어 분포는 사전에 미리 알 수 없음. 
- 일일이 문헌들을 다 읽고 탐색한 뒤에 대략 어떤 주제들이 있는 파악할 수 밖에 없음. 
- 따라서 각 문헌 내에 잠재적 Dirichlet 확률 분포가 있다고 가정하여 직접 관찰할 수 있는 문헌 집단 내의 각 문헌들의 용어 분포들로부터 주제의 용어 분포를 예측/추정하는 단계. 

<img src = "https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/5f1038ad42ed8a4428e395c96d57f83d201ef3b3/3-Figure1-1.png">


## 5.2. Gensim을 이용한 토픽모델링
참고: https://nlpforhackers.io/topic-modeling/

nltk의 brown 코퍼스를 가져온다.

#### brown corpus : https://en.wikipedia.org/wiki/Brown_Corpus

In [21]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/yoon/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [1]:
from nltk.corpus import brown 
data = [] 
for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)
 
NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)
print(data[:5])

500


In [2]:
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
 
NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')
 
def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text
 
# gensim을 사용하기에 앞서 단어를 토큰화하고 불용어를 제거한다.
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))
  
 # gensim의 dictionary라이브러리를 사용하여 단어를 수치화한다. 
dictionary = corpora.Dictionary(tokenized_data)
 
# 딕셔너리로 변환한 데이터를 Bag of words로 변환한다.
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# 테스트로 20번째 문서가 어떻게 생겼는지 확인해본다: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...
 
# LDA모델을 생성
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
 
# 선택) LSI모델을 생성
#lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

[(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2), (44, 2), (45, 2), (46, 2), (47, 2), (49, 1), (50, 1), (53, 1), (56, 1), (59, 1), (60, 1), (66, 1), (75, 1), (80, 1), (98, 1), (101, 1), (106, 1), (117, 1), (129, 1), (130, 2), (132, 2), (135, 2), (140, 1), (141, 2), (143, 4), (144, 2), (145, 2), (166, 1), (195, 1), (198, 3), (219, 1), (220, 4), (221, 3), (223, 1), (229, 4), (230, 4), (231, 2), (235, 1), (236, 1), (242, 2), (246, 2), (255, 1), (263, 1), (269, 1), (270, 5), (271, 2), (275, 5), (276, 1), (278, 4), (280, 2), (281, 1), (307, 2), (310, 1), (311, 3), (313, 1), (314, 5), (318, 4), (322, 1), (336, 1), (338, 3), (339, 1), (340, 1), (341, 1), (345, 1), (346, 1), (351, 1), (354, 1), (355, 1), (366, 3), (368, 13), (370, 1), (372, 1), (374, 3), (377, 3), (381, 3), (386, 1), (392, 6), (396, 1), (401, 1), (412, 2), (426, 2), (428, 2), (431, 2), (434, 2), (439, 2), (444, 1), (450, 1), (452, 1), (462, 1), (465, 1), (467, 1), (470, 1), (478, 1), (483, 1), (

In [4]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)
 

LDA Model:
Topic #0: 0.007*"one" + 0.006*"would" + 0.003*"could" + 0.003*"time" + 0.003*"said" + 0.003*"like" + 0.003*"two" + 0.002*"first" + 0.002*"way" + 0.002*"made"
Topic #1: 0.005*"one" + 0.005*"would" + 0.005*"said" + 0.004*"new" + 0.003*"could" + 0.003*"man" + 0.003*"time" + 0.003*"state" + 0.002*"may" + 0.002*"two"
Topic #2: 0.004*"would" + 0.004*"one" + 0.004*"may" + 0.003*"said" + 0.003*"two" + 0.003*"new" + 0.003*"first" + 0.002*"even" + 0.002*"man" + 0.002*"also"
Topic #3: 0.006*"one" + 0.005*"said" + 0.004*"would" + 0.003*"time" + 0.003*"could" + 0.003*"two" + 0.003*"first" + 0.002*"like" + 0.002*"man" + 0.002*"new"
Topic #4: 0.005*"would" + 0.005*"one" + 0.003*"new" + 0.003*"like" + 0.003*"time" + 0.003*"said" + 0.002*"many" + 0.002*"man" + 0.002*"could" + 0.002*"world"
Topic #5: 0.008*"one" + 0.004*"could" + 0.003*"time" + 0.003*"would" + 0.003*"said" + 0.003*"two" + 0.003*"like" + 0.003*"first" + 0.002*"new" + 0.002*"may"
Topic #6: 0.006*"would" + 0.005*"one" + 0.004*"s

In [5]:
text = "The economy is working better than ever"
bow = dictionary.doc2bow(clean_text(text))
 
print(lda_model[bow])
# [(0, 0.020005183), (1, 0.020005869), (2, 0.02000626), (3, 0.020005472), (4, 0.020009108), (5, 0.020005926), (6, 0.81994385), (7, 0.020006068), (8, 0.020006327), (9, 0.020005994)]
 

[(0, 0.020006057), (1, 0.02000642), (2, 0.020005446), (3, 0.020006329), (4, 0.81994295), (5, 0.020006884), (6, 0.020005438), (7, 0.020006325), (8, 0.020007018), (9, 0.020007117)]


LDA 결과값은 해당 텍스트가 토픽들간에 분포되어있는 정도를 알려준다. 예를 들어 위 결과를 보면:
[(0, 0.020229582), (1, 0.48642197), (2, 0.020894188), (3, 0.020058075), (4, 0.022410348), (5, 0.025939714), (6, 0.20046122), (7, 0.13457063), (8, 0.048185956), (9, 0.02082831)]. 
해당 텍스트가 토픽 1에 0.486 만큼 가장 많이 분포되어, 토픽 1이 이 텍스트를 가장 잘 설명한다.   

아래와 같이 쿼리 간 유사도를 계산하여, 해당 텍스트와 가장 유사한 문서를 확인할 수도 있다. 

In [6]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
# [(104, 0.87591344), (178, 0.86124849), (31, 0.8604598), (77, 0.84932965), (85, 0.84843522), (135, 0.84421808), (215, 0.84184396), (353, 0.84038532), (254, 0.83498049), (13, 0.82832891)]
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(data[document_id][:1000])

  if np.issubdtype(vec.dtype, np.int):


[(471, 0.99819046), (464, 0.99809164), (472, 0.99804556), (80, 0.99791795), (430, 0.9978959), (477, 0.9978152), (426, 0.9977547), (450, 0.9976276), (367, 0.9976251), (152, 0.99759597)]
Among us , we three handled quite a few small commissions , from spot drawings for advertising agencies uptown to magazine work and quick lettering jobs . Each of us had his own specialty besides . George did wonderful complicated pen-and-ink drawings like something out of a medieval miniature : hundreds of delicate details crammed into an eight-by-ten sheet and looking as if they had been done under a jeweler's glass . He also drew precise crisp spots , which he sold to various literary and artistic journals , The New Yorker , for instance , or Esquire . I did book jackets and covers for paperback reprints : naked girls huddling in corners of dingy furnished rooms while at the doorway , daring the cops to take him , is the guy in shirt sleeves clutching a revolver . The book could be The Brothers Karama

## 5.3. Scikit-Learn으로 LDA 생성

In [7]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Let's see how the first document in the corpus looks like in different topic spaces
print(lda_Z[0])




(500, 10)
[1.05596684e-04 1.05622756e-04 1.05600231e-04 1.05613095e-04
 8.76221712e-01 1.05622365e-04 1.05610043e-04 1.05596769e-04
 1.05602617e-04 1.22933424e-01]


In [8]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
 
print("LDA Model:")
print_topics(lda_model, vectorizer)
print("=" * 20)

LDA Model:
Topic 0:
[('new', 0.4036333169758561), ('playing', 0.40047542329360664), ('time', 0.3847133117702908), ('years', 0.349823911433485), ('church', 0.3394967971994472), ('music', 0.3211117623340652), ('world', 0.31889343779661106), ('good', 0.3185447037474335), ('man', 0.3109116792185486), ('way', 0.28351768003624256)]
Topic 1:
[('new', 782.1573650982392), ('time', 674.5518680711532), ('man', 649.722514088747), ('world', 641.121085935483), ('people', 558.4950072607762), ('life', 497.6328436724659), ('great', 470.50496390338196), ('years', 449.3836095784511), ('did', 447.8331704757196), ('like', 414.57520818391873)]
Topic 2:
[('parker', 34.7661817045717), ('stein', 14.629150968331386), ('association', 7.24959907803013), ('boston', 4.724937127222067), ('witnesses', 4.2246196674005265), ('missile', 4.083808813991264), ('said', 3.9536478235615697), ('union', 3.618046169140777), ('tears', 3.5928458392365563), ('pilots', 3.4348359786670137)]
Topic 3:
[('used', 188.79948334165022), ('n

새로운 문서의 변환은 아래와 같다. 

In [9]:
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x)

[0.02500002 0.02500661 0.02500006 0.02500503 0.77497039 0.02500341
 0.02500387 0.02500001 0.02500109 0.02500951]


유클리디안 유사도 계산은 아래와 같다. 

In [11]:
from sklearn.metrics.pairwise import euclidean_distances
 
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar
 
similarities = most_similar(x, lda_Z)
document_id, similarity = similarities[0]
print(data[document_id][:1000])

Sixty miles north of New York City where the wooded hills of Dutchess County meet the broad sweep of the Hudson River there is a new home development called `` Oakwood Heights '' . As a matter of fact you could probably find a new home development in every populated county in the country with three-bedroom ranch style cottages in the $14,000 range . But Oakwood Heights is unique in one particular . Its oil for heating is metered monthly to each home from a line that starts at a central storage point . This is a pilot operation sponsored by a new entity chartered in Delaware as the Tri-State Pipeline Corporation , with principal offices in New York State . Its president is Otis M. Waters , partner in the law firm of Timen & Waters , 540-K Chrysler Bldg. , New York City . Vice-president is Louis Berkman and the secretary-treasurer is Mark Ritter . Ritter is the builder of Oakwood Heights and president of Kahler-Craft Distributors , Inc. , Newburgh , N.Y. . The idea of a central tank with

## 5.4. PyLDAvis로 시각화 하기

In [12]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
 
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())



[0.02500375 0.77496241 0.02500723 0.02500086 0.02500835 0.02500536
 0.02500005 0.02501147 0.02500001 0.02500051] 1.0


LDA는 반복 알고리즘이며 다음 두 단계를 계속 반복한다.  즉, 초기화 단계에서 각 단어는 임의의 주제에 지정된 다음,  반복적으로 각 단어를 검토하고 단어를 주제를 파악하여 재할당한다.

- 주제에 속하는 단어의 확률은 얼마인가?
- 주제에 의해 생성될 문서의 확률은 얼마인가?

이러한 중요한 특성으로 인해 LDA 결과를 쉽게 시각화 할 수 있습니다. PyLDAvis라는 라이브러리를 사용할 수 있다. 

In [13]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


PyLDAvis

- 크기가 크게 나온 토픽은 코퍼스에서 자주 등장하는 토픽이다. 

- 유사도가 높을수록 토픽들 간 거리도 가까워진다.

- 각 토픽을 선택했을 때, 그 토픽을 대표하는 단어의 목록을 볼 수 있다. 

- 시각화되어 나온 측정치는 단어가 얼마나 자주 출현하는지와 얼마나 차별성을 가는지를 보여준다. 단어 중요도 가중치는 옆 슬라이드에 있는 람다 값을 조정하면 바꿀 수 잇다. 

- 각 단어 위에 커서를 대면 그 단어가 각 토픽을 대표하는 만큼의 크기가 반영되어 토픽의 크기가 바뀐다.