# 5. Topic Modeling

- **의미적 관계성(Semantic Relations):** 자연어 처리 연구 영역 중 중요한 임무 중 하나. 개체명 인식, 질의어 확장, 클러스터링 등 많은 영역에서 근본적인 과정으로 사용 가능. 

- **토픽 모델링** 

    - 비지도 학습 방법론 중 하나 <br>

    - 구조화되지 않은 방대한 문헌 집단 (= 비정형 데이터)에서 주제를 (비지도 학습 방법으로) 찾아내기 위한 알고리즘. (여기서 주제란, 같은 맥락에서 나타날 가능성이 있거나 비슷한 토픽이나 주제를 나타내는 단어들을 그룹화한 것을 뜻한다). 

- **토픽 모델링 알고리즘 

    - LDA : Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its foundations are Probabilistic Graphical Models
    - LSA or LSI: Latent Semantic Analysis or Latent Semantic Indexing – Uses Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on Linear Algebra
    - NMF: Non-Negative Matrix Factorization – Based on Linear Algebra

- **통계적 유사성:** 주어진 문헌 집단 안에서 두 개의 단어 사이에 어떤 관계가 있는지를 통계적인 수치를 이용해서 유사성을 구함 = 통계적인 모델링을 통해 유사도를 유추
- **Generative Model** : 어떤 확률 분포와 파라미터가 있다고 가정할 대 그로부터 랜덤한 프로세스에 따라서 데이터를 생성하는 것.     
- 맥락과 관련된 단서들을 이용하여 유사한 의미를 가진 단어들을 클러스터링하는 방식으로 주제를 추론하는 모델<br>
- 데이터 양이 많을수록 성능이 좋아지고, 적을수록 낮아짐 = 그 문헌 집단 안에 양이 많을수록 집단 내에서 표현하고있는 주제들이 뚜렷하게 나타나기 때문<br>


## 5.1. LDA
- 토픽 모델링에서 가장 많이 쓰이는 것은 LDA(Latent Dirichlet Allocation)인데, 확률 분포는 Dirichlet distribution을 따르게 된다. 
- LSI(Latent Semantic Index) 의 발전된 버전.
    - LSI: Singular Value Decomposition (선형 대수적 기법)을 사용해서 단어와 그 단어의 (잠재적) 개념 사이의 관계 파악하는 기법
    - LSI는 용언-문헌 행렬의 차원을 축소하는 방법으로 문헌을 표현
    - LSA vs LDA: https://www.datasciencecentral.com/profiles/blogs/a-tale-about-lda2vec-when-lda-meets-word2vec"

#### - 문헌 내의 용어 분포는 알 수 있지만, 주제들의 용어 분포는 사전에 미리 알 수 없음. (일일이 문헌들을 다 읽고 탐색한 뒤에 대략 어떤 주제들이 있는 파악할 수 밖에 없음) 
#### - 따라서 각 문헌 내에 잠재적 Dirichlet 확률 분포가 있다고 가정하여 직접 관찰할 수 있는 문헌 집단 내의 각 문헌들의 용어 분포들로부터 주제의 용어 분포를 예측/추정하는 단계. 

<img src = "https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/5f1038ad42ed8a4428e395c96d57f83d201ef3b3/3-Figure1-1.png">

source = Blei et al., 2003. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

## 5.2. Gensim을 이용한 토픽모델링
참고: https://nlpforhackers.io/topic-modeling/

nltk의 brown 코퍼스를 가져온다.

#### brown corpus : https://en.wikipedia.org/wiki/Brown_Corpus

In [5]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/yoon/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [6]:
from nltk.corpus import brown 
data = [] 
for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)
 
NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)
print(data[:5])

500


In [7]:
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords
 
NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')
 
def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text
 
# gensim을 사용하기에 앞서 단어를 토큰화하고 불용어를 제거한다.
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))
  
 # gensim의 dictionary라이브러리를 사용하여 단어를 수치화한다. 
dictionary = corpora.Dictionary(tokenized_data)
 
# 딕셔너리로 변환한 데이터를 Bag of words로 변환한다.
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# 테스트로 20번째 문서가 어떻게 생겼는지 확인해본다: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...
 
# LDA모델을 생성
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
 
# 선택) LSI모델을 생성
#lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

[(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2), (44, 2), (45, 2), (46, 2), (47, 2), (49, 1), (50, 1), (53, 1), (56, 1), (59, 1), (60, 1), (66, 1), (75, 1), (80, 1), (98, 1), (101, 1), (106, 1), (117, 1), (129, 1), (130, 2), (132, 2), (135, 2), (140, 1), (141, 2), (143, 4), (144, 2), (145, 2), (166, 1), (195, 1), (198, 3), (219, 1), (220, 4), (221, 3), (223, 1), (229, 4), (230, 4), (231, 2), (235, 1), (236, 1), (242, 2), (246, 2), (255, 1), (263, 1), (269, 1), (270, 5), (271, 2), (275, 5), (276, 1), (278, 4), (280, 2), (281, 1), (307, 2), (310, 1), (311, 3), (313, 1), (314, 5), (318, 4), (322, 1), (336, 1), (338, 3), (339, 1), (340, 1), (341, 1), (345, 1), (346, 1), (351, 1), (354, 1), (355, 1), (366, 3), (368, 13), (370, 1), (372, 1), (374, 3), (377, 3), (381, 3), (386, 1), (392, 6), (396, 1), (401, 1), (412, 2), (426, 2), (428, 2), (431, 2), (434, 2), (439, 2), (444, 1), (450, 1), (452, 1), (462, 1), (465, 1), (467, 1), (470, 1), (478, 1), (483, 1), (

In [8]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    #  대표 토픽 10개를 추출
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)


# 선택) LSI
#print("LSI Model:")
#for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    #print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
#print("=" * 20)

LDA Model:
Topic #0: 0.005*"one" + 0.004*"would" + 0.004*"said" + 0.003*"time" + 0.003*"could" + 0.003*"two" + 0.003*"made" + 0.002*"many" + 0.002*"new" + 0.002*"may"
Topic #1: 0.007*"one" + 0.004*"time" + 0.004*"would" + 0.003*"may" + 0.003*"first" + 0.003*"said" + 0.003*"like" + 0.003*"even" + 0.003*"could" + 0.003*"new"
Topic #2: 0.006*"one" + 0.005*"would" + 0.004*"new" + 0.003*"time" + 0.003*"said" + 0.003*"could" + 0.003*"may" + 0.002*"like" + 0.002*"first" + 0.002*"two"
Topic #3: 0.006*"one" + 0.005*"would" + 0.004*"new" + 0.004*"said" + 0.004*"man" + 0.003*"could" + 0.003*"like" + 0.003*"two" + 0.003*"time" + 0.003*"even"
Topic #4: 0.006*"one" + 0.004*"would" + 0.003*"could" + 0.003*"first" + 0.003*"also" + 0.003*"said" + 0.003*"two" + 0.003*"may" + 0.003*"new" + 0.002*"even"
Topic #5: 0.006*"one" + 0.003*"would" + 0.003*"may" + 0.003*"said" + 0.003*"man" + 0.003*"even" + 0.003*"new" + 0.002*"could" + 0.002*"time" + 0.002*"two"
Topic #6: 0.009*"would" + 0.006*"one" + 0.004*"sai

설정한 모델이 다른 인풋(new document)가 들어왔을 때 잘 적용되는지 확인해본다. text라는 새로운 문서를 담은 변수를 넣어본다

In [9]:
text = "The economy is working better than ever"
bow = dictionary.doc2bow(clean_text(text))
 
print(lda_model[bow])
# [(0, 0.020005183), (1, 0.020005869), (2, 0.02000626), (3, 0.020005472), (4, 0.020009108), (5, 0.020005926), (6, 0.81994385), (7, 0.020006068), (8, 0.020006327), (9, 0.020005994)]
 
#print(lsi_model[bow])

[(0, 0.020006005), (1, 0.020006472), (2, 0.020005895), (3, 0.81994534), (4, 0.020006869), (5, 0.020005459), (6, 0.020006683), (7, 0.020005727), (8, 0.020005424), (9, 0.020006163)]


LDA 결과값은 해당 문서(```text```)가 토픽들간에 분포되어있는 정도를 알려준다. 예를 들어 위 결과를 보면:
[(0, 0.020229582), (1, 0.48642197), (2, 0.020894188), (3, 0.020058075), (4, 0.022410348), (5, 0.025939714), (6, 0.20046122), (7, 0.13457063), (8, 0.048185956), (9, 0.02082831)]. 
해당 텍스트가 토픽 1에 0.486 만큼 가장 많이 분포되어, 토픽 1이 이 텍스트를 가장 잘 설명한다.   

아래와 같이 쿼리 간 유사도를 계산하여, 해당 텍스트와 가장 유사한 문서를 확인할 수도 있다. 

In [10]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
# [(104, 0.87591344), (178, 0.86124849), (31, 0.8604598), (77, 0.84932965), (85, 0.84843522), (135, 0.84421808), (215, 0.84184396), (353, 0.84038532), (254, 0.83498049), (13, 0.82832891)]
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(data[document_id][:1000])

  if np.issubdtype(vec.dtype, np.int):


[(172, 0.99788624), (473, 0.9976964), (353, 0.9976497), (329, 0.99762857), (388, 0.99759823), (6, 0.99753135), (275, 0.9974576), (366, 0.9973526), (17, 0.99733174), (140, 0.99733174)]
Her father , James Upton , was the Upton mentioned by Hawthorne in the famous introduction to the Scarlet Letter as one of those who came into the old custom house to do business with him as the surveyor of the port . A gentleman of the old school , Mr. Upton possessed intellectual power , ample means , and withal , was a devoted Christian . The daughter profited from his interest in scientific and philosophical subjects . Her mother also was a person of superior mind and broad interests . There is clear evidence that Lucy from childhood had an unusual mind . She possessed an observant eye , a retentive memory , and a critical faculty . When she was nine years old , she wrote a description of a store she had visited . She named 48 items , and said there were `` many more things which it would take too lon

## 5.3. Scikit-Learn으로 LDA 생성

In [11]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Let's see how the first document in the corpus looks like in different topic spaces
print(lda_Z[0])




(500, 10)
[1.05599964e-04 1.05602222e-04 1.05610220e-04 8.17345798e-01
 1.05601929e-04 1.81809313e-01 1.05628347e-04 1.05617189e-04
 1.05613894e-04 1.05615687e-04]


#선택: LSI

    # Build a Latent Semantic Indexing Model
    lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
    lsi_Z = lsi_model.fit_transform(data_vectorized)
    print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

    print(lsi_Z[0])

In [12]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
 
print("LDA Model:")
print_topics(lda_model, vectorizer)
print("=" * 20)

LDA Model:
Topic 0:
[('khrushchev', 29.193331986664763), ('meeting', 23.693247592025543), ('president', 13.994849697532842), ('soviet', 11.564608263819794), ('premier', 11.102588626035738), ('moscow', 9.673522958567643), ('kennedy', 8.201571735951092), ('leader', 7.76360709048388), ('summit', 6.935680574420381), ('laos', 6.7176145626140364)]
Topic 1:
[('cooling', 24.203140709799737), ('house', 10.338661849208936), ('heating', 8.1089572652761), ('heat', 7.365936229853356), ('air', 7.348040239060102), ('conditioning', 6.0034548542256125), ('theaters', 5.030007305126337), ('unit', 4.602971649799021), ('cool', 4.5339465577978535), ('furnace', 3.3211774332173207)]
Topic 2:
[('shelter', 55.35640615660219), ('holmes', 30.140712758836486), ('detective', 23.598133826447366), ('pool', 23.148967720358993), ('locking', 22.65029455541106), ('used', 20.6385924944391), ('bar', 19.798535369238945), ('long', 19.730100972215652), ('cut', 19.726088967561214), ('frame', 19.394749435017427)]
Topic 3:
[('st

LSI : 
```
print("LSI Model:")
print_topics(lsi_model, vectorizer)
print("=" * 20)
```

새로운 (기존에 없던 문서) 문서의 변환은 아래와 같다. 

In [13]:
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x)

[0.02500002 0.02500006 0.02500334 0.77497132 0.0250014  0.02500825
 0.02500898 0.02500137 0.025004   0.02500128]


유클리디안 유사도 계산은 아래와 같다. 

In [14]:
from sklearn.metrics.pairwise import euclidean_distances
 
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar
 
similarities = most_similar(x, lda_Z)
document_id, similarity = similarities[0]
print(data[document_id][:1000])

Sen. John L. McClellan of Arkansas and Rep. David Martin of Nebraska are again beating the drums to place the unions under the anti-monopoly laws . Once more the fallacious equation is advanced to argue that since business is restricted under the anti-monopoly laws , there must be a corresponding restriction against labor unions : the law must treat everybody equally . Or , in the words of Anatole France , `` The law in its majestic equality must forbid the rich , as well as the poor , from begging in the streets and sleeping under bridges '' . The public atmosphere that has been generated which makes acceptance of this law a possibility stems from the disrepute into which the labor movement has fallen as a result of Mr. McClellan's hearings into corruption in labor-management relations and , later , into the jurisdictional squabbles that plagued industrial relations at the missile sites . The Senator was shocked by stoppages over allegedly trivial disputes that delayed our missile pro

### (옵션).  SVD로 단어와 문헌을 2차원 그래프로 시각화하기 
#### Plotting words and documents in 2D with SVD
We can use SVD with 2 components (topics) to display words and documents in 2D. The process is really similar. Let’s start with displaying documents since it’s a bit more straightforward.

In [15]:
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

In [16]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)
 

In [17]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

## 5.4. PyLDAvis로 LDA 시각화 하기

In [12]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
 
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())#모든 factor값의 합도 같이 확인해보자



[0.02500375 0.77496241 0.02500723 0.02500086 0.02500835 0.02500536
 0.02500005 0.02501147 0.02500001 0.02500051] 1.0


모든 factor의 합은 1이다.  LDA는 문서를 다른 토픽들을 혼합해서 생성한 결과물이라고 가정하기때문이다. LDA의 목적은 어떤 토픽에 얼마나 많은 문서가 생성될 것인지 계산/예측하는 것이다. 이 예시에서는 반 이상의 문서가 2번째 토픽에 해당된다는 것을 알 수 있다. (```0.77496241```)

LDA는 반복 알고리즘이며 다음 두 단계를 계속 반복한다.  즉, 초기화 단계에서 각 단어는 임의의 주제에 지정된 다음,  반복적으로 각 단어를 검토하고 단어를 주제를 파악하여 재할당한다.

- 주제에 속하는 단어의 확률은 얼마인가?
- 주제에 의해 생성될 문서의 확률은 얼마인가?

이러한 중요한 특성으로 인해 PyLDAvis라는 라이브러리를 이용하여 LDA 결과를 쉽게 시각화 할 수 있다.

In [13]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


PyLDAvis

- 크기가 크게 나온 토픽은 코퍼스에서 자주 등장하는 토픽이다. 

- 유사도가 높을수록 토픽들 간 거리도 가까워진다.

- 각 토픽을 선택했을 때, 그 토픽을 대표하는 단어의 목록을 볼 수 있다. 

- 시각화되어 나온 측정치는 단어가 얼마나 자주 출현하는지와 얼마나 차별성을 가는지를 보여준다. 단어 중요도 가중치는 옆 슬라이드에 있는 람다 값을 조정하면 바꿀 수 잇다. 

- 각 단어 위에 커서를 대면 그 단어가 각 토픽을 대표하는 만큼의 크기가 반영되어 토픽의 크기가 바뀐다.