# LDA (Latent Dirichlet Allocation)

- Topic Modeling : 문서의 집합에서 Topic을 찾아내는 process
- LDA는 Topic Modeling의 대표적인 알고리즘
- 가정 : 
  "문서는 Topic의 혼합으로 구성" & "Topic은 확률분포에 기반하여 단어를 생성"
- 데이터가 주어지면, LDA는 '문서가 생성되던 과정을 역추적'

### '문서 작성을 위해 이런 주제들을 넣을거고, 이런 주제들을 위해 이런 단어들을 넣을 거야.'

## 1. Process
* 1) 문서에 사용할 단어의 개수 N을 정함
* 2) 문서에 사용할 Topic의 혼합을 확률분포에 기반하여 결정 (ex 스포츠 60%, 과일 40%)
* 3) 문서에 사용할 각 단어를 정함
- ( Topic 분포에서 topic T를 확률적으로 고름 -> 60%로 스포츠, 40%로 과일을 고름 )
- ( 선택한 토픽 T에서 단어의 출현 확률 분포에 기반해, 문서에 사용할 단어를 고름 )

## 2. LSA vs LDA

- 1) LSA : DTM을 차원 축소하여, 축소 차원에서 근접 단어들을 Topic으로 묶음
- 2) LDA : 단어가 특정 topic에 존재할 확률과 문서에 특정 topic이 존재할 확률을 결합확률로 추정하여 Topic을 추출

## 3. 실습 1
( LSA에서 사용한 data 이용)

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords

dataset = fetch_20newsgroups(shuffle=True,random_state=1, remove=('headers','footers','quotes'))
documents = dataset.data

news_df = pd.DataFrame({'document':documents})
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ")
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

stop_words = stopwords.words('english')
tokenized_doc = news_df['clean_doc'].apply(lambda x : x.split())
tokenized_doc = tokenized_doc.apply(lambda x : [item for item in x if item not in stop_words])

In [4]:
news_df['clean_doc'].head()

0    well sure about story seem biased what disagre...
1    yeah expect people read actually accept hard a...
2    although realize that principle your strongest...
3    notwithstanding legitimate fuss about this pro...
4    well will have change scoring playoff pool unf...
Name: clean_doc, dtype: object

In [3]:
tokenized_doc.head()

0    [well, sure, story, seem, biased, disagree, st...
1    [yeah, expect, people, read, actually, accept,...
2    [although, realize, principle, strongest, poin...
3    [notwithstanding, legitimate, fuss, proposal, ...
4    [well, change, scoring, playoff, pool, unfortu...
Name: clean_doc, dtype: object

### 1) 정수 인코딩 & 단어집합 만들기
- 각 단어에 정수를 인코딩 & 각 뉴스에서 단어의 빈도수를 기록!
- 각 단어를 (word_id, word_frequency) 형태로!

In [8]:
from gensim import corpora
dictionary = corpora.Dictionary(tokenized_doc)



In [9]:
corpus = [dictionary.doc2bow(text) for text in tokenized_doc]

In [10]:
# 4번째 news의 모든 단어의 (word_id, word_frequency 출력)
corpus[5]

[(49, 1),
 (83, 1),
 (150, 1),
 (213, 1),
 (214, 1),
 (215, 1),
 (216, 1),
 (217, 1),
 (218, 2),
 (219, 1),
 (220, 1),
 (221, 1),
 (222, 1),
 (223, 1),
 (224, 2),
 (225, 1),
 (226, 1),
 (227, 1),
 (228, 1),
 (229, 1),
 (230, 1)]

### 2) LDA 훈련시키기
- 기존의 뉴스 데이터가 20개의 category를 가짐!
- 토픽의 개수를 20개로 하여 LDA 모델 학습

In [11]:
import gensim
NUM_TOPICS = 20
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_TOPICS,
                                          id2word=dictionary, passes=15)

topics = ldamodel.print_topics(num_words=4)

In [12]:
for topic in topics:
    print(topic)

(0, '0.010*"nrhj" + 0.007*"wwiz" + 0.006*"bxom" + 0.006*"gizw"')
(1, '0.030*"windows" + 0.014*"color" + 0.011*"card" + 0.010*"video"')
(2, '0.023*"would" + 0.022*"thanks" + 0.020*"anyone" + 0.020*"please"')
(3, '0.017*"period" + 0.010*"power" + 0.009*"play" + 0.006*"scorer"')
(4, '0.005*"hitter" + 0.004*"innings" + 0.004*"inning" + 0.004*"pitched"')
(5, '0.007*"control" + 0.007*"guns" + 0.006*"firearms" + 0.006*"university"')
(6, '0.016*"file" + 0.011*"available" + 0.009*"program" + 0.009*"information"')
(7, '0.019*"game" + 0.017*"team" + 0.015*"year" + 0.013*"games"')
(8, '0.014*"health" + 0.010*"medical" + 0.010*"pain" + 0.008*"disease"')
(9, '0.017*"drive" + 0.015*"system" + 0.012*"chip" + 0.011*"scsi"')
(10, '0.011*"government" + 0.007*"armenian" + 0.007*"people" + 0.006*"armenians"')
(11, '0.013*"book" + 0.011*"jesus" + 0.008*"matthew" + 0.006*"word"')
(12, '0.020*"window" + 0.010*"motif" + 0.010*"using" + 0.009*"widget"')
(13, '0.013*"church" + 0.009*"water" + 0.009*"cover" + 0.0

### 3) LDA 시각화

In [13]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [14]:
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis)

- 좌측의 원들은 각각 20개의 토픽을 나타냄
- 각 원 간의 거리 : 토픽들 간의 유사한 정도 ( 겹치는 원 = 유사한 토픽 )
- 우측에는, 해당 TOPIC을 구성하는 term들과 그 정도!

### 4) 문서 별 토픽 분포 보기

In [16]:
for i, topic_list in enumerate(ldamodel[corpus]):
    if i==5:
        break
    print(i, '번 째 문서의 TOPIC 비율 : ', topic_list)

0 번 째 문서의 TOPIC 비율 :  [(10, 0.29285672), (11, 0.16073646), (16, 0.4919449), (18, 0.041558668)]
1 번 째 문서의 TOPIC 비율 :  [(4, 0.025609754), (7, 0.25793052), (13, 0.026179668), (16, 0.45710588), (17, 0.21488152)]
2 번 째 문서의 TOPIC 비율 :  [(10, 0.27561134), (16, 0.57688445), (18, 0.1337945)]
3 번 째 문서의 TOPIC 비율 :  [(2, 0.07093574), (3, 0.08743375), (8, 0.016644193), (9, 0.31374454), (10, 0.08422507), (14, 0.018146532), (16, 0.39916867)]
4 번 째 문서의 TOPIC 비율 :  [(7, 0.6907137), (16, 0.27595297)]


In [19]:
def make_topic_table(ldamodel, corpus, texts):
    topic_table = pd.DataFrame()
    
    for i, topic_list in enumerate(ldamodel[corpus]):
        # 각 document에서 비중이 높은 Topic 순으로 정렬
        doc = topic_list[0] if ldamodel.per_word_topics else topic_list
        doc = sorted(doc, key=lambda x:(x[1]), reverse=True)
        
        for j, (topic_num, prop_topic) in enumerate(doc):
            if j==0: # 가장 비중 높은 TOPIC
                topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]),
                                                ignore_index=True)
            else:
                break
    
    return(topic_table)

In [20]:
Topic_table = make_topic_table(ldamodel, corpus, tokenized_doc)
Topic_table = Topic_table.reset_index()

In [21]:
Topic_table.columns = ['Document #', 'TOP Topic','Top Topic %','Each Topic %']

In [22]:
Topic_table.head()

Unnamed: 0,Document #,TOP Topic,Top Topic %,Each Topic %
0,0,16.0,0.492,"[(10, 0.2928441), (11, 0.16073583), (16, 0.491..."
1,1,16.0,0.4571,"[(4, 0.025609752), (7, 0.25792506), (13, 0.026..."
2,2,16.0,0.5769,"[(10, 0.2756111), (16, 0.57688487), (18, 0.133..."
3,3,16.0,0.3992,"[(2, 0.07094741), (3, 0.08743723), (8, 0.01664..."
4,4,7.0,0.6906,"[(7, 0.69064075), (16, 0.27602592)]"


## 4. 실습 2

### 1) data 불러오기

In [23]:
import pandas as pd
text = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False)

In [24]:
text.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [25]:
len(text)

1103663

In [26]:
text = text[['headline_text']]

### 2) text 전처리
- 불용어 제거 & 표제어 추출 & 길이 짧은 단어 제거

In [27]:
import nltk

In [28]:
text['headline_text'] = text.apply(lambda row:nltk.word_tokenize(row['headline_text']), axis=1)

In [29]:
text.head(3)

Unnamed: 0,headline_text
0,"[aba, decides, against, community, broadcastin..."
1,"[act, fire, witnesses, must, be, aware, of, de..."
2,"[a, g, calls, for, infrastructure, protection,..."


a) 불용어 제거

In [31]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
text['headline_text'] = text['headline_text'].apply(lambda x : [word for word in x if word not in (stop)])

In [33]:
text.head(3)

Unnamed: 0,headline_text
0,"[aba, decides, community, broadcasting, licence]"
1,"[act, fire, witnesses, must, aware, defamation]"
2,"[g, calls, infrastructure, protection, summit]"


b) 표제어 추출 ( 3인칭 -> 1인칭, 과거형 -> 현재형 )

In [34]:
from nltk.stem import WordNetLemmatizer
text['headline_text'] = text['headline_text'].apply(lambda x : 
                                                    [WordNetLemmatizer().lemmatize(word,pos='v') for word in x])

In [35]:
text.head(3)

Unnamed: 0,headline_text
0,"[aba, decide, community, broadcast, licence]"
1,"[act, fire, witness, must, aware, defamation]"
2,"[g, call, infrastructure, protection, summit]"


c) 단어길이 3이하 단어 제거

In [36]:
tokenized_doc = text['headline_text'].apply(lambda x : [word for word in x if len(word) > 3])

In [37]:
tokenized_doc.head(3)

0       [decide, community, broadcast, licence]
1      [fire, witness, must, aware, defamation]
2    [call, infrastructure, protection, summit]
Name: headline_text, dtype: object

### 3) TF-IDF 행렬
- 역 토큰화(Detokenize) 이후

In [38]:
detokenized_doc = []
for i in range(len(text)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)
    
text['headline_text'] = detokenized_doc

In [39]:
text['headline_text'].head()

0       decide community broadcast licence
1       fire witness must aware defamation
2    call infrastructure protection summit
3                   staff aust strike rise
4      strike affect australian travellers
Name: headline_text, dtype: object

역 토큰화가 잘 수행됨. TfidfVectorizer를 사용하여 TF-IDF행렬 생성!


In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english',max_features=1000)
X = vectorizer.fit_transform(text['headline_text'])

X.shape

(1103663, 1000)

### 4) Topic Modeling

In [42]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [43]:
lda_model = LDA(n_components=10, learning_method='online', random_state=42, max_iter=1)

In [44]:
lda_top = lda_model.fit_transform(X)

In [45]:
print(lda_model.components_.shape)
lda_model.components_

(10, 1000)


array([[1.00001915e-01, 1.00002295e-01, 1.00014575e-01, ...,
        1.00002978e-01, 1.00004944e-01, 7.01940718e+02],
       [1.00001760e-01, 1.00001208e-01, 1.00004019e-01, ...,
        1.00008561e-01, 1.00009485e-01, 1.00002809e-01],
       [1.00001980e-01, 1.00000673e-01, 1.00003029e-01, ...,
        1.00002936e-01, 1.00006167e-01, 1.00003873e-01],
       ...,
       [1.00001496e-01, 1.00000684e-01, 1.00001995e-01, ...,
        1.00004461e-01, 1.00006882e-01, 1.00003195e-01],
       [1.00001600e-01, 1.00001237e-01, 1.00005101e-01, ...,
        1.75869640e+03, 3.94800656e+02, 1.00004471e-01],
       [1.00001597e-01, 8.66862949e+02, 1.00011441e-01, ...,
        1.00004254e-01, 1.00001923e-01, 1.00003813e-01]])

In [46]:
terms = vectorizer.get_feature_names()

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print('Topic %d :' %(idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n-1:-1]])

In [47]:
get_topics(lda_model.components_, terms)

Topic 1 : [('queensland', 8131.82), ('perth', 6335.39), ('canberra', 6288.68), ('house', 6219.91), ('donald', 5757.13)]
Topic 2 : [('australia', 13989.98), ('court', 6181.41), ('live', 5656.81), ('years', 5563.76), ('jail', 4618.46)]
Topic 3 : [('charge', 8435.22), ('melbourne', 7615.09), ('north', 6240.89), ('kill', 6092.11), ('year', 5981.94)]
Topic 4 : [('government', 8662.56), ('home', 5750.72), ('warn', 5154.6), ('turnbull', 4946.32), ('health', 4266.21)]
Topic 5 : [('police', 12142.06), ('sydney', 8596.69), ('south', 6254.86), ('death', 6073.78), ('test', 5593.3)]
Topic 6 : [('interview', 5889.12), ('state', 4857.79), ('people', 4563.41), ('life', 4390.62), ('arrest', 4326.08)]
Topic 7 : [('election', 7652.81), ('adelaide', 6840.62), ('make', 6197.68), ('face', 5352.78), ('miss', 4602.49)]
Topic 8 : [('trump', 13042.38), ('report', 5560.34), ('market', 5093.57), ('rural', 4483.41), ('china', 4420.49)]
Topic 9 : [('australian', 11387.47), ('attack', 6825.45), ('rise', 4162.36), ('