# Latent Dirichlet Allocation(LDA)

    LAD 는 문서들은 토픽들의 혼합으로 구성되어져 있으며, 
    토픽들은 확률 분포에 기반하여 단어들을 생성한다고 가정.
    데이터가 주어지면 문서가 생성되던 과정을 역추적한다.
    
    수행과정 
    1. 사용자에게 토픽의 개수 K를 받아온다.
    2. 모든 단어를 K개 중 하나의 토픽에 할당한다.
    3. 이제 모든 문서의 모든 단어에 대해 아래 사항 반복진행.
     3-1. 어떤 문서의 각 단어 w는 자신은 잘못된 토픽에 할당되어져 있지만, 다른 단어들은 전부 올바른
          토픽에 할당되어져 있는 상태라고 가정. 단어 w 는 2가지 기준에 따라 토픽이 재할당된다.
          P(topic t | document d) :  문서 d의 단어들 중 토픽 t에 해당하는 단어들의 비율
          P(word w | topic t) : 각 토픽 t에서 해당 단어 w의 분포.
         
    LSA와 차이
    LSA : DTM 차원 축소해서 축소 차원에서 근접 단어들을 토픽으로 묶는다.
    LDA : 단어가 특정 토픽에 존재할 확률과 문서에서 특정 토픽이 존재할 확률을 결합확률로 추정하여 토픽 추출.

In [6]:
import pandas as pd

data = pd.read_csv("abcnews-date-text.csv",error_bad_lines=False)
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [14]:
text = data[['headline_text']]

In [16]:
text.head()

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


텍스트 전처리

In [17]:
import nltk

text['headline_text'] = text.apply(lambda row : nltk.word_tokenize(row['headline_text']),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text.apply(lambda row : nltk.word_tokenize(row['headline_text']),axis=1)


In [19]:
from nltk.stem import WordNetLemmatizer
text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])
print(text.head(5))

                                       headline_text
0  [aba, decide, against, community, broadcast, l...
1  [act, fire, witness, must, be, aware, of, defa...
2  [a, g, call, for, infrastructure, protection, ...
3  [air, nz, staff, in, aust, strike, for, pay, r...
4  [air, nz, strike, to, affect, australian, trav...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])


In [20]:
tokenized_doc = text['headline_text'].apply(lambda x: [word for word in x if len(word) > 3])
print(tokenized_doc[:5])

0    [decide, against, community, broadcast, licence]
1            [fire, witness, must, aware, defamation]
2          [call, infrastructure, protection, summit]
3                         [staff, aust, strike, rise]
4            [strike, affect, australian, travellers]
Name: headline_text, dtype: object


TF-IDF 행렬

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

#역 토큰화 
detokenized_doc = []
for i in range(len(text)):
    t = " ".join(tokenized_doc[i])
    detokenized_doc.append(t)

text['headline_text'] = detokenized_doc

vectorizer = TfidfVectorizer(stop_words='english',
                            max_features=1000)

X = vectorizer.fit_transform(text['headline_text'])
X.shape


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = detokenized_doc


(1226258, 1000)

토픽 모델링


In [24]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=10,learning_method='online',
                                random_state=777,max_iter=1)

LDA.fit_transform(X)
print(LDA.components_)

[[1.00001414e-01 1.00000943e-01 1.00002270e-01 ... 1.00004914e-01
  1.00002732e-01 1.00003056e-01]
 [1.00007711e-01 2.03994281e+02 1.00002099e-01 ... 1.00008282e-01
  1.00003643e-01 5.79474578e+02]
 [1.00001185e-01 1.00000510e-01 1.00001626e-01 ... 1.00015936e-01
  1.00008259e-01 1.00008496e-01]
 ...
 [1.00002327e-01 1.00000173e-01 6.40483496e+02 ... 1.00011263e-01
  1.00002932e-01 1.00006160e-01]
 [1.00005269e-01 1.00001100e-01 1.00001152e-01 ... 1.00005936e-01
  1.00000782e-01 1.00008356e-01]
 [1.00003439e-01 1.00000202e-01 1.00001393e-01 ... 1.00007561e-01
  1.00005423e-01 1.00004653e-01]]


In [26]:
terms = vectorizer.get_feature_names() # 단어 집합. 1,000개의 단어가 저장됨.

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])
get_topics(LDA.components_,terms)

Topic 1: [('court', 8210.31), ('change', 7263.63), ('year', 6107.99), ('woman', 5919.53), ('face', 5696.83)]
Topic 2: [('australian', 13282.55), ('donald', 9113.27), ('world', 6872.23), ('shoot', 5313.39), ('leave', 4927.4)]
Topic 3: [('coronavirus', 39269.55), ('covid', 19482.21), ('queensland', 12906.92), ('news', 8583.8), ('live', 7907.18)]
Topic 4: [('election', 9985.87), ('record', 6380.97), ('crash', 6152.95), ('tasmania', 6141.91), ('make', 6104.51)]
Topic 5: [('border', 6378.89), ('state', 6081.07), ('coast', 6014.33), ('restrictions', 5960.47), ('attack', 5827.67)]
Topic 6: [('police', 13929.72), ('sydney', 10950.59), ('case', 10135.02), ('government', 9187.84), ('home', 7318.49)]
Topic 7: [('australia', 19357.91), ('melbourne', 8899.05), ('report', 5574.88), ('north', 4923.87), ('interview', 4373.91)]
Topic 8: [('victoria', 10824.69), ('coronavirus', 8841.6), ('china', 8357.87), ('canberra', 6154.57), ('perth', 4708.38)]
Topic 9: [('charge', 8388.17), ('market', 6534.22), ('s