# 잠재 디리클레 할당(LDA) 실습
- LDA를 사이킷런을 사용하여 진행

<b>1) 뉴스 기사 제목 데이터에 대한 이해</b>
- 15년간 발행된 뉴스 기사 제목을 다운로드
- 링크 : https://www.kaggle.com/therohk/million-headlines

In [1]:
import pandas as pd
data = pd.read_csv('C:/Users/fxk/PycharmProjects/tenjumh/Study/NLP_Natural Language Processing/data/abcnews-date-text.csv', error_bad_lines=False)

In [3]:
print(len(data))
print(data.head(5))
# publish_data(날짜(와 headline_text(기사 제목) 두 개의 열

1103663
   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


In [7]:
# 뉴스 기사 제목만 필요함으로 별도 저장
text = data[['headline_text']]
text1 = data['headline_text']   # 이렇게하면 컬럼 정보가 사라짐 따라서 위에 같이...

In [8]:
text1.head(5)

0    aba decides against community broadcasting lic...
1       act fire witnesses must be aware of defamation
2       a g calls for infrastructure protection summit
3             air nz staff in aust strike for pay rise
4        air nz strike to affect australian travellers
Name: headline_text, dtype: object

In [9]:
text.head(5)

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


In [14]:
print(text1.shape)
print(text.shape)

(1103663,)
(1103663, 1)


<b>2) 텍스트 전처리</b>
- 불용어 제거, 표제어 추출, 짧은 단어 제거 전처리

In [15]:
import nltk

# 단어 토큰화 수행
text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [16]:
print(text.head(5))

                                       headline_text
0  [aba, decides, against, community, broadcastin...
1  [act, fire, witnesses, must, be, aware, of, de...
2  [a, g, calls, for, infrastructure, protection,...
3  [air, nz, staff, in, aust, strike, for, pay, r...
4  [air, nz, strike, to, affect, australian, trav...


In [17]:
# 불용어 제거
from nltk.corpus import stopwords
stop = stopwords.words('english')
text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in (stop)])
print(text.head(5))

                                       headline_text
0   [aba, decides, community, broadcasting, licence]
1    [act, fire, witnesses, must, aware, defamation]
2     [g, calls, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [18]:
# 표제어 추출 : 3인칭 단수를 1인칭 변경, 과거형 동사를 현재형으로 변경

from nltk.stem import WordNetLemmatizer
text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])
print(text.head(5))

                                       headline_text
0       [aba, decide, community, broadcast, licence]
1      [act, fire, witness, must, aware, defamation]
2      [g, call, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [19]:
# 짧은 단어 삭제

tokenized_doc = text['headline_text'].apply(lambda x: [word for word in x if len(word) > 3])
print(tokenized_doc[:5])

0       [decide, community, broadcast, licence]
1      [fire, witness, must, aware, defamation]
2    [call, infrastructure, protection, summit]
3                   [staff, aust, strike, rise]
4      [strike, affect, australian, travellers]
Name: headline_text, dtype: object


<b>3) TF-IDF</b>
- TF-IDF에 TfidfVectorize는 기본적으로 토큰화되어 있지 않은 텍스트 데이터를 입력으로 사용
- 따라서 전처리를 위해 했던 토큰화 작업을 역으로 취소하는 "역토큰화"작업 수행

In [21]:
# 역토큰화 (토큰화 작업을 되돌림)
detokenized_doc = []
for i  in range(len(text)):   # 왜 불용어, 인칭 통일 작업까지 한 녀석으로 하지?
    t = ' '.join(tokenized_doc[i])   # str.join()은 "Study/Python Function Collection" 참조
    detokenized_doc.append(t)

text['headline_text'] = detokenized_doc  # 다시 text['headline_text']에 재저장

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [22]:
text['headline_text'][:5]

0       decide community broadcast licence
1       fire witness must aware defamation
2    call infrastructure protection summit
3                   staff aust strike rise
4      strike affect australian travellers
Name: headline_text, dtype: object

- 역토큰화가 수행됨
- 사이킷런의 TfidfVectorizer를 이용하여 TF-IDF행렬 생성
- 시간이 오래걸리니 1000개의 단어로만 함.

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)  #상위 1000개 단어 보존
X = vectorizer.fit_transform(text['headline_text'])
print(text.shape)
print(tokenized_doc.shape)
print(X.shape)     # 1103663개 뉴스에, 뉴스 당 1000개 단어만

(1103663, 1)
(1103663,)
(1103663, 1000)


<b>3) 토픽 모델링</b>

In [26]:
from sklearn.decomposition import LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components=10, learning_method='online', random_state=777, max_iter=1)

In [27]:
lda_top = lda_model.fit_transform(X)
print(lda_model.components_)
print(lda_model.components_.shape)

[[1.00000703e-01 1.00000829e-01 1.00003578e-01 ... 1.00004871e-01
  1.00003129e-01 1.00002930e-01]
 [1.00001421e-01 8.66862951e+02 1.00008903e-01 ... 1.00004224e-01
  1.00005598e-01 7.01841034e+02]
 [1.00000648e-01 1.00000545e-01 1.00002661e-01 ... 1.00005158e-01
  1.00008596e-01 1.00001987e-01]
 ...
 [1.00001636e-01 1.00000889e-01 2.68570402e+03 ... 1.00003039e-01
  1.00010511e-01 1.00004475e-01]
 [1.00001352e-01 1.00000852e-01 1.00003353e-01 ... 1.00003378e-01
  1.00005211e-01 1.00003635e-01]
 [1.00002244e-01 1.00000967e-01 1.00003675e-01 ... 1.00002444e-01
  1.00003580e-01 1.00004738e-01]]
(10, 1000)


In [29]:
terms = vectorizer.get_feature_names() # 단어 집합. 1,000개의 단어가 저장됨

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:"%(idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])

get_topics(lda_model.components_,terms)

Topic 1: [('government', 8658.95), ('queensland', 8134.58), ('perth', 6332.45), ('year', 5981.93), ('change', 5833.07)]
Topic 2: [('world', 7026.33), ('house', 6217.97), ('donald', 5757.52), ('open', 5620.39), ('years', 5563.76)]
Topic 3: [('police', 12140.34), ('kill', 6091.65), ('interview', 5921.12), ('live', 5657.67), ('rise', 4162.16)]
Topic 4: [('court', 6173.46), ('crash', 5497.33), ('state', 4857.9), ('tasmania', 4443.89), ('accuse', 4300.92)]
Topic 5: [('australia', 13994.07), ('south', 6253.18), ('woman', 5614.31), ('coast', 5465.23), ('warn', 5155.11)]
Topic 6: [('charge', 8440.62), ('election', 7650.47), ('adelaide', 6839.75), ('murder', 6418.61), ('make', 6198.2)]
Topic 7: [('help', 5372.6), ('miss', 4601.06), ('people', 4561.71), ('2016', 4212.58), ('family', 4149.3)]
Topic 8: [('sydney', 8597.95), ('melbourne', 7603.52), ('canberra', 6285.91), ('plan', 5606.37), ('power', 4198.99)]
Topic 9: [('attack', 6818.74), ('market', 5094.55), ('council', 3854.2), ('share', 3811.79