# 뉴스기사 분석
- 빅카인즈 뉴스기사 데이터 활용(https://www.bigkinds.or.kr/)

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/swkim01/DSAC3/blob/main/gg-51-뉴스기사.ipynb"><img src="https://github.com/swkim01/DSAC3/raw/main/colab_logo_32px.png" />구글 코랩에서 실행</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/swkim01/DSAC3/blob/main/gg-51-뉴스기사.ipynb"><img src="https://github.com/swkim01/DSAC3/raw/main/GitHub-Mark-32px.png" />깃헙에서 소스 보기</a>
  </td>
</table>

In [None]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# 뉴스 데이터 가져오기
#!curl -L https://bit.ly/2X7UON2 -o news.xlsx
!curl -L https://github.com/swkim01/DSAC3/raw/main/news.xlsx -o news.xlsx

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   128  100   128    0     0    429      0 --:--:-- --:--:-- --:--:--   428
100 1789k  100 1789k    0     0  2505k      0 --:--:-- --:--:-- --:--:-- 2505k


In [None]:
news_all = pd.read_excel("news.xlsx")
news_all.columns

Index(['뉴스 식별자', '일자', '언론사', '기고자', '제목', '통합 분류1', '통합 분류2', '통합 분류3',
       '사건/사고 분류1', '사건/사고 분류2', '사건/사고 분류3', '인물', '위치', '기관', '키워드', '특성추출',
       '본문', 'URL', '분석제외 여부'],
      dtype='object')

In [None]:
news_text = news_all['본문']
news_text[:5]

0    - 비핵화 수준 상응 조치 놓고\n- 양국 협상팀 막판까지 ‘밀당’\n- 1차 때와...
1    김정은 국무위원장이 27일 시작되는 제2차 북미정상회담 성공을 위해 심혈을 기울이고...
2    북미가 처음으로 정상 간 단독회담과 만찬을 가지며 또다시 새로운 역사 창조에 나섰다...
3    지난해 9월 남북정상회담 당시 리선권 북한 조국평화통일위원장의 '냉면' 발언으로 정...
4    지자체 민간 교류 활성화 대부분 \n여, 부처간 논의 예산 지원 확대 \n야, 사업...
Name: 본문, dtype: object

- CountVectorizer 객체를 생성하고 훈련 데이터를 토큰으로 나누고 어휘 사전을 구축
- fit_transform 에서는 transform 메소드를 함께 실행하여 희소 행렬로 저장

In [None]:
cv = CountVectorizer()
dtm = cv.fit_transform(news_text.tolist())
df = pd.DataFrame(pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names()))
df[:5]

Unnamed: 0,00,000원을,001420,001550,002100,005690,01,017800,02,025860,...,힌국당,힘겨루기로,힘겨웠던,힘들다고,힘들어,힘들어지는,힘을,힘이,힘입어,靈山
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
len(cv.vocabulary_)

16199

- 가장 빈도가 높은 2000개만 사용하도록 지정

In [None]:
cv = CountVectorizer(max_features=2000)
dtm = cv.fit_transform(news_text.tolist())
df = pd.DataFrame(pd.DataFrame(dtm.toarray(), columns = cv.get_feature_names()))
df[:3]

Unnamed: 0,00,01,02,0px,10,100주년,100주년을,10시,10일,10일까지,...,회동을,회복,회의론을,회의를,회의에서,효과,효과를,후보지로,후속,힘을
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
top_word = []
for i in df.transpose().values:
    top_word.append(sum(i.tolist()))
    
df.transpose().index[top_word.index(max(top_word))]

'2차'

- TfidfVectorizer 객체를 생성하고 훈련 데이터를 학습하여 문서-단어 행렬로 저장


In [None]:
tv = TfidfVectorizer(max_features=2000)
dtm = tv.fit_transform(news_text.tolist())
df = pd.DataFrame(pd.DataFrame(dtm.toarray(), columns = tv.get_feature_names()))

In [None]:
df[:3]

Unnamed: 0,00,01,02,0px,10,100주년,100주년을,10시,10일,10일까지,...,회동을,회복,회의론을,회의를,회의에서,효과,효과를,후보지로,후속,힘을
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
