# Creating a Bag of Words (BoW) with NAVER News Articles body morphemes

NAVER News 기사의 본문을 형태소 단위로 나누어 각 본문에서 등장한 형태소의 빈도수를 데이터프레임으로 정리해 보겠습니다. <br><br>
참고로,
  <span style = "color : #F7CAC9">**Text 내에서 단어의 빈도수를 수치화하는 표현 방법**</span>
을 Bag of Words(BoW)라고 합니다.

## Import Library Packages

In [1]:
import pandas as pd, numpy as np, warnings; warnings.filterwarnings('ignore')

## DataFrame 불러오기

In [2]:
art_df = pd.read_csv("../Data/Crawling_Data/서울지하철혼잡AI예측_20240604_15시10분47초.csv")

art_df

Unnamed: 0,date,title,link,content
0,2023-11-02 12:01:07,지하철역 승강장 혼잡도 실시간 예측한다,https://n.news.naver.com/mnews/article/018/000...,"[\n행안부, 'AI 기반 지하철 승강장 혼잡도 예측 모델' 개발서울지하철 2개 역..."
1,2023-11-02 13:44:29,‘혼잡률 200% 육박’ 서울지하철…AI가 ‘심각’ 판단하면 진입 통제,https://n.news.naver.com/mnews/article/056/001...,"[\n\n\n\n\n김포시민들이 서울로 출퇴근할 때 주로 이용하는 교통수단, 김포 ..."
2,2023-11-03 05:34:00,의자 없애고 혼잡도 실시간 분석…서울 '지옥철' 오명 벗나,https://n.news.naver.com/mnews/article/421/000...,[\n승강장 혼잡도 AI 분석모델 장한평·군자역 시범적용'혼잡도 심각' 4·7호선 ...
3,2023-06-23 14:40:01,"김혜지 서울시의원, 지하철역 혼잡도…다양한 개선대책 추진",https://n.news.naver.com/mnews/article/081/000...,"[\n김 의원, 서울교통공사가 개발 중인 혼잡도 평가시스템 점검천호역을 비롯한 혼잡..."
4,2020-09-14 11:44:43,"SKT ""서울 지하철 1~8호선 혼잡도 미리 확인한다""",https://n.news.naver.com/mnews/article/629/000...,[\n\n\n\n\nSK텔레콤이 지하철의 칸별 혼잡도 예측 정보를 'T맵 대중교통'...
5,2018-09-10 13:06:51,[기고]빅데이터 기반 지하철 쾌적 경로 서비스 제공해야,https://n.news.naver.com/mnews/article/030/000...,[\n\n\n\n\n황보현우 겸임교수하루 평균 약 800만명이 이용하는 서울 지하철...
6,2023-12-06 11:35:03,국민 안전 지평 넓힌 공공데이터 활용[포럼],https://n.news.naver.com/mnews/article/021/000...,"[\n이상민 행정안전부 장관1990년 4월 24일, 인간이 가진 지식과 감각의 지평..."
7,2023-10-27 15:33:03,"핼러윈 주말, 홍대에 최대 7만명 모인다…마포구, 인파관리에 2850명 투입",https://n.news.naver.com/mnews/article/081/000...,[\n27일부터 5일간 특별 관리상상마당앞 합동상황실 운영지능형 인파관리시스템 활용...
8,2022-11-09 18:40:03,CCTV·통화 '밀집 데이터' 있었는데…골목 참사 못 막은 'IT 강국 코리아' [...,https://n.news.naver.com/mnews/article/015/000...,[\n장강호 사회부 기자\n\n\n\n이태원 참사가 발생한 지난달 29일. 이태원 ...
9,2023-01-01 09:19:01,"[신년사] 신상진 성남시장 ""시민과 함께 새로운 50년을 준비하겠다""",https://n.news.naver.com/mnews/article/002/000...,[\n\t\t\t존경하는 시민 여러분과 성남시 공직자 여러분! 계묘년(癸卯年) 새해...


## 기사 본문을 형태소 단위로 분리하기
- <span style = "color :#F7CAC9">KoNLPy의 **Okt Module을 이용**</span>
하여 각 기사의 한국어 본문을 형태소 단위로 분리

In [3]:
from konlpy.tag import Okt
from tqdm import tqdm

In [4]:
art_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   date     10 non-null     object
 1   title    10 non-null     object
 2   link     10 non-null     object
 3   content  10 non-null     object
dtypes: object(4)
memory usage: 452.0+ bytes


### key가 각 기사의 Index이고, value가 형태소 단위로 분리된 본문인 딕셔너리 만들기

In [5]:
# 각 기사의 본문에서 명사, 형용사, 동사, 부사를 추출하여 List에 저장 후 Dictionary에 저장
okt = Okt()
poss_dic = {}

for i in tqdm(art_df.index) :
  ## 1.
  main = art_df.loc[i, "content"]
  poss = okt.pos(main, norm = True, stem = True)

  ## 2.
  poss_lst = []
  for word, tag in poss :
    if tag in ['Noun','Adjective','Verb','Adverb']:
      poss_lst.append(word)

  ## 3. 
  poss_dic[i] = poss_lst

100%|██████████| 10/10 [00:03<00:00,  3.32it/s]


1. art_df에서 행이 i이고, 열이 "content"인 Data(각 기사의 본문)를 Variable에 저장해 주고, <br>
  okt의 pos 함수를 이용하여 Data를 (형태소, 품사) 형식의 Tuple들로 분리하는 품사 Tagging의 작업을 하여 List로 저장합니다. <br><br>
2. poss List 안의 각각의 Tuple들 안에서 품사가 명사, 형용사, 동사, 부사인 것들만 poss_lst List에 저장합니다. <br>
  다른 품사들은 Text 분석에 필요하지 않다고 판단하여 사용하지 않겠습니다. <br><br>
3. 각 기사의 Index를 key로 갖고 본문을 형태소 단위로 분리한 List를 value로 갖는 poss_dic Dictionary를 생성합니다.

In [6]:
poss_dic

{0: ['행안부',
  '기반',
  '지하철',
  '승강장',
  '혼잡',
  '예측',
  '모델',
  '개발',
  '서울',
  '지하철',
  '개',
  '역',
  '시범',
  '적용',
  '정확도',
  '표준',
  '모델',
  '추진',
  '이데일리',
  '이연호',
  '기자',
  '정부',
  '지하철역',
  '승강장',
  '혼잡',
  '상황',
  '실시간',
  '파악',
  '하다',
  '수',
  '있다',
  '인공',
  '지능',
  '기반',
  '데이터',
  '분석',
  '모델',
  '개발',
  '하다',
  '달',
  '현장',
  '시범',
  '적용',
  '하다',
  '전자',
  '관제',
  '실',
  '대시보드',
  '화면',
  '사진',
  '행정안전부',
  '행정안전부',
  '통합',
  '데이터',
  '센터',
  '지난',
  '지하철',
  '김포',
  '골드',
  '라인',
  '샘플',
  '진행',
  '하다',
  '온',
  '기반',
  '지하철',
  '승강장',
  '혼잡',
  '예측',
  '모델',
  '개발',
  '마치',
  '달',
  '지하철',
  '시범',
  '적용',
  '하다',
  '밝히다',
  '이번',
  '개발',
  '되다',
  '모델',
  '가다',
  '산출',
  '지하철',
  '승강장',
  '체류',
  '인원',
  '토대',
  '승강장',
  '면적',
  '고려',
  '밀도',
  '혼잡',
  '률',
  '산출',
  '뒤',
  '그',
  '수준',
  '단계',
  '단계',
  '구분',
  '하다',
  '표',
  '출하',
  '개념',
  '모델',
  '개발',
  '과정',
  '통합',
  '데이터',
  '센터',
  '서울',
  '교통',
  '공사',
  '김포',
  '골드',
  '라인',
  '함께',
  '참여',
  '하다',
  '지하철',


## 각 기사의 형태소들을 Bag of Words(BoW)에 담기

### 모든 News 기사 본문의 형태소들에서 중복을 없앤 후 List에 저장

In [7]:
all_unique_words = []

for words in poss_dic.values() :
    all_unique_words.extend(words)
    all_unique_words = list(set(all_unique_words))

1. poss_dic에 저장된 모든 형태소들을 List에 저장합니다. <br>
  List에 저장 시에 각 List의 원소들만 담기 위해 extend 함수를 사용합니다. <br><br>
 
2. 이후, 중복을 없애기 위해 중복을 허용하지 않는 set 타입으로 바꾸어준 후 다시 list 타입으로 바꾸어줍니다.

### 각각의 기사마다 특정 단어가 몇 번 나왔는지를 나타내는 DataFrame (BoW) 생성

In [8]:
datas = []

for i in tqdm(art_df.index) :
    words = poss_dic[i]  # 형태소가 분리된 단어들을 모아놓은 List
    vc = pd.Series(words).value_counts() # (key: 형태소, values: 빈도수)
    data = vc.to_dict()
    datas.append(data)

df = pd.DataFrame(
    datas,
    index = art_df.index,
    columns = all_unique_words
)

100%|██████████| 10/10 [00:00<00:00, 2014.07it/s]


1. poss_dic Dictionary에 각 기사의 Index를 key 값으로 넣어 형태소가 분리된 단어들을 모아놓은 words List를 생성합니다. <br><br>
2. 이후 words List를 value_counts() 함수를 통해 Index가 형태소이고, Data가 빈도수인 Series로 만듭니다. <br><br>
3. 그리고, Series Type의 vc를 key가 형태소이고, value가 빈도수인 Dictionary 형태로 만들어주고 datas List에 추가합니다. <br><br>
4. 마지막으로, 방금 만든 List를 사용하여 index가 각 기사의 Index이고, columns가 중복이 없는 모든 기사들의 형태소인 DataFrame을 만듭니다. 

In [9]:
df

Unnamed: 0,반드시,필요하다,개다,보행,대기권,마이스,변화,삶,테크노,현안,...,각종,집회,알림,경찰서,민선,티켓,대처,대시보드,중심,확률
0,,,,,,,,,,,...,,,,,,,,2.0,,
1,,,,,,,,,,,...,1.0,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,1.0,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,1.0,,,,
5,1.0,,,,,,,,,,...,3.0,,,,,,,,,
6,,,,,2.0,,1.0,1.0,,,...,,,,,,,1.0,,,
7,,,,3.0,,,,,,,...,,,,2.0,,,,,1.0,
8,,1.0,,,,,,,,,...,,2.0,1.0,,,,,,,1.0
9,,,1.0,,,1.0,1.0,3.0,1.0,1.0,...,,,,,1.0,,,,1.0,


1. 대부분의 데이터가 NaN Type이지만, 이는 어쩌면 당연한 것입니다. <br><br>
2. 모든 기사에서 수집한 유일한 형태소들 중 각각의 기사에 포함되는 형태소의 비중은 크지 않을 것입니다.

## NaN Data가 포함되어 있는 경우 분석에 유리하지 않으므로 NaN Data를 모두 0으로 변경하기.

In [10]:
df = df.fillna(0)

In [11]:
df

Unnamed: 0,반드시,필요하다,개다,보행,대기권,마이스,변화,삶,테크노,현안,...,각종,집회,알림,경찰서,민선,티켓,대처,대시보드,중심,확률
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0
8,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
9,0.0,0.0,1.0,0.0,0.0,1.0,1.0,3.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
