<a href="https://colab.research.google.com/github/yjoonjang/text-mining/blob/main/2-2.%20%E1%84%86%E1%85%AE%E1%86%AB%E1%84%89%E1%85%A5%20%E1%84%8B%E1%85%B2%E1%84%89%E1%85%A1%EB%8F%84.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

문서 유사도 응용
=====
'스타벅스' 네이버 뉴스 유사도 구하기
-----

이번에는 문서 유사도를 사용하여 실전에 응용을 해볼 것입니다.<br>
직접 크롤링한 '스타벅스' 네이버 뉴스 기사 데이터를 가지고 유사한 기사들을 클러스터링하는 모델을 만들어보겠습니다.<br>
해당 데이터는 2023년 1월 1일부터 2023년까지 4월 27일까지 네이버 뉴스 기사를 수집한 것입니다.

프로세스는 다음과 같습니다.
- 데이터 불러오기
- 데이터 전처리(중복값 제거 등)
- 텍스트 전처리(명사 중심)
- 코사인 유사도 구하기
- 문서 유사도 기반 추천

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# 데이터프레임 불러오기
path = '/content/drive/MyDrive/text-mining/네이버 뉴스_스타벅스_2023.04.27.csv'
df = pd.read_csv(path)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16842 entries, 0 to 16841
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   제목         16842 non-null  object
 1   언론사        16842 non-null  object
 2   날짜         16842 non-null  object
 3   URL        16842 non-null  object
 4   네이버뉴스_URL  4842 non-null   object
dtypes: object(5)
memory usage: 658.0+ KB


In [6]:
df.head()

Unnamed: 0,제목,언론사,날짜,URL,네이버뉴스_URL
0,"오늘(1/1) 코스트코 정상영업, 지점별 1월 휴무일·영업시간 '확인하세요'",핀포인트뉴스,2023.01.01.,http://www.pinpointnews.co.kr/news/articleView...,
1,호랑이 가고 검은 토끼 온다…유통가 ‘토끼 마케팅’ 활발,인더뉴스,2023.01.01.,https://www.inthenews.co.kr/news/article.html?...,
2,[아듀2022 ②] 머지부터 FTX 파산까지...올해의 주요 이슈 TOP 10,토큰포스트,2023.01.01.,https://www.tokenpost.kr/article-117836,
3,강남 집 팔아 청소년 쉼터 세웠다…바리스타 키우는 회장님,중앙일보,2023.01.01.,https://www.joongang.co.kr/article/25130324,https://n.news.naver.com/mnews/article/025/000...
4,"변화하는 유통업계, 영역과 경계 허문 '파괴적 커머스' 시대 열렸다",뉴스1,2023.01.01.,https://www.news1.kr/articles/4865044,https://n.news.naver.com/mnews/article/421/000...


In [8]:
# URL 기준 중복값 확인
print(df.duplicated().sum())
print(df.duplicated(subset='URL').sum())
print(df.duplicated(subset='제목').sum())

71
71
1949


In [9]:
df[df.duplicated(subset='URL')]

Unnamed: 0,제목,언론사,날짜,URL,네이버뉴스_URL
415,"[업계만화경] '떠오르는 아침, 속 든든하게!' 조식 시장도 후끈후끈",나이스경제,2023.01.04.,http://www.niceeconomy.co.kr/news/articleView....,
1423,"KB국민은행, 퇴직연금 사전지정운용제도 이벤트 실시",소비자가 만드는 신문,2023.01.12.,http://www.consumernews.co.kr/news/articleView...,
1433,"KB국민은행, 3월 말까지 '퇴직연금 디폴트옵션' 이벤트 진행",SBS Biz,2023.01.12.,https://biz.sbs.co.kr/article_hub/20000098230?...,https://n.news.naver.com/mnews/article/374/000...
1806,"LG U+, 데이터 고객 맞춤형 뉴스 구독하면 선물",중소기업신문,2023.01.16.,http://www.smedaily.co.kr/news/articleView.htm...,
2219,직원들과 기념촬영하는 한덕수 총리,뉴시스,2023.01.18.,http://www.newsis.com/view/?id=NISI20230118_00...,https://n.news.naver.com/mnews/article/003/001...
...,...,...,...,...,...
14191,"KB국민은행, 1개월 만기 'KB 특별한 적금' 사전 예약 실시",여성소비자신문,2023.04.05.,http://www.wsobi.com/news/articleView.html?idx...,
15245,"경남정보대, 2023 국가산업대상 인재육성부문 수상",교수신문,2023.04.13.,http://www.kyosu.net/news/articleView.html?idx...,
15497,"고려대학교 안암병원 가정의학과 김양현 교수, 2023 자랑스러운 기업(기관/인물)&...",한국목재신문,2023.04.14.,https://www.woodkorea.co.kr/news/articleView.h...,
16622,"우리은행, '기업뱅킹' 이용 고객 대상 쓰리고 이벤트",EBN,2023.04.25.,https://www.ebn.co.kr/news/view/1575340/?sc=Naver,


In [10]:
# URL 기준 중복값 제거
df.drop_duplicates(subset='URL', inplace=True)

In [11]:
# 인덱스 재설정
df.reset_index(drop=True, inplace=True)

In [12]:
# Konlpy 불러오기
!pip install Konlpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting JPype1>=0.7.0 (from Konlpy)
  Downloading JPype1-1.4.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: JPype1, Konlpy
Successfully installed JPype1-1.4.1 Konlpy-0.6.0


In [13]:
# Okt 불러오기
from konlpy.tag import Okt
okt = Okt()

In [14]:
# 명사 중심의 토큰화
df['단어'] = df['제목'].map(okt.nouns)

In [16]:
# 단어 리스트 만들기
word_list = sum(df['단어'],[])

In [17]:
# Counter 불러오기
from collections import Counter
c = Counter(word_list)

In [18]:
# 가장 많이 등장한 단어
c.most_common(100)

[('이벤트', 3458),
 ('스타벅스', 3372),
 ('진행', 1877),
 ('출시', 1502),
 ('증권', 1110),
 ('카드', 975),
 ('실시', 923),
 ('고객', 913),
 ('투자', 891),
 ('커피', 739),
 ('페이', 718),
 ('오픈', 684),
 ('서비스', 653),
 ('은행', 617),
 ('할인', 595),
 ('애플', 576),
 ('대상', 562),
 ('삼성', 557),
 ('기념', 548),
 ('국민은행', 548),
 ('이마트', 528),
 ('등', 478),
 ('롯데', 463),
 ('신세계', 440),
 ('최대', 437),
 ('연금', 422),
 ('유통', 410),
 ('돌파', 399),
 ('프로모션', 395),
 ('맞이', 375),
 ('라떼', 369),
 ('신한은행', 363),
 ('위', 360),
 ('한국', 359),
 ('주식', 357),
 ('판매', 356),
 ('개최', 341),
 ('봄', 341),
 ('행사', 340),
 ('명', 336),
 ('음료', 333),
 ('스', 321),
 ('혜택', 318),
 ('데이', 313),
 ('선물', 312),
 ('새해', 309),
 ('오늘', 302),
 ('스벅', 301),
 ('우리은행', 297),
 ('청년', 293),
 ('상품', 284),
 ('푸드', 279),
 ('신규', 278),
 ('갤럭시', 275),
 ('가입', 274),
 ('카페', 271),
 ('기업', 271),
 ('캠페인', 266),
 ('뱅크', 264),
 ('라이브', 264),
 ('종', 262),
 ('경품', 261),
 ('생활', 260),
 ('퀴즈', 259),
 ('아메리카노', 259),
 ('시장', 258),
 ('제공', 255),
 ('카카오', 250),
 ('쇼핑', 249),
 ('첫', 246),
 

In [19]:
# '스타벅스' 또는 '커피'가 들어간 제목만 필터링
df = df[df['단어'].map(lambda x: '스타벅스' in x or '커피' in x)].copy(); df

Unnamed: 0,제목,언론사,날짜,URL,네이버뉴스_URL,단어
75,"손흥민 효과! 메가커피, 선물하고 싶은 브랜드 등극",싱글리스트,2023.01.02.,http://www.slist.kr/news/articleView.html?idxn...,,"[손흥민, 효과, 메, 커피, 선물, 브랜드, 등]"
76,"‘손흥민 효과’ 메가커피, 선물하고 싶은 브랜드 등극",스포츠경향,2023.01.02.,http://sports.khan.co.kr/news/sk_index.html?ar...,https://n.news.naver.com/mnews/article/144/000...,"[손흥민, 효과, 메, 커피, 선물, 브랜드, 등]"
77,"메가MGC커피, 손흥민 효과로 '선물하고 싶은 브랜드' 등극",서울와이어,2023.01.02.,http://www.seoulwire.com/news/articleView.html...,,"[메, 커피, 손흥민, 효과, 선물, 브랜드, 등]"
78,"메가MCG커피, 손흥민 모델 기용 후 판매 급성장",포인트데일리,2023.01.02.,http://www.thekpm.com/news/articleView.html?id...,,"[메, 커피, 손흥민, 모델, 기용, 후, 판매, 급성]"
80,"[생활경제 이슈] 메가MGC커피, 손흥민 손잡고 선물하고 싶은 브랜드 등극 外",로이슈,2023.01.02.,http://www.lawissue.co.kr/view.php?ud=20230102...,,"[생활, 경제, 이슈, 메, 커피, 손흥민, 선물, 브랜드, 등]"
...,...,...,...,...,...,...
16732,"도미노피자, 오비맥주, 스타벅스, 맥도날드 다양한 고객 참여형 캠페인 전개",매일안전신문,2023.04.26.,https://idsn.co.kr/news/view/1065588006309683,,"[도미노피자, 오비맥주, 스타벅스, 맥도날드, 고객, 참여, 캠페인, 전개]"
16734,고궁서 즐기는 공짜 스벅 커피·'신비아파트' 체험행사,여성신문,2023.04.26.,http://www.womennews.co.kr/news/articleView.ht...,https://n.news.naver.com/mnews/article/310/000...,"[고궁, 공짜, 스벅, 커피, 신비, 아파트, 체험, 행사]"
16755,"中부주석, 스타벅스 창업자에 러브콜",매일경제,2023.04.26.,https://www.mk.co.kr/article/10722382,https://n.news.naver.com/mnews/article/009/000...,"[부주석, 스타벅스, 창업, 러브콜]"
16767,"스타벅스 콜드 브루, 누적판매 1억 5000만 잔 돌파",뉴스프리존,2023.04.26.,http://www.newsfreezone.co.kr/news/articleView...,,"[스타벅스, 콜드, 브루, 누적, 판매, 잔, 돌파]"


In [20]:
# 텍스트 클리닝
import re
df['제목_전처리'] = df['제목'].map(lambda x: re.sub('[^\w\s]', ' ', x)) #

In [21]:
df['제목_전처리']

75                        손흥민 효과  메가커피  선물하고 싶은 브랜드 등극
76                        손흥민 효과  메가커피  선물하고 싶은 브랜드 등극
77                   메가MGC커피  손흥민 효과로  선물하고 싶은 브랜드  등극
78                         메가MCG커피  손흥민 모델 기용 후 판매 급성장
80          생활경제 이슈  메가MGC커피  손흥민 손잡고 선물하고 싶은 브랜드 등극 外
                             ...                      
16732        도미노피자  오비맥주  스타벅스  맥도날드 다양한 고객 참여형 캠페인 전개
16734                    고궁서 즐기는 공짜 스벅 커피  신비아파트  체험행사
16755                              中부주석  스타벅스 창업자에 러브콜
16767                   스타벅스 콜드 브루  누적판매 1억 5000만 잔 돌파
16770    中   美 포위망  깨기 총력전 스타벅스 창업자에  중국 경제 적극 참여해 달라 
Name: 제목_전처리, Length: 3650, dtype: object

In [22]:
# 인덱스 리셋
df.reset_index(drop=True, inplace=True)

In [23]:
# TF-IDF 벡터화
vectorizer = TfidfVectorizer(min_df = 10, ngram_range=(1,2), tokenizer=okt.morphs)

In [26]:
features = vectorizer.fit_transform(df['제목_전처리'])

In [27]:
# feature 이름 불러 오기
feature_names = vectorizer.get_feature_names_out()

In [28]:
feature_names

array(['000만', '000만 명', '1', ..., '휘호 전달', '힐링', '힐링 공간'], dtype=object)

In [29]:
# DTM 생성
dtm_np = np.array(features.todense())

In [32]:
# DataFrame 확인
pd.DataFrame(data = dtm_np, columns = feature_names)

Unnamed: 0,000만,000만 명,1,1 000만,1 명,1 위,1 절,1 호점,100,1000,...,회원 스타벅스,획득,효과,후원,휘호,휘호 기증,휘호 유지,휘호 전달,힐링,힐링 공간
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.370753,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.370753,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.321537,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3645,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3646,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3648,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
# 코사인 유사도 구하기
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(dtm_np, dtm_np)

In [34]:
# 인덱스 설정
indices = pd.Series(df.index, index=df['제목'])

In [36]:
indices

제목
손흥민 효과! 메가커피, 선물하고 싶은 브랜드 등극                        0
‘손흥민 효과’ 메가커피, 선물하고 싶은 브랜드 등극                       1
메가MGC커피, 손흥민 효과로 '선물하고 싶은 브랜드' 등극                   2
메가MCG커피, 손흥민 모델 기용 후 판매 급성장                         3
[생활경제 이슈] 메가MGC커피, 손흥민 손잡고 선물하고 싶은 브랜드 등극 外         4
                                                 ... 
도미노피자, 오비맥주, 스타벅스, 맥도날드 다양한 고객 참여형 캠페인 전개        3645
고궁서 즐기는 공짜 스벅 커피·'신비아파트' 체험행사                    3646
中부주석, 스타벅스 창업자에 러브콜                              3647
스타벅스 콜드 브루, 누적판매 1억 5000만 잔 돌파                   3648
中, ‘美 포위망’ 깨기 총력전…스타벅스 창업자에 “중국 경제 적극 참여해 달라”    3649
Length: 3650, dtype: int64

In [37]:
# 뉴스 기사 인덱스 추출
idx = indices["스타벅스, 올해 커뮤니티 스토어 청년인재 배출…'역대 최다 인원'"]
print(idx)

1753


In [45]:
# 선택한 뉴스 기사와 다른 뉴스 기사간 유사도 확인
sim_scores = list(enumerate(cosine_sim[idx]))

In [46]:
# 유사도가 높은 순으로 뉴스 정렬
sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True)

In [49]:
# 유사도 높은 10개의 뉴스
sim_scores = sim_scores[0:11]; sim_scores # 0은 자기 자신일 것

[(1753, 1.0),
 (1754, 0.6341466293697997),
 (1726, 0.6006578966817233),
 (1764, 0.5848960858757748),
 (1722, 0.5535078918877148),
 (1729, 0.5515513614344503),
 (1718, 0.5493485565097199),
 (2363, 0.5404018032691418),
 (1715, 0.5337755980560601),
 (1717, 0.5337755980560601),
 (1760, 0.5337755980560601)]

In [50]:
sim_scores = sim_scores[1:11]

In [51]:
# 가장 유사한 10개의 뉴스의 인덱스
news_indices = [i[0] for i in sim_scores]

In [52]:
# 유사한 뉴스 기사 제목 출력
df['제목'].iloc[news_indices]

1754    스타벅스, 청년인재 대학생 양성…올해 역대 최다 졸업생 배출
1726           스타벅스, 커뮤니티 스토어 청년인재 졸업식 진행
1764         ‘스타벅스 청년인재’ 올해 16명 졸업… 역대 최다
1722    스타벅스, 청년인재 대학생 16명 졸업... 역대 최다 배출
1729     스타벅스의 청년인재 대학생, 올해 역대 최다 인원 졸업했다
1718     스타벅스, 올해 '청년인재 대학생' 최다 배출…16명 졸업
2363           스타벅스, '커뮤니티 스토어' 청년인재 모집한다
1715       스타벅스, 커뮤니티 스토어 청년인재 졸업생 16명 배출
1717       스타벅스, 커뮤니티 스토어 청년인재 졸업생 16명 배출
1760       스타벅스, 커뮤니티 스토어 청년인재 졸업생 16명 배출
Name: 제목, dtype: object

In [53]:
# 코사인 유사도 기반 데이터 프레임
sim_df = pd.DataFrame(cosine_sim); sim_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3640,3641,3642,3643,3644,3645,3646,3647,3648,3649
0,1.000000,1.000000,0.771640,0.536160,0.627143,0.027781,0.000000,0.704813,0.394299,0.430379,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.033421,0.000000,0.000000,0.000000
1,1.000000,1.000000,0.771640,0.536160,0.627143,0.027781,0.000000,0.704813,0.394299,0.430379,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.033421,0.000000,0.000000,0.000000
2,0.771640,0.771640,1.000000,0.464987,0.823668,0.024093,0.000000,0.925677,0.595107,0.290204,...,0.000000,0.000000,0.000000,0.000000,0.046721,0.000000,0.028985,0.000000,0.000000,0.000000
3,0.536160,0.536160,0.464987,1.000000,0.446965,0.038661,0.000000,0.502320,0.404426,0.465671,...,0.000000,0.000000,0.000000,0.000000,0.074084,0.000000,0.046510,0.000000,0.063407,0.000000
4,0.627143,0.627143,0.823668,0.446965,1.000000,0.023160,0.000000,0.889800,0.485606,0.278957,...,0.000000,0.123744,0.000000,0.094688,0.000000,0.000000,0.027861,0.000000,0.000000,0.096305
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3645,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.004998,0.000000,0.000000,0.000000,...,0.081705,0.165300,0.155806,0.126486,0.005117,1.000000,0.000000,0.006463,0.004379,0.128647
3646,0.033421,0.033421,0.028985,0.046510,0.027861,0.040238,0.000000,0.031312,0.025210,0.029027,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000
3647,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.004839,0.000000,0.000000,0.000000,...,0.079104,0.179990,0.006844,0.854926,0.004954,0.006463,0.000000,1.000000,0.004240,0.734042
3648,0.000000,0.000000,0.000000,0.063407,0.000000,0.000000,0.003279,0.000000,0.000000,0.000000,...,0.053601,0.003999,0.004637,0.003060,0.049818,0.004379,0.000000,0.004240,1.000000,0.003112


In [54]:
# Boolean Index 만들기
sim_boolean = sim_df > 0.5 ; sim_boolean

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3640,3641,3642,3643,3644,3645,3646,3647,3648,3649
0,True,True,True,True,True,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,True,True,True,True,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,True,True,False,True,False,False,True,True,False,...,False,False,False,False,False,False,False,False,False,False
3,True,True,False,True,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,True,True,False,True,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3645,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
3646,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3647,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,True,False,True
3648,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [55]:
# 유사도 0.5가 넘는 게시글이 10개 이상 있는 경우
sim_boolean.sum() >= 10

0        True
1        True
2        True
3       False
4       False
        ...  
3645    False
3646    False
3647     True
3648     True
3649     True
Length: 3650, dtype: bool

In [56]:
# 10개 이상 있는 게시글이 몇 개나 있을까?
sum(sim_boolean.sum() >= 10)

2465

In [57]:
# 유사도가 0.5 이상인 뉴스의 인덱스 확인해보기
sim_df[sim_boolean.sum() >= 10].index

Int64Index([   0,    1,    2,    7,    8,   10,   11,   12,   13,   16,
            ...
            3621, 3627, 3628, 3630, 3635, 3636, 3643, 3647, 3648, 3649],
           dtype='int64', length=2465)

In [58]:
# 유사도가 0.4 이상인 게시글은 얼마나 유사할까?
idx_list = sim_df[0][sim_df[0] > 0.4].index
df.iloc[idx_list,:]

Unnamed: 0,제목,언론사,날짜,URL,네이버뉴스_URL,단어,제목_전처리
0,"손흥민 효과! 메가커피, 선물하고 싶은 브랜드 등극",싱글리스트,2023.01.02.,http://www.slist.kr/news/articleView.html?idxn...,,"[손흥민, 효과, 메, 커피, 선물, 브랜드, 등]",손흥민 효과 메가커피 선물하고 싶은 브랜드 등극
1,"‘손흥민 효과’ 메가커피, 선물하고 싶은 브랜드 등극",스포츠경향,2023.01.02.,http://sports.khan.co.kr/news/sk_index.html?ar...,https://n.news.naver.com/mnews/article/144/000...,"[손흥민, 효과, 메, 커피, 선물, 브랜드, 등]",손흥민 효과 메가커피 선물하고 싶은 브랜드 등극
2,"메가MGC커피, 손흥민 효과로 '선물하고 싶은 브랜드' 등극",서울와이어,2023.01.02.,http://www.seoulwire.com/news/articleView.html...,,"[메, 커피, 손흥민, 효과, 선물, 브랜드, 등]",메가MGC커피 손흥민 효과로 선물하고 싶은 브랜드 등극
3,"메가MCG커피, 손흥민 모델 기용 후 판매 급성장",포인트데일리,2023.01.02.,http://www.thekpm.com/news/articleView.html?id...,,"[메, 커피, 손흥민, 모델, 기용, 후, 판매, 급성]",메가MCG커피 손흥민 모델 기용 후 판매 급성장
4,"[생활경제 이슈] 메가MGC커피, 손흥민 손잡고 선물하고 싶은 브랜드 등극 外",로이슈,2023.01.02.,http://www.lawissue.co.kr/view.php?ud=20230102...,,"[생활, 경제, 이슈, 메, 커피, 손흥민, 선물, 브랜드, 등]",생활경제 이슈 메가MGC커피 손흥민 손잡고 선물하고 싶은 브랜드 등극 外
7,"'손흥민 모델' 메가MGC커피, 선물하고 싶은 브랜드 등극",OBS,2023.01.02.,http://www.obsnews.co.kr/news/articleView.html...,,"[손흥민, 모델, 메, 커피, 선물, 브랜드, 등]",손흥민 모델 메가MGC커피 선물하고 싶은 브랜드 등극
9,“역시 손흥민 파워” 메가커피 모바일쿠폰 판매량 3배 증가,매일경제,2023.01.02.,https://www.mk.co.kr/article/10589534,https://n.news.naver.com/mnews/article/009/000...,"[역시, 손흥민, 파워, 메, 커피, 모바일, 쿠폰, 판매량, 배, 증가]",역시 손흥민 파워 메가커피 모바일쿠폰 판매량 3배 증가
10,"메가커피, 모바일 e쿠폰 판매량 3배 성장…전속 모델 손흥민 효과 ‘톡톡’",브릿지경제,2023.01.02.,https://www.viva100.com/main/view.php?key=2023...,,"[메, 커피, 모바일, 쿠폰, 판매량, 배, 성장, 전속, 모델, 손흥민, 효과, 톡톡]",메가커피 모바일 e쿠폰 판매량 3배 성장 전속 모델 손흥민 효과 톡톡
11,"‘손흥민 전속 모델’ 메가커피, 모바일 쿠폰 매출 3배 껑충",데일리안,2023.01.02.,https://www.dailian.co.kr/news/view/1189190/?s...,https://n.news.naver.com/mnews/article/119/000...,"[손흥민, 전속, 모델, 메, 커피, 모바일, 쿠폰, 매출, 배, 껑충]",손흥민 전속 모델 메가커피 모바일 쿠폰 매출 3배 껑충
16,‘손흥민 효과’ 톡톡…메가커피 모바일 쿠폰 판매량 3배↑,매일경제,2023.01.02.,https://www.mk.co.kr/article/10590093,https://n.news.naver.com/mnews/article/009/000...,"[손흥민, 효과, 톡톡, 메, 커피, 모바일, 쿠폰, 판매량, 배]",손흥민 효과 톡톡 메가커피 모바일 쿠폰 판매량 3배


In [60]:
# 유사도 25% 이상의 뉴스 토픽 묶기

idx_list = list(sim_df[sim_boolean.sum() >= 10].index)
cluster = []
pass_list = []
threshold = 0.25
id_idx = []
for i in idx_list:
    if i not in pass_list:
        idx = sim_df[i][sim_df[i] > threshold].index
        cluster.append(idx)
        pass_list.extend(list(idx))
        id_idx.append(i)
    else:
        pass

print(len(cluster))

106


In [61]:
# 스타벅스 뉴스 토픽
starbucks_df = df.iloc[id_idx,:].copy(); starbucks_df

Unnamed: 0,제목,언론사,날짜,URL,네이버뉴스_URL,단어,제목_전처리
0,"손흥민 효과! 메가커피, 선물하고 싶은 브랜드 등극",싱글리스트,2023.01.02.,http://www.slist.kr/news/articleView.html?idxn...,,"[손흥민, 효과, 메, 커피, 선물, 브랜드, 등]",손흥민 효과 메가커피 선물하고 싶은 브랜드 등극
12,"스타벅스, 국내산 흑미 ‘블랙 햅쌀 고봉 라떼’ 출시",조세일보,2023.01.02.,http://www.joseilbo.com/news/news_read.php?uid...,https://n.news.naver.com/mnews/article/123/000...,"[스타벅스, 국내, 산, 블랙, 햅쌀, 고봉, 라떼, 출시]",스타벅스 국내산 흑미 블랙 햅쌀 고봉 라떼 출시
31,"[인사] LG생활건강, 스타벅스 출신 문혜영 부사장 미주사업총괄 영입",이넷뉴스,2023.01.04.,https://www.enetnews.co.kr/news/articleView.ht...,,"[인사, 생활, 건강, 스타벅스, 출신, 혜영, 부사, 미주, 사업, 총괄, 입]",인사 LG생활건강 스타벅스 출신 문혜영 부사장 미주사업총괄 영입
64,"스타벅스, 2023년도 1분기 장애인 바리스타 채용",싱글리스트,2023.01.05.,http://www.slist.kr/news/articleView.html?idxn...,,"[스타벅스, 장애인, 바리스타, 채용]",스타벅스 2023년도 1분기 장애인 바리스타 채용
83,"스타벅스, 장애인 바리스타 공개채용...오는 15일까지 접수",뉴스핌,2023.01.05.,http://www.newspim.com/news/view/20230105000119,,"[스타벅스, 장애인, 바리스타, 공개, 채용, 접수]",스타벅스 장애인 바리스타 공개채용 오는 15일까지 접수
...,...,...,...,...,...,...,...
3443,"스타벅스, 장애인의 날 맞아 '텀블러 그림 공모전' 개최",포인트데일리,2023.04.19.,https://www.thekpm.com/news/articleView.html?i...,,"[스타벅스, 장애인, 날, 텀블러, 그림, 공모전, 개최]",스타벅스 장애인의 날 맞아 텀블러 그림 공모전 개최
3457,동서식품 ‘카누 바리스타’ 국산 캡슐커피 자존심 세운다,인더뉴스,2023.04.19.,https://www.inthenews.co.kr/news/article.html?...,,"[동서식품, 카누, 바리스타, 국산, 캡슐, 커피, 자존심]",동서식품 카누 바리스타 국산 캡슐커피 자존심 세운다
3492,"스타벅스, 다회용컵 사용 확산 '다다익선' 캠페인",싱글리스트,2023.04.20.,http://www.slist.kr/news/articleView.html?idxn...,,"[스타벅스, 회용컵, 사용, 확산, 다다익선, 캠페인]",스타벅스 다회용컵 사용 확산 다다익선 캠페인
3518,"스타벅스, 지구의날 기념 다회용 컵 사용 캠페인 전개...""지구도 지키고 이모티콘도""",한국면세뉴스,2023.04.20.,http://www.kdfnews.com/news/articleView.html?i...,,"[스타벅스, 지구, 날, 기념, 회용, 컵, 사용, 캠페인, 전개, 지구, 이모티콘]",스타벅스 지구의날 기념 다회용 컵 사용 캠페인 전개 지구도 지키고 이모티콘도
