<a href="https://colab.research.google.com/github/sgr1118/NLP_basic/blob/main/_4_%EA%B5%B0%EC%A7%91_%EB%B6%84%EC%84%9D(Cluster_Analysis).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 군집 분석(Cluster Analysis)

* 군집 분석은 데이터의 특성에 따라 유사한 것끼리 묶음
* 유사성을 기반으로 군집을 분류하고, 군집에 따라 유형별 특징을 분석하는 기법
* 텍스트에 대한 군집 분석에서는 군집으로 묶여진 텍스트들끼리는 최대한 유사하고, 다른 군집으로 묶여진 텍스트들과는 최대한 유사하지 않도록 분류

## 문서 유사도 측정

* 문서와 문서 간의 유사도 비교를 위해 유클리드 거리, 자카드 유사도 그리고 코사인 유사도 계산

In [2]:
import nltk

nltk.download('punkt')
nltk.download('wordnet')

from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
d1 = 'Thank like a man of action and act like man of thought.'
d2 = 'Try no to become a man of success but rather try to become a man of value.'
d3 = 'Give me liberty, of give me death'

corpus = [d1, d2, d3]
print(corpus)

['Thank like a man of action and act like man of thought.', 'Try no to become a man of success but rather try to become a man of value.', 'Give me liberty, of give me death']


In [4]:
import pandas as pd

vector = CountVectorizer(stop_words = 'english')
bow = vector.fit_transform(corpus)

columns = []
for k, v in sorted(vector.vocabulary_.items(), key = lambda item:item[1]):
    columns.append(k)

df = pd.DataFrame(bow.toarray(),columns = columns)
df

Unnamed: 0,act,action,death,liberty,like,man,success,thank,thought,try,value
0,1,1,0,0,2,2,0,1,1,0,0
1,0,0,0,0,0,2,1,0,0,2,1
2,0,0,1,1,0,0,0,0,0,0,0


### 유클리드 거리(Euclidean distance)

* 다차원 공간에서 두개의 점 $p$와 $q$ 사이의 거리를 계산하는 방법

$$ \sqrt{\sum_{i=1}^{n}\left (q_i - p_i \right)^2} $$


In [7]:
import numpy as np

def euclidean_distance(p, q):
    return np.sqrt(np.sum((q-p)**2))

In [9]:
print(euclidean_distance(bow[0].toarray(), bow[1].toarray()))
print(euclidean_distance(bow[0].toarray(), bow[2].toarray()))
print(euclidean_distance(bow[1].toarray(), bow[2].toarray()))

3.7416573867739413
3.7416573867739413
3.4641016151377544


### 자카드 유사도(Jaccard Similarity)

* 두 텍스트 문서 사이에 공통된 용어의 수와 해당 텍스트에 존재하는 총 고유 용어 수의 비율을 사용

$$ jaccard(A, B)=\frac{\left | A\cap B \right |}{\left | A\cup B \right |}=\frac{\left | A\cap B \right |}{\left | A \left |+ \right | B \right | - \left | A\cap B \right |} $$


In [11]:
from nltk.corpus.reader.ycoe import wordpunct_tokenize
def jaccard_similarity(d1, d2):
    lemmatizer = WordNetLemmatizer()

    words1 = [lemmatizer.lemmatize(word.lower()) for word in wordpunct_tokenize(d1)]
    words2 = [lemmatizer.lemmatize(word.lower()) for word in wordpunct_tokenize(d2)]

    inter = len(set(words1).intersection(set(words2)))
    union = len(set(words1).union(set(words2)))

    return inter/union

In [13]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [14]:
print(jaccard_similarity(d1, d2))
print(jaccard_similarity(d1, d3))
print(jaccard_similarity(d2, d3))

0.2222222222222222
0.06666666666666667
0.058823529411764705


### 코사인 유사도(Cosine Similarity)

* 백터 표현 사이의 각도에 대한 코사인 값을 사용. BoW와 TF-IDF 행렬은 텍스트에 대한 백터 표현으로 활용 가능

$$ cosine(A, B)=\frac{A \cdot B}{\left \| A \right \| \left \| B \right \|}=\frac{\sum_{i=1}^{N}A_i\times B_i}{\sqrt{\sum_{i=1}^{N}\left (A_i \right)^2}\times \sqrt{\sum_{i=1}^{N}\left (B_i \right)^2}} $$

In [15]:
tfidf = TfidfVectorizer()

tfidf_vectors = tfidf.fit_transform(corpus)

print(cosine_similarity(tfidf_vectors[0], tfidf_vectors[1]))
print(cosine_similarity(tfidf_vectors[0], tfidf_vectors[2]))
print(cosine_similarity(tfidf_vectors[1], tfidf_vectors[2]))

[[0.22861951]]
[[0.06083323]]
[[0.04765587]]


## 군집화(Clustering)

* 리뷰 데이터 다운로드 (http://archive.ics.uci.edu/ml/machine-learning-databases/opinion/OpinosisDataset1.0.zip)

In [16]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/opinion/OpinosisDataset1.0.zip

--2023-01-25 04:53:15--  http://archive.ics.uci.edu/ml/machine-learning-databases/opinion/OpinosisDataset1.0.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 773840 (756K) [application/x-httpd-php]
Saving to: ‘OpinosisDataset1.0.zip’


2023-01-25 04:53:15 (4.42 MB/s) - ‘OpinosisDataset1.0.zip’ saved [773840/773840]



In [17]:
!unzip OpinosisDataset1.0.zip

Archive:  OpinosisDataset1.0.zip
   creating: OpinosisDataset1.0/examples/
   creating: OpinosisDataset1.0/examples/prepare4rouge/
   creating: OpinosisDataset1.0/examples/prepare4rouge/input/
   creating: OpinosisDataset1.0/examples/prepare4rouge/input/summaries-base/
  inflating: OpinosisDataset1.0/examples/prepare4rouge/input/summaries-base/accuracy_garmin_nuvi_255W_gps.baseline  
  inflating: OpinosisDataset1.0/examples/prepare4rouge/input/summaries-base/bathroom_bestwestern_hotel_sfo.baseline  
   creating: OpinosisDataset1.0/examples/prepare4rouge/input/summaries-gold/
   creating: OpinosisDataset1.0/examples/prepare4rouge/input/summaries-gold/accuracy_garmin_nuvi_255W_gps/
  inflating: OpinosisDataset1.0/examples/prepare4rouge/input/summaries-gold/accuracy_garmin_nuvi_255W_gps/accuracy_garmin_nuvi_255W_gps.1.gold  
  inflating: OpinosisDataset1.0/examples/prepare4rouge/input/summaries-gold/accuracy_garmin_nuvi_255W_gps/accuracy_garmin_nuvi_255W_gps.2.gold  
 extracting: Opinosis

In [19]:
!ls OpinosisDataset1.0/topics

accuracy_garmin_nuvi_255W_gps.txt.data
bathroom_bestwestern_hotel_sfo.txt.data
battery-life_amazon_kindle.txt.data
battery-life_ipod_nano_8gb.txt.data
battery-life_netbook_1005ha.txt.data
buttons_amazon_kindle.txt.data
comfort_honda_accord_2008.txt.data
comfort_toyota_camry_2007.txt.data
directions_garmin_nuvi_255W_gps.txt.data
display_garmin_nuvi_255W_gps.txt.data
eyesight-issues_amazon_kindle.txt.data
features_windows7.txt.data
fonts_amazon_kindle.txt.data
food_holiday_inn_london.txt.data
food_swissotel_chicago.txt.data
free_bestwestern_hotel_sfo.txt.data
gas_mileage_toyota_camry_2007.txt.data
interior_honda_accord_2008.txt.data
interior_toyota_camry_2007.txt.data
keyboard_netbook_1005ha.txt.data
location_bestwestern_hotel_sfo.txt.data
location_holiday_inn_london.txt.data
mileage_honda_accord_2008.txt.data
navigation_amazon_kindle.txt.data
parking_bestwestern_hotel_sfo.txt.data
performance_honda_accord_2008.txt.data
performance_netbook_1005ha.txt.data
price_amazon_kindle.txt.data
pri

In [None]:
import glob, os

path = r'./OpinosisDataset1.0/topics/'
files = glob.glob(os.path.join(path, '*data'))
filenames = []
opinions = []

for file_ in files:
    filename = file_.split('/')[-1]
    filename = filename.split('.')[0]
    filenames.append(filename)

    df = pd.read_table(file_, index_col = None, header = 0, encoding = 'latin1')
    opinions.append(df.to_string())

opinion_df = pd.DataFrame({'filename':filenames, 'opinion': opinions})
opinion_df

In [26]:
tfidf = TfidfVectorizer(stop_words = 'english', ngram_range = (1,2), min_df = 0.05, max_df = 0.85)

tfidf_vectors = tfidf.fit_transform(opinion_df['opinion'])
feature_name = tfidf.get_feature_names_out()
print(feature_name)

['00' '000' '000 miles' ... 'yes rooms' 'yields' 'zoom']


### DBSCAN 알고리즘

* 밀도 기반의 군집화 알고리즘
* 특정 벡터부터 시작해 반경내 기준치 만큼의 점들이 존재한다면 군집화 하는 방식
* 일정 밀도 이상의 데이터를 기준으로 군집을 형성하기 때문에 노이즈 처리에 용이
* 이미 형성된 군집 기준으로 기준점을 옮겨가며 처리하기 때문에 분포가 이상한 데이터에도 강건함
* K-means에 비해 속도가 느리고, 파라미터 값인 epsilon, min_sampels 값에 영향을 많이 받음

In [31]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps = 0.7, min_samples = 3, metric = 'cosine')
dbscan_label = dbscan.fit_predict(tfidf_vectors)
print(dbscan_label)

[ 0 -1  0 -1  1 -1 -1  0  2  3  0  3  1  0  3  0 -1  3 -1  0 -1  3  0 -1
 -1 -1 -1  1  0  3  1 -1  2  3  3  3  3 -1  3  2  1  3 -1  3 -1  2 -1 -1
 -1 -1  2]


In [32]:
opinion_df['cluster'] = dbscan_label
opinion_df

Unnamed: 0,filename,opinion,cluster
0,video_ipod_nano_8gb,...,0
1,mileage_honda_accord_2008,...,-1
2,size_asus_netbook_1005ha,...,0
3,gas_mileage_toyota_camry_2007,...,-1
4,battery-life_ipod_nano_8gb,...,1
5,staff_bestwestern_hotel_sfo,...,-1
6,parking_bestwestern_hotel_sfo,...,-1
7,display_garmin_nuvi_255W_gps,...,0
8,comfort_toyota_camry_2007,...,2
9,service_swissotel_hotel_chicago,...,3


In [34]:
for cluster_num in set(dbscan_label):
    print('Clustr: {}'.format(cluster_num))
    df = opinion_df[opinion_df['cluster'] == cluster_num]
    for filename in df['filename']:
        print(filename)
    print()

Clustr: 0
video_ipod_nano_8gb
size_asus_netbook_1005ha
display_garmin_nuvi_255W_gps
screen_netbook_1005ha
screen_garmin_nuvi_255W_gps
speed_garmin_nuvi_255W_gps
screen_ipod_nano_8gb
keyboard_netbook_1005ha
voice_garmin_nuvi_255W_gps

Clustr: 1
battery-life_ipod_nano_8gb
battery-life_amazon_kindle
performance_honda_accord_2008
battery-life_netbook_1005ha
performance_netbook_1005ha

Clustr: 2
comfort_toyota_camry_2007
interior_toyota_camry_2007
seats_honda_accord_2008
interior_honda_accord_2008
comfort_honda_accord_2008

Clustr: 3
service_swissotel_hotel_chicago
price_amazon_kindle
bathroom_bestwestern_hotel_sfo
food_holiday_inn_london
price_holiday_inn_london
service_bestwestern_hotel_sfo
food_swissotel_chicago
location_bestwestern_hotel_sfo
rooms_bestwestern_hotel_sfo
room_holiday_inn_london
service_holiday_inn_london
rooms_swissotel_chicago
location_holiday_inn_london

Clustr: -1
mileage_honda_accord_2008
gas_mileage_toyota_camry_2007
staff_bestwestern_hotel_sfo
parking_bestwestern_ho

### K-means 알고리즘

* 대표적인 군집화 알고리즘
* 클러스터 수 k를 직접 지정해야 함
* 각 군집내 평균 벡터와 해당 군집에 속한 벡터간의 거리 제곱의 합이 최소가 되는 군집을 찾는 방법
* 노이즈 데이터에 취약하고, 중심점(centroid)을 임의로 잡기 때문에 군집 결과가 상이하거나 나쁠수 있음

In [35]:
from sklearn.cluster import KMeans

k = 3
kmeans = KMeans(n_clusters = k, max_iter = 10000, random_state = 42)
kmeans_label = kmeans.fit_predict(tfidf_vectors)
kmeans_centers = kmeans.cluster_centers_

print(kmeans_label) # 3가지의 레이블 생성됨
pd.DataFrame(kmeans_centers) # 4400개 컬럼이 3개의 군집으로 묶여있다.

[2 0 2 0 2 1 1 2 0 1 2 2 2 2 1 2 2 1 2 2 2 1 2 2 1 2 1 0 2 1 2 2 0 1 1 1 1
 2 1 0 2 1 2 1 2 0 0 2 0 2 0]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4390,4391,4392,4393,4394,4395,4396,4397,4398,4399
0,0.000932,0.007161,0.00486,0.003372,0.003252,0.004269,0.007004,0.0,0.0,0.0,...,0.004123,0.002509,0.005445,0.0,0.0,0.002379,0.001887,0.0,0.002924,0.0
1,0.004467,0.0,0.0,0.0,0.000195,0.0,0.0,0.000857,0.001729,0.0,...,0.001365,0.001024,0.002556,0.000865,0.0007,0.000951,0.003261,0.001657,0.0,0.0
2,0.001304,0.0,0.0,0.0,0.0,0.000551,0.0,0.0,0.0,0.003177,...,0.005709,0.003491,0.005241,0.00244,0.001774,0.0,0.002975,0.0,0.000411,0.007339


### 리뷰 문서 유사도 측정

## 계층적 군집화(Hierarchical Clustering)

* 계층적 군집화란 개별 개체들을 유사한 개체나 그룹과 통합해 군집화를 수행하는 알고리즘
* 비계층적 군집화와는 달리 군집 수를 지정하지 않아도 군집화를 할 수 있는 것이 장점  
* 계층적 군집화는 모든 개체간 거리나 유사도가 미리 계산되어 있어야만 하며, 계산복잡도도 비계층적 군집화보다 큼




### 병합 군집화(Agglomerative Clustering)

* 비계층적 군집화의 일종인 `agglomerativeClustering`(병합 군집)을 이용, 계층적 군집화 실습    
* 병합 군집은 각 개체들을 클러스터로 간주, 종료 조건을 만족할 때 까지 가장 비슷한 두 클러스터들을 합치며 진행
* 병합 군집의 종료 조건에는 3가지(ward, average, complete)를 지정 가능

`ward`: 모든 클러스터 내의 분산을 가장 적게 증가시키는 두 클러스터를 합침(기본값)

`average`: 클러스터간 평균 거리가 가장 짧은 두 클러스터를 합침

`complete`: 클러스터간 최대 거리가 가장 짧은 두 클러스터를 합침

### 덴드로그램(Dendrogram)

* `pdist`를 이용한 각 단어간 유클리디안 거리 계산

* 각 단어간 유클리디안 거리를 이용한 군집 분석 및 덴드로그램 시각화

* 각 단어간 코사인 유사도를 이용한 군집 분석 및 덴드로그램 시각화

## 뉴스그룹 군집 분석

### 데이터 로드 및 전처리

### K-means 군집화

### 병합 군집화