# Clustering

- 군집화의 목적 
    - 많은 샘플을 소수의 군집으로 묶어 각 군집의 특성을 파악하여 데이터의 특성을 이해하기 위함
    - 군집의 특성을 바탕으로 각 군집에 속하는 샘플들에 대한 세분화된 의사결정을 수행하기 위함. 
- 거리와 유사도 
    - 유사한 샘플을 하나의 군집으로 묶기 위해서는 거리 혹은 유사도의 개념이 필요하다. 
    - `두 샘플이 유사하다` = `두 샘플 간 유사도가 높다` = `두 샘플 간 거리가 짧다`    
        - 대부분의 거리 척도와 유사도 척도는 수치형 변수에 대해서 정의되어 있으므로, 문자를 숫자로 바꿔주는 작업이 반드시 선행되어야 함. 
        - Dummy화
            - one-hot encoding
            - `Pandas.get_dummies`
                - data: 더미화를 수행할 DF or Series
                - drop_first : 첫 번째 더미 변수를 제거할지 여부. 특별한 경우를 제외하면 True로 설정(전체 중에 한 변수는 나머지 변수로 완벽히 추론 가능하므로 보통 하나는 지운다).
        - 유클리디안 거리 
            ![2_31.png](../materials/2_35.png) 
        - 맨하탄 거리(설문조사 처럼 리커트 척도로 정의된 데이터에서 자주 쓴다)
            ![2_31.png](../materials/2_34.png)
        - 코사인 유사도(사잇각에 대한 코사인값. 방향 유사도를 측정하는 상황이며 벡터의 길이는 상관이 없음, 추천 시스템 등에 주로 사용). 
            ![2_31.png](../materials/2_33.png)
        - 매칭 유사도 : 이진형 데이터에 대한 척도. 
            ![2_31.png](../materials/2_31.png)
        - 자카드 유사도 : 둘중 하나라도 1을 가지는 특징 중 일치하는 비율을 고려(즉, 둘다 0인 것은 무시하겠다는 것).     
            ![2_31.png](../materials/2_32.png)

하단에 책 구매 추천 분석에서, 중요한 것이 자카드 유사도는 이진 분석에서 사용하는 것인데, 여기서 3이 있음. <br>
scikit-learn에서는 1이든 2든 3이든 다 1로 취급함. 

## Stratified Clustering

- 군집끼리 묶고 묶이면서 모든 샘플이 하나로 묶일 때까지 묶어나가는 방식. 
- 맨 처음에는 모든 데이터가 하나의 군집. 가까운것끼리 순차적으로 묶어가는 것. 
- 점점 큰 군집을 생성하면서 중간에 가장 적절한 군집 갯수를 선택하는 방식. 

![2_36-2.png](../materials/2_36.png)

어떻게 묶을 것인가? 군집 간 거리를 측정하는 방법이 필요하다. 
- 최단연결법은 각 군집의 전체 중에 서로 가장 짧은 두개를 서로 연결하는 것.
- 단 이 중에 유독 가까운거 두개가 있고 다른애들은 멀리 몰려 있으면 이상하게 될 수도 있겠지. 이상치에 민감하다. 
- 계산량이 많을 수 밖에 없는게, 모든 각각의 거리를 다 계산해야 한다. 

![2_36-2.png](../materials/2_37.png)

- 최장연결법은 각 군집의 전체 중에 서로 가장 먼 두개를 서로 연결하는 것.
- 단 이 중에 유독 가장 먼거 두개가 있고 다른애들은 멀리 몰려 있으면 이상하게 될 수도 있겠지. 이상치에 민감하다. 
- 계산량이 많을 수 밖에 없는게, 모든 각각의 거리를 다 계산해야 한다. 

![2_36-2.png](../materials/2_38.png)

- 평균연결법은 각 군집의 모든 조합에 대해서 거리를 구하고 전체 거리를 평균 구한다. 
- 이상치에 둔감한 편이다. 
- 계산량이 많을 수 밖에 없는게, 이 또한 모든 각각의 거리를 다 계산해야 한다. 

![2_36-2.png](../materials/2_39.png)

- 중심연결법은 각 군집의 모든 점들의 중심을 구한다. 
- 이상치에 둔감한 편이다. 
- 각 군집의 중심계산 1번씩과 그 중심간의 거리를 구하는 1번 해서 총 3번이면 된다. 계산량이 적다. 

![2_36-2.png](../materials/2_40.png)

- 와드연결법은 사이킷런의 디폴트로 사용될 만큼 많이 쓰이는 방법
- r1은 C1의 중심, r2는 C2의 중심. $r_{1,2}$는 두 군집이 하나로 묶였을 때 그때의 중심이다. 
- 이제 수식을 보면 된다. term 1은 C1의 각 점들과 r1간의 거리의 합. term2는 C2의 각 점들과 r2간의 거리의 합. term 3는 C1/C2의 모든 군집의 모든 점들과 $r_{1, 2}$의 거리의 합. 
- 군집 내 거리의 합에서 $r_{1, 2}$의 거리를 뺀다. 
- 이상치에 매우 둔감한 편이다. 
- 계산량이 매우 많다. 
- **군집 크기를 비슷하게 만드는 효과**가 있어서 널리 쓰인다. 
    - 작은 군집일 수록 다른 군집과의 거리가 짧게 나오는 효과가 있음. 
    - 이 부분 노트 참조. ward distance의 의도는, 거리를 구하되, 기존 군집의 밀집도를 더 보려는 시도이다. 

![2_36-2.png](../materials/2_41.png)

덴드로그램
- 연결 선이 밑에 있을 수록 먼저 합쳐진 것. 
- 군집이 만들어 지는 과정을 한눈에 볼 수 있고, 사용자가 선택하는 기준선에 따라 군집의 갯수를 결정할 수 있다. 
- 단, 현실에서는 샘플 수가 많은 경우에 해석이 불가능할 정도로 복잡해진다. 
    - 때문에 현실에서 덴드로그램 그려놓고 선 그어서 군집 정하는 경우는 거의 없다. 

![2_36-2.png](../materials/2_42.png)
![2_36-2.png](../materials/2_43.png)

![2_36-2.png](../materials/2_44.png)

- ward는 각 군집마다 중심점을 필요로 함. 중심이라는 개념 자체가 유클리디안 기하학에서 정해진 개념이기 때문에, 이 경우는 Euclidiean만 쓸 수 있다. 

![2_36-2.png](../materials/2_45.png)
![2_36-2.png](../materials/2_46.png)
![2_36-2.png](../materials/2_47.png)

In [7]:
import os
import pandas as pd

os.chdir(r"/Users/sanghyuk/Documents/preprocessing_python/lecture_source/2. 탐색적 데이터 분석/데이터")

### 고객 특성에 따른 군집화

In [8]:
df = pd.read_csv("Telco_customer_info.csv", engine = "python", encoding='CP949')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [9]:
df.set_index('customerID', inplace = True)

In [10]:
df.head()

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [16]:
df = pd.get_dummies(df, drop_first = True)
df.head()

Unnamed: 0_level_0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,군집정보
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7590-VHVEG,0,1,29.85,29.85,0,1,0,0,1,0,...,0,0,0,0,0,1,0,1,0,2
5575-GNVDE,0,34,56.95,1889.5,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,1
3668-QPYBK,0,2,53.85,108.15,1,0,0,1,0,0,...,0,0,0,0,0,1,0,0,1,2
7795-CFOCW,0,45,42.3,1840.75,1,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
9237-HQITU,0,2,70.7,151.65,0,0,0,1,0,0,...,0,0,0,0,0,1,0,1,0,2


In [17]:
df.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No internet service', 'StreamingMovies_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check', '군집정보'],
      dtype='object')

In [13]:
from sklearn.cluster import AgglomerativeClustering
clusters = AgglomerativeClustering(n_clusters = 3,
                                   affinity = 'euclidean',
                                   linkage = 'ward').fit(df) 

In [14]:
df['군집정보'] = clusters.labels_

In [15]:
df['군집정보'].head()

customerID
7590-VHVEG    2
5575-GNVDE    1
3668-QPYBK    2
7795-CFOCW    1
9237-HQITU    2
Name: 군집정보, dtype: int64

In [18]:
df.groupby(['군집정보'])[['MonthlyCharges', 'TotalCharges']].mean()

Unnamed: 0_level_0,MonthlyCharges,TotalCharges
군집정보,Unnamed: 1_level_1,Unnamed: 2_level_1
0,92.846384,5615.733243
1,65.558747,2192.519269
2,48.428814,444.712194


In [19]:
df.groupby(['군집정보'])[['SeniorCitizen', 'StreamingTV_Yes']].mean()

Unnamed: 0_level_0,SeniorCitizen,StreamingTV_Yes
군집정보,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.213708,0.744738
1,0.181723,0.390078
2,0.121936,0.176471


### 구매기록 기준 고객 군집화

In [20]:
df = pd.read_csv("베스트셀러_도서구매기록.txt", sep = "\t", engine = "python", encoding='CP949')

In [21]:
df.head()

Unnamed: 0,회원번호,책제목
0,75111,정글만리 1
1,48022,정글만리 1
2,3063,해커스 토익 Reading
3,84128,뱃살부터 빼셔야겠습니다
4,77611,장하준의 경제학 강의


In [22]:
matrix_df = pd.crosstab(index = df['회원번호'], columns = df['책제목'])
matrix_df.head()

책제목,1cm+,21세기 자본,EBS FM 라디오 고교 영어듣기 (2014년),EBS N제 국어영역 국어 270제 A형 (2014년),EBS N제 국어영역 국어 270제 B형 (2014년),EBS N제 영어영역 영어 280제 (2014년),EBS 수능완성 국어영역 국어 A형 유형편+실전편 (2014년),EBS 수능완성 국어영역 국어 B형 유형편+실전편 (2014년),EBS 수능완성 수학영역 기하와 벡터 (2014년),EBS 수능완성 수학영역 미적분과 통계 기본 유형편+실전편 A형 (2014년),...,정글만리 3,제3인류 3,창문 넘어 도망친 100세 노인,책은 도끼다,"총, 균, 쇠",칼 비테의 자녀교육 불변의 법칙,코스모스,해커스 토익 Listening,해커스 토익 Reading,해커스 토익 보카 Vocabulary
회원번호,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
808,0,0,1,0,1,1,0,3,0,1,...,0,0,1,0,0,0,0,0,0,0
1101,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1479,0,0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1805,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2011,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


여기서 중요한 것이 자카드 유사도는 이진 분석에서 사용하는 것인데, 여기서 3이 있음. <br>
scikit-learn에서는 1이든 2든 3이든 다 1로 취급함. 

In [24]:
# 군집화 모델 인스턴스화 및 학습

from sklearn.cluster import AgglomerativeClustering as AC
clustering_model = AC(n_clusters = 10,
                      affinity = "jaccard",
                      linkage = "average")
clustering_model.fit(matrix_df)

AgglomerativeClustering(affinity='jaccard', linkage='average', n_clusters=10)

In [28]:
cluster_labels = clustering_model.labels_
cluster_labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [32]:
# 회원별 소속 군집 확인
cluster_info = pd.DataFrame({"회원ID":matrix_df.index, "소속군집":cluster_labels})
cluster_info.head()

Unnamed: 0,회원ID,소속군집
0,808,0
1,1101,0
2,1479,0
3,1805,0
4,2011,0


In [30]:
cluster_info['소속군집'].value_counts() # 대다수가 0번 군집에 속함

0    665
1     12
2      6
3      4
8      2
4      2
9      1
7      1
6      1
5      1
Name: 소속군집, dtype: int64

In [31]:
matrix_df_with_cluster_info = pd.merge(matrix_df, cluster_info, left_index = True, right_on = '회원ID')
matrix_df_with_cluster_info.head()

Unnamed: 0,1cm+,21세기 자본,EBS FM 라디오 고교 영어듣기 (2014년),EBS N제 국어영역 국어 270제 A형 (2014년),EBS N제 국어영역 국어 270제 B형 (2014년),EBS N제 영어영역 영어 280제 (2014년),EBS 수능완성 국어영역 국어 A형 유형편+실전편 (2014년),EBS 수능완성 국어영역 국어 B형 유형편+실전편 (2014년),EBS 수능완성 수학영역 기하와 벡터 (2014년),EBS 수능완성 수학영역 미적분과 통계 기본 유형편+실전편 A형 (2014년),...,창문 넘어 도망친 100세 노인,책은 도끼다,"총, 균, 쇠",칼 비테의 자녀교육 불변의 법칙,코스모스,해커스 토익 Listening,해커스 토익 Reading,해커스 토익 보카 Vocabulary,회원ID,소속군집
0,0,0,1,0,1,1,0,3,0,1,...,1,0,0,0,0,0,0,0,808,0
1,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1101,0
2,0,0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1479,0
3,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1805,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2011,0


소속 군집에 따라 어떤 책을 많이 샀는지 보고 싶다. 

In [80]:
matrix_df_with_cluster_info.groupby(['소속군집'])[matrix_df.columns].mean()

Unnamed: 0_level_0,1cm+,21세기 자본,EBS FM 라디오 고교 영어듣기 (2014년),EBS N제 국어영역 국어 270제 A형 (2014년),EBS N제 국어영역 국어 270제 B형 (2014년),EBS N제 영어영역 영어 280제 (2014년),EBS 수능완성 국어영역 국어 A형 유형편+실전편 (2014년),EBS 수능완성 국어영역 국어 B형 유형편+실전편 (2014년),EBS 수능완성 수학영역 기하와 벡터 (2014년),EBS 수능완성 수학영역 미적분과 통계 기본 유형편+실전편 A형 (2014년),...,정글만리 3,제3인류 3,창문 넘어 도망친 100세 노인,책은 도끼다,"총, 균, 쇠",칼 비테의 자녀교육 불변의 법칙,코스모스,해커스 토익 Listening,해커스 토익 Reading,해커스 토익 보카 Vocabulary
소속군집,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.063158,0.039098,0.440602,0.264662,0.354887,0.619549,0.312782,0.372932,0.288722,0.324812,...,0.099248,0.033083,0.156391,0.042105,0.097744,0.034586,0.010526,0.03609,0.039098,0.04812
1,0.5,0.5,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,...,1.0,0.333333,1.166667,0.5,0.0,0.0,0.0,0.0,0.0,0.25
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0
3,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
# 군집별로 가장 많이 구매하는 책 목록 확인
# 각 행별로, 최댓값을 가진 컬럼을 출력하는 것. 
# Max와 Argmax의 차이. 

matrix_df_with_cluster_info.groupby(['소속군집'])[matrix_df.columns].mean().idxmax(axis = 1)

소속군집
0    EBS 수능특강 영어영역 영어 (2014년)
1           창문 넘어 도망친 100세 노인
2                 나미야 잡화점의 기적
3              어떻게 원하는 것을 얻는가
4    EBS 수능특강 영어영역 영어 (2014년)
5                   강신주의 감정수업
6                    마법천자문 27
7                 꾸뻬 씨의 행복 여행
8              나는 까칠하게 살기로 했다
9                    월급쟁이 부자들
dtype: object

## K-means

1. 중심점 랜덤하게 설정(K개의 중심 랜덤하게 설정). 
2. 나머지 데이터를 너랑 더 가까운 대장 골라!
3. 모인 데이터들의 중심점을 다시 대장으로!
4. 다시 새로운 대장 주변으로 모여!
5. 더 이상 업데이트 할 게 없는 순간까지 계속 이 과정 반복


단 수렴하지 않거나, 업데이트가 안 끝날 가능성 또한 존재하는 알고리즘. 

![2_36-2.png](../materials/2_48.png)
![2_36-2.png](../materials/2_49.png)

![2_36-2.png](../materials/2_50.png)

![2_36-2.png](../materials/2_51.png)

In [41]:
import os
import pandas as pd

os.chdir(r"/Users/sanghyuk/Documents/preprocessing_python/lecture_source/2. 탐색적 데이터 분석/데이터")

In [42]:
df = pd.read_csv("Telco_customer_info.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [43]:
df.set_index('customerID', inplace = True)

In [44]:
df.head()

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [45]:
df = pd.get_dummies(df, drop_first = True)
df.head()

Unnamed: 0_level_0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7590-VHVEG,0,1,29.85,29.85,0,1,0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
5575-GNVDE,0,34,56.95,1889.5,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
3668-QPYBK,0,2,53.85,108.15,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,1
7795-CFOCW,0,45,42.3,1840.75,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
9237-HQITU,0,2,70.7,151.65,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0


In [46]:
df.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No internet service', 'StreamingMovies_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
      dtype='object')

In [47]:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 3).fit(df)

In [48]:
# 군집 중심 확인 및 이름 붙이기
pd.DataFrame(clusters.cluster_centers_,
             columns = df.columns,
             index = range(3))

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.128691,18.261705,49.763794,687.758896,0.501321,0.37599,0.284994,0.891236,0.108764,0.269868,...,0.3644658,0.194718,0.3644658,0.195438,0.145738,0.165666,0.526531,0.164226,0.346218,0.32461
1,0.207196,44.123449,77.810701,3280.360205,0.51737,0.557072,0.30397,0.860422,0.139578,0.497519,...,0.001240695,0.545285,0.001240695,0.55273,0.302109,0.21464,0.671836,0.260546,0.348015,0.116005
2,0.216733,64.384861,97.979243,6297.778685,0.499602,0.740239,0.336255,0.998406,0.001594,0.829482,...,-1.110223e-16,0.807171,-1.110223e-16,0.81753,0.301195,0.517131,0.710757,0.332271,0.288446,0.051793
