## Cluster Analysis (군집분석)

비지도학습(Unsupervised Learning)에 속하는 머신러닝 기법.

- eg: pattern recognition, image analysis, bioinformatics, customer segmentation

### K-Means Clustering Algorithm
n개의 data와 k개라는 class 갯수가 주어졌을때, 각각의 class들에 속한 점들간의 분산을 최소화하는게 k-clustering의 목적입니다. 즉, 근처에 있는 data들끼리 모아서 하나의 class로 선언하는 방식입니다.

![Random Unsplash Image](https://www.ncbi.nlm.nih.gov/books/NBK543520/bin/463627_1_En_9_Fig4_HTML.jpg)

k-means 알고리즘은 cluster의 중심이 되는 cluster centroid를 임의로 k개 만큼 선정하고, 각 점으로부터 거리를 계산하여 모든 data의 Class를 할당, 모든 Class들의 무게중심으로 centroid를 이동, 다시 각 점으로부터의 거리를 계싼하여 class할당... 을 반복하는 알고리즘입니다. 이후 cluster centeroid가 최적의 위치로 이동하게 되면, 더이상 이동하지 않으며, 그때 iteration을 종료하게됩니다. 




- 참고영상: https://www.youtube.com/watch?v=YIGtalP1mv0

- 데이터 출처: https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python


#### Pros 

- 쉽고 간편한 연산
- 큰 데이터셋에 적응가능(scalable)
- 패턴이 뚜렷하지 않은 데이터에도 적용 가능

#### Cons

- k 값을 임의로 정해야 함
- outlier에 민감함

### Import & Explore Data

In [1]:
#import data

import pandas as pd
import numpy as np

df = pd.read_csv("Mall_Customers.csv")
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


#### 데이터 클리닝 할 때 유용한것들 몇가지
- df.shape : 크기확인
- df.renmae(columns = {'old_name' : 'new_name'}) : columns 이름 바꿔주기
- df.stypes : 데이터 타입 확인
- df.describe() : 각 컬럼의 데이터를 요약해서 보여줌
- df.isnull().sum() : null 값 확인
- df.dropna(how = "all") : 모든 null값 삭제 
- df.columnname.unique() : unique 한 값들 보여줌
- len(df.columnname.unique()) : 해당 컬럼에 unique 한 값이 몇개인지 확인 

In [2]:
df.rename(columns = {'CustomerID' : 'id', 'Gender': 'gender', 'Age' : 'age', 'Annual Income (k$)':'annual_income', 'Spending Score (1-100)': 'spending_score'}, inplace = True)
df.head()


Unnamed: 0,id,gender,age,annual_income,spending_score
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [3]:
print(df['age'].max())
print(df['spending_score'].mean())

70
50.2


In [4]:
df.shape

(200, 5)

In [5]:
df.dtypes

id                 int64
gender            object
age                int64
annual_income      int64
spending_score     int64
dtype: object

In [6]:
df.describe()

Unnamed: 0,id,age,annual_income,spending_score
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


In [7]:
df.isnull().sum()

#만약 null 값이 있으면, df = df.dropna(how='all')

id                0
gender            0
age               0
annual_income     0
spending_score    0
dtype: int64

In [8]:
df.gender.unique()

array(['Male', 'Female'], dtype=object)

In [9]:
len(df.id.unique())

200

### Groupby

* 공식문서: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

In [10]:
# group customers by 'gender',
# and count the number of customers in each group. 

df2 = df.groupby('gender').id.count().reset_index()
df2

Unnamed: 0,gender,id
0,Female,112
1,Male,88


In [11]:
# group customers by 'gender', 
# and calculate the average age per gender using mean() method.

df3 = df.groupby('gender').age.mean().reset_index()
df3

Unnamed: 0,gender,age
0,Female,38.098214
1,Male,39.806818


In [14]:
# group customers by 'gender' and 'age',  
# and calculate the average annual income using mean() method.

df4 = df.groupby(['gender', 'age']).annual_income.mean().reset_index()
df4

Unnamed: 0,gender,age,annual_income
0,Female,18,65.00
1,Female,19,64.00
2,Female,20,26.50
3,Female,21,44.75
4,Female,22,37.00
...,...,...,...
82,Male,66,63.00
83,Male,67,45.00
84,Male,68,63.00
85,Male,69,44.00


### One Hot Encoding

기계가 읽어들일수 없는 값들(Categorical Variables)을 1과 0의 숫자로 변환하는 과정

- **pd.get_dummies(df, coulmns = ['column_name'])**

![Random Unsplash Image](https://miro.medium.com/max/1400/1*O_pTwOZZLYZabRjw3Ga21A.png)


- 공식문서: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
- 참고용 블로그: https://blog.naver.com/oys0608/222324158962

In [15]:
df = pd.get_dummies(df, columns = ['gender'])
df.head()

Unnamed: 0,id,age,annual_income,spending_score,gender_Female,gender_Male
0,1,19,15,39,0,1
1,2,21,15,81,0,1
2,3,20,16,6,1,0
3,4,23,16,77,1,0
4,5,31,17,40,1,0


### Normalisation 

분석 전에 데이터를 정규화(normalization)하여 데이터의 스케일을 0과 1사이로 맞춰주는 작업. 

![Random Unsplash Image](https://i2.wp.com/cmdlinetips.com/wp-content/uploads/2020/06/Quantile_Normalization_in_Python.png?w=379&ssl=1)

![Random Unsplash Image](https://i0.wp.com/cmdlinetips.com/wp-content/uploads/2020/06/Boxplot_after_Quantile_Normalization_Seaborn.png?w=603&ssl=1)

* sklearn.preprocessing 라이브러리의 MinMaxScaler 클래스 이용. 

In [16]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[ : ] = scaler.fit_transform(df[ : ])

In [17]:
df

Unnamed: 0,id,age,annual_income,spending_score,gender_Female,gender_Male
0,0.000000,0.019231,0.000000,0.387755,0.0,1.0
1,0.005025,0.057692,0.000000,0.816327,0.0,1.0
2,0.010050,0.038462,0.008197,0.051020,1.0,0.0
3,0.015075,0.096154,0.008197,0.775510,1.0,0.0
4,0.020101,0.250000,0.016393,0.397959,1.0,0.0
...,...,...,...,...,...,...
195,0.979899,0.326923,0.860656,0.795918,1.0,0.0
196,0.984925,0.519231,0.909836,0.275510,1.0,0.0
197,0.989950,0.269231,0.909836,0.744898,0.0,1.0
198,0.994975,0.269231,1.000000,0.173469,0.0,1.0
