# References
1. [Customer Segmentation using RFM Analysis](https://www.kaggle.com/sarahm/customer-segmentation-using-rfm-analysis)
2. [RFM Analysis](https://www.kaggle.com/yugagrawal95/rfm-analysis)
3. [RFM Analysis Using Online Retail II Dataset](https://www.kaggle.com/sevvalyurtekin/rfm-analysis-using-online-retail-ii-dataset)
4. [RFM Analysis For Successful Customer Segmentation](https://www.kaggle.com/abdulmeral/rfm-analysis-for-successful-customer-segmentation)
5. [CRM Analytics](https://www.kaggle.com/sercanyesiloz/crm-analytics)
6. [Complete E-Commerce Analysis](https://www.kaggle.com/anmoltripathi/complete-e-commerce-analysis)

# 각 커널의 아이디어
## 1. Customer Segmentation using RFM Analysis
### Preprocessing
- `Country` == 'United Kingdom'
- `Quantity` > 0
- `CustomerID` is not null
- `InvoiceDate` >= '2010-12-09'

### How to create customer segments from RFM Models
- Quartile
    - `R_Quartile`:
        - 4: <= .25
        - 3: <= .5
        - 2: <= .75
        - 1: else
    - `F_Quartile`, `M_Quartile`:
        - 1: <= .25
        - 2: <= .5
        - 3: <= .75
        - 4: else
    - `RFMScore`: str(R_Quartile) + str(F_Quartile) + str(M_Quartile)

### How to interpret model
- Best Customers: RFMScore == '444'
- Loyal Customers: F_Quartile == 4
- Big Spenders: M_Quartile == 4
- Almost Lost: RFMScore == '244'
- Lost Customers: RFMScore == '144'
- Lost Cheap Customers: RFMScore == '111'
    
## 2. RFM Analysis
### Preprocessing
- `Country` == 'United Kingdom'
- `Quantity` > 0
- `UnitPrice` > 0
- `CustomerID` is not null

### How to create customer segments from RFM Models
- K-Means
- Quartile ( 생략 )

### How to interpret model
#### K-Means: Elbow Method를 사용해 군집 수를 결정
- 3개의 군집으로 나눔
- 각 군집에 대한 데이터를 보고 이름을 붙임
    - cluster 0 have high recency rate which is bad. cluster 1 and cluster 2 having low so they are in race of platinum and gold customer.
    - cluster 0 have low frequency rate which is bad. cluster 1 and cluster 2 having high so they are in race of platinum and gold customer.
    - cluster 0 have low Monetary rate which is bad. cluster 1 have highest Montary (money spend) platinum where as cluster 2 have medium level(Gold) and cluster 0 is silver customer.
    
## 3. RFM Analysis Using Online Retail II Dataset
- [Online Retail II Data Set from ML Repository](https://www.kaggle.com/mathchi/online-retail-ii-data-set-from-ml-repository) 데이터셋 사용

### Preprocessing
- dropna
- `InvoiceNo` do not starts with 'C'

### How to create customer segments from RFM Models
- Quartile
    - qcut함수를 사용해 깔끔하게 구현
    - `R_Quartile`:
        - 5: <= .2
        - 4: <= .4
        - 3: <= .6
        - 2: <= .8
        - 1: else
    - `F_Quartile`, `M_Quartile`:
        - 1: <= .2
        - 2: <= .4
        - 3: <= .6
        - 4: <= .8
        - 5: else
    - `RFM_SCORE`: str(R_Quartile) + str(F_Quartile)
    - `RFM_SCORE`를 계산할 때 `M_Quartile`은 고려하지 않는다는 점에 주의
    
### How to interpret model
#### `RFM_SCORE`을 기준으로 다음과 같이 분류
```python
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}
```

## 4. RFM Analysis For Successful Customer Segmentation
### Preprocessing
- `TotalPrice` = `UnitPrice` * `Quantity`
- `Country` == 'United Kingdom'
- `Quantity` > 0
- `TotalPrice` > 0

### How to create customer segments from RFM Models
- Quartile
    - `Rec_Tile`:
        - 5: <= .2
        - 4: <= .4
        - 3: <= .6
        - 2: <= .8
        - 1: else
    - `Freq_Tile`, `Mone_Tile`:
        - 1: <= .2
        - 2: <= .4
        - 3: <= .6
        - 4: <= .8
        - 5: else
    - `RFM Score`: str(Rec_Tile) + str(Freq_Tile) + str(Mone_Tile)
    - `RFM_Sum`: Rec_Tile + Freq_Tile + Mone_Tile
- K-Means

### How to interpret model
#### Quartile: `RFM_Sum`을 기준으로 평가
- Can't Loose Them: >= 9
- Champions: >= 8
- Loyal: >= 7
- Potential: >= 6
- Promising: >= 5
- Needs Attention: >= 4
- Require Activation: else

#### K-Means: Elbow Method를 사용해 군집 수를 결정
- RFM 데이터를 Min Max Scaling했다.
- 4개의 군집으로 나눔
- No interprets

## 5. CRM Analytics
- 시각화가 잘 되어있음
### Preprocessing
- `InvoiceNo` do not starts with 'C'
- `Quantity` > 0
- Remove Outliers ( `UnitPrice`, `Quantity` )
    - Q1: quartile 0.01
    - Q3: quartile 0.99
    - IQR 방식을 이용해 이상치 제거
- `TotalPrice` = `UnitPrice` * `Quantity`

### How to create customer segments from RFM Models
- Quartile
    - `recency_score`:
        - 5: <= .2
        - 4: <= .4
        - 3: <= .6
        - 2: <= .8
        - 1: else
    - `frequency_score`, `monetary_score`:
        - 1: <= .2
        - 2: <= .4
        - 3: <= .6
        - 4: <= .8
        - 5: else
    - `RFM_SCORE`: str(recency_score) + str(frequency_score) + str(monetary_score)
- K-Means
    
### How to interpret model
#### Quartile: `RFM_SCORE`을 기준으로 다음과 같이 분류
```python
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}
```

#### K-Means: Elbow Method를 사용해 군집 수를 결정
- 데이터는 `recency_score`, `frequency_score`만 사용
- 3개의 군집으로 나눔
- No interprets

## 6. Complete E-Commerce Analysis
### Preprocessing
- `TotalPrice` = `UnitPrice` * `Quantity`
- `TotalPrice` >= 0

### How to create customer segments from RFM Models
- K-Means
    
### How to interpret model
#### K-Means: Elbow Method를 사용해 군집 수를 결정
- 데이터는 `recency_score`, `frequency_score`만 사용
- 3개의 군집으로 나눔
- 각 군집에 대한 RFM 데이터를 보고 다음과 같이 분류
| Clusters | Recency                   | Frequency         | Monetary         |
|----------|---------------------------|-------------------|------------------|
| 0        | Have not visited recently | Least frequent    | Least spending   |
| 1        | Most recently visited     | Highest frequency | Spending Highest |
| 2        | Recently visited          | Decent frequency  | Decent Spending  |

# 참고자료 요약
1. RFM 데이터를 가지고 고객 세분화를 하는 방법은 크게 세 가지가 있었다. 이름은 내가 임의로 붙인 것이다.
    1. **Quartile String**: Quartile을 기반으로 점수를 산출하고, 점수를 이어붙여 문자열로 만든다. 같은 패턴을 보이는 고객을 묶어 군집화한다.
    2. **Quartile Sum**: Quartile을 기반으로 점수를 산출하고, 점수를 더한다. 비슷한 점수를 받은 고객들을 군집화한다.
    3. **K-Means**: RFM 데이터를 K-Means로 군집화 한다.
2. 데이터 전처리는 다음과 같았다.
    1. `Quantity`, `UnitPrice`가 0 이하인 거래내역 삭제
    2. `Country` 칼럼에서 영국만 남겨두기
    3. `CustomerID`가 null인 거래내역 삭제
    4. `InvoiceNo`가 'C'로 시작하는 거래내역 삭제
3. Quartile을 사용하여 4개로 구분하는 커널도 있었고, 5개로 구분하는 커널도 있었다.
4. K-Means를 사용하는 대부분의 커널은 Elbow-Methods를 사용하였으며, 대부분은 군집 수가 3이었다.
5. K-Means를 통해 군집화를 할 대, RFM 모두 사용하는 것이 아니라 RF만 사용하는 것이 많았다.
6. K-Means를 하기 전에 데이터를 Min Max Scaling하는 경우가 있었다.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/ecommerce-data/data.csv', 
                 dtype={'InvoiceNo': str, 'StockCode': str, 'Description': str, 'Quantity': int, 'UnitPrice': float, 'CustomerID': str, 'Country': str}, 
                 encoding='ISO-8859-1',
                 parse_dates=['InvoiceDate'])

df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.nunique().to_frame().transpose()

In [None]:
df = df[
    (df['Country'] == 'United Kingdom') &
    (~df['CustomerID'].isnull()) &
#     (~df['InvoiceNo'].str.startswith('C')) &
#     (df['Quantity'] > 0) & 
    (df['UnitPrice'] > 0) & 
    (~df['StockCode'].str.isalpha()) &
    (~df['StockCode'].str.contains('BANK|C2|DCGS|gift'))
].reset_index(drop=True)
df.head()

In [None]:
pd.DataFrame(
    index=['Quantity > 0', 'Quantity <= 0'], 
    columns=["Starts with 'C'", "Do not starts with 'C'"],
    data=[
          [len(df[(df['Quantity'] > 0) & (df['InvoiceNo'].str.startswith('C'))]), len(df[(df['Quantity'] > 0) & (~df['InvoiceNo'].str.startswith('C'))])],
          [len(df[(df['Quantity'] <= 0) & (df['InvoiceNo'].str.startswith('C'))]), len(df[(df['Quantity'] <= 0) & (~df['InvoiceNo'].str.startswith('C'))])]
          ]
    )

In [None]:
pd.DataFrame(
    index=['UnitPrice > 0', 'UnitPrice <= 0'], 
    columns=["Starts with 'C'", "Do not starts with 'C'"],
    data=[
          [len(df[(df['UnitPrice'] > 0) & (df['InvoiceNo'].str.startswith('C'))]), len(df[(df['UnitPrice'] > 0) & (~df['InvoiceNo'].str.startswith('C'))])],
          [len(df[(df['UnitPrice'] <= 0) & (df['InvoiceNo'].str.startswith('C'))]), len(df[(df['UnitPrice'] <= 0) & (~df['InvoiceNo'].str.startswith('C'))])]
          ]
    )

## Remove Outliers

In [None]:
def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.01)
    quartile3 = dataframe[variable].quantile(0.99)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit
    
replace_with_thresholds(df, "Quantity")
replace_with_thresholds(df, "UnitPrice")

In [None]:
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df.head()

## M ( Monetary )
- 취소된 주문은 차감해줘야 한다.

In [None]:
monetary = df.groupby('CustomerID', as_index=False)['TotalPrice'].agg(sum)
monetary.rename(columns={'TotalPrice': 'monetary'}, inplace=True)
monetary.head()

# F ( Frequency )
- 취소된 주문은 세면 안된다.

In [None]:
print(len(df[df['Quantity'] <= 0]))

df = df[
    (df['Quantity'] > 0)
].reset_index(drop=True)

In [None]:
frequency = df.groupby('CustomerID')['InvoiceDate'].agg(lambda x: len(set(x.dt.date))).to_frame().reset_index()
frequency.rename(columns={'InvoiceDate': 'frequency'}, inplace=True)
frequency.head()

# R ( Recency )

In [None]:
max_date = max(df['InvoiceDate'].dt.date)
max_date

In [None]:
recency = df.groupby('CustomerID')['InvoiceDate'].agg(lambda x: (max_date - x.dt.date.max()).days + 1).to_frame().reset_index()
recency.rename(columns={'InvoiceDate': 'recency'}, inplace=True)
recency.head()

In [None]:
rfm = pd.merge(recency, frequency, how='outer', on='CustomerID')
rfm = pd.merge(rfm, monetary, how='outer', on='CustomerID')
rfm.head()

In [None]:
rfm.isnull().sum()

In [None]:
rfm[rfm['recency'].isnull()]

In [None]:
rfm = rfm[
    (~rfm['recency'].isnull()) &
    (~rfm['frequency'].isnull()) &
    (rfm['monetary'] > 0)
].reset_index(drop=True)
print(len(rfm))
rfm.head()

In [None]:
rfm['recency'].plot.hist()

In [None]:
rfm['frequency'].plot.hist()

In [None]:
rfm['monetary'].plot.hist()

# Segmentation Method 1: Quartile String

In [None]:
rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
rfm["frequency_score"] = pd.qcut(rfm['frequency'].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])
rfm['rfm_score'] = rfm['recency_score'].astype(str) + rfm['frequency_score'].astype(str)
rfm.head()

In [None]:
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}

rfm['segment1'] = rfm['rfm_score'].replace(seg_map, regex=True)
rfm.head()

In [None]:
rfm['segment1'].value_counts().sort_values()

In [None]:
rfm.groupby('segment1')[['recency', 'frequency', 'monetary']].agg('mean')

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

sns.scatterplot(x='recency', 
                y='frequency', 
                hue='segment1', # different colors by group
                style='segment1', # different shapes by group
                s=50, # marker size
                alpha=.5,
                data=rfm)
plt.show()

# Segmentation Method 2: Quartile Sum

In [None]:
def rfm_level_by_sum(score):
    if score >= 12:
        return 'Can\'t Loose Them'
    elif score >= 11:
        return 'Champions'
    elif score >= 10:
        return 'Loyal'
    elif score >= 9:
        return 'Potential'
    elif score >= 8:
        return 'Promising'
    elif score >= 7:
        return 'Needs Attention'
    else:
        return 'Require Activation'

rfm['rfm_sum'] = rfm['recency_score'].astype(int) + rfm['frequency_score'].astype(int) + rfm['monetary_score'].astype(int)
rfm['segment2'] = rfm['rfm_sum'].map(lambda x: rfm_level_by_sum(x))
rfm.head()

In [None]:
rfm['segment2'].value_counts().sort_values()

In [None]:
rfm.groupby('segment2')[['recency', 'frequency', 'monetary']].agg('mean')

In [None]:
sns.scatterplot(x='recency', 
                y='frequency', 
                hue='segment2', # different colors by group
                style='segment2', # different shapes by group
                s=50, # marker size
                alpha=.5,
                data=rfm)
plt.show()

# Segmentation Method 3: K-Means
- 최적의 군집수를 결정하기 위해 Elbow Method, Silhouette Score를 사용
- StandardScaler를 적용시키는 것이 성능이 괜찮아 보임
    - 근거 1: 각 군집에 포함되는 고객의 수가 상대적으로 균등함
    - 근거 2: 박스플롯을 그려봤을 때 군집별로 차이가 뚜렷함

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler

kmeans_data = rfm[['recency', 'frequency', 'monetary']]
standard_scaler = StandardScaler()
kmeans_data = standard_scaler.fit_transform(kmeans_data)
kmeans_data = pd.DataFrame(kmeans_data, columns=['recency', 'frequency', 'monetary'])
kmeans_data.head()

In [None]:
kmeans_data.describe()

## Elbow Methods

In [None]:
inertia = []
k = [1,2,3,4,5,6,7,8,9]
for i in k:
    kmean = KMeans(n_clusters=i, random_state=42)
    kmean.fit(kmeans_data)
    inertia.append(kmean.inertia_)

In [None]:
plt.plot(k, inertia, marker='o');

## Silhouette Score

In [None]:
# 출처: https://ariz1623.tistory.com/224
### 여러개의 클러스터링 갯수를 List로 입력 받아 각각의 실루엣 계수를 면적으로 시각화한 함수 작성
def visualize_silhouette(cluster_lists, X_features): 

    from sklearn.datasets import make_blobs
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_samples, silhouette_score

    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    import math

    # 입력값으로 클러스터링 갯수들을 리스트로 받아서, 각 갯수별로 클러스터링을 적용하고 실루엣 개수를 구함
    n_cols = len(cluster_lists)

    # plt.subplots()으로 리스트에 기재된 클러스터링 수만큼의 sub figures를 가지는 axs 생성 
    fig, axs = plt.subplots(figsize=(4*n_cols, 4), nrows=1, ncols=n_cols)

    # 리스트에 기재된 클러스터링 갯수들을 차례로 iteration 수행하면서 실루엣 개수 시각화
    for ind, n_cluster in enumerate(cluster_lists):

        # KMeans 클러스터링 수행하고, 실루엣 스코어와 개별 데이터의 실루엣 값 계산. 
        clusterer = KMeans(n_clusters = n_cluster, max_iter=500, random_state=0)
        cluster_labels = clusterer.fit_predict(X_features)

        sil_avg = silhouette_score(X_features, cluster_labels)
        sil_values = silhouette_samples(X_features, cluster_labels)

        y_lower = 10
        axs[ind].set_title('Number of Cluster : '+ str(n_cluster)+'\n' \
                          'Silhouette Score :' + str(round(sil_avg,3)) )
        axs[ind].set_xlabel("The silhouette coefficient values")
        axs[ind].set_ylabel("Cluster label")
        axs[ind].set_xlim([-0.1, 1])
        axs[ind].set_ylim([0, len(X_features) + (n_cluster + 1) * 10])
        axs[ind].set_yticks([])  # Clear the yaxis labels / ticks
        axs[ind].set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1])

        # 클러스터링 갯수별로 fill_betweenx( )형태의 막대 그래프 표현. 
        for i in range(n_cluster):
            ith_cluster_sil_values = sil_values[cluster_labels==i]
            ith_cluster_sil_values.sort()

            size_cluster_i = ith_cluster_sil_values.shape[0]
            y_upper = y_lower + size_cluster_i

            color = cm.nipy_spectral(float(i) / n_cluster)
            axs[ind].fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_sil_values, \
                                facecolor=color, edgecolor=color, alpha=0.7)
            axs[ind].text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
            y_lower = y_upper + 10

        axs[ind].axvline(x=sil_avg, color="red", linestyle="--")
        
visualize_silhouette([2,3,4,5,6,7,8,9], kmeans_data[['recency', 'frequency', 'monetary']])

군집의 수는 4로 결정

In [None]:
from sklearn import metrics

kmeans = KMeans(n_clusters=4, init = "k-means++", random_state = 42)
kmeans.fit(kmeans_data)

In [None]:
rfm['segment3'] = kmeans.labels_
rfm.head()

In [None]:
rfm.groupby('segment3')[['recency', 'frequency', 'monetary']].agg('mean')

## K-Means Segmentation 결과 해석

In [None]:
sns.scatterplot(x='recency', 
                y='frequency', 
                hue='segment3', # different colors by group
                style='segment3', # different shapes by group
                s=50, # marker size
                alpha=.5,
                data=rfm)
plt.show()

In [None]:
rfm['segment3'].value_counts().to_frame().transpose()

- 2번 군집은 다른 군집에 비해 recency가 높다. 즉, 최근에 활동한적이 없는 고객군이다.
- 0, 2번 군집은 1, 3번 군집에 비해 구매 빈도가 적다.
- 3번 군집은 다른 군집들에 비해 돈을 많이 썼다.

In [None]:
plt.figure(figsize=(32, 8))
plt.subplot(1,3,1)
sns.boxplot(data=rfm, x='segment3', y='recency')
plt.subplot(1,3,2)
sns.boxplot(data=rfm, x='segment3', y='frequency')
plt.subplot(1,3,3)
sns.boxplot(data=rfm, x='segment3', y='monetary')
plt.show()

정리하자면 다음과 같다.

| Clusters | Recency                   | Frequency         | Monetary         |
|----------|---------------------------|-------------------|------------------|
| 0        | Recently visited          | Least frequent    | Least spending   |
| 1        | Most Recently visited     | Decent frequency  | Decent Spending  |
| 2        | Have not visited recently | Least frequent    | Least spending   |
| 3        | Most Recently visited     | Highest frequency | Spending Highest |