# 購物籃分析 market basket analysis

## 購物籃分析簡介

**購物籃分析(market basket analysis)** 又稱 **關連分析(association analysis)** ，其目的是從大量的交易資料中，探勘出隱藏在資料間具有相關性的關連規則(association rules)。這些關連規則表示消費者通常買什麼，哪些商品經常會被一起購買。購物籃分析最經典的就是啤酒與尿布的例子。

### 購物籃分析的概念

購物籃分析的演算概念主要為兩個機率統計量的計算，分別為 **支持度(support)** 和 **信賴度(confidence)** 。以下用一個例子來說明支持度和信賴度的意義與計算方式。如下圖所示，假定所有的發票共有2000筆(以下以$C$代表)，包含A產品的發票有1250(750+500)筆(以$C_{A}$代表)，包含B產品的發票則有1000(500+500)筆(以$C_{B}$代表)，同時包含A產品與B產品的發票有500筆。

![](image/mbdata.png)

我們想計算「如果購買A產品時也一起購買B產品」(購物籃分析將這條關連規則表示成 ${A}\Rightarrow{B}$ )時的支持度與信賴度：

- 支持度(Support) ：
在所有的發票中，同時購買A、B產品的次數比例 $Pr(A, B)=\frac{C_{A, B}}{C}=\frac{500}{2000}=0.25$
如果支持度大，表示顧客很有可能同時購買A、B產品。

- 信賴度(Confidence) ：
在購買A產品的發票中，同時也購買B產品的次數比例 $Pr(B|A)=\frac{Pr(A,B)}{Pr(A)}=\frac{\frac{C_{A, B}}{C}}{\frac{C_{A}}{C}}=\frac{\frac{500}{2000}}{\frac{1250}{2000}}=0.4$
如果信賴度大，表示顧客在購買A產品時也很有可能同時購買B產品，但反之，購買B產品時並不一定同時購買A產品。

在進行購物籃分析時，需要先設定最小支持度與最小信賴度。如果所設定的最小支持度與最小信賴度太低，則會產生太多關連規則，造成決策上的干擾。反之，最小支持度與最小信賴度的設定太高則可能會面臨關連規則找出太少而造成難以應用的窘境。

一個強關聯規則，通常支持度和信賴度的值都很高。但支持度和信賴度值高的規則，卻不一定代表這條規則所指的事件彼此間就一定存在著高相關性。同時還需檢查**增益率(lift)**的值是否大於1。

  - 當增益度的值＞1， 則A與B間有正向關係
  - 當增益度的值＝1， 則A與B間沒有關係
  - 當增益度的值＜1， 則A與B間為負向關係

增益率的計算方式：$\frac{Pr(B|A)}{Pr(B)}=\frac{Pr(A,B)}{Pr(A){\times}Pr(B)}$

## 購物籃分析的應用

以下我們將利用python的mlxtend套件分析Online Retail.xlsx資料集。如同前面幾次課程，這個資料集包括發票編號(InvoiceNo)、貨品編號(StockCode)、描述(Description)、數量(Quantity)、發票日期(InvoiceDate)、單價(UnitPrice)、顧客識別號(CustomerID)、國別(Country)等變數欄位。我們以同一個發票編號的發票做為一次交易，分析哪些貨品比較可能會一起購買。

### 載入套件

In [1]:
# 載入所需套件

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
'''
圖形中有中文字型的問題
參考
https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/359974/
'''
from matplotlib.font_manager import FontProperties

han_font = FontProperties(fname=r"c:/windows/fonts/msjh.ttc", size=14) # 中文字形

In [3]:
'''
設計圖形呈現的外觀風格
'''
sns.set(style="whitegrid")

In [4]:
# 關聯分析
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

### 讀入資料

In [6]:
# 讀入資料檔
df = pd.read_excel('Online Retail.xlsx')

### 資料清理

In [7]:
# 去除CustomerID沒有資料的紀錄
df = df.dropna(subset=['CustomerID'])

In [8]:
from datetime import date

df = df.assign(PurchaseDate=df.InvoiceDate.apply(lambda x: x.date()))

# 取出2010-12-09到2011-12-09一年之間的資料
df = df[df.PurchaseDate>=date(2010, 12, 9)]

In [9]:
#取出購買紀錄(不包含取消紀錄)
df = df[df.Quantity>0] 

In [10]:
# 清除品項描述欄位中前後多餘的空白
df['Description'] = df['Description'].str.strip()

In [11]:
# 去除品項描述欄位中資料為郵資(POSTAGE)的紀錄
df = df.loc[df.Description!="POSTAGE"]

### 查看資料

In [None]:
df.head()

## 分析不同國家的情形

In [16]:
df.groupby("Country").InvoiceNo.nunique().reset_index().sort_values("InvoiceNo", ascending=False)

Unnamed: 0,Country,InvoiceNo
34,United Kingdom,16000
14,Germany,437
13,France,377
10,EIRE,252
3,Belgium,97
22,Netherlands,93
29,Spain,87
0,Australia,54
25,Portugal,52
31,Switzerland,46


In [13]:
# 德國
print("德國共有{}筆記錄，{}筆發票，{}項商品".
      format(len(df[df.Country=="Germany"]),
             len(set(df[df.Country=="Germany"].InvoiceNo)),
             len(set(df[df.Country=="Germany"].Description))))

德國共有8476筆記錄，437筆發票，1681項商品


In [14]:
# 法國
print("法國共有{}筆記錄，{}筆發票，{}項商品".
      format(len(df[df.Country=="France"]),
             len(set(df[df.Country=="France"].InvoiceNo)),
             len(set(df[df.Country=="France"].Description))))

法國共有7871筆記錄，377筆發票，1531項商品


In [29]:
national_purchase = df.groupby(['Country', 'InvoiceNo', 'Description']).Quantity.sum().reset_index()
national_purchase.head()

Unnamed: 0,Country,InvoiceNo,Description,Quantity
0,Australia,539419,ALARM CLOCK BAKELIKE PINK,4
1,Australia,539419,CORONA MEXICAN TRAY,50
2,Australia,539419,DOORMAT RED RETROSPOT,4
3,Australia,539419,DOORMAT UNION FLAG,20
4,Australia,539419,JUMBO BAG RED RETROSPOT,10


In [31]:
national_purchase.Quantity = 1

In [None]:
### 德國的購物籃分析

In [32]:
Germany_inv_items = national_purchase[national_purchase.Country=="Germany"]
Germany_inv_items.head()

Unnamed: 0,Country,InvoiceNo,Description,Quantity
21020,Germany,537892,4 TRADITIONAL SPINNING TOPS,1
21021,Germany,537892,BABUSHKA LIGHTS STRING OF 10,1
21022,Germany,537892,CHILDS BREAKFAST SET CIRCUS PARADE,1
21023,Germany,537892,PLASTERS IN TIN CIRCUS PARADE,1
21024,Germany,537892,RETROSPOT CHILDRENS APRON,1


In [34]:
Germany_inv_items = Germany_inv_items.pivot(index='InvoiceNo',
                                            columns='Description',
                                            values='Quantity')
Germany_inv_items.head()

In [35]:
Germany_inv_items = Germany_inv_items.fillna(0)
Germany_inv_items.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
537892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537894,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537995,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
538174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
538644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
frequent_itemsets = apriori(Germany_inv_items, min_support=0.05, use_colnames=True)

frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.100686,(6 RIBBONS RUSTIC CHARM)
1,0.073227,(ALARM CLOCK BAKELIKE PINK)
2,0.06865,(CHARLOTTE BAG APPLES DESIGN)
3,0.052632,(CHILDRENS CUTLERY DOLLY GIRL)
4,0.050343,(CHILDRENS CUTLERY SPACEBOY)


In [38]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN CIRCUS PARADE),0.112128,0.118993,0.050343,0.44898,3.773155,0.037001,1.598864
1,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN SPACEBOY),0.118993,0.112128,0.050343,0.423077,3.773155,0.037001,1.538978
2,(PLASTERS IN TIN STRONGMAN),(PLASTERS IN TIN CIRCUS PARADE),0.073227,0.118993,0.050343,0.6875,5.777644,0.04163,2.819222
3,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN STRONGMAN),0.118993,0.073227,0.050343,0.423077,5.777644,0.04163,1.606407
4,(PLASTERS IN TIN WOODLAND ANIMALS),(PLASTERS IN TIN CIRCUS PARADE),0.141876,0.118993,0.070938,0.5,4.201923,0.054056,1.762014


In [39]:
rules[(rules.lift>=1) & (rules.confidence>0.6)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(PLASTERS IN TIN STRONGMAN),(PLASTERS IN TIN CIRCUS PARADE),0.073227,0.118993,0.050343,0.6875,5.777644,0.04163,2.819222
18,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.070938,0.128146,0.059497,0.83871,6.544931,0.050406,5.405492
20,(ROUND SNACK BOXES SET OF 4 FRUITS),(ROUND SNACK BOXES SET OF4 WOODLAND),0.160183,0.251716,0.132723,0.828571,3.291688,0.092402,4.364989
22,(SPACEBOY LUNCH BOX),(ROUND SNACK BOXES SET OF4 WOODLAND),0.105263,0.251716,0.070938,0.673913,2.677273,0.044442,2.294737


In [None]:
## Total

In [41]:
inv_items = national_purchase.pivot(index='InvoiceNo',
                                    columns='Description',
                                    values='Quantity')
inv_items.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
537879,,,,,,,,,,,...,,,,,,,,,,
537880,,,,,,,,,,,...,,,,,,,,,,
537881,,,,,,,,,,,...,,,,,,,,,,
537882,,,,,,,,,,,...,,,,,,,,,,
537883,,,,,,,,,,,...,,,,,,,,,,


In [42]:
inv_items = inv_items.fillna(0)
inv_items.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,...,ZINC STAR T-LIGHT HOLDER,ZINC SWEETHEART SOAP DISH,ZINC SWEETHEART WIRE LETTER RACK,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC TOP 2 DOOR WOODEN SHELF,ZINC WILLIE WINKIE CANDLE STICK,ZINC WIRE KITCHEN ORGANISER,ZINC WIRE SWEETHEART LETTER TRAY
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
537879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537882,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537883,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [48]:
frequent_itemsets = apriori(inv_items, min_support=0.03, use_colnames=True)

frequent_itemsets

Unnamed: 0,support,itemsets
0,0.039313,(6 RIBBONS RUSTIC CHARM)
1,0.035662,(60 TEATIME FAIRY CAKE CASES)
2,0.042289,(ALARM CLOCK BAKELIKE GREEN)
3,0.033135,(ALARM CLOCK BAKELIKE PINK)
4,0.047400,(ALARM CLOCK BAKELIKE RED)
...,...,...
86,0.039200,(VINTAGE SNAP CARDS)
87,0.105695,(WHITE HANGING HEART T-LIGHT HOLDER)
88,0.032742,(WOOD BLACK BOARD ANT WHITE FINISH)
89,0.043244,(WOODEN FRAME ANTIQUE WHITE)


In [49]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


In [None]:
print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
print("Remaining order_item: {:21d}".format(len(order_item)))


In [None]:
gr2 = gr1.groupby("product_id")\
.order_id.nunique().reset_index()\
.sort_values("product_id", ascending=True)

In [None]:
gr2 = gr2.assign(totals=gr2.product_id*gr2.order_id)

In [None]:
total_sum = gr2.totals.sum()

In [None]:
gr2 = gr2.assign(percent=gr2.totals/total_sum)

In [None]:
gr2 = gr2.assign(cum_percent=np.around(gr2.percent.cumsum()*100, decimals=2))

In [None]:
'''
選用線圖呈現訂單上商品數量的分布情形。
'''
plt.figure(figsize=[10, 5]) #圖的大小
ax = sns.lineplot(x="product_id", y="cum_percent", data=gr2)

ax.set_xlabel("訂單上商品數量", fontproperties=han_font) # x軸的標題，字型選用中文字型
ax.set_ylabel("銷售商品累積百分比", fontproperties=han_font)
ax.set_title('訂單上商品數量的分布情形', fontproperties=han_font, fontsize=18)

In [None]:
# 熱銷商品
gr3 = order_products_prior.groupby("product_name")\
.agg({'reordered': 'count'}).reset_index().sort_values("reordered", ascending=False)

gr3