## 关联规则
### 支持度（support）:一个项集或者一个事务中出现的频率，$\sigma(X)$表示项集X的支持度计数
- $s(X) = \sigma(X)/N$
- 规则X==>Y 表示物品集X和Y一起支持度，也就是物品集X和Y一起出现的频率。
- 某天一共有100名顾客到商店购买东西，其中30个顾客同时购买了X物品和Y物品，那么X和Y的关联度就是30/100=0.3
### 置信度（confidence）：确定Y在包含X的事务中出现的频繁程度。
- $p(X,Y) = p(XY)/p(X)$
- 置信度反映了关联规则的可信度-购买了项目集X中的商品的顾客同时也购买了Y中商品的可能性有多大
- 购买薯片的顾客中有50%的人购买了可乐，则置信度为50%
### (X，Y)==>Z
- 支持度：表示交易中同时出现X、Y和Z的概率
- 置信度：包含（X,Y）的交易中也包含Z的概率
### 提升度（lift）:物品集A的出现对物品集B的出现概率发生了多大变化
- $lift(A==>B) = \frac{confident(A==>B)}{support(A)}$
- 如果提升度大于1，则表示A和B之间存在依赖关系，提升度小于等于1则表示A和B之间不存在依赖关系

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [9]:
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Onion': [1, 0, 0, 1, 1, 1],
    'Potato': [1, 1, 0, 1, 1, 1],
    'Burger': [1, 1, 0, 0, 1, 1],
    'Milk': [0, 1, 1, 1, 0, 1],
    'Beer': [0, 0, 1, 0, 1, 0]
}

设置最低支持的为50%

In [10]:
df = pd.DataFrame(data)
df

Unnamed: 0,ID,Onion,Potato,Burger,Milk,Beer
0,1,1,1,1,0,0
1,2,0,1,1,1,0
2,3,0,0,0,1,1
3,4,1,1,0,1,0
4,5,1,1,1,0,1
5,6,1,1,1,1,0


In [3]:
frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer']], min_support=0.5, use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.666667,(Onion)
1,0.833333,(Potato)
2,0.666667,(Burger)
3,0.666667,(Milk)
4,0.666667,"(Potato, Onion)"
5,0.5,"(Burger, Onion)"
6,0.666667,"(Potato, Burger)"
7,0.5,"(Potato, Milk)"
8,0.5,"(Potato, Burger, Onion)"


In [11]:
df2 = df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer']].astype(bool)
df2

Unnamed: 0,Onion,Potato,Burger,Milk,Beer
0,True,True,True,False,False
1,False,True,True,True,False
2,False,False,False,True,True
3,True,True,False,True,False
4,True,True,True,False,True
5,True,True,True,True,False


In [12]:
bool_columns = df.columns.drop('ID') # 获取除 ID 外的所有列名
bool_columns

Index(['Onion', 'Potato', 'Burger', 'Milk', 'Beer'], dtype='object')

In [13]:
frequent_itemsets = apriori(df2, min_support=0.5, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.666667,(Onion)
1,0.833333,(Potato)
2,0.666667,(Burger)
3,0.666667,(Milk)
4,0.666667,"(Potato, Onion)"
5,0.5,"(Burger, Onion)"
6,0.666667,"(Potato, Burger)"
7,0.5,"(Potato, Milk)"
8,0.5,"(Potato, Burger, Onion)"


计算规则
- 指定不同的衡量标准与最小阈值

In [14]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.7)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Potato),(Onion),0.833333,0.666667,0.666667,0.8,1.2,1.0,0.111111,1.666667,1.0,0.8,0.4,0.9
1,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
2,(Burger),(Onion),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
3,(Onion),(Burger),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
4,(Potato),(Burger),0.833333,0.666667,0.666667,0.8,1.2,1.0,0.111111,1.666667,1.0,0.8,0.4,0.9
5,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
6,(Potato),(Milk),0.833333,0.666667,0.5,0.6,0.9,1.0,-0.055556,0.833333,-0.4,0.5,-0.2,0.675
7,(Milk),(Potato),0.666667,0.833333,0.5,0.75,0.9,1.0,-0.055556,0.666667,-0.25,0.5,-0.5,0.675
8,"(Potato, Burger)",(Onion),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
9,"(Potato, Onion)",(Burger),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75


In [16]:
rules[(rules['lift']>1.125) & (rules['confidence']>0.8)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1,(Onion),(Potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
5,(Burger),(Potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
10,"(Burger, Onion)",(Potato),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8


| measure     | definition                                 | interpretation                           |
|-------------|--------------------------------------------|------------------------------------------|
| support     | $\text{supp}_T(A \Rightarrow B)$           | $P(A \cap B)$                          |
| confidence  | $\frac{\text{supp}_T[A \Rightarrow B]}{\text{supp}_T[A]}$ | $P(B / A)$                             |
| lift        | $\frac{\text{conf}_T[A \Rightarrow B]}{\text{supp}_T[B]}$ | $\frac{P(B / A)}{P(B)}$                |
| leverage    | $\text{supp}_T[A \Rightarrow B] - \text{supp}_T[A] \text{supp}_T[B]$ | $P(A \cap B) - P(A) P(B)$              |
| conviction  | $\frac{1 - \text{supp}_T[B]}{1 - \text{conf}_T[A \Rightarrow B]}$ | $\frac{1 - P(B)}{1 - P(B / A)}$        |


| measure     | min value, incompatibility | value at independance       | max value, logical rule       |
|-------------|----------------------------|-----------------------------|--------------------------------|
| support     | $0$                      | $\text{supp}_T(A) \text{supp}_T(B)$ | $\text{supp}_T(A)$           |
| confidence  | $0$                      | $\text{supp}_T(B)$        | $1$                          |
| lift        | $0$                      | $1$                       | $\frac{1}{\text{supp}_T(B)}$ |
| leverage    | $-\text{supp}_T(A) \text{supp}_T(B)$ | $0$                       | $\text{supp}_T(A) (1 - \text{supp}_T(B))$ |
| conviction  | $1 - \text{supp}_T(B)$   | $1$                       | $\infty$                     |

get_dummies单类别列编码

In [18]:
import pandas as pd

# 创建含类别列的DataFrame
df = pd.DataFrame({
    '颜色': ['红色', '绿色', '蓝色', '红色']
})

# 对"颜色"列进行独热编码
encoded_df = pd.get_dummies(df, columns=['颜色'])
print(encoded_df)

   颜色_红色  颜色_绿色  颜色_蓝色
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False


In [17]:
df = pd.DataFrame({
    '颜色': ['红色', '绿色', '蓝色', '红色'],
    '尺寸': ['大', '中', '小', '中']
})

# 同时对"颜色"和"尺寸"列编码
encoded_multi = pd.get_dummies(df, columns=['颜色', '尺寸'])
print(encoded_multi)

   颜色_红色  颜色_绿色  颜色_蓝色   尺寸_中   尺寸_大   尺寸_小
0   True  False  False  False   True  False
1  False   True  False   True  False  False
2  False  False   True  False  False   True
3   True  False  False   True  False  False


In [20]:
# 创建示例数据
data = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Basket': [
        ['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
        ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
        ['Soda', 'Chips', 'Milk'],
        ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
        ['Soda', 'Coffee', 'Milk', 'Bread'],
        ['Beer', 'Chips']
    ]
}
df = pd.DataFrame(data)
# # 将 Basket 列转换为字符串，以便进行 one-hot 编码
# df['Basket'] = df['Basket'].apply(lambda x: ', '.join(x))
# # 使用 get_dummies 进行 one-hot 编码
# encoded_df = pd.get_dummies(df, columns=['Basket'])
# print(encoded_df)

In [22]:
retail_id = df.drop('Basket', axis=1)
retail_basket = df['Basket'].str.join(',') #['Beer', 'Diaper', 'Chips'] → 'Beer,Diaper,Chips'
retail_basket

0              Beer,Diaper,Pretzels,Chips,Aspirin
1    Diaper,Beer,Chips,Lotion,Juice,BabyFood,Milk
2                                 Soda,Chips,Milk
3                  Soup,Beer,Diaper,Milk,IceCream
4                          Soda,Coffee,Milk,Bread
5                                      Beer,Chips
Name: Basket, dtype: object

In [23]:
retail_basket = retail_basket.str.get_dummies(',')
retail_basket

Unnamed: 0,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [24]:
retail=retail_id.join(retail_basket)
retail

Unnamed: 0,ID,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,2,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,3,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,4,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,5,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,6,0,0,1,0,1,0,0,0,0,0,0,0,0,0
