# Apriori关联规则学习方法
关联规则学习是一种在大型数据库中发现变量之间的有趣关系的方法。它的目的是利用一些有趣的量度来识别数据库中发现的强规则。关联规则被广泛应用于购物篮分析、网络用法挖掘、入侵检测、连续生产及生物信息学

In [1]:
# 数据集构造
dataset = [
    ["牛奶", "洋葱", "牛肉", "芸豆", "鸡蛋", "酸奶"],
    ["玉米", "洋葱", "洋葱", "芸豆", "豆腐", "鸡蛋"],
    ["牛奶", "香蕉", "玉米", "芸豆", "酸奶"],
    ["芸豆", "玉米", "香蕉", "牛奶", "鸡蛋"],
    ["香蕉", "牛奶", "鸡蛋", "酸奶"],
    ["牛奶", "苹果", "芸豆", "鸡蛋"],
]
dataset

[['牛奶', '洋葱', '牛肉', '芸豆', '鸡蛋', '酸奶'],
 ['玉米', '洋葱', '洋葱', '芸豆', '豆腐', '鸡蛋'],
 ['牛奶', '香蕉', '玉米', '芸豆', '酸奶'],
 ['芸豆', '玉米', '香蕉', '牛奶', '鸡蛋'],
 ['香蕉', '牛奶', '鸡蛋', '酸奶'],
 ['牛奶', '苹果', '芸豆', '鸡蛋']]

In [2]:
 # 将数据集转换为Apriori算法可以用的格式
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()  # 定义模型
te_ary = te.fit_transform(dataset)  # 转换数据集
te_ary

array([[ True,  True,  True, False,  True, False, False,  True, False,
         True],
       [ True, False, False,  True,  True, False,  True, False, False,
         True],
       [False,  True, False,  True,  True, False, False,  True,  True,
        False],
       [False,  True, False,  True,  True, False, False, False,  True,
         True],
       [False,  True, False, False, False, False, False,  True,  True,
         True],
       [False,  True, False, False,  True,  True, False, False, False,
         True]])

In [3]:
# 将数据处理为DataFrame
df = pd.DataFrame(te_ary, columns=te.columns_)  # 将数组处理为 DataFrame
df

Unnamed: 0,洋葱,牛奶,牛肉,玉米,芸豆,苹果,豆腐,酸奶,香蕉,鸡蛋
0,True,True,True,False,True,False,False,True,False,True
1,True,False,False,True,True,False,True,False,False,True
2,False,True,False,True,True,False,False,True,True,False
3,False,True,False,True,True,False,False,False,True,True
4,False,True,False,False,False,False,False,True,True,True
5,False,True,False,False,True,True,False,False,False,True


In [5]:
# 设定最小支持度
from mlxtend.frequent_patterns import apriori

apriori(df, min_support=0.5, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.833333,(牛奶)
1,0.5,(玉米)
2,0.833333,(芸豆)
3,0.5,(酸奶)
4,0.5,(香蕉)
5,0.833333,(鸡蛋)
6,0.666667,"(芸豆, 牛奶)"
7,0.5,"(酸奶, 牛奶)"
8,0.5,"(香蕉, 牛奶)"
9,0.666667,"(鸡蛋, 牛奶)"


In [6]:
# 查看至少包含两个项的频繁项集
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
frequent_itemsets[
    frequent_itemsets.itemsets.apply(lambda x: len(x)) >= 2
]  # 选择长度 >=2 的频繁项集

Unnamed: 0,support,itemsets
6,0.666667,"(芸豆, 牛奶)"
7,0.5,"(酸奶, 牛奶)"
8,0.5,"(香蕉, 牛奶)"
9,0.666667,"(鸡蛋, 牛奶)"
10,0.5,"(玉米, 芸豆)"
11,0.666667,"(鸡蛋, 芸豆)"
12,0.5,"(鸡蛋, 芸豆, 牛奶)"


In [7]:
# 生成关联规则，此处需要指定最小置信度阈值
from mlxtend.frequent_patterns import association_rules

association_rules(
    frequent_itemsets, metric="confidence", min_threshold=0.6
)  # 置信度阈值为 0.6

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(芸豆),(牛奶),0.833333,0.833333,0.666667,0.8,0.96,1.0,-0.027778,0.833333,-0.2,0.666667,-0.2,0.8
1,(牛奶),(芸豆),0.833333,0.833333,0.666667,0.8,0.96,1.0,-0.027778,0.833333,-0.2,0.666667,-0.2,0.8
2,(酸奶),(牛奶),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8
3,(牛奶),(酸奶),0.833333,0.5,0.5,0.6,1.2,1.0,0.083333,1.25,1.0,0.6,0.2,0.8
4,(香蕉),(牛奶),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8
5,(牛奶),(香蕉),0.833333,0.5,0.5,0.6,1.2,1.0,0.083333,1.25,1.0,0.6,0.2,0.8
6,(鸡蛋),(牛奶),0.833333,0.833333,0.666667,0.8,0.96,1.0,-0.027778,0.833333,-0.2,0.666667,-0.2,0.8
7,(牛奶),(鸡蛋),0.833333,0.833333,0.666667,0.8,0.96,1.0,-0.027778,0.833333,-0.2,0.666667,-0.2,0.8
8,(玉米),(芸豆),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8
9,(芸豆),(玉米),0.833333,0.5,0.5,0.6,1.2,1.0,0.083333,1.25,1.0,0.6,0.2,0.8


In [8]:
# 另一种置信度阈值
association_rules(
    frequent_itemsets, metric="confidence", min_threshold=0.8
)  # 置信度阈值为 0.8

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(酸奶),(牛奶),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8
1,(香蕉),(牛奶),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8
2,(玉米),(芸豆),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8


mlxtend 使用了 DataFrame 而不是 → 来展示关联规则。其中：
* antecedents：规则先导项
* consequents：规则后继项
* antecedent support：规则先导项支持度
* consequent support：规则后继项支持度
* support：规则支持度
* confidence：规则置信度
* lift：规则提升度，表示含有先导项条件下同时含有后继项的概率，与后继项总体发生的概率之比。
* leverage：规则杠杆率，表示当先导项与后继项独立分布时，先导项与后继项一起出现的次数比预期多多少。
* conviction：规则确信度，与提升度类似，但用差值表示。