# 关联规则


关联规则（**Association Rule**）是一种用于发现数据集中变量（项、属性）之间隐藏关系的规则。它主要用于**市场篮分析（Market Basket Analysis）**、推荐系统和数据挖掘等领域。

---

### **1. 关联规则的基本概念**
#### **(1) 规则表示**
关联规则通常表示为：
$$
X \Rightarrow Y
$$
其中：
- \(X\) 和 \(Y\) 是两个不相交的项集（Itemsets）
- 规则的意思是：“如果项集 \(X\) 出现在交易中，那么项集 \(Y\) 也很可能出现。”

例如，在超市购物数据中：
- 规则 **{牛奶, 面包} → {黄油}** 表示：如果顾客买了牛奶和面包，那么他们很可能也会买黄油。

---

### **2. 关联规则的度量标准**
为了衡量规则的有效性，我们通常使用以下指标：

#### **(1) 支持度（Support）**
表示**X 和 Y 同时出现的概率**：
$$
Support(X \Rightarrow Y) = P(X \cup Y)
$$
即：
$$
\text{支持度} = \frac{\text{同时包含 } X \text{ 和 } Y \text{ 的事务数}}{\text{总事务数}}
$$
支持度衡量了规则的**普遍性**，支持度越高，规则越重要。

#### **(2) 置信度（Confidence）**
表示**在包含 X 的情况下，也包含 Y 的概率**：
$$
Confidence(X \Rightarrow Y) = P(Y | X) = \frac{Support(X \cup Y)}{Support(X)}
$$
置信度衡量了规则的**可靠性**，即在 X 发生时 Y 发生的可能性有多大。

#### **(3) 提升度（Lift）**
衡量 X 和 Y 之间的关联程度：
$$
Lift(X \Rightarrow Y) = \frac{Confidence(X \Rightarrow Y)}{Support(Y)}
$$
- 如果 **Lift > 1**，说明 X 和 Y 之间有**正相关**（Y 更有可能在 X 存在时出现）。
- 如果 **Lift = 1**，说明 X 和 Y **独立**，没有关联。
- 如果 **Lift < 1**，说明 X 和 Y 之间有**负相关**（Y 反而更不可能在 X 存在时出现）。

---

### **3. 关联规则挖掘算法**
常见的关联规则挖掘算法包括：

#### **(1) Apriori 算法**
- 通过**频繁项集**生成关联规则。
- 采用**逐层搜索**的方法，从小的项集开始，不断扩展，直到找到所有满足最低支持度和置信度的规则。

#### **(2) FP-Growth（频繁模式增长）算法**

- 通过构建**FP-树（频繁模式树）**来存储频繁项集，从而减少不必要的计算，提高效率。

---

### **4. 关联规则的应用**
1. **市场篮分析**（超市商品推荐）
   - 发现哪些商品经常一起购买，例如“啤酒和尿布”。
2. **推荐系统**（个性化推荐）
   - 电影推荐：用户 A 看了电影 X，系统可以推荐电影 Y。
3. **网页点击流分析**
   - 分析用户点击行为，优化网站布局。
4. **医疗诊断**
   - 发现某些症状与疾病之间的关系，辅助医生决策。

---

### **5. 关联规则示例**
假设有以下交易数据：

| 交易 ID | 购买的商品 |
|---------|------------|
| 1       | 牛奶, 面包, 黄油 |
| 2       | 牛奶, 面包 |
| 3       | 牛奶, 纸巾 |
| 4       | 面包, 黄油 |
| 5       | 牛奶, 面包, 黄油, 纸巾 |

从数据中可以挖掘出规则：
- **{牛奶, 面包} → {黄油}**
  - 支持度 = 2/5 = 40%        支持度  = {牛奶, 面包, 黄油}出现概率
  - 置信度 = 2/3 ≈ 66.7%     置信度 = {牛奶, 面包, 黄油}出现概率 /  {牛奶, 面包}出现概率 
  - 提升度 = (66.7% / 60%) ≈ 1.11 （说明有轻微正相关）   提升度 = 置信度 / {黄油}出现的次数


---

### **总结**
关联规则用于发现数据项之间的关系，主要衡量指标包括**支持度、置信度和提升度**。常见的算法有 **Apriori** 和 **FP-Growth**，它们广泛应用于**市场分析、推荐系统和数据挖掘**等领域。

In [200]:
import pandas as pd
from mlxtend.frequent_patterns import apriori , association_rules

In [201]:
data =  {
    'ID':[1,2,3,4,5,6],
    'onion':[1,0,0,1,1,1],
    'potato':[1,1,0,1,1,1],
    'burger':[1,1,0,0,1,1],
    'milk':[0,1,1,1,0,1],
    'beer':[0,0,1,0,1,0]
}

In [202]:
df = pd.DataFrame(data)
df = df.drop(columns=['ID'])

In [203]:
df

Unnamed: 0,onion,potato,burger,milk,beer
0,1,1,1,0,0
1,0,1,1,1,0
2,0,0,0,1,1
3,1,1,0,1,0
4,1,1,1,0,1
5,1,1,1,1,0


In [204]:
# 设置最小支持度来选择频繁项集
frequent_itemsets = apriori(df, min_support=0.5 , use_colnames=True)



In [205]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.666667,(onion)
1,0.833333,(potato)
2,0.666667,(burger)
3,0.666667,(milk)
4,0.666667,"(potato, onion)"
5,0.5,"(burger, onion)"
6,0.666667,"(potato, burger)"
7,0.5,"(potato, milk)"
8,0.5,"(potato, burger, onion)"


In [206]:
rules = association_rules(frequent_itemsets,metric='lift' , min_threshold = 1)

In [207]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(potato),(onion),0.833333,0.666667,0.666667,0.8,1.2,1.0,0.111111,1.666667,1.0,0.8,0.4,0.9
1,(onion),(potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
2,(burger),(onion),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
3,(onion),(burger),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
4,(potato),(burger),0.833333,0.666667,0.666667,0.8,1.2,1.0,0.111111,1.666667,1.0,0.8,0.4,0.9
5,(burger),(potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
6,"(potato, burger)",(onion),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
7,"(potato, onion)",(burger),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
8,"(burger, onion)",(potato),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8
9,(potato),"(burger, onion)",0.833333,0.5,0.5,0.6,1.2,1.0,0.083333,1.25,1.0,0.6,0.2,0.8


In [208]:
rules[(rules['lift'] > 1.125 )& (rules['confidence'] > 0.8) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1,(onion),(potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
5,(burger),(potato),0.666667,0.833333,0.666667,1.0,1.2,1.0,0.111111,inf,0.5,0.8,1.0,0.9
8,"(burger, onion)",(potato),0.5,0.833333,0.5,1.0,1.2,1.0,0.083333,inf,0.333333,0.6,1.0,0.8


# 常规数据集转化为one-hot编码

In [209]:
# 创建零售购物篮数据
retail_shopping_basket = {
    'ID': [1, 2, 3, 4, 5, 6],
    'Basket': [
        ['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
        ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
        ['Soda', 'Chips', 'Milk'],
        ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
        ['Soda', 'Coffee', 'Milk', 'Bread'],
        ['Beer', 'Chips']
    ]
}

In [210]:
# 转换为 DataFrame
retail = pd.DataFrame(retail_shopping_basket)
# 选择所需的列
retail = retail[['ID', 'Basket']]
# 设置最大列宽以便更好地显示数据
pd.options.display.max_colwidth = 100

In [211]:
# 显示 DataFrame
retail

Unnamed: 0,ID,Basket
0,1,"[Beer, Diaper, Pretzels, Chips, Aspirin]"
1,2,"[Diaper, Beer, Chips, Lotion, Juice, BabyFood, Milk]"
2,3,"[Soda, Chips, Milk]"
3,4,"[Soup, Beer, Diaper, Milk, IceCream]"
4,5,"[Soda, Coffee, Milk, Bread]"
5,6,"[Beer, Chips]"


In [212]:
retail_id = retail.drop('Basket', axis=1 )
retail_id

Unnamed: 0,ID
0,1
1,2
2,3
3,4
4,5
5,6


In [213]:
from pandas import DataFrame
retail_basket = retail.Basket.str.join(',')
DataFrame(retail_basket)

Unnamed: 0,Basket
0,"Beer,Diaper,Pretzels,Chips,Aspirin"
1,"Diaper,Beer,Chips,Lotion,Juice,BabyFood,Milk"
2,"Soda,Chips,Milk"
3,"Soup,Beer,Diaper,Milk,IceCream"
4,"Soda,Coffee,Milk,Bread"
5,"Beer,Chips"


In [214]:
# str.get_dummies(',')用于对字符串列进行独热编码（One-Hot Encoding, OHE）
# 它会按照指定的分隔符 sep 拆分字符串，并将其转换为独热编码的 DataFrame。
# 别处理的字符串得是DataFrame类型！！！
retail_basket = retail_basket.str.get_dummies(',')
retail_basket

Unnamed: 0,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [215]:
retail = retail_id.join(retail_basket)
retail

Unnamed: 0,ID,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
0,1,1,0,1,0,1,0,1,0,0,0,0,1,0,0
1,2,0,1,1,0,1,0,1,0,1,1,1,0,0,0
2,3,0,0,0,0,1,0,0,0,0,0,1,0,1,0
3,4,0,0,1,0,0,0,1,1,0,0,1,0,0,1
4,5,0,0,0,1,0,1,0,0,0,0,1,0,1,0
5,6,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [216]:
retail = retail.drop(columns=['ID'])
# apriori 函数要求输入的 DataFrame 只包含布尔值（True/False）或二进制数（0/1），但你的 DataFrame 包含了一列 "ID"，这不是 0/1 数据。
frequent_itemsets = apriori(retail, min_support=0.5 , use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.666667,(Beer)
1,0.666667,(Chips)
2,0.5,(Diaper)
3,0.666667,(Milk)
4,0.5,"(Chips, Beer)"
5,0.5,"(Diaper, Beer)"


In [217]:
rules = association_rules(frequent_itemsets,metric='lift' , min_threshold = 1)

In [218]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Chips),(Beer),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
1,(Beer),(Chips),0.666667,0.666667,0.5,0.75,1.125,1.0,0.055556,1.333333,0.333333,0.6,0.25,0.75
2,(Diaper),(Beer),0.5,0.666667,0.5,1.0,1.5,1.0,0.166667,inf,0.666667,0.75,1.0,0.875
3,(Beer),(Diaper),0.666667,0.5,0.5,0.75,1.5,1.0,0.166667,2.0,1.0,0.75,0.5,0.875


# 电影题材关联

In [219]:
movies =  pd.read_csv('movies.csv')
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [220]:
movies_ohe = movies.drop('genres',axis=1).join(movies.genres.str.get_dummies('|'))
pd.options.display.max_columns =100

In [221]:
movies_ohe.head(5)

Unnamed: 0,movieId,title,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [222]:
movies_ohe.shape

(9742, 22)

In [223]:
movies_ohe = movies_ohe.drop(columns=['movieId','title'])
frequent_itemsets_movies = apriori(movies_ohe, min_support=0.025 , use_colnames=True)
frequent_itemsets_movies



Unnamed: 0,support,itemsets
0,0.187641,(Action)
1,0.129645,(Adventure)
2,0.062718,(Animation)
3,0.068158,(Children)
4,0.385547,(Comedy)
5,0.123075,(Crime)
6,0.045165,(Documentary)
7,0.447649,(Drama)
8,0.079963,(Fantasy)
9,0.10039,(Horror)


In [224]:
rules_movies = association_rules(frequent_itemsets_movies,metric='lift' , min_threshold = 1.25)
rules_movies

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Adventure),(Action),0.129645,0.187641,0.062615,0.482977,2.57394,1.0,0.038289,1.571224,0.702576,0.245869,0.363553,0.408338
1,(Action),(Adventure),0.187641,0.129645,0.062615,0.333698,2.57394,1.0,0.038289,1.306247,0.752735,0.245869,0.234448,0.408338
2,(Crime),(Action),0.123075,0.187641,0.042907,0.348624,1.857929,1.0,0.019813,1.247142,0.526575,0.160215,0.198167,0.288645
3,(Action),(Crime),0.187641,0.123075,0.042907,0.228665,1.857929,1.0,0.019813,1.136892,0.568426,0.160215,0.120409,0.288645
4,(Action),(Sci-Fi),0.187641,0.100595,0.046294,0.246718,2.452576,1.0,0.027419,1.193981,0.729069,0.191345,0.162466,0.353461
5,(Sci-Fi),(Action),0.100595,0.187641,0.046294,0.460204,2.452576,1.0,0.027419,1.504937,0.658508,0.191345,0.33552,0.353461
6,(Thriller),(Action),0.194416,0.187641,0.067235,0.345829,1.843034,1.0,0.030754,1.241814,0.567807,0.213564,0.194726,0.352072
7,(Action),(Thriller),0.187641,0.194416,0.067235,0.358315,1.843034,1.0,0.030754,1.25542,0.563072,0.213564,0.203454,0.352072
8,(Adventure),(Animation),0.129645,0.062718,0.025354,0.195566,3.118175,1.0,0.017223,1.165145,0.780486,0.151813,0.141737,0.299911
9,(Animation),(Adventure),0.062718,0.129645,0.025354,0.404255,3.118175,1.0,0.017223,1.460953,0.724755,0.151813,0.315515,0.299911


In [227]:
rules_movies[rules_movies.lift > 4].sort_values(by=['lift'],ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
16,(Children),(Animation),0.068158,0.062718,0.031,0.454819,7.251799,1.0,0.026725,1.719213,0.925161,0.31038,0.418339,0.474545
17,(Animation),(Children),0.062718,0.068158,0.031,0.494272,7.251799,1.0,0.026725,1.842573,0.919791,0.31038,0.457281,0.474545


In [228]:
movies[movies.genres.str.contains('Children') & (~movies.genres.str.contains('Animation'))]

Unnamed: 0,movieId,title,genres
1,2,Jumanji (1995),Adventure|Children|Fantasy
7,8,Tom and Huck (1995),Adventure|Children
26,27,Now and Then (1995),Children|Drama
32,34,Babe (1995),Children|Drama
34,38,It Takes Two (1995),Children|Comedy
...,...,...,...
9636,179401,Jumanji: Welcome to the Jungle (2017),Action|Adventure|Children
9670,182731,Pixel Perfect (2004),Children|Comedy|Sci-Fi
9679,183301,The Tale of the Bunny Picnic (1986),Children
9697,184987,A Wrinkle in Time (2018),Adventure|Children|Fantasy|Sci-Fi
