# 1. 关联分析理论基础

**关联分析**
* association analysis，或 association rule learning
* 从大规模数据集中寻找物品间的隐含关系
* 如果一个事务中有X，则该事务中则很有可能有Y，写成关联规则
   * {X}→{Y}
   * 将这种找出项目之间联系的方法叫做关联分析

**关联分析能得到什么结果？**
* 频繁项集：经常出现在一起的物品集合
* 关联规则：两个物品之间的关系强弱

**例子：从交易清单中挖掘信息**

* 交易号码0：豆奶，莴苣
* 交易号码1：莴苣，尿布，葡萄酒，甜菜
* 交易号码2：豆奶，尿布，葡萄酒，橙汁
* 交易号码3：莴苣，豆奶，尿布，葡萄酒
* 交易号码4：莴苣，豆奶，尿布，橙汁

**如何挖掘？**
* 根据支持度（support），置信度（confidence）
* **支持度support**
   * 5条交易记录中，豆奶出现4次，则豆奶的support为4/5
   * 5条交易记录中，{豆奶，尿布}出现4次，则{豆奶，尿布}的support为3/5
* **置信度confidence**
   * 两个item的支持度之比
      * 比如，{尿布，葡萄酒}的support=3/5；{尿布}的support=4/5
      * 则{尿布}--->{葡萄酒}的confidence为(3/5)/(4/5)=3/4
      * 说明这条规则对75%的记录都适用
   * {葡萄酒}->{尿布} ?
      * {尿布，葡萄酒}的support=3/5；{葡萄酒}的support=3/5
      * C({葡萄酒}->{尿布}) = 1

公式定义confidence(A->B) = P(B|A) = SUPPORT(A&B)/SUPPORT(A)

# 2. Apriori算法

**Apriori算法**
   * 生成频繁项集，即满足最小支持度阈值的所有项集；
   * 生成关联规则，从上一步中找出的频繁项集中找出搞置信度的规则，即满足最小置信度阈值。


**Apriori核心原理**
* 如果某个项集是频繁的，那么它的子集也是频繁的
* 如下图，若某个项集不频繁，则它的父集也不频繁

<img src="apriori_algorithm.png">

In [13]:
from apyori import apriori

transactions = [
    ['豆奶', '莴苣'],
    ['莴苣', '尿布', '葡萄酒', '甜菜'],
    ['豆奶', '尿布', '葡萄酒', '橙汁'],
    ['莴苣', '豆奶', '尿布', '葡萄酒'],
    ['莴苣', '豆奶', '尿布', '橙汁'],
]
results = list(apriori(transactions))

In [22]:
results

[RelationRecord(items=frozenset({'尿布'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'尿布'}), confidence=0.8, lift=1.0)]),
 RelationRecord(items=frozenset({'橙汁'}), support=0.4, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'橙汁'}), confidence=0.4, lift=1.0)]),
 RelationRecord(items=frozenset({'甜菜'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'甜菜'}), confidence=0.2, lift=1.0)]),
 RelationRecord(items=frozenset({'莴苣'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'莴苣'}), confidence=0.8, lift=1.0)]),
 RelationRecord(items=frozenset({'葡萄酒'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'葡萄酒'}), confidence=0.6, lift=1.0)]),
 RelationRecord(items=frozenset({'豆奶'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozense

**从结果中可以看到**

```python
RelationRecord(items=frozenset({'尿布', '葡萄酒'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset({'尿布'}), items_add=frozenset({'葡萄酒'}), confidence=0.7499999999999999, lift=1.2499999999999998), OrderedStatistic(items_base=frozenset({'葡萄酒'}), items_add=frozenset({'尿布'}), confidence=1.0, lift=1.25)]),
```

* {'尿布'}->{'葡萄酒'}的confidence=0.7499999999999999
* {'葡萄酒'}->{'尿布'}的confidence=1.0


# 3. FP-Growth算法

* 效率比Aprior高

In [38]:
import pyfpgrowth
transactions = [
    ['豆奶', '莴苣'],
    ['莴苣', '尿布', '葡萄酒', '甜菜'],
    ['豆奶', '尿布', '葡萄酒', '橙汁'],
    ['莴苣', '豆奶', '尿布', '葡萄酒'],
    ['莴苣', '豆奶', '尿布', '橙汁'],
]

#Use find_frequent_patterns to find patterns in baskets that occur over the support threshold:
patterns = pyfpgrowth.find_frequent_patterns(transactions, 3)
#Use generate_association_rules to find patterns that are associated with another with a certain minimum probability:
rules = pyfpgrowth.generate_association_rules(patterns, 0.5)

In [37]:
patterns

{('尿布', '莴苣'): 3,
 ('尿布', '葡萄酒'): 3,
 ('尿布', '豆奶'): 3,
 ('莴苣',): 4,
 ('葡萄酒',): 3,
 ('豆奶',): 4}

In [39]:
rules

{('莴苣',): (('尿布',), 0.75), ('葡萄酒',): (('尿布',), 1.0), ('豆奶',): (('尿布',), 0.75)}