Association analysis is the task of finding interesting relationships in large datasets.

The **support** of an itemset is defined as the percentage of the dataset that contains this itemset. 

 The **confidence** is defined for an association rule like {diapers}  ➞  {wine}. The confidence  for  this  rule  is  defined  as  support({diapers,  wine})/support({diapers}).

> If an itemset is infrequent, then its supersets are also infrequent.

## Finding frequent itemsets with the Apriori algorithm

In [29]:
# create a simple dataset for testing
def loadDataSet():
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

In [8]:
# C1 is a candidate itemset of size one
# 
def createC1(dataSet):
    C1 = []
    for transaction in dataSet:
        for item in transaction:
            if not [item] in C1:
                C1.append([item]) # add a list contain one item to list
    C1.sort()
    return list(map(frozenset, C1))
# frozenset are immutable, need to use the type as the key in a dictionary

In [23]:
# scan the dataset to see if these one itemsets meet our minimum support requirements
# dataSet, items, minimum support value
def scanD(D, Ck, minSupport):
    # create a empty dictionary
    ssCnt = {}
    # iterate every combination in dataset
    for tid in D:
        # iterate each relationship in Ck
        for can in Ck:
            # if dataset include relationship
            if can.issubset(tid):
                if can not in ssCnt:
                    ssCnt[can]=1
                else:
                    ssCnt[can] += 1
    numItems = float(len(D))
    retList = []
    supportData = {}
    for key in ssCnt:
        support = ssCnt[key] / numItems
        if support >= minSupport:
            retList.insert(0,key)
        supportData[key] = support
    return retList, supportData
    # return the items met the minimum support requirement
    # a dictionary with support values

In [24]:
dataSet = loadDataSet()
C1 = createC1(dataSet)
C1

[frozenset({1}),
 frozenset({2}),
 frozenset({3}),
 frozenset({4}),
 frozenset({5})]

In [25]:
D = list(map(set, dataSet))
D

[{1, 3, 4}, {2, 3, 5}, {1, 2, 3, 5}, {2, 5}]

In [27]:
L1,suppData0 = scanD(D,C1,0.5)
L1

[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})]

In [28]:
suppData0

{frozenset({1}): 0.5,
 frozenset({3}): 0.75,
 frozenset({4}): 0.25,
 frozenset({2}): 0.75,
 frozenset({5}): 0.75}

### Putting together the full Apriori algorithm

In [30]:
def aprioriGen(Lk, k):
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk):
            L1 = list(Lk[i])[:k-2]
            L2 = list(Lk[j])[:k-2]
            L1.sort()
            L2.sort()
            if L1 == L2:
                retList.append(Lk[i] | Lk[j])
    return retList