## Introduction
Frequent Item set Mining is a data analysis method which takes a set of data and apply statistical methods to find interesting and previously-unknown patterns within this set of data. Patterns like which ones emerge frequently, which items are associated, and which items correlate with others. This topic always come with a Market Basket Dataset, which has a bunch of costumers' shopping items look like this:
<img src="http://gdurl.com/kIhH" width = "300" height = "200" alt="" align=center/>

One of the most common ways to represent these patterns is via association rules, a single example of which is given below:

    milk => bread [support = 25%, confidence = 60%, lift = 1.3 ]

`Support` is a measure of absolute frequency of milk and bread are purchased together.   
`Confidence` is a measure of correlative frequency when those who purchased milk also purchased bread.   
`Lift` lift indicates whether the two items are occuring together in the same orders simply by chance.
* lift = 1 implies no relationship between A and B. 
* lift > 1 implies that there is a positive relationship between A and B.(occur together more often than random)
* lift < 1 implies that there is a negative relationship between A and B.(occur together less often than random)


The hidden patterns of history data can help us make some decisions. For example, we can optimize the placement of goods if we find frequent pair of goods bought together, or we can change product warehouse location to achieve cost savings and increase economic efficiency.

## Apriori Algorithm
Apriori is the most well-known example of a frequent pattern mining algorithm.  
Assume that we are running a grocery store that only have 4 kinds of goods: good 0, good 1, good 2, good 3. We can get all possible combinations set of goods that may be purchased together:
<img src="http://gdurl.com/B_oY" width = "300" height = "200" alt="" align=center/>
We find that there are 15 possible purchase sets, lines between sets of items indicate that two or more collections can be combined to form a larger set.  
Apriori Algorithm uses Downward Closure Property:      
* If a set of items is a frequent itemset, all its subsets are also frequent.    
* If a set of items is infrequent, then all its supersets are also infrequent.   

That is, if {0,1} is frequent, {0}, {1} must also be frequent. If {2} is not frequent, then {2,0} could not be frequent.
Using this property, we can reduce the numer of sets need to compute.

###  Apyori API
Python package list has included a efficient implementation of apyori algorithm. We can use it directly by downloading the apyori.oy file form https://pypi.python.org/pypi/apyori/1.1.1 and run on a dataset.
The following dataset we use is donated by Tom Brijs and contains the (anonymized) retail market basket data from an anonymous Belgian retail store.[2] Each row in the retail.csv file represents one transaction of a customer. We use frequent item set support threshold 0.3 and confidence threshold 0.6 and find only 2 related items. We will use this as ground truth to evaluate the correctness of our own implementation.

In [23]:
import csv
from apyori import apriori

def readData( path ):
    D = []
    with open(path, 'r') as f:
        dataset = csv.reader(f)
        for row in dataset:
            D.append(row)
    return D

D = readData('./retail.csv')
results = list(apriori(D, min_support=0.3, min_confidence=0.6))
print (results)

[RelationRecord(items=frozenset({'39', '48'}), support=0.33055057734624893, ordered_statistics=[OrderedStatistic(items_base=frozenset({'48'}), items_add=frozenset({'39'}), confidence=0.6916340334638661, lift=1.2032726128908016)])]


### Create frequent set with single item
Now we are going to implement apyori algorithm from scratch.   
According to Wiki page, Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. So we start from the bottom where each candidate set has only one item.

In [16]:

def createC1(dataSet):
    """ 
    Create candidate sets C1, each contain one item.
    This function create a set contain each single item.
    Args:
        dataSet: dataset contains all transactions
    Returns: 
        Set: a sorted list of candidate sets
    """
    C1 = []
    for transaction in dataSet:
        for item in transaction:
            if [ item ] not in C1:
                C1.append( [ item ] )
    C1.sort()
    return C1


def scanD( D, Ck, minSupport ):
    '''
    This function sacn all transactions and compute how much support dataset D give to each candidate set in Ck 
    and filter the candidate sets with a min suppurt threshold
    Args:
        D: dataset contains all transactions
        Ck: Candidate sets
        minSuppot: frequent item support value threshold
    Returns:
        retList: list of frequent sets
        supportData: dictionary of set: support value for each candidate set
    '''      
    ssCnt = {}
    for tid in D:
        # for each transaction
        # make candidate set a frozen one sothat each candidate can be a key in the dictionary
        C = map(frozenset, Ck )
        for can in C:
            # check if candidate set is subset of transaction
            if can.issubset( tid ):
                ssCnt[ can ] = ssCnt.get( can, 0) + 1
                
    numItems = len( D )
    retList = []
    supportData = {}
    for key in ssCnt:
        # support of each candidate set = number of all appearance / number of all transaction
        support = ssCnt[ key ] / numItems
        supportData[ key ] = support
        # add to frequent set list
        if support >= minSupport:
            retList.insert( 0, key )
    return retList, supportData





In [17]:
# creat a small dataset for test
myData = [ [ 1, 3, 4 ], [ 2, 3, 5 ], [ 1, 2, 3, 5 ], [ 2, 5 ] ]
C1 = createC1(myData)

L, supportData = scanD( myData, C1, 0.5 )
   
print ("all frequent set: ", L)
print ("all support value: ", supportData)

all frequent set:  [frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})]
all support value:  {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75}


### Creat frequent item set with k items
The pseudo code for the algorithm is given below for a transaction database T, and a support threshold of epsilon. Usual set theoretic notation is employed, though note that T is a multiset. Ck is the candidate set for level k. At each step, the algorithm is assumed to generate the candidate sets from the large item sets of the preceding level, heeding the downward closure lemma.
<img src="http://gdurl.com/5JtK" width = "500" height = "500" alt="" align=center/>
( In many cases, we usually concern about only k = 2)

In [26]:

def aprioriGen( Lk, k ):
    '''
    This function generate a new candiadte set by an exist frequent item set
    Args:
        Lk: List of frequent item sets
        k: number of items in the new candidate sets
    Returns:
        retList: a list of all candidate sets, each set has k items
    '''
    retList = []
    lenLk = len( Lk )
    for i in range( lenLk ):
        for j in range( i + 1, lenLk ):
            L1 = list( Lk[ i ] )[ : k - 2 ];
            L2 = list( Lk[ j ] )[ : k - 2 ];
            L1.sort();L2.sort()
            if L1 == L2:
                retList.append( Lk[ i ] | Lk[ j ] ) 
    return retList

def apriori( dataSet, minSupport = 0.5 ):
    '''
    This function compute support value for each frequent item set condidate
    and resturn a list of frequent item sets
    Args:
        dataSet: dataset contains all transactions
        minSupport: frequent item support value threshold
    Returns:
        L: a list of frequent item sets
        suppData: dictionary of set: support value for each candidate set
        
    '''
    C1 = createC1( dataSet )
    L1, suppData = scanD( dataSet, C1, minSupport )
    L = [ L1 ]
    k = 2
    
    while ( len( L[ k - 2 ] ) > 0 ):
        Ck = aprioriGen( L[ k - 2 ], k )
        Lk, supK = scanD( dataSet, Ck, minSupport )
        
        suppData.update( supK )
        L.append( Lk )
        k += 1
    
    L = L[:-1]
    return L, suppData

In [27]:
L, suppData = apriori( myData, 0.5 )
print("all frequent set: ", L)
print("all support value: ",suppData)


all frequent set:  [[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})], [frozenset({2, 3}), frozenset({3, 5}), frozenset({2, 5}), frozenset({1, 3})], [frozenset({2, 3, 5})]]
all support value:  {frozenset({1}): 0.5, frozenset({3}): 0.75, frozenset({4}): 0.25, frozenset({2}): 0.75, frozenset({5}): 0.75, frozenset({1, 3}): 0.5, frozenset({2, 5}): 0.75, frozenset({3, 5}): 0.5, frozenset({2, 3}): 0.5, frozenset({1, 5}): 0.25, frozenset({1, 2}): 0.25, frozenset({2, 3, 5}): 0.5}


### Mining Association Rules from Frequent item set
To find an association rule, we start with a frequent set. That is, an item or a set may derive another item. From example in figure 1, we can get that if there is a frequent item set {milk, bread}, then there may be a related rule "milk --> bread", which means that if someone buys milk, then he will statistically buy bread, but is not necessarily the reverse.   
\begin{equation}
Confidence\left( A\rightarrow B\right) =\frac {Support\left( A\cup B\right) }{Support\left( A\right) }
\end{equation}
There is another property that reduces the number of rules for testing：

* If a rule does not meet the minimum confidence requirements, then all subsets of the rule will not meet the minimum confidence requirements:
 
we can explain this rule in the figure below:

    The darker node in the graph represent the rule with low confidence.
    if 012->3 is low confidence, then all rules which contain conclusion 3 are less reliable
<img src="http://gdurl.com/VdnW" width = "500" height = "500" alt="" align=center/>

In [28]:

def calcConf( freqSet, H, supportData, brl, minConf=0.7 ):
    '''
    This function compute the confidence of each rule and return rules who have 
    larger confidence then threshold
    Args:
        freqSet: frozenset of frequent item set {1,2,3} 
        H: frozenset of single item {[1], [2], [3]}
        supportData: dictionary of set: support value for each frequent item set
        brl: tuples of association rules
        minConf: minimum confidence threshold
    '''
    prunedH = []
    conf1 = float('inf')
    conf2 = float('inf')
    if len(freqSet) == 2:
        for conseq in H:
            lift = supportData[ freqSet ] / (supportData[ freqSet - conseq ] * supportData[conseq ])
            conf1 = supportData[ freqSet ] / supportData[ freqSet - conseq ]
            if conf1 >= minConf:
                print (freqSet - conseq, '-->', conseq, 'conf:', conf1, 'lift:', lift)
                brl.append( ( freqSet - conseq, conseq, conf1, lift) )
                prunedH.append( conseq )
                return prunedH
    for conseq in H:
        lift = supportData[ freqSet ] / (supportData[ freqSet - conseq ] * supportData[conseq ])
        conf1 = supportData[ freqSet ] / supportData[ freqSet - conseq ]
        conf2 = supportData[ freqSet ] / supportData[ conseq ]
        if conf1 >= minConf:
            print (freqSet - conseq, '-->', conseq, 'conf:', conf1, 'lift:', lift)
            brl.append( ( freqSet - conseq, conseq, conf1, lift ) )
            if len(conseq) > 1:
                prunedH.append( conseq )
        if conf2 >= minConf:
            print (conseq, '-->', freqSet - conseq, 'conf:', conf2, 'lift:', lift)
            brl.append( ( conseq, freqSet - conseq, conf2, lift ) )
            if len(freqSet - conseq) > 1:
                prunedH.append( freqSet - conseq )
    return prunedH

def rulesFromConseq( freqSet, H, supportData, brl, minConf=0.7 ):
    '''
    This function take a frequent item set, compute confidence of each potencial rule recursively
    until the size of frequent item is 2.
    Args:
        freqSet: frozenset of frequent item {1,2,3}
        H: frozenset of single item {[1], [2], [3]}
        supportData: dictionary of set: support value for each frequent item set
        brl: tuples of association rules
    
    '''
    m = len( H[ 0 ] )
    print(H)
    print(freqSet)
    print(m)
    if len( freqSet ) > m + 1:
        Hmp1 = aprioriGen( H, m + 1 )
        print(Hmp1)
        Hmp1 = calcConf( freqSet, Hmp1, supportData, brl, minConf )
        if len( Hmp1 ) > 1:
            print("recursion", Hmp1)
            rulesFromConseq( freqSet, Hmp1, supportData, brl, minConf )

def generateRules( L, supportData, minConf=0.7 ):
    '''
    This function generate association rule fron frequent item sets an support value
    Args:
        L: list of frequent item set
        supportData: dictionary of all item set: support value
        minConf: minimum confidence threshold
    Return: 
        bigRuleList: list of association rules
    '''
    bigRuleList = []
    for i in range( 1, len( L ) ):
        for freqSet in L[ i ]:

            H1 = [ frozenset( [ item ] ) for item in freqSet ]
            if i > 1:
                rulesFromConseq( freqSet, H1, supportData, bigRuleList, minConf )
            else:
                calcConf( freqSet, H1, supportData, bigRuleList, minConf )
    return bigRuleList

In [29]:
L, suppData = apriori( myData, 0.5 )
rules = generateRules( L, suppData, minConf=0.65 )
print ('rules:\n', rules)

frozenset({3}) --> frozenset({2}) conf: 0.6666666666666666 lift: 0.8888888888888888
frozenset({5}) --> frozenset({3}) conf: 0.6666666666666666 lift: 0.8888888888888888
frozenset({5}) --> frozenset({2}) conf: 1.0 lift: 1.3333333333333333
frozenset({3}) --> frozenset({1}) conf: 0.6666666666666666 lift: 1.3333333333333333
[frozenset({2}), frozenset({3}), frozenset({5})]
frozenset({2, 3, 5})
1
[frozenset({2, 3}), frozenset({2, 5}), frozenset({3, 5})]
frozenset({5}) --> frozenset({2, 3}) conf: 0.6666666666666666 lift: 1.3333333333333333
frozenset({2, 3}) --> frozenset({5}) conf: 1.0 lift: 1.3333333333333333
frozenset({3}) --> frozenset({2, 5}) conf: 0.6666666666666666 lift: 0.8888888888888888
frozenset({2, 5}) --> frozenset({3}) conf: 0.6666666666666666 lift: 0.8888888888888888
frozenset({2}) --> frozenset({3, 5}) conf: 0.6666666666666666 lift: 1.3333333333333333
frozenset({3, 5}) --> frozenset({2}) conf: 1.0 lift: 1.3333333333333333
recursion [frozenset({2, 3}), frozenset({2, 5}), frozense

### Test on a large dataset
Now is time to test our implementation on a large dataset. We get the same result as in the first part.


In [30]:
D = readData('./retail.csv')
L, suppData = apriori( D, 0.3 )
print ("frequent item set finish")
print (L)
rules = generateRules( L, suppData, minConf=0.6 )
print ('rules:\n', rules)

frequent item set finish
[[frozenset({'48'}), frozenset({'39'})], [frozenset({'39', '48'})]]
frozenset({'48'}) --> frozenset({'39'}) conf: 0.6916340334638661 lift: 1.2032726128908016
rules:
 [(frozenset({'48'}), frozenset({'39'}), 0.6916340334638661, 1.2032726128908016)]


### Discussion
The problem of this simple implementation is that it's not capable of handling large datasets. When I tried to apply this apriori on another dataset over 400MB, I run out of memory and the runtime is not acceptable. Paper [4] described an implementation of Quick Apriori using Trie data structure. 
<img src="http://gdurl.com/nG2l" width = "300" height = "300" alt="" align=center/>
This figure presents a trie that stores the candidates {A,C,D}, {A,E,G}, {A,E,L}, {A,E,M}, {K,M,N}, every node stores the length of the longest directed path that starts from
there, and a hash table is employed to store the edges of the node to accelerate searching.
Significant advantages of this approach:
* Candidate generation becomes easy and fast, can be generated from pairs of nodes that have the same parents 
* Association rules are produced much faster, since retrieving a support of an itemset is quicker


## Source
1. Jiawei Han, Micheline Kamber and Jian Pei Data Mining: Concepts and Techniques, 3rd ed.
2. Frequent Itemset Mining Dataset Repository http://fimi.ua.ac.be/data/
3. Frequent Pattern Mining and the Apriori Algorithm: A Concise Technical Overview https://www.kdnuggets.com/2016/10/association-rule-learning-concise-technical-overview.html
4. Bodon, Ferenc. "A fast APRIORI implementation." FIMI. Vol. 3. 2003.
