# Introduction:

This tutorial will introduce you to the method of Association Rules Mining and a seminal algoithm known as Apriori Algorithm, for mining assosciation rules. Association rules mining builds upon the broader concept of mining frequent patterns. Frequent patterns are patterns that appear in datasets recurrently and frequently. The motivation for frequent patterns mining comes from Rakesh Agrawals concept of strong rules for disconvering associations between products in a transactions records at point-of-sale systems of supermarkets. This example of mining for frequent itemsets is widely known as market-basket analysis.

#### Market Basket Analysis:
Market basket analysis is the process of analyzing customer buying habits by discovering associations between different products  or items that customers place in their shopping basket. The associations, when discovered, help the retailers to manage their shelf space, develop marketing strategies, engage in selective marketing and bundling of the products together. For example, if a customer buys a toothbrush, what is the likelihood of the customer buying a mouthwash like Listerine. 



The association rules mining also finds its applications in recommendation systems in e-commerce websites, video streaming websites, borrower defaulter prediction in capital lending firms, web-page usage mining, intrusion detection and so on. 

Although there are many association rules mining algorithms, we would be exploring apriori algorithm. In doing so, we will define the constituents of association rules viz, itemsets, frequent itemsets etc. 

#### Tutorial Content:

In our build-up to implementing apriori algorithm, we will learn about what itemsets are, how are they represented, the measures that quantify interestingness of sossciation rules. Theorotically, we will use the basic concepts of probability to define the measures that quantify interestingness in association rules. In python, we would be using following libraries to implement to algorithm :

    numpy
    pandas
    itertools.chain
    itertools.combinations



 

In [122]:
import numpy
import pandas
import collections as cp
from itertools import chain
from itertools import combinations

# Measures of Rule Interestingness in dataset:

There are two measures of rule interestingess of data that lays foundation to mine frequent patterns. They are known as rule support and confidence. 

                                 toothbrush => mouthwash [support = 5%, confidence = 80%]

A support of 5% of association rule is equal to saying that 5% of all the transactions that are being considered for analysis have toothbrush and mouthwash purchased together.

A confidence of 80% of association rule is equivalent to saying that 80% of the customers who bought toothbrush also bought mouthwash.

Association rules are considered to be interesting if they satisfy a minimum support threshold and a minimum confidence threshold. These thresholds can be set by the users of the system, decision managers of the organization, or domain experts. 


# Itemsets and Association Rules

Let $$I = {I_1, I_2,..., I_m}$$ be a set of items and D be the dataset under consideration. Each transaction T is a set of items such that T ⊆ I and has an identifier, TID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T. An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, and A∩B = φ.

The association rule A ⇒ B holds in the transaction set D, with support *s* (percentage of transactions in D that contain A U B ) 

                                support( A ⇒ B ) = P(A U B)
                                
and with confidence *c*, where c is the percentage of transactions in D containing A that also contain B, which is equal to the conditional probability *P(B*|*A)*. 
                                
                                confidence( A ⇒ B ) = P(B|A)
                             
                             
Rules that satistfy the requirement of minimum support threshold and minimum confidence threshold are considered as strong rules. 

At this point we can introduce the association rule mining, generally, a two step process :

1. Find all the itemsets that are frequent by selecting the itemsets that occur at least as frequently as a predetermined minimum support count, min_sup.

2. Generate strong association rules from the frequent itemsets obtained from the step 1. In addition to the min_sup requirement, these rules must satisfy the mini

In [136]:
def fileExtract(filename):
    with open(filename, 'rU') as file_iter:
        for line in file_iter:
                line = line.strip().rstrip(',') #Removing the the comma at the end of the line
                record = frozenset(line.split(','))
                yield record

In [137]:
# The data of each set is stored in frozenset object which is immutable 
filename = """C:\Users\Ketan\Documents\CMU\Fall Semester\Practical Data Science\Tutorial\INTEGRATED-DATASET.csv"""
loadedData = fileExtract(filename)

```python
>>>print list(loadedData)
[frozenset(['Brooklyn', 'LBE', '11204']), frozenset(['Cambria Heights', 'MBE', 'WBE', 'BLACK', '11411']), 
 frozenset(['MBE', '10598', 'BLACK', 'Yorktown Heights']), frozenset(['11561', 'MBE', 'BLACK', 'Long Beach']), 
 frozenset(['MBE', 'Brooklyn', 'ASIAN', '11235']), frozenset(['MBE', '10010', 'WBE', 'ASIAN', 'New York']), frozenset(['10026', 'MBE', 'New York', 'ASIAN']), 
 frozenset(['10026', 'MBE', 'New York', 'BLACK'])
 ....
 ....
 ....
 frozenset(['NON-MINORITY', 'WBE', '10025', 'New York']), frozenset(['MBE', '11554', 'WBE', 'ASIAN', 'East Meadow']), 
 frozenset(['MBE', 'Brooklyn', 'WBE', 'BLACK', '11208']), frozenset(['NON-MINORITY', 'WBE', '7717', 'Avon by the Sea']), 
 frozenset(['MBE', '11417', 'LBE', 'ASIAN', 'Ozone Park']), frozenset(['NON-MINORITY', '10010', 'WBE', 'New York']), 
 frozenset(['NON-MINORITY', 'Teaneck', 'WBE', '7666']), frozenset(['Bronx', 'MBE', 'WBE', 'BLACK', '10456']), 
 frozenset(['MBE', '7514', 'BLACK', 'Paterson']), frozenset(['NON-MINORITY', 'WBE', '10023', 'New York']), 
 frozenset(['MBE', 'Valley Stream', 'ASIAN', '11580']), frozenset(['MBE', 'Brooklyn', 'BLACK', '11214']), 
 frozenset(['New York', 'LBE', '10016']), frozenset(['MBE', 'New York', 'ASIAN', '10002'])]
```

In [138]:
def getItemsetsTransactionsList(loadedData):
    transactionList = list()                #Create list of transactions
    itemSet = set()             
    for record in loadedData:
        transaction = frozenset(record)
        transactionList.append(transaction)
        for item in transaction:
            itemSet.add(frozenset([item]))  # Generating 1-itemSets
    return itemSet, transactionList
itemSet, transactionList = getItemsetsTransactionsList(loadedData)

```python
>>> print itemSet
frozenset(['Brooklyn', 'LBE', '11204'])
frozenset(['Cambria Heights', 'MBE', 'WBE', 'BLACK', '11411'])
frozenset(['MBE', '10598', 'BLACK', 'Yorktown Heights'])
frozenset(['11561', 'MBE', 'BLACK', 'Long Beach'])
frozenset(['MBE', 'Brooklyn', 'ASIAN', '11235'])
frozenset(['MBE', '10010', 'WBE', 'ASIAN', 'New York'])
frozenset(['10026', 'MBE', 'New York', 'ASIAN'])
frozenset(['10026', 'MBE', 'New York', 'BLACK'])
.....
.....
frozenset(['NON-MINORITY', 'WBE', 'Mineola', '11501'])
frozenset(['MBE', 'ASIAN', '10550', 'Mount Vernon'])
frozenset(['MBE', 'Port Chester', '10573', 'HISPANIC'])
frozenset(['NON-MINORITY', 'Merrick', 'WBE', '11566'])
```

Once we have generated unique itemsets and the transaction list of all transactions, the next step is to process them by applying the Apriori ALgorithm.

Apriori Algorithm uses the prior knowledge of the frequently occurring itemsets. It employs an iterative approach also known as level-wise search, where k-itemsets are used to explore (k+1) itemsets.     
        
 
Apriori Algorithm can be divided into two steps:

 1. Join Step
 2. Prune Step
        
###### 1. Join Step: 
In this step, a set of candidate k-itemsets is generating by joining $L_{k-1}$ with itself to find $L_k$
###### 2. Prune Step: 
A superset of $L_k$ called $C_k$ is maintained, which has members that may or may not be frequent. To determine the items that become part of $L_k$, a scan of a transaction list is made to check of counts of the items greater than the minimum support count. All the items that have count greater than minimum support count become part of $L_k$. However, $C_k$ can be very huge and result in too many scans to the transactionList (which would be itself huge) and a lot of computation. To avoid this, the algorithm makes use of what is called the *Apriori Property*, which is described below. 

Any (k − 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k − 1)-subset of a candidate k-itemset is not in $L_{k−1}$, then the candidate cannot be frequent either and so can be removed from $C_k$. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

#### Illustration by example :

[<img src="https://s18.postimg.org/eujogosbt/Transactions.png">](https://s18.postimg.org/eujogosbt/Transactions.png)

                                        [Sourced from Data Mining: Concepts and Techniques]

Suppose we have a database of transactions as shown above. It has 9 transactions. Each transaction has one or many items that were purchased together. We will apply Apriori Algorithm the transaction dataset to find the frequent itemsets.

**Step : 1.**    In the first iteration of the algorithm, each item that appears in the transaction set, is one of the members of the candidate 1-itemsets, $C_1$. As such, we scan the dataset to get counts of occurences of all the items.

**Step : 2.**    We will assume that the minimum support count is of 2 counts. $therefore$ Relative support would be $2/9 = 22%$. Now we can identify the set of frequent itemsets, $L_1$. It would be all the candidate 1-itemsets in $C_1$ that satisfy the minimum support condition. In our case, all candidates satisfy this condition.

[<img src="https://s14.postimg.org/qv5g284w1/image.png">](https://s14.postimg.org/qv5g284w1/image.png)

**Step : 3.**    Now comes the **join step**. We would now join $L_1$ with itself to generate candidate set of 2-itemsets, $C_2$. It is to be noted that each subset of the candidates in $C_2$ is also frequent, hence, the **prune step** would not remove any candidates.

**Step : 4.**    Again, the transaction dataset is scanned to get the support counts of all the candidates in $C_2$. The candidates that have support count greater than *min_sup* make up the frequent 2-itemsets, $L_2$

[<img src="https://s12.postimg.org/cw3wvqxsd/image.png">](https://s12.postimg.org/cw3wvqxsd/image.png)

**Step : 5.**    Now, for generation of candidate 3-itemsets, $C_3$, we join $L_2 x L_2$, from which we obtain : {{$I_1$, $I_2$, $I_3$}, {$I_1$, $I_2$, $I_5$}, {$I_1$, $I_3$, $I_5$}, {$I_2$, $I_3$, $I_4$}, {$I_2$, $I_3$, $I_5$}, {$I_2$, $I_4$, $I_5$}}. We can apply the **prune step** here. We know that Apriori Property says that for a itemset to be frequent, all of its subsets must also be frequent. If we take the $4^th$ itemset, {$I_2$, $I_3$, $I_4$}, the subset {$I_3$, $I_4$} is not a frequent 2-itemset (Please refer the picture for $L_2$). And hence, {$I_2$, $I_3$, $I_4$} is not a frequent 3-itemset. Same can be deduced about the other three candidate 3-itemsets and hence would be pruned. This saved the effort of retrieving the counts of these itemsets during the subsequent scan to the transaction dataset. 

**Step : 6**     The transactions in the dataset are scanned to determine the counts of the remaining and those have counts greater than the min_sup are selected as frequent 3-itemset, $L_3$.

[<img src="https://s13.postimg.org/d1w0nw05j/image.png">](https://s13.postimg.org/d1w0nw05j/image.png)

**Step : 7**     Further, the algorithm performs, $L_3 x L_3$ to get the candidate set of 4-itemsets, $C_4$. The join results in {{$I_1$,$I_2$,$I_3$,$I_5$}}, however, is pruned because its subset {$I_2$,$I_3$,$I_5$} is not frequent. And hence we reach a point where $C_4 = \phi$ and the algorithm terminates, having found all the frequent itemsets.


## Generating Association Rules from Frequent Itemsets:

Now that we have all the possible frequent itemsets, we proceed to find the association rules, (which is the ultimate goal of the activity). The strong association rules satisfy both the minimum support threshold and minimum confidence threshold. We can find the confidence using the following equation for two items A and B :


                        confidence (A => B) = P(B/A) = support_count(A U B)/support_count(A)
                        
Based on this equation, the association rules can be formed as follows :

- For each frequent itemset $l$, generate all nonempty subsets of $l$.
                        
- For every nonempty subset $s$ of $l$, output the rule “$s ⇒ (l − s)$” if 
                      
 $\frac{(support count(l))}{(support count(s))}$ ≥ $min-conf$
              
   where min_conf is the minimum confidence threshold.
   
From our example, one of the frequent 3-itemset was $l$ = {{$I_1$,$I_2$,$I_5$}}. The non-empty subsets that can be generated from this itemset are {$I_1$}, {$I_2$}, {$I_5$}, {$I_1$,$I_2$}, {$I_2$,$I_5$}, {$I_1$,$I_5$}. The resulting association rules, by applying the formula above are :

$I_1$ ^ $I_2$   $=>$   $I_5$   $:$  $confidence = 2/4 = 50\% $

$I_1$ ^ $I_5$   $=>$   $I_2$   $:$  $confidence = 2/2 = 100\% $

$I_5$ ^ $I_2$   $=>$   $I_1$   $:$  $confidence = 2/2 = 100\% $

$I_1$ $=>$ $I_2$ ^ $I_5$       $:$  $confidence = 2/6 = 100\% $

$I_2$ $=>$ $I_1$ ^ $I_5$       $:$  $confidence = 2/7 = 29\% $

$I_5$ $=>$ $I_2$ ^ $I_1$       $:$  $confidence = 2/2 = 100\% $

By fixing the minimum confidence threshold, we can select or reject the rules that satisfy or don' satisfy the condition.

In [142]:
frequencySet = cp.defaultdict(int) 
largeSet = dict()
assocRules = dict()
if(assocRules == largeSet){
    print "Should not happen"
}
else {
    print "OK"
} #Vanity check, not relevant for calculation
minSupport = 0.17
minConfidence = 0.5

def getMinimumSupportItems(itemSet, transactionList, minSupport, freqSet):
        """Function to calculate the support of items of itemset in the transaction. The support is checked against minimum support.
        Returns the itemset with those  items that satisfy the minimum threshold requirement"""
        newItemSet = set()
        localSet = cp.defaultdict(int) #local dictionary to count the items in the itemset that are part of the transaction

        for item in itemSet:
                for transaction in transactionList:
                        if item.issubset(transaction):
                                frequencySet[item] += 1
                                localSet[item] += 1
        print itemSet
        for item, count in localSet.items():
                support = float(count)/len(transactionList)

                if support >= minSupport:
                        newItemSet.add(item)

        return newItemSet
        pass

In [None]:
# Printing and confirming the contents of the qualified newItemSet
supportOnlySet = getMinimumSupportItems(itemSet, transactionList, minSupport, frequencySet)
print supportOnlySet

These are all the frequent 1-itemsets

```python
>>> print supportOnlySet
set([frozenset(['BLACK']), frozenset(['ASIAN']), frozenset(['New York']), frozenset(['MBE']), frozenset(['NON-MINORITY']), frozenset(['WBE'])])
```

In [144]:
def joinSet(itemSet, length):
    """Function to perform the join step of the Apriori Algorithm"""
    return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])

In [132]:
def subsets(arr):
    """ Returns non empty subsets of arr"""
    return chain(*[combinations(arr, i + 1) for i, a in enumerate(arr)])

In [145]:
# We canlculate the k-itemsets by iterating level-wise will there 
# are no frequent itemsets as illustrated in the example above
toBeProcessedSet = supportOnlySet
k = 2 
while(toBeProcessedSet != set([])):
    largeSet[k-1] = toBeProcessedSet
    toBeProcessedSet = joinSet(toBeProcessedSet, k)
    toBeProcessedSet_c = getMinimumSupportItems(toBeProcessedSet,transactionList,minSupport,frequencySet)
    toBeProcessedSet = toBeProcessedSet_c
    k = k + 1

def getSupport(item):
    "Local function to get the support of k-itemsets"
    return float(frequencySet[item])/len(transactionList)

finalItems = []
for key, value in largeSet.items():
    finalItems.extend([(tuple(item), getSupport(item)) for item in value])
print finalItems

finalRules = []
for key, value in largeSet.items()[1:]:
    for item in value:
        _subsets = map(frozenset, [x for x in subsets(item)])
        for element in _subsets:
            remain = item.difference(element)
            if len(remain) > 0:
                confidence = getSupport(item)/getSupport(element)
                if confidence >= minConfidence:
                    finalRules.append(((tuple(element), tuple(remain)), confidence))
print finalRules

In [146]:
def printResults(items, rules):
    """prints the generated itemsets sorted by support and the confidence rules sorted by confidence"""
    for item, support in sorted(items, key=lambda (item, support): support):
        print "item: %s , %.3f" % (str(item), support)
    print "\n------------------------ RULES:"
    for rule, confidence in sorted(rules, key=lambda (rule, confidence): confidence):
        pre, post = rule
        print "Rule: %s ==> %s , %.3f" % (str(pre), str(post), confidence)
printResults(finalItems, finalRules)

```python
>>> printResults(finalItems, finalRules)
item: ('MBE', 'New York') , 0.170
item: ('New York', 'WBE') , 0.175
item: ('MBE', 'ASIAN') , 0.200
item: ('ASIAN',) , 0.202
item: ('New York',) , 0.295
item: ('NON-MINORITY',) , 0.300
item: ('NON-MINORITY', 'WBE') , 0.300
item: ('BLACK',) , 0.301
item: ('MBE', 'BLACK') , 0.301
item: ('WBE',) , 0.477
item: ('MBE',) , 0.671

------------------------ RULES:
Rule: ('New York',) ==> ('MBE',) , 0.578
Rule: ('New York',) ==> ('WBE',) , 0.594
Rule: ('WBE',) ==> ('NON-MINORITY',) , 0.628
Rule: ('ASIAN',) ==> ('MBE',) , 0.990
Rule: ('BLACK',) ==> ('MBE',) , 1.000
Rule: ('NON-MINORITY',) ==> ('WBE',) , 1.000
```

In [None]:
###### If the code is to be converted in a python file, please copy this pease of code at the top of the blocks. 
#You will be able to pass arguments of min_sup and min_conf from command line
if __name__ == "__main__":

    optparser = OptionParser()
    optparser.add_option('-f', '--inputFile',
                         dest='input',
                         help='filename containing csv',
                         default=None)
    optparser.add_option('-s', '--minSupport',
                         dest='minS',
                         help='minimum support value',
                         default=0.15,
                         type='float')
    optparser.add_option('-c', '--minConfidence',
                         dest='minC',
                         help='minimum confidence value',
                         default=0.6,
                         type='float')

    (options, args) = optparser.parse_args()

    inFile = None
    if options.input is None:
            inFile = sys.stdin
    elif options.input is not None:
            inFile = dataFromFile(options.input)
    else:
            print 'No dataset filename specified, system with exit\n'
            sys.exit('System will exit')

    minSupport = options.minS
    minConfidence = options.minC

    items, rules = runApriori(inFile, minSupport, minConfidence)

    printResults(items, rules)

## Summary and references

1. Han and Kamber : Data Mining, Concepts and Techniques
2. https://github.com/asaini/Apriori/ - Reference for implementation