## Optimizing Frequent Itemset Mining using A-priori and PCY Algorithm

## Introduction


The goal of this tutorial is to find the most frequently N(We set N=3 here, but it can be any number you want) ingredients are used together in recipes all over the world. We will utilize a real dataset which contains 37852 recipes and 298 foods all over the world. It may be too small for observing the performance of optimization, but it perfectly shows how the real algorithm works and help us to find the relationship between datasets.


Frequent itemset problem is an interesting and useful branch in data mining and analytics which focuses on finding items that often appear together in a large number of  datasets. It is usually described in market-basket model. That is, we regard each data element as an item and each dataset as a basket, and the Normally the number of the item in one basket is much less than the total number of the items. Then we define a set of items which appears more than a specific threshold to be a frequent itemset. 

However, applications of frequent itemset analysis are not limited to market baskets. The same model can be used to mine many other kinds of data. Some examples are: 


1. **Related Documents**. Let items be words, and let baskets be documents (e.g., Web pages, blogs, tweets). A basket/document contains those items/words that are present in the document. If we ignore all the most common words, then we would hope to find among the frequent pairs some pairs of words that represent a joint concept. For example, we would expect a pair like {Brad, Angelina} to appear with surprising frequency. 

2. **Plagiarism** Let the items be documents and the baskets be sentences. An item/document is "in" a basket/sentence if the sentence is in the document. This arrangement appears backward, but it is exactly what we need, and we should remember that the relationship between items and baskets is an arbitrary many-many relationship. In practice, even one or two sentences in common is a good indicator of plagiarism. 

1. **Biomarkers** Let the items be of two types – biomarkers such as genes or blood proteins, and diseases. Each basket is the set of data about a patient: their genome and blood-chemistry analysis, as well as their medical history of the disease. A frequent itemset that consists of one disease and one or more biomarkers suggests a test for the disease.



## Implementation

#### Naive Algorithm

Let's start with the most intuitive method, Which is both easy to understand and implement. The detailed step is shown in the following:

1. We iterate every line in the original file and obtain all singletons (Duplicates has been removed).
2. We generate all triple combinations based on these singletons.
3. Iterate to the file again and count the occurence of each combination/
4. Generate the top K frequent itemsets.

We will only use the first 20 lines of the whole dataset to do a demo.

In [4]:
import heapq
import itertools
from collections import defaultdict

In [21]:
def combine(nums, N=3):
    """
        Generate the N-wise combination on the given array nums
        Args:
            nums: target array
            N: number of elements in the permutation
        Returns:
            Python array, containing all combinations of size N, using nums
        Helper functions, from Leetcode Discuss, https://discuss.leetcode.com/topic/14371/fast-simple-python-code-recursive
    """
    return set(i for i in itertools.combinations(nums, N))

def get_all_items(recipes):
    items = set()
    for recipe in recipes:
        items = items | set(recipe)
    return items

def counting(coms, recipes, N):
    """
        For each itemset in coms, we iterate through recipes and ensure whether these condidates appear in the recipe. 
    """
    counter = {}
    for recipe in recipes:
        recipe_coms = combine(recipe, N)
        for com in recipe_coms:
            if com in coms:
                if com not in counter:
                    counter[com] = 0
                counter[com] += 1
    return counter

def get_top_K(counter, K=5):
    return sorted(zip(counter.values(), counter.keys()))[-K:]

In [95]:
def get_all_recipes(filename):
    lines = []
    with open(filename, 'r') as fp:
        for line in fp:
            lines.append(sorted([int(i) for i in line.strip().split()]))

    return lines

recipes = get_all_recipes("data_small")

In [26]:
def naive_algorithm(recipes):
    items = get_all_items(recipes)
    coms = combine(items)
    print get_top_K(counting(coms, recipes, 3))
    
%timeit naive_algorithm(recipes)

[(3, (51, 187, 529)), (3, (73, 470, 658)), (3, (205, 529, 966)), (3, (205, 529, 984)), (3, (622, 887, 888))]
[(3, (51, 187, 529)), (3, (73, 470, 658)), (3, (205, 529, 966)), (3, (205, 529, 984)), (3, (622, 887, 888))]
[(3, (51, 187, 529)), (3, (73, 470, 658)), (3, (205, 529, 966)), (3, (205, 529, 984)), (3, (622, 887, 888))]
[(3, (51, 187, 529)), (3, (73, 470, 658)), (3, (205, 529, 966)), (3, (205, 529, 984)), (3, (622, 887, 888))]
1 loop, best of 3: 10.3 s per loop


We got the result 

*[(3, '529 51 187'), (3, '529 966 205'), (3, '529 984 205'), (3, '622 887 888'), (3, '658 73 470')]*

The naive method is feasible. However, it is far from satisfactory both in time and space complexity.

When we use **timeit**, we found it cost about 10s to find out top 10 frequent items.

When the number of single items grows, the number of triple combinations also grows at a very fast speed. Our dataset contains about 50000 lines of recipes, but the naive algorithm could only work for ~20 recipes. Also, for each recipe in the original file, we need to iterate through the whole combination list, given that most of them did not even appear once in the original data set. There is still a lot of optimization that we can do.

Before any introduction to Apriori Algorithm, let's first define the Associate Rule.

#### Associated Rule (Some of the material comes from Wikipedia)

The Apriori Algorithm is based on a collection of if–then rules, called association rules. The form of an association rule is I → j, where I is a set of items and j is an item. The implication of this association rule is that if all of the items in I appear in some basket, then j is likely to appear in that basket as well. Inversely, if any subset of the candidate basket is not a frequent itemset, then there is no possibility that this candidate basket will also be frequent itemset. 

The reason is simple. Let J ⊆ I. Then every basket that contains all the items in I surely contains all the items in J. Thus, **the count for J must be at least as great as the count for I**, and if the count for J is at most s, then the count for I is also at most s. Since J may be contained in some baskets that are missing one or more elements of I − J, it is entirely possible that the count for I is strictly less than the count for J.

Association rule typically does not consider the order of items either within a transaction or across transactions.

#### Apriori Algorithm

Apriori Algorithm makes full use of Associated Rule. The core of Apriori Algorithm is that, **Assume an itemset I, if I is a frequent itemset, so are all its subsets. Conversely, if one of its subsets is not a frequent itemset, I is also not a frequent itemset.**

![title](imgs/image01.png)
![title](imgs/image09.png)

If the length of our target itemset is K, and we define N as the threshold of a frequent itemset, then Apriori Algorithm will work as following:

1. Begin for all item singletons, load the data and eliminates all singletons that have the less frequency than N.
2. Form tuples using the remaining singletons, load the data again and eliminate all tuples that have the less frequency than N.
3. Repeat step 2, until we form itemsets of size K. The subsets of these itemsets are all frequent itemsets, so they are legal candidates and eligible for the final screening.
4. We run the final screening on all frequent itemsets and finally come up with all frequent items of size K.

Let's set our target itemset size to 3 and the threshold to 3, and re-write the algorithm.

In [37]:
def find_frequent_itemsets(counter, threshold=3):
    candidates = []
    for k, v in counter.items():
        if v >= threshold:
            candidates.append(k)
    return candidates

def generate_triple_from_tuple(candidates_2):
    candidate_set = set([str(t[0]) + " " + str(t[1]) for t in candidates_2])
    candidates_3 = []
    items = set()
    for i in candidates_2:
        items.add(i[0])
        items.add(i[1])
    
    items, length = sorted(list(items)), len(items)
    for i in range(length):
        for j in range(i + 1, length):
            if str(items[i]) + " " + str(items[j]) in candidate_set:
                for k in range(j + 1, length):
                    if str(items[i]) + " " + str(items[k]) in candidate_set and str(items[j]) + " " + str(items[k]) in candidate_set:
                        candidates_3.append((items[i], items[j], items[k]))
                        
    return candidates_3

def apriori_algorithm(recipes):
    """
        Given recipes, this function runs Apriori Algorithm and obtains the frequent itemset
        Args:
            recipes: An array of recipe
        Returns:
            An array containing all frequent itemsets
    """
    
    # Get all singletons
    items = get_all_items(recipes)
    coms = combine(items, 1)
    
    # Retains frequent singletons of size 1
    counter_1 = counting(coms, recipes, 1)
    freq_1 = find_frequent_itemsets(counter_1)
    candidates_1 = sorted([t[0] for t in freq_1])
    
    # Get all tuples
    coms = combine(candidates_1, 2)
    counter_2 = counting(coms, recipes, 2)
    candidates_2 = find_frequent_itemsets(counter_2)

    # Get all triples
    candidates_3 = generate_triple_from_tuple(candidates_2)

    # Final screening and return the result
    return find_frequent_itemsets(counting(candidates_3, recipes, 3))
    
%timeit apriori_algorithm(recipes)

10 loops, best of 3: 100 ms per loop


This time it costs us only 103ms to get the result. This is about **100X speedup!** In fact, if we analyze 4 times larger size of data, it still costs only 19.7 seconds to complete (5X speedup). But this is still not that satisfying. Can we do even better? We will introduce a further optimized algorithm, the **PCY Algorithm**

#### PCY Algorithm

PCY Algorithm is an improvement of Apriori Algorithm, so most pars of PCY Algorithm is the same as Apriori Algorithm. We noticed that there are still some unused memory in round 1. 

![title](imgs/image08.png)


We can still utilize these memory to further reduce the number of candidates. This is done by hashing. In the real world, normally the itemsets of size 2 will occupy the most amount of memory. So when we calculate the frequency of singletons, we can also make counters for pairs. Of course, we cannot assign counter for each pairs, but we could a counter for all pairs that have the same hash values. When we generate all candidates frequent pairs, we need also to look at the hash counter, to see whether the corresponding bucket has more frequency than our threshold (or called "overflowed")

![title](imgs/image00.png)

1. If the bucket is not overflowed, then we could directly conclude that the pair will not possible to become a frequency itemsets and thus, we should eliminate it before final screening.
2. If the bucket is overflowed, then we could conclude that the pair will be a valid frequent pair. Please note that **we cannot ensure that this pair is frequent, since the bucket may contains other pairs, and the frequency of the bucket is actually the sum of all inside pairs.**

A good question is that how to design the size of the hash counter and the hash functions. If the size of the hash table is too large, then we may consume too much memory. Conversely, if the size of the hash table is too small, then most of the bucket will overflow, and we cannot eliminate many pairs as we expected. Here we also provide a function to calcuate the overflow rate of the hash function.

In [93]:
hashtable = {}

def overflow_rate(hashtable, threshold):
    counter = 0.0
    for v in hashtable.values():
        if v >= threshold:
            counter += 1.0
    return counter / len(hashtable)

def get_hash(pair):
    """
        This hash function calculates the hash value by multiply and then add these numbers
    """
    return pair[0] * 2 + pair[1] * 3

def pcy_counting_pair(hashtable, coms, recipes):
    """
        For each itemset in coms, we iterate through recipes and ensure whether these condidates appear in the recipe.
        This is the special version of PCY counting. It will hash the pairs to a bucket for each recipe
    """

    counter = {}
    for recipe in recipes:
        recipe_coms = combine(recipe, 1)
        for com in recipe_coms:
            if com in coms:
                if com not in counter:
                    counter[com] = 0
                counter[com] += 1   
                
        # Core component of PCY
        recipe_coms = combine(recipe, 2)
        for com in recipe_coms:
            hash_value = get_hash(com)
            if hash_value not in hashtable:
                hashtable[hash_value] = 0
            hashtable[hash_value] += 1
   
    return counter

def pcy_combine(hashtable, nums, threshold=3):
    """
        Generate the N-wise combination on the given array nums
        This is the special version of PCY combining. It will also check if the generated pairs are in the hashtable
        Args:
            nums: target array
            N: number of elements in the permutation
        Returns:
            Python array, containing all combinations of size N, using nums
    """
    return set(i for i in itertools.combinations(nums, 2) if get_hash(i) in hashtable and hashtable[get_hash(i)] >= threshold)

In [98]:
def pcy_algorithm(hashtable, recipes):
    """
        Given recipes, this function runs PCY Algorithm and obtains the frequent itemset
        Args:
            recipes: An array of recipe
        Returns:
            An array containing all frequent itemsets
    """
    
    # Get all singletons
    items = get_all_items(recipes)
    coms = combine(items, 1)
    
    # Retains frequent singletons of size 1
    counter_1 = pcy_counting_pair(hashtable, coms, recipes)
    freq_1 = find_frequent_itemsets(counter_1)
    candidates_1 = sorted([t[0] for t in freq_1])
    
    # Get all tuples
    coms = pcy_combine(hashtable, candidates_1)
    counter_2 = counting(coms, recipes, 2)
    candidates_2 = find_frequent_itemsets(counter_2)
    
    # Get all triples
    candidates_3 = generate_triple_from_tuple(candidates_2)

    # Final screening and return the result
    return find_frequent_itemsets(counting(candidates_3, recipes, 3))

hashtable = {}
%timeit pcy_algorithm(hashtable, recipes)
print "The overflow rate is\t", overflow_rate(hashtable, 3)

10 loops, best of 3: 111 ms per loop
The overflow rate is	1.0


Although it costs us 110ms to get the result for small dataset, we analyze 4 times larger size of data, it costs only 4.9 seconds to complete (4X speedup). That's great! But can we do even better?

#### Multi-hash PCY Algorithm

We can further optimize our algorithm by using Multi-hash version of PCY Algorithm. Instead of keeping only one hashtable, we can use multiple hashtables and hash functions, and we regard a candidate pair to be frequent only when the all corresponding buckets containing this pair have overflown. This will further reduce the number of candidates, thus speed up our algorithm.

We will not provide implementation here since it's straightforward to implement this function. You are welcomed to implement Multi-hash Algorithm by yourself.

### You need MapReduce and Spark

You may find that we are using very small amount of data to do a demo. This is because these problems are in fact more suitable for distributed parallel computing. A single machine does not have enough memory and computing power to calculate frequent pairs over a large dataset. To transform our current implementation into a MapReduce or Spark version is not that hard. For example, Apriori Algorithm can be rewritten as a K-phase MapRduce program, corresponding to the size of the itemset, ranging from 1 to N.

## Summary

In this tutorial, we explore several implementations in order to find frequent itemsets from buckets. We begin with the naive method, which blindly generates all possible candidate itemset and counting them one-by-one. Then we introduce Associate Rule and Apriori Algorithm, which will efficient eliminate most of the candidates before final screening. After that, we further optimizing the memory usage by introducing PCY Algorithm, which works pretty well when the amount of dataset grows.