## Data Mining Lab2 - Association Rules and Frequent Itemsets
### Yage Hao(郝雅歌) and Hongyi Luo(罗弘毅)

In [1]:
import numpy as np
import itertools

## 1. Data Preprocessing

In [2]:
# In this part we mainly read the file and clear duplications 

file = open('receipts_dataset.txt', 'r')

receipts = []
for line in file:
    x = list(map(int, line.split(" ")[:-1]))
    # Appending elements 
    receipts.append(list(set(x)))

numItems = len(receipts)

In [3]:
# The order of the basket may be different from the original file, and I didn't figure out why.
# But it's not important
receipts[0]

[448, 834, 164, 775, 328, 687, 240, 368, 274, 561, 52, 630, 825, 25, 538, 730]

## 2. Task A : Frequent Itemsets

- In this part, we first solve the occurrences of 1-freq, then choose candiates pairs among them.
- We use dictionary as our data structure to represent the item combination occurrences.
- Read the frequent itemsets, filter them by the setting threshold, finally generate it in the correct form.

### 2.1. Getting the candidates item combinations. (Ck)
 
    Getting the dictionary Ck: {seq -> support(seq)} 
    Given a dictionary of occurrences of the elements, and a threshold (minimum support theshold)
    return a the dictionary of elements of the n_plet specified
    
    dict_freq_items: dict of tuples of (int, int), dictionary containing frequent itemsets, 
                    sequences of length = n_plet-1. (Oss: if n_plet=1, this parameter must be None)
    receipts: list of list of int, dataset of baskets 
    n_plet: int, number of elements per sequence [1, inf]
    

In [4]:
# Get occurrences dictionary of a given dimension. (Ck)
# The basic idea of this part is if a set is frequent, then the subset must be frequent. 
# So we are gonna to calculate from the 1 element frequent item.


def get_ck_possible_itemset(dict_freq_items, receipts, n_plet):
    if n_plet < 1:
        print("Error: n_plet must be at least 1, retry")
        
    new_frequent_dict = {}
    
    if n_plet == 1:
        # just do it brutal
        for receipt in receipts:
            for el in receipt:
                if new_frequent_dict.get(el) is None:
                    new_frequent_dict[el] = 1
                else:
                    new_frequent_dict[el] += 1
                    
    # Case n_plet >= 2 - must compute subsequences
    else:
        # Get frequent itemset as list, from the dictionary
        
        # Check if my dictionary has sequences of at least 2 lements. 
        if n_plet >=3:
            # In this case, I need to return the single elements flatting the lists 
            # For more details about itertool.chain, see python itertools.chain doc
            # [[1,2], [3,4]] -> [1,2,3,4]
            # be ware, here we must use * to make the function work (from the doc default setting)
            freq_items = set(itertools.chain(*dict_freq_items.keys()))
        else:
            # Just return elements list (already flatten by default)
            # because if you input n_plet =2, you actually input the previous n_plet = 1 set.
            freq_items = set(dict_freq_items.keys())

        # Start to compute the n_plets
        
        # Initialize dictionary for frequent sublists at 0
        for receipt in receipts:
            # 1- get only the remained frequent items from the receipt I am iterating on
            Common_elements = freq_items.intersection(set(receipt)) # note that receipt means each line
            
            # 2- compute all possible subsequences of the Common elements and save in new dict
            Common_sequences = itertools.combinations(sorted(Common_elements), n_plet) 
            
            for seq in Common_sequences:
                if new_frequent_dict.get(seq) is None:
                    new_frequent_dict[seq] = 1
                else:
                    new_frequent_dict[seq] += 1

    return new_frequent_dict

### 2.2. Filtering the candidates to get the final results

    Filtering function
    Given a dictionary of occurrences of the elements, and a threshold (minimum support theshold)
    return a the dictionary of elements with at least support >= threshold
    

In [5]:
def filter_under_threshold(dict_occurrences, thresh = 2):
    to_rem_el = []
    for key in dict_occurrences.keys():
        # from the definition of the slides
        if dict_occurrences[key]/numItems < thresh:
            to_rem_el.append(key)
            
    for el in to_rem_el:
        del dict_occurrences[el]
    
    return dict_occurrences

### 2.3. Test the result
    according to several test, we think n_plet = 4 should be a fair number to end.

In [6]:
# The input below stands for "dict_freq_items, receipts, n_plet"
# And dict_occurrences, thresh = ?

C1 = get_ck_possible_itemset(None, receipts, 1)
C1 = filter_under_threshold(C1, thresh = 0.005)

C2 = get_ck_possible_itemset(C1, receipts, 2)
C2 = filter_under_threshold(C2, thresh = 0.005)

C3 = get_ck_possible_itemset(C2, receipts, 3)
C3 = filter_under_threshold(C3, thresh = 0.005)

C4 = get_ck_possible_itemset(C3, receipts, 4)
C4 = filter_under_threshold(C4, thresh = 0.005)

In [7]:
list(C1.keys())[:10]

[448, 834, 164, 775, 328, 687, 240, 368, 274, 561]

In [9]:
list(C2.keys())[:10]

[(448, 538),
 (39, 704),
 (39, 825),
 (704, 825),
 (708, 883),
 (708, 978),
 (853, 883),
 (883, 978),
 (529, 782),
 (227, 390)]

In [13]:
list(C3.keys())[:10]

[(39, 704, 825),
 (708, 883, 978),
 (571, 623, 795),
 (571, 623, 853),
 (571, 795, 853),
 (623, 795, 853),
 (392, 801, 862),
 (350, 411, 572),
 (350, 411, 579),
 (350, 411, 803)]

## 3. Task B : Association Rules
- The main idea in this part is to use the freq set we get previously to generate subset "Rules" and measure its confidence.
- So we need a new class rules, a method to get all kinds of subset, and a method to calculate the confidence.
- For the sake of running performance, we also need to take care of the duplications.

### 3.1. Rule Class
    As a simple example for rule: A->B, where A stands for "first" and B stands for "second"
    Because both of them can be int or list of int, so we should alwasy take care of the data type.

In [18]:
class Rule():

    def __init__(self, first, second):
        self.first = first
        self.second = second
        
    def get_first(self):
        return self.first
    
    def get_second(self):
        return self.second
    
    def get_first_length(self):
        if isinstance(self.first, int):
            return 1
        else: 
            return len(self.first)
        
    def get_sub_rule(self):
        # In case of which the first is a int, we return error
        if isinstance(self.first, int):
            print("Error: impossible to generate subrule")
            return None
        
        # Otherwise, we compute the subrule having the form
        # given Rule ([a,b,..,c,d], [e,f]) -> Rule ([a,b,..,c], [d,e,f])
        elif len(self.get_first()) == 2:
            new_first = self.get_first()[0]
            new_second = [self.get_first()[-1]]
            if isinstance(self.get_second(), int):
                new_second.append(self.get_second())
            else:
                for el in self.get_second():
                    new_second.append(el)
                new_second = sorted(new_second)
        else:
            new_first = self.get_first()[:-1]
            new_second = [self.get_first()[-1]]
            
            if isinstance(self.get_second(), int):
                new_second.append(self.get_second())
            else:
                for el in self.get_second():
                    new_second.append(el)
                new_second = sorted(new_second)
                

        return Rule(new_first, new_second)
    
    # Merge the first and the second
    # Given Rule ([a,b,..,c,d], [e,f]) -> [a,b,..,c,d,e,f]
    def get_rule_itemset(self):
        if isinstance(self.first, int):
            first = [self.first]
        elif len(self.first) == 1:
            # make a tuple of 1 element
            first = self.first
        else:
            first = list(self.first)
        
        if isinstance(self.second, int):
            second = [self.second]
        elif len(self.second) == 1:
            second = self.second
        else:
            second = list(self.second)
        
        list_el_set = list(first)
        for el in second:
            list_el_set.append(el)
        return tuple(list_el_set)
        

### 3.2. Association rule extraction
    For a given confidence, we analysis the freq-set from the biggest ones.
    We use them to generate the subset in different order such like A,B,C,D -> (A,B,C->D), (A,B -> C,D)
    There are two laws always applay:
        1: If A,B,C- D is below or above confidence, so is A,B -> C,D
        2: conf(A,B→C,D) = supp(A,B,C,D)/supp(A,B)
    

In [19]:
def generate_association_rules(dict_freq_items, confidence):
    # mst generate associations starting from BIG ones
    # use a queue and append candidates rules that pass the filter with a given confidence
    max_n_plet = len(dict_freq_items)-1
    
    # initialize queue
    queue_rules_to_eval = []
    
    # initialize list of valid rules of type (rule_tuple, rule_confidence)
    valid_rules = []
    
    #given m= length of the sequence, append in a queue the rule to evaluate of the form:
    # seq[-1:], seq[-1]
    
    # Starting getting the dictionary of longest sequences of frequent items
    for freq_seq in dict_freq_items[::-1]:
    
    #Initialization: get all the biggest frequent sequences
        freq_seq = list(freq_seq.keys())

        for el in freq_seq:
            if isinstance(el, int):
                continue
            # append tuple of type (seq \ seq[j], seq[j])
            if len(el) > 2:
                for i in range(len(el)):
                    queue_rules_to_eval.append(Rule([el[j] for j in range(len(el)) if i != j], el[i]))
            elif len(el) == 2:
                queue_rules_to_eval.append(Rule(el[0], el[-1]))
                queue_rules_to_eval.append(Rule(el[-1], el[0]))

        def eval_rule_subrules(rule, c):
            # first: evaluate he rule as-is
            # if valid: generate one lower step rule, else do nothing
            # get support of rule of type 

            def get_support(seq, dict_freq_items):
                # If the sequence is made by only a number, we have just an integer, not a list
                if isinstance(seq, int):
                    support = dict_freq_items[0].get(seq)
                else: #meaning that length >1:
                    support = dict_freq_items[len(seq) - 1].get(tuple(sorted(seq)))

                return support

            # confidence = support(I + j) / support(I)
            supp_I_j = get_support(rule.get_rule_itemset() , dict_freq_items)
            supp_I = get_support(rule.get_first(), dict_freq_items)

            rule_conf = supp_I_j/supp_I

            if rule_conf >= c:
                #generate under-rules and append to queue
                valid_rules.append([rule, rule_conf])

                if rule.get_first_length() >= 2:
                    rule_to_append = rule.get_sub_rule() 
                    queue_rules_to_eval.append(rule_to_append)

        previous_rule_len = 0
        
        while len(queue_rules_to_eval) > 0:
            curr_rule = queue_rules_to_eval.pop(0)

            # Whenever I finish to evaluate all n-freq_set rules, 
            # I remove all the possible duplicates that might be formed from the queue to speed up the process
            curr_Itemset_len = curr_rule.get_first_length()

            if previous_rule_len != curr_Itemset_len:
                # Make a set working for list of objects: check done on the equality of both attributes
                seen_f = set()
                seen_s = set()
                unique = []
                for obj in queue_rules_to_eval:
                    if str(obj.first) not in seen_f and str(obj.second) not in seen_s:
                        unique.append(obj)
                        # Let's use string for the sake of hashability.. (list can't be hashed)
                        seen_f.add(str(obj.first))
                        seen_s.add(str(obj.second))

                queue_rules_to_eval = unique
                previous_rule_len = curr_Itemset_len

            eval_rule_subrules(curr_rule, confidence)

    return valid_rules

### 3.3. Test the results   
    We set the confidence as 0.1 which I completely choose randomly.

In [21]:
i = 1
for rule in generate_association_rules([C1, C2, C3, C4], 0.1):
    print("Rule " + str(i) +" : ", rule[0].get_first(), "->", rule[0].get_second(), "\nConfidence:", np.round(rule[1], 3))
    i+=1

Rule 1 :  [623, 795, 853] -> 571 
Confidence: 0.91
Rule 2 :  [571, 795, 853] -> 623 
Confidence: 0.916
Rule 3 :  [571, 623, 853] -> 795 
Confidence: 0.919
Rule 4 :  [571, 623, 795] -> 853 
Confidence: 0.93
Rule 5 :  [411, 572, 579] -> 350 
Confidence: 0.945
Rule 6 :  [350, 572, 579] -> 411 
Confidence: 0.95
Rule 7 :  [350, 411, 579] -> 572 
Confidence: 0.941
Rule 8 :  [350, 411, 572] -> 579 
Confidence: 0.936
Rule 9 :  [350, 411, 842] -> 803 
Confidence: 0.956
Rule 10 :  [350, 411, 803] -> 842 
Confidence: 0.944
Rule 11 :  [290, 458, 888] -> 208 
Confidence: 0.96
Rule 12 :  [208, 458, 888] -> 290 
Confidence: 0.951
Rule 13 :  [208, 290, 888] -> 458 
Confidence: 0.942
Rule 14 :  [208, 290, 458] -> 888 
Confidence: 0.959
Rule 15 :  [354, 583, 617] -> 158 
Confidence: 0.951
Rule 16 :  [158, 583, 617] -> 354 
Confidence: 0.944
Rule 17 :  [158, 354, 617] -> 583 
Confidence: 0.933
Rule 18 :  [158, 354, 583] -> 617 
Confidence: 0.936
Rule 19 :  [541, 631, 963] -> 494 
Confidence: 0.95
Rule 20