# Association Rules & EDA

In [2]:
%matplotlib notebook

Make all required imports. For this notebook, you need to install the Python package [`efficient-apriori`](https://pypi.org/project/efficient-apriori/). You can add further packages if needed.

In [3]:
import numpy as np
import pandas as pd
from datetime import datetime


from efficient_apriori import apriori

from src.utils import powerset, binary_split

## 1 Complexity Analysis of Naive Brute Force Approach (6 Points)

The most naive approach for mining assocation rules would be generate all possible rules and check if their support and confidence exceed to specified thresholds `minsup` and `minconf`. In the lecture, you have learned that, given $d$ unique items in a dataset of transactions, there are $3^d - 2^{d+1} + 1$ possible rules.

**Task 1 (6 points)** Proof that $d$ unique items result in $3^d - 2^{d+1} + 1$ possible rules! (Hint: Write out all possible rules for $d = 2, 3, 4, ...$ items; you should quickly spot the pattern that will allow you to validate the formula)

**Your answer:**

Given d unique items and an itemset L, we need to find all non-empty subsets x that x->L-x, ignoring empty set->L and L->empty set


For the left-hand side of rule: consider an itemset has k elements, k: [1, d-1]

There are $\begin{pmatrix} d \\ k \end{pmatrix}$ possible itemsets of size k.

For the right-hand side of rule: item of size k can form a rule with the remaining d-k items.

An itemset of size k can form a rule with $2^{d-k}-1$ non-empty itemsets.


So the total number of possible rules N:

$N = \sum_{k=1}^{d-1}\begin{pmatrix} d \\ k \end{pmatrix}(2^{d-k} - 1) $
$ = \sum_{k=0}^{d-1}\begin{pmatrix} d \\ k \end{pmatrix}(2^{d-k} - 1) - (2^d - 1) $
$ = (\sum_{k=0}^{d}\begin{pmatrix} d \\ k \end{pmatrix}2^{d-k}) - (\sum_{k=0}^{d}\begin{pmatrix} d \\ k \end{pmatrix}) - (2^d - 1)$
$ = 3^d - 2^d - 2^d + 1$
$ = 3^d - 2^{d+1} + 1$


---------------------------------------------------------------------------

## 2 Finding Association Rules - Implementation (14 Points)

Your task is to implement association rule mining. Of course, we do not use the naive approach discussed above where we would generate and check $3^d - 2^{d+1} + 1$, with $d$ being the number of unique items in a dataset.

In the lecture, we saw that an association rule $X\Rightarrow Y$ has only sufficient support if the itemset $X\cup Y$ has sufficient support -- that is, $X\cup Y$ is a frequent itemset. Although there are still $2^d-1$ possible itemset we need to check, this brute-force approach for Frequent Itemset Generation is easy to implement and gives a better understanding of the complexity of the task of association rule mining.

### Auxillary Methods

We provide you with two methods to make the implementation tasks easier for you

Given a set of items, `powerset()` returns all possible subset of items with a specified minimum and maximum length. For example, you can use this method to generate all itemsets for a transaction.

In [4]:
for subset in powerset(('c', 'b', 'a'), min_len=1, max_len=3):
    print(subset)

('a',)
('b',)
('c',)
('a', 'b')
('a', 'c')
('b', 'c')
('a', 'b', 'c')


Given a set of items, `binary_split()` return all combinations of how the input set can be split into 2 non-empty subsets (where the union of both subsets for the input set). It's easy to see that you can use this to get from an itemset to all possible association rules.

In [5]:
for X, Y in binary_split(('b','c','a')):
    print('{} => {}'.format(X, Y))

('a',) => ('b', 'c')
('b',) => ('a', 'c')
('c',) => ('a', 'b')
('a', 'b') => ('c',)
('a', 'c') => ('b',)
('b', 'c') => ('a',)


**Important:** Note that we implement each transaction and each itemset as Python `tuple` and not as a Python `set`. This makes the implementation of the algorithm much easier as tuples can be used as keys in dictionaries (sets cannot). To ensure that you don't have to worry about the order of items in a tuple, both `powerset` and `binary_split` return subsets or pairs of subsets where the items are sorted -- check the example outputs above.

### Toy Dataset

The following dataset with 5 transaction and 6 different items is directly taken from the lecture slides. This should make it easier to test your implementation. You also only implement the brute-force approach which would not perform over the real-world dataset we later look at later on.

In [6]:
transactions_demo = [
    ('bread', 'yogurt'),
    ('bread', 'milk', 'cereal', 'eggs'),
    ('yogurt', 'milk', 'cereal', 'cheese'),
    ('bread', 'yogurt', 'milk', 'cereal'),
    ('bread', 'yogurt', 'milk', 'cheese')
]

In the lecture, we talked about the completely naive approach to generate and check all possible association rules given a set of $d$ unique items. This would result in $3^d - 2^{d+1} + 1$ rules to generate and check (you just proofed this). As a first ways to address the $O(3^d)$ complexity was to decouple the calculation of support and confidence which allowed us to split the task of finding association rules into two parts:

* **(1) Frequent Itemset Generation:** Identify all frequent itemsets -- that is, itemsets with a support greater or equal to `minsup`. The observation was to only frequent itemset allow to derive rules with sufficient support
    
* **(2) Association Rules Generation:** Given all frequent itemsets, generate all possible rules and check if their confidence is above `minconf`.

We have seen in the lecture that Step (2) is arguably less problematic as the information required to calculate the confidence if a rule has already been generated during (1). We therefore focused on the complexity of Frequent Itemset Generation. Again, we first looked at the naive approach to check all possible itemsets if their support is above minsup. Here, give $d$ unique items, that the number of possible itemsets is $2^d-1$.

In the following, you will implement Assocition Rule Ming using this naive approach. Although this approach is with $O(2^d) still exponential, it's implementation is straightforward, provides a better understanding of support and confidence, as well as helps to appreciate the complexity that call for more advanced methods such as the Apriori algorithm

### 2.1 Step 1: Frequent Itemset Generation (Brute Force) (7 Points)

Implement the brute force approach for Frequent Itemset Generation -- i.e., generate all possible itemsets for each transaction and calculate the overall support -- using the template for method `find_frequent_itemsets()` below. The expected format of the output is given in the comments below. The auxiliary method `powerset()` should help for this task.

In [7]:
def find_frequent_itemsets(transactions, minsup):
    
    num_transactions = len(transactions)
    
    #########################################################################################
    # Step 1: Count the number of occurences of all possible itemset
    #########################################################################################

    # Create a dictionary to keep track of the support counts for each itemset
    # e.g., support_counts = {(a,): 4, (b,): 20, (c,): 5, (a, c): 2, ...}
    support_counts = {}
    
    ### Your code starts here ###############################################################

    for transaction in transactions:
        for itemset in powerset(transaction, min_len=1, max_len=len(transaction)):
          if itemset not in support_counts:
            support_counts[itemset] = 0
          support_counts[itemset] += 1
    #print(support_counts)
        
        
    ### Your code ends here #################################################################
                
                
    #########################################################################################
    # Step 2: Filter all itemset with a support >= minsup ==> frequent item sets
    #########################################################################################
    
    # In the end, frequent_itemsets as dictionary (key = itemset, value = support)
    # e.g., frequent_itemsets = {(a,): 0.8, (b,): 0.6, (c,): 0.8, (a, c): 0.4, ...}
    frequent_itemsets = {}

    ### Your code starts here ###############################################################

    for key, value in support_counts.items():
      sup = value/num_transactions
      if sup >= minsup:
        frequent_itemsets[key] = sup
    
    ### Your code ends here #################################################################

    # Return frequent itemsets (incl. their support)
    return frequent_itemsets

Test your implementation of `find_frequent_itemsets()`. The results should match with the examples on the lecture slides

In [8]:
frequent_itemsets = find_frequent_itemsets(transactions_demo, 0.6)

for itemset, support_count in frequent_itemsets.items():
    print(itemset, support_count)

('bread',) 0.8
('yogurt',) 0.8
('bread', 'yogurt') 0.6
('cereal',) 0.6
('milk',) 0.8
('bread', 'milk') 0.6
('cereal', 'milk') 0.6
('milk', 'yogurt') 0.6


### 2.2 Step 2: Find Association Rules (7 Points)

Complete method `find_association_rules()` to find all association rules with sufficient support and confidence for a given set of transactions. This method uses `find_frequent_itemsets` to first compute all frequent itemsets. Again, the expected format of the output is given in the comments below. The auxiliary method `binary_split()` should help for this task.

In [9]:
def find_association_rules(transactions, minsup=0.0, minconf=0.0):

    # Perform Step 1: Frequent Itemset Generation
    frequent_itemsets = find_frequent_itemsets(transactions, minsup)
    
    # In the end, rules is a dictionary (key = (X, Y), value = (support, confidence, lift))
    # e.g., {(('cereal',), ('milk,')): (0.6, 1.0, 1.25), ...}
    rules = {}
    
    ### Your code starts here ###############################################################
    for itemset,sup in frequent_itemsets.items():
      for X, Y in binary_split(itemset):
        #print('{} => {}'.format(X, Y))

        sup_X = frequent_itemsets[X]
        sup_Y = frequent_itemsets[Y]
        conf = sup / sup_X
        lift = sup / (sup_X * sup_Y)
        if conf >= minconf:
          rules[X,Y] = tuple((sup, conf, lift))
        #print(conf)
        
        
    ### Your code ends here #################################################################
                
    return rules

Test your implementation of `find_association_rules()`. The results should match with the examples on the lecture slides. You can also use the `efficient-apriori` package on the toy data to see if your implementation returns the same results; see below.

In [10]:
rules = find_association_rules(transactions_demo, minsup=0.6, minconf=1.0)
    
for (X, Y), (sup, conf, lift) in rules.items():
    print('Rule [{} => {}] (support: {}, confidence: {}, lift: {})'.format(X, Y, sup, conf, lift))

Rule [('cereal',) => ('milk',)] (support: 0.6, confidence: 1.0, lift: 1.25)


### Comparison with `efficient-apriori` package

You can run the apriori algorithm over the demo data to check if your implementation is correct. Try different values for the parameters `minsup` and `minconf` and compare the results. Note that the order of the returned association rules might differ between your implementation and the apriori one.

In [11]:
_, rules = apriori(transactions_demo, min_support=0.6, min_confidence=1.0, max_length=4)

for r in rules:
    print('Rule [{} => {}] (support: {}, confidence: {}, lift: {})'.format(r.lhs, r.rhs, r.support, r.confidence, r.lift))


Rule [('cereal',) => ('milk',)] (support: 0.6, confidence: 1.0, lift: 1.25)


---------------------------------------------------------------------------

## 3 Association Rule Mining over Real-World Data (30 Points)

Although you now implemented association rule mining yourself, the brute-force approach for Frequent Rule Mining does not scale for real-world dataset. With even small real-world datasets containing several thousands of unique items, number of $2^d-1$ possible itemsets quickly explodes. You can try running your implementation on the retail dataset below, but chances are very high that the notebook will crash :).

In the following, we therefore use the `efficient-apriori` package that implements the Apriori algorihtm



### Dataset

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail. The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers (Source: https://archive.ics.uci.edu/ml/machine-learning-databases/00502/)

Attribute Information:

* **Invoice**: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'C', it indicates a cancellation.
* **StockCode**: Product (item) code. Nominal. A 5-digit integral number and an optional letter, uniquely assigned to each distinct product. The optional letter is used to further distinguish products that only differn w.r.t., for example, their color; see Row 2 and 3 below. 
* **Description**: Product (item) name. Nominal.
* **Quantity**: The quantities of each product (item) per transaction. Numeric.
* **InvoiceDate**: Invice date and time. Numeric. The day and time when a transaction was generated.
* **Price**: Unit price. Numeric. Product price per unit in sterling.
* **CustomerID**: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.

**Important:** 
* We treat products that only differ in the optional letter as different products. For example, 79323P (PINK CHERRY LIGHTS) and 79323W (WHITE CHERRY LIGHTS) are different products.
* Although the basic algorithm for Association Rule Mining we have seen in the lecture does not consider further information such as the quantity, price of items, the time of transaction, or customer information, these attributes might still affect the data cleaning process

In [12]:
df_retail = pd.read_csv('data/online-retail.csv')

df_retail.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,12/01/09 07:45 AM,6.95,13085.0
1,489434,79323P,PINK CHERRY LIGHTS,12,12/01/09 07:45 AM,6.75,13085.0
2,489434,79323W,WHITE CHERRY LIGHTS,12,12/01/09 07:45 AM,6.75,13085.0
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,12/01/09 07:45 AM,2.1,13085.0
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,12/01/09 07:45 AM,1.25,13085.0


In [13]:
num_entries, num_attributes = df_retail.shape

print('There are {} entries, each with {} attributes.'.format(num_entries, num_attributes))

There are 525461 entries, each with 7 attributes.


Since association rules will only have stock codes as antecedents and consequents, they are not easy to interpret. To quickly map stock codes to descriptions, we can generate a dictionary for later use. This might take a couple of seconds. 

**Note:** This is important for completing the notebook, but help to better interpret the rules. You can also check out the lecture notebook about Association Rule Mining to see some examples.

In [14]:
%%time
code2desc = { row['StockCode']:row['Description'] for  idx, row in df_retail.iterrows() }

CPU times: user 22.9 s, sys: 174 ms, total: 23 s
Wall time: 23.4 s


Simple example for using `code2desc`

In [15]:
stock_code = '85048'

print('The item (description) for {} is: {}'.format(stock_code, code2desc[stock_code]))

The item (description) for 85048 is: 15CM CHRISTMAS GLASS BALL 20 LIGHTS


### 3.1 Exploratory Data Analysis & Data Cleaning (20 Points)

#### 3.1 a) Find and Remove "Dirty" Records (10)

If you check the dataset against its description -- see the attribute information above -- you will notice that many records are "dirty", meaning they are not in th expected format. In the following, **identify at least 3 cases** of dirty records, and clean your data accordingly.


**Important:** 
* Recall from the lecture that data cleaning often involves to make certain decisions. As such, you might come up with different steps than other students. This is OK as long as you can reasonably justify your steps.
* A transaction generally contains multiple rows/records. If at least one row/record of a transaction needs to be removed because it's dirty, you should remove all records of the same transaction to avoid inconsistencies.
* Perform the data cleaning on a copy of the original dataset `df_retail_cleaned`; see code cell below. Later tasks will work on the original dataset `df_retail` to ensure that the result are consistent and do not depend on your choice of data preprocessing.

Please provide your answer below. It should list the different cases of "dirty" records you have identified and briefly discuss which data cleaning steps you can and/or need to perform to address those records.

**Your answer:**



(1) Incomplete Data: missing values

eg. Empty data in "Customer ID".

data cleaning step:
- check columns that contains empty data using is.na() method
- remove records that contain missing values in column like 'Customer ID'

(2)check duplicate values

Some records might be duplicated. If so, keep only the first record using python drop_duplicates() function.

(3) Incorrect data 'Stockcode'

eg. 'StockCode' = 'M' while 'StockCode' should be a 5-digit integral number and an optional letter

data cleaning step: remove such records which are not '5-digit integral number and an optional letter'

(4) Incorrect data 'Quantity'

eg. 'Quantity' = -770 while 'Quantity' shoule be Integers greater than 0

data cleaning step:

remove such records where their 'Quantity' < 0

OR the negative values are probably mistakes, we can turn it into possitive value using abs() method (getting absolute value)

(5) Incorrect data 'Price'

eg. 'Price' = 0 while the product prive per unit should not be 0 in general

data cleaning step: we can remove record where their 'Price' <= 0; and since such transaction contains dirty data, we can also remove all same transaction.

(6)(optional) 'Invoice' that starts with 'C' actually indicates cancellations, so we might want to get rid of it.

Data cleaning step: remove record with 'Invoice' starting with C


Use the code cell below to actually implement your steps to identify "dirty" records as well as your steps to clean the data. The results should back up your answer above. Feel free to split the cell into multiple code cells to improve organization (not a must, though).

In [16]:
# We first create a copy of the dataset and use this one to clean the data.
df_retail_cleaned = df_retail.copy()


### Your code starts here ###############################################################

# 1. check and remove missing values
print('Check empty data:')
print(df_retail_cleaned.isna().sum()) #check empty data
# 'Customer Id' contains ‘nan’ values which should drop
df_retail_cleaned = df_retail_cleaned[df_retail_cleaned['Customer ID'] != 'nan']
df_retail_cleaned.dropna(inplace= True)
print('After removing nan data:')
df_retail_cleaned.shape



Check empty data:
Invoice             0
StockCode           0
Description      2928
Quantity            0
InvoiceDate         0
Price               0
Customer ID    107927
dtype: int64
After removing nan data:


(417534, 7)

In [17]:
# 2. detect duplicate values
print(df_retail_cleaned.shape)
print(df_retail_cleaned.drop_duplicates().shape) #smaller than df_retail_cleaned, which means there are duplicate values
# drop duplicate entries
df_retail_cleaned.drop_duplicates(inplace = True)


(417534, 7)
(410763, 7)


In [18]:
# 3. check incorrect data for 'Stockcode'

#'StockCode' should only contain a 5-digit integral number and an optional letter
correct_stockcode = ((df_retail_cleaned['StockCode'].str.isdigit()) & (df_retail_cleaned['StockCode'].str.len() == 5)) | ((df_retail_cleaned['StockCode'].str.len() == 6) & (df_retail_cleaned['StockCode'].str[:5].str.isdigit()) & (df_retail_cleaned['StockCode'].str[-1].str.isalpha()))
df_retail_cleaned = df_retail_cleaned[correct_stockcode]
df_retail_cleaned.shape

(408090, 7)

In [19]:
# 4. check incorrect data for 'Quantity'
print('Check incorrect for column Quantity:')
print(df_retail_cleaned[df_retail_cleaned['Quantity'] < 0].shape)
# assuming negative quantity is a mistake, turn them into positive integers (absolute value)
df_retail_cleaned['Quantity'] = df_retail_cleaned['Quantity'].abs()

Check incorrect for column Quantity:
(9268, 7)


In [20]:
# 5. Incorrect data 'Price'
print('Check incorrect column price:')
print(df_retail_cleaned[df_retail_cleaned['Price'] <= 0].shape)
df_retail_cleaned = df_retail_cleaned[df_retail_cleaned['Price'] > 0] #drop all records where their price <= 0

# because 'price' is dirty, drop all records of the same transaction to avoid inconsistencies
dirty_price_invoice = list(df_retail_cleaned[df_retail_cleaned['Price'] <= 0]['Invoice'])
#print(dirty_price_invoice)
df_retail_cleaned = df_retail_cleaned[~df_retail_cleaned.Invoice.isin(dirty_price_invoice)] #drop all corresponding transactions
print('After this step:')
df_retail_cleaned.shape

Check incorrect column price:
(28, 7)
After this step:


(408062, 7)

In [21]:
# 6. Drop cancellations transactions (optional)

df_retail_cleaned = df_retail_cleaned[~df_retail_cleaned['Invoice'].str.contains('C')]
# 'Invoice' that starts with 'C' actually indicates cancellations

In [22]:
### Your code ends here #################################################################


print('After preprocessing, There are now {} entries.'.format(df_retail_cleaned.shape[0]))


After preprocessing, There are now 398794 entries.


#### 3.1b) Basic Facts about the Dataset (8 Points)

The following tasks are about getting basic insights into the dataset. As the data preprocessing steps you choose to perform might effect the results of these tasks, please use the original data stored in `df_retail`. 

Use the code cell below to actually implement your steps that enabled you to answer the questions. This allows for a fairer grading even if your answers are not (exactly) correct. Again, please use the uncleaned dataset `df_retail` to ensure consistent results

This is a markdown cell. Please fill in your answer for (1)~(8).

| No. | Question                                                                                                   | Answer       |
|-----|------------------------------------------------------------------------------------------------------------|--------------|
| 1)  | Starting date of the dataset?                                                                              |2009-12-01|
| 2)  | Ending date of the dataset?                                                                                |2010-12-09|
| 3)  | Number of customers?                                                                                       |4383|
| 5)  | Number of unique items?                                                                         |4632|
| 4)  | Number of transactions (incl. canceled transactions)?                                                                                    |28816|
| 6)  | Number of transactions Customer ID 17850 has made (incl. canceled transactions)?                                                         |158|
| 7)  | Which customer (ID) has made the most transactions (incl. canceled transactions) and how many?                                                        |14911, 270|
| 8)  | What is the item ID of the best-seller (best-seller = item with the highest sales volume)? |21212|

In [23]:
### Your code starts here ###############################################################

#1)
print(pd.to_datetime(df_retail['InvoiceDate']).min()) #starting date
#2)
print(pd.to_datetime(df_retail['InvoiceDate']).max()) #ending date
#3)
print(df_retail['Customer ID'].nunique())
#5)
print(df_retail['StockCode'].nunique())
#4)
print(df_retail['Invoice'].nunique())
#6)
print(df_retail[df_retail['Customer ID'] == 17850]['Invoice'].nunique())
#7)
#print(df_retail.groupby('Customer ID')['Invoice'].nunique().sort_values(ascending=False))
print('which customer(ID) has made the most transactions: ')
print(df_retail.groupby('Customer ID')['Invoice'].nunique().idxmax())
print('how many:')
print(df_retail.groupby('Customer ID')['Invoice'].nunique().max())
#8)
print('The best-seller is (Item ID): ')
#total_amount = df_retail['Quantity'] * df_retail['Price']
print(df_retail.groupby(['StockCode']).sum()['Quantity'].idxmax())
print('The highest sales volume is: ')
print(df_retail.groupby(['StockCode']).sum()['Quantity'].max())


### Your code ends here #################################################################


2009-12-01 07:45:00
2010-12-09 20:01:00
4383
4632
28816
158
which customer(ID) has made the most transactions: 
14911.0
how many:
270
The best-seller is (Item ID): 
21212
The highest sales volume is: 
59411


#### 3.1 c) Complexity Analysis fo Brute-Force Approach (2 Points)

We know that our brute-force implementation for Frequent Itemset Generation has to check $2^d-1$ itemsets, and we know the number $d$ of unique items from our EDA. Suppose we can count $2^{36}$ itemsets per second.  Will we complete the counting before the sun burns out (the sun has another $5\cdot 10^9$ years to burn)?

**Your Answer:**

No, we will not complete the counting before the sun burns out.

Reason:

how many seconds before the sun burns out:
secs = 365 * 24 * 60 *60 = 31,536,000 = 3.1536 * 10^7

So how many counts we could complete by then:
counts = $2^{36} * 3.1536 * 10^7 * 5 * 10^9 = 2^{36} * 1.5768 * 10^{17}$

how many unique items (known from 3.1 EDA): 4632

Then we need to check $2^{4632} -1$ itemsets. Exponentially, it is far larger than the counts we can calculate before the sun burns out.

So we will not complete the counting before sun burns out.


### 3.2 Running the Apriori Algorithm (10 Points)

The `efficient-apriori` package assume as input a list of transactions; see `transactions_demo`. However, right now we have all transactions in table-like format. We therefore need to transform the data. Note that using the `StockCode` to represent an item will suffice; we don't need to whole description here. It also saves memory as stock codes are typically much shorter then descriptions.

In [24]:
transactions_retail = df_retail.groupby(['Invoice']).agg({'StockCode': tuple})['StockCode'].to_list()

#
# Output format
#
# transactions_retail = [
#     (22554, 82494L, 21975),    
#     (21175, 84991, 85099F, 85099B),
#     (85099B, 21930),
#     ...
#]

**Hint:** You can check the lecture notebook for Association Rules to see examples using the `efficient-apriori` package, and how to evaluate results. This should help with the following tasks and questions.

**3.2 a) (1 Point)** Run efficient-apriori in python with min_support=0.5%, min_confidence=20%, max_length=4. Write down the rule with the highest lift (denoted as `r1`, e.g., `r1 = ( (A, B), (C,) )` to represent the rule $\{A,B\}\Rightarrow \{C\}$, where A, B and C are stock codes)

In [25]:
%%time

itemsets, rules = apriori(transactions_retail, min_support=0.005, min_confidence=0.2, max_length=4)
rules_filtered = rules
top_rule = sorted(rules_filtered, key=lambda rule: rule.lift, reverse=True)[0]
print(top_rule)

antecedent = [ code2desc[c] for c in top_rule.lhs]
consequent = [ code2desc[c] for c in top_rule.rhs]
print('{} => {} -- lift: {}'.format(antecedent, consequent, top_rule.lift))

r1 = tuple((top_rule.lhs, top_rule.rhs))

r1_lift = top_rule.lift

{22521, 22522} -> {22520, 22523} (conf: 0.924, supp: 0.005, lift: 155.628, conv: 12.999)
['CHILDS GARDEN TROWEL PINK', 'CHILDS GARDEN FORK BLUE '] => ['CHILDS GARDEN TROWEL BLUE ', 'CHILDS GARDEN FORK PINK'] -- lift: 155.62820777433782
CPU times: user 16min, sys: 2.23 s, total: 16min 2s
Wall time: 16min 3s


**3.2 b) (1 Point)** Run efficient-apriori in python with min_support=1%, min_confidence=20%, max_length=4. Write down the rule with the highest lift (denoted as `r2`, e.g., `r2 = ( (A, B), (C,) )` to represent the rule $\{A,B\}\Rightarrow \{C\}$, where A, B and C are stock codes). 

In [26]:
%%time

itemsets, rules = apriori(transactions_retail, min_support=0.01, min_confidence=0.2, max_length=4)
rules_filtered = rules
top_rule = sorted(rules_filtered, key=lambda rule: rule.lift, reverse=True)[0]
print(top_rule)

antecedent = [ code2desc[c] for c in top_rule.lhs]
consequent = [ code2desc[c] for c in top_rule.rhs]
print('{} => {} -- lift: {}'.format(antecedent, consequent, top_rule.lift))



r2 = tuple((top_rule.lhs, top_rule.rhs))

r2_lift = top_rule.lift

{22699} -> {22697} (conf: 0.741, supp: 0.010, lift: 53.631, conv: 3.804)
['ROSES REGENCY TEACUP AND SAUCER '] => ['GREEN REGENCY TEACUP AND SAUCER'] -- lift: 53.63111855574167
CPU times: user 4min 17s, sys: 623 ms, total: 4min 18s
Wall time: 4min 18s


**3.2 c) (1 Point)** Run efficient-apriori in python with min_support=0.5%, min_confidence=40%, max_length=4. Write down the rule with the highest lift (denoted as `r3`, e.g., `r3 = ( (A, B), (C,) )` to represent the rule $\{A,B\}\Rightarrow \{C\}$, where A, B and C are stock codes). 

In [27]:
%%time

itemsets, rules = apriori(transactions_retail, min_support=0.005, min_confidence=0.4, max_length=4)
rules_filtered = rules
top_rule = sorted(rules_filtered, key=lambda rule: rule.lift, reverse=True)[0]
print(top_rule)

antecedent = [ code2desc[c] for c in top_rule.lhs]
consequent = [ code2desc[c] for c in top_rule.rhs]
print('{} => {} -- lift: {}'.format(antecedent, consequent, top_rule.lift))

r3 = tuple((top_rule.lhs, top_rule.rhs))

r3_lift = top_rule.lift

{22521, 22522} -> {22520, 22523} (conf: 0.924, supp: 0.005, lift: 155.628, conv: 12.999)
['CHILDS GARDEN TROWEL PINK', 'CHILDS GARDEN FORK BLUE '] => ['CHILDS GARDEN TROWEL BLUE ', 'CHILDS GARDEN FORK PINK'] -- lift: 155.62820777433782
CPU times: user 16min 21s, sys: 2.74 s, total: 16min 24s
Wall time: 16min 25s


**3.2 d) (1 Point)** Run efficient-apriori in python with min_support=1%, min_confidence=40%, max_length=4. Write down the rule with the highest lift (denoted as `r4`, e.g., `r4 = ( (A, B), (C,) )` to represent the rule $\{A,B\}\Rightarrow \{C\}$, where A, B and C are stock codes).

In [28]:
%%time

itemsets, rules = apriori(transactions_retail, min_support=0.01, min_confidence=0.4, max_length=4)
rules_filtered = rules
top_rule = sorted(rules_filtered, key=lambda rule: rule.lift, reverse=True)[0]
print(top_rule)

antecedent = [ code2desc[c] for c in top_rule.lhs]
consequent = [ code2desc[c] for c in top_rule.rhs]
print('{} => {} -- lift: {}'.format(antecedent, consequent, top_rule.lift))

r4 = tuple((top_rule.lhs, top_rule.rhs))

r4_lift = top_rule.lift

{22699} -> {22697} (conf: 0.741, supp: 0.010, lift: 53.631, conv: 3.804)
['ROSES REGENCY TEACUP AND SAUCER '] => ['GREEN REGENCY TEACUP AND SAUCER'] -- lift: 53.63111855574167
CPU times: user 4min 26s, sys: 598 ms, total: 4min 26s
Wall time: 4min 27s


**3.2 e) (3 Points)** You must have noticed numerous differences between the 4 runs in a)-d). List at least 3 differences you have found. You may want to consider the elapsed time and the quality of the results. 

**Your Answer:**

(1) different parameters: different min confidence and min support setting
(2) different running time
(3) result in different set of rules & different best(highest lift) rule
(4) different quality of top rules (different confidence, support, lift of rules)

**3.2 f) (3 Points)** From your observation, what are the effects of increasing/reducing `min_support` and `min_confidence`? Support your answer with evidence. You can perform more runs of efficient-apriori with different paramter settings, if needed.

**Your Answer:**

(1)increasing/reducing `min_support` :

- increase 'min_support', the elapsed time will decrease.

Evidence: the running time of r3 and r4, with the same min_confidence of 0.4, when min_support = 0.005, the Wall times is about 16 mins; when min_support = 0.01, the Wall times is about 4 mins.

We can found the similar pattern when finding r1 and r2. With the same min_confidence of 0.2, when min_support = 0.005, the Wall times is about 16 mins; when min_support = 0.01, the Wall times is about 4 mins.

- increase 'min_support', the best rule found will change: the lift of best rule is decreased.

Evidence: the result of r3,r3_lift and r4,r4_lift. With the same min_confidence of 0.4, when min_support = 0.005, the found rule is {22521, 22522} -> {22520, 22523}, with lift = 155.628; when min_support = 0.01, the fould rule is {22699} -> {22697}, with lift = 53.631. r3_lift > r4_lift

We can found the similar pattern in the result of r1, r1_lift and r1, r2_lift. With the same min_confidence of 0.2, when min_support = 0.005, the found rule is {22521, 22522} -> {22520, 22523}, with lift = 155.628; when min_support = 0.01, the fould rule is {22699} -> {22697}, with lift = 53.631. r1_lift > r2_lift

(2)increasing/reducing `min_confidence` :

- increase 'min_confidence', the elapsed time slightly increase.

Evidence: the running time of r2 and r4, with the same min_support of 0.01, when min_confidence = 0.2, the Wall times is 4min 19s; when min_confidence = 0.4, the Wall times is about 4min 28s (slightly increase but no big difference).

We can found the similar pattern when finding r1 and r3. With the same min_support of 0.005, when min_confidence = 0.2, the Wall times is 16min 3s; when min_confidence = 0.4, the Wall times is about 16min 25s (slightly increase but no big difference).

The change of running time brought by increasing/reducing 'min_confidence' is not as significant as increasing/reducing 'min_support'.

(The stage 1 of apriori algorithm using 'min_confidence' will affect running time much more significantly comparing with stage 2 using min_confidence.)

- increase 'min_confidence', the same best rule was found.

With the same min_support and different min_confidence, the found r1 and r3 are the same rule; So as r2 and r4.

(Tried using min_support = 0.01, min_confidence = 0.6, still finding the same best rule as r2 and r4.)

- increase 'min_confidence', the size of resulting rule set will decrease.

Evidence: consider the finding of r2 and r4, with the same min_support of 0.01, when min_confidence = 0.2, 405 rules were found as the result using len(rules); when min_confidence = 0.4, only 200 rules were found.