# Association Rules:


iLet T be the set of all transactions. 
Let $x_{i}$ be items in the set of all items, I. 
Let X,Y,Z... be items sets (not empty), that contain some subset of x$_{i}$'s from I.

**Support Count Function**, defined as $\sigma(X) = \{ t_{i} \vert X \in t_{i}, t_{i} \in T  \}$. Note that when we count each transaction, *X must be contained in its entirety*.

**Set Count**: Define $\vert X \vert$ to be a count of the number of elements in the given set.

**Association Rule:** $X \rightarrow Y$: An implication, where X,Y $\subseteq$ I and X $\cap$ Y = $\emptyset$. How is this an "implication"? Recall that a Set Theory Implication is one where X $\subseteq$ Y. This means that X is contained in Y, and for all points that satisfy this set. It cannot be the case that we find a point in X that is not in Y (Here, $\mathbb{T} \rightarrow \mathbb{F}$, not possible). For our Association Rule, (!!!)

## Strength/Co-Occurence Measures:

*Note:* Let $\vert T \vert = N$

### Support (of an Itemset): 

$$s(X) = \frac{\sigma(X)}{N} $$.

- Or, proportion of transactions in which itemset X occurs
- Range: (0,1)
- In general, for $\vert X_{i} \vert > \vert X_{j} \vert$, $s(X_{i}) \leq s(X_{j})$. As our set X increases in size, the set of transactions that contain it entirely must stay the same, or shrink.

### Support (of a Rule): 

$$s(X \rightarrow Y) = \frac{\sigma(X \cup Y)}{N} $$.

- Or, proportion of transactions in which $Z = X \cup  Y$ occur.
- Range: (0,1)
- Again, for $\vert Z_{i} \vert > \vert Z_{j} \vert$, $s(Z_{i}) \leq s(Z_{j})$.
- Practically, Support is used to determine if our rule is just a random occurence, or actually something to consider (!!!). A *strong* rule is one that has relatively high support.


### Confidence:

$$ c(X \rightarrow Y) = \frac{\sigma(X \cup Y)}{\sigma(X)} = \frac{s(X \rightarrow Y)}{s(X)} $$

- Or, the proportion of transactions containing X and Y, over the proportion of transactions containing X. 

- Range: (0,1)
-  Measure of Reliability that Y is likely to be present in transactions that contain X. Because Confidence is rarely 1, it can be the case that Y is present but not X. **X -> Y should not be thought of as a set/logical implication, because of this. (there can be transactions where X occurs, but not Y - for set implication, this is not possible).**
- **Claim**: Confidence provides an estimate for conditional probability. Note that we defined X $\cap$ Y = $\emptyset$, and since P(X $\vert$ Y) = P(XY) / P(Y) = 0, this doesn't initially make sense. But note that our X $\cup$ Y when used in support calculations. Generally, P(X $\cup$ Y) >= P(X $\cap$ Y), so this provides an **overestimate** of conditional probability.
- The usual comment: correlation does not apply causation applies here. A rule that has high confidence indicates a *correlation, or co-occurence* between two sets. 




## Objective Interest Measures

These are measures that use transactional data to rank different rules. These measures provide statistical information, and are considered to be domain independent (don't need special/expert information to make judgements). 

One limitation of this method, is that a user must specify thresholds to allow for a cutoff of ranked rules. There is no rule on what cutoff one should choose - and rules for this may require domain knowledge. 

The usual approach is to couple support and confidence together, to rank the best rules. These approaches have limitations however, as discussed below.

### Interpreting 2-Term Rules with Contingency Table:

Given $\{A\} \rightarrow \{B\}$, we can form the following contingency table:

CT | B | $\bar{B}$ | Marg(A)
--- | --- | --- | ---
**A** | f$_{11}$ | f$_{10}$ | f$_{1+}$
**$\bar{A}$** | f$_{01}$ | f$_{11}$ | f$_{0+}$
**Marg(B)** | f$_{+1}$ | f$_{+0}$ | N


Define:

- Support: $\frac{f_{11}}{N}$
- Confidence: $\frac{f_{11}}{f_{1+}}$
- Lift: $\frac{f_{11}}{}$

Probabilistic Interpretation:
- Support is the probability that two itemsets occur together in a set of transactions.
- Confidence is a conditional probability measure.
- Lift measures the correlation between two itemsets.

### Examples when Confidence and Lift Fail:

#### Example 1 (over-estimate): $\{Tea\} \rightarrow \{Coffee\}$

CT | Coffee | $\bar{Coffee}$ | Marg(A)
--- | --- | --- | ---
**Tea** | 150 | 50 | 200
**$\bar{Tea}$** | 650 | 150 | 800
**Marg(B)** | 800 | 200 | 1000

Here, we see that s(T -> C) = 150/100 = 15%. Confidence is 150/200 = 75%. These are decent numbers. On closer inspection however, 80% of our sample drink coffee. Yet, for all Tea Drinkers, only 75% drink coffee. So this suggests that tea is negatively associated with coffee.

#### Example 2 (underestimate): $\{Tea\} \rightarrow \{Honey\}$

CT | Honey | $\bar{Honey}$ | Marg(A)
--- | --- | --- | ---
**Tea** | 100 | 100 | 200
**$\bar{Tea}$** | 20 | 780 | 800
**Marg(B)** | 120 | 880 | 1000

Here, we have a rule support of 10%, and confidence of 50%. Yet, tea is very strongly associated with honey (positive relationship). Note that 12% of the sample uses honey, and 100/120 honey users drink tea. Conversely, the number of people who don't use honey is 88%. 

Both examples illustrate that support and confidence do not measure correlation between two variables. There are many ways to do this. Such as with Lift:

**Lift**: l($A \rightarrow B) $ = $\frac{s(A,B)}{s(A) x s(B)}$

Lift is an estimate of correlation. Recall that A and B are indepedent if P(AB) = P(A) x P(B). Since support is just a measure of probability an itemset occurs in a sample, we can define this in terms of support: $s_{\perp} = s(A)s(B)$. Then **lift is a ratio of joint rule support, over indepenently calculated rule support.**

Calculating Independence with Contingency Values:

$$ s(AB) = s(A)s(B) \rightarrow \frac{f_{11}}{N} = \frac{f_{1+}f_{+1}}{N^{2}} \rightarrow f_{11} = \frac{f_{1+}f_{+1}}{N} $$

Interpretation:

- Lift = 1: Then s(AB) = s(A)s(B). This means our two itemsets are independent.

- Lift > 1: Then s(AB) > s(A)s(B). A positive correlation exists between A and B.

- Lift < 1: Then s(AB) < s(A)s(B). A negative correlation exists between A and B.

### Other Measures:

**Conviction:** $$ cv(A \rightarrow B) = \frac{1 - s(B)}{1 - c(A,B)} $$

This is another measure of the strength of implication. The higher the number, the stronger the implication. 

- Range: (0,$\infty$)

- Non-Commutative: cv(A,B) $\neq$ cv(B,A)

- If A,B are independent, then 1.

- If full confidence (in implication), value is infinite.

**Leverage:** $$lev(A \rightarrow B) = s(A \rightarrow B) - s(A)s(B) $$

Measures the difference between joint support, and independent support calculation. What else does it measure? Degree of counting overlap between joint and independent terms?

- Range: (-1,1)

- = 0 when A and B are independent.

-    > 0 indicates positive relationship

- < 0 indicates negative relationship


## Code:

In [4]:
#Imports
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import altair as alt
import pyvis

In [5]:
#Standard Settings:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
alt.renderers.enable('notebook')
alt.data_transformers.enable('default', max_rows=None)
%matplotlib inline 
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 40)
pd.set_option('display.width', 1000)

### Support Functions:

In [10]:
def loadcleandata(cty):
    df = pd.read_excel('./data/online_retail.xlsx') #Why did a 20MB file take 1min to load??
    df['Description'] = df['Description'].str.strip()
    df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
    df['InvoiceNo'] = df['InvoiceNo'].astype('str')
    df = df[~df['InvoiceNo'].str.contains('C')]    
    
    basket = (df[df['Country'] ==cty].groupby(['InvoiceNo', 'Description'])["Quantity"])
    basket = basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo') 
    basket_sets = basket.applymap(encode_units) 
    basket_sets.drop('POSTAGE', inplace=True, axis=1)
    return basket_sets

    
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1



#Note, you still need to sort the DF of rules that you get.

def getrules(bsets,minsup,met):
    frequent_itemsets = apriori(bsets, min_support=minsup, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric=met, min_threshold=1)
    return rules

In [7]:
df = loadcleandata("Germany")

In [11]:
rules = getrules(df,0.05,"lift") #Do you even...

In [14]:
rules.sort_values(by="conviction",ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
11,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.070022,0.126915,0.059081,0.84375,6.648168,0.050194,5.587746
12,(ROUND SNACK BOXES SET OF 4 FRUITS),(ROUND SNACK BOXES SET OF4 WOODLAND),0.157549,0.245077,0.131291,0.833333,3.400298,0.092679,4.52954
14,(SPACEBOY LUNCH BOX),(ROUND SNACK BOXES SET OF4 WOODLAND),0.102845,0.245077,0.070022,0.680851,2.778116,0.044817,2.365427
1,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.115974,0.137856,0.067834,0.584906,4.242887,0.051846,2.076984
6,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.107221,0.137856,0.061269,0.571429,4.145125,0.046488,2.01167


### References:

1) aaa

2) bbb
