## Introduction:

After experimenting with the **mlxtend** library and investigating different metrics for itemset rules, we can now give a basic outline of how to use our rule metrics to identify top rules, for store placement or sales campaigns. 

In general, itemsets should be found with threshold'ed **support**, positive/negative relationships are identified with **lift**, and the strength of the said rules are measured by **conviction**. 



### Purpose and Goal:

What is the goal of Market Basket Analysis? 

It is to identify reliable, *non-obvious* relationships between itemsets. In the context of transactional/sales data, we are looking for rules to point out non-obvious relationships between goods and services, that we can push to exploit hidden streams of revenue.

We use the Apriori algorithm to identify itemsets with sufficient support (read: enough examples to be statistically relevent), and then find the top association rules between itemsets according to objective measures (lift and conviction etc). 

Once our top rules are found, we apply context and human intuition to judge if the rules are worthwhile. For example, if we derive the following two rules:

$$\{ Coffee \} \rightarrow \{ Sugar, Milk \}$$


$$\{ Pork,Beef,Chicken \} \rightarrow \{ Lemon\;\; Juice \} $$

The former is obvious, and can be discarded. The latter might be because of a current cooking or health-trend, and should be investigated further.



In [None]:
## Criteria for Rule Selection:

With all of these definitions and examples in mind, how should we select our rules? Lets recap what our 4 main metrics do:

1) Support (of a Rule): Used to get some baseline statistical significance. If too low, our rule could have just occured by chance.

2) Confidence: A measure of strength of implication: how likely that Y is likely to occur given X occurs in a transaction. Overestimates a conditional probability. Does not give information about the valence or kind of relationship between two itemsets.

3) Lift: A measure of correlation between two itemsets. Indicates valence and strength of co-occurence (not causality).

4) Conviction: Another measure of strength of implication, but one that indicates when two item-sets are indepednent (cv = 1), in addition to taking into account both the antecedent and consequent of the rule (confidence only considers the antecedent).


### Process to Find Good Rules:

1) For the *mlxtend.apriori()* function, set the support threshold low enough to get a large number of rules.

2) After itemsets have been mined with the apriori() algorithm, Get the top X rules based on the highest lift and conviction values, using the *mlxtend.association_rules()* function.

3) Sort by lift first, and conviction second.

4) Remove duplicate and circular rules; for bi-directional rules. If there are bi-directional rules between two nodes, select whichever rule has the highest conviction.  

5) Display rules with pyvis functions(), and apply human/domain knowledge to pick out desired rules.

The code below demonstrates this process:

In [2]:
from src.seanlib import *
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from pyvis.network import Network

### Rule Generation Code:

In [3]:
#Generation of Rules, prepping data frame
#Signature: String -> DataFrame

#Countries with more than 500 transactions. Some countries have so few transactions, that MBA is pointless.
cities_avail = ['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 
        'Portugal', 'Italy', 'Belgium', 'Channel Islands',  'Cyprus', 'Sweden']

#Purpose: Given a Country, take the raw UCI dataset, and clean it up. 
def loadcleandata(cty):
    df = pd.read_excel('./data/online_retail.xlsx')
    df['Description'] = df['Description'].str.strip()
    df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
    df['InvoiceNo'] = df['InvoiceNo'].astype('str')
    df = df[~df['InvoiceNo'].str.contains('C')]    
    
    basket = (df[df['Country'] == cty].groupby(['InvoiceNo', 'Description'])["Quantity"])
    basket = basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo') 
    basket_sets = basket.applymap(encode_units) 
    if "POSTAGE" in basket_sets.columns: #Some countries have a postage itemset; it ruins our graphs and analysis.
        basket_sets.drop('POSTAGE', inplace=True, axis=1)
    return basket_sets

#Purpose: 1-hot encoding for each feature.
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

#Signature: BasketSets, Integer, String -> DataFrame
#Purpose: get the itemsets, and then find the rules between them, based on arguments supplied.
def getrules(bsets,minsup,met):
    frequent_itemsets = apriori(bsets, min_support=minsup, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric=met, min_threshold=1)
    return rules

### Graph Generation Code:

In [5]:
#Support Functions:
def strends(x):
    for y in x: #Frozensets...a weird datatype from our Apriori Algorithms. OK.
        thestr = y
    return thestr

#Given a rules Data Frame, generate an html graph.
def gen_graph(rules):
    #reset index
    rules.reset_index(inplace=True, drop=True)
    
    #make itemsets into mutable strings.
    rules["antecedents"] = rules["antecedents"].apply(strends)
    rules["consequents"] = rules["consequents"].apply(strends)
    
    #Adding nodes to our graph.
    #I use the uniqueness aspects of sets, to make sure nodes are unique.
    #aCSet = {-1}
    #aCSet.remove(-1)
    nodeDict = {}
    for rowindex in rules.index: 
        fetch = rules.iloc[rowindex]["antecedents"]
        if fetch not in nodeDict.keys():
            nodeDict[fetch] = rowindex
                    
        fetch = rules.iloc[rowindex]["consequents"]
        if fetch not in nodeDict.keys():
            nodeDict[fetch] = rowindex
        #aCSet.add(rules.iloc[rowindex]["antecedents"])
        #aCSet.add(rules.iloc[rowindex]["consequents"])
    
    net = Network(width="800px",height="800px",notebook=True,directed=True)
    
    for itemset in list(nodeDict.keys()):
        index = nodeDict[itemset]
        net.add_node(itemset, value=rules.iloc[index]["antecedent support"])  
       
    for i, row in rules.iterrows():
        if row["antecedents"] != row["consequents"]: 
            net.add_edge(source=row["antecedents"],to=row["consequents"],
                        physics=False,value=row["conviction"]*2,arrowStriketrhough=False)
    
    return net
    

### Minimal Example: Use of code (Raw Data -> Final Graphical Representation): 

In [7]:
#Load the raw dataset:
dfOR = pd.read_excel('./data/online_retail.xlsx')


In [8]:
#Get the basket_sets related to a particular country directly:
bask_sets = loadcleandata("Germany")

In [10]:
#Getting Rules:
df = loadcleandata("Germany")
rules = getrules(df,0.035,"lift") #Do you even...
rules.sort_values(by=["lift","conviction"],ascending=False,inplace=True)
rules.reset_index(inplace=True, drop=True)


In [11]:
rules.drop(index=list(range(1,rules.shape[0],2)),inplace=True)

In [12]:
network = gen_graph(rules)
network.show("mygraph.html")

...and a Rule Cluster is produced! The UCI dataset has labels that don't really make sense, so we can't apply domain knowledge to pick out rules. However, it can be seen from the Cluster above, that rules with heavy set arrows would be ones to first focus on. Thank you for following this tutorial, try MBA out on your own transactional datasets!

### END