### Introduction:

After experimenting with the **mlxtend** library and investigating different metrics for itemset rules, we can now give a basic outline of how to use Affinity Analysis to identify product rules, for store placement or sales campaigns. In general, itemsets should be found with thresholded **support**, positive/negative relationships are identified with **lift**, and the strength of the said rules are measured by **conviction**. 



### Usage of Apriori Algorithm to get strong rules:

1) Set the support threshold low enough to get a large number of rules (say a few hundred).

2) After itemsets have been mined with the apriori algorithm, Get the top X rules based on the highest lift values.

3) From the top X rules, filter down further using conviction.


There is also a pyvis graphical visualization to give an idea of the rules. Note: this algorithm chooses the rule with the stronger conviction and support, if there are bi-directional rules between two nodes.

In [16]:
from src.seanlib import *
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from pyvis.network import Network

### Rule Generation Code:

In [None]:
#Generation of Rules, prepping data frame
#Signature: String -> DataFrame

#Countries with more than 500 transactions. Some countries have so few that MBA is pointless.
cities_avail = ['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 
        'Portugal', 'Italy', 'Belgium', 'Channel Islands',  'Cyprus', 'Sweden']

#Purpose: Given a European City, 
def loadcleandata(cty):
    df = pd.read_excel('./data/online_retail.xlsx') #Why did a 20MB file take 1min to load??
    df['Description'] = df['Description'].str.strip()
    df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
    df['InvoiceNo'] = df['InvoiceNo'].astype('str')
    df = df[~df['InvoiceNo'].str.contains('C')]    
    
    basket = (df[df['Country'] ==cty].groupby(['InvoiceNo', 'Description'])["Quantity"])
    basket = basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo') 
    basket_sets = basket.applymap(encode_units) 
    print(basket_sets.head())
    #basket_sets.drop('POSTAGE', inplace=True, axis=1) No postage column exists in raw DF.
    return basket_sets

def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

#Note, you still need to sort the DF of rules that you get.

def getrules(bsets,minsup,met):
    frequent_itemsets = apriori(bsets, min_support=minsup, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric=met, min_threshold=1)
    return rules

### Graph Generation Code:

In [None]:
#Support Functions:
#TO DO: Investigate double arrow relationships. Are they the same strenght, or do some overlap the others?
#Do I visualize both or the strongest one?

def strends(x):
    for y in x: #Frozensets...a weird datatype from our Apriori Algorithms. OK.
        thestr = y
    return thestr

#Given a rules Data Frame, generate an html graph.
def gen_graph(rules):
    #reset index
    rules.reset_index(inplace=True, drop=True)
    
    #make itemsets into mutable strings.
    rules["antecedents"] = rules["antecedents"].apply(strends)
    rules["consequents"] = rules["consequents"].apply(strends)
    
    #Adding nodes to our graph.
    #I use the uniqueness aspects of sets, to make sure nodes are unique.
    #aCSet = {-1}
    #aCSet.remove(-1)
    nodeDict = {}
    for rowindex in rules.index: 
        fetch = rules.iloc[rowindex]["antecedents"]
        if fetch not in nodeDict.keys():
            nodeDict[fetch] = rowindex
                    
        fetch = rules.iloc[rowindex]["consequents"]
        if fetch not in nodeDict.keys():
            nodeDict[fetch] = rowindex
        #aCSet.add(rules.iloc[rowindex]["antecedents"])
        #aCSet.add(rules.iloc[rowindex]["consequents"])
    
    net = Network(width="800px",height="800px",notebook=True,directed=True)
    
    for itemset in list(nodeDict.keys()):
        index = nodeDict[itemset]
        net.add_node(itemset, value=rules.iloc[index]["antecedent support"])  
       
    for i, row in rules.iterrows():
        if row["antecedents"] != row["consequents"]: 
            net.add_edge(source=row["antecedents"],to=row["consequents"],
                        physics=False,value=row["conviction"]*2,arrowStriketrhough=False)
    
    return net
    

In [None]:
#Minimal Example of How to use code: 

In [4]:
#Load the raw dataset:
dfOR = pd.read_excel('./data/online_retail.xlsx')


In [14]:
#Get the basket_sets related to a particular country directly,
bask_sets = loadcleandata("Norway")

In [17]:
#Getting Rules:
df = loadcleandata("United Kingdom")
rules = getrules(df,0.05,"lift") #Do you even...
rules.sort_values(by=["lift","conviction"],ascending=False).head(10)

NameError: name 'apriori' is not defined