Association Rule mining is intended to find interesting patterns in datasets, largely centered around transactions or orders. Each transaction is unique, but the same items can be purchased in multiple transactions.

Your ultimate goal is to predict what a customer wants, so you can have it ready and waiting for them so they don't have to think too hard and just put it in their cart. This leads to higher sales and happier customers because they don't spend time searching for the items they want. See this website for extra info: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

#### Vocabulary
 - itemset = all items in 1 transaction
 - A "pattern" is a conjunction of items, or the unique itemset.
 - A "rule" X --> Y means if you buy X you are likely to buy Y. A rule is simply a guide, what you do with this rule is up to you and your business needs. 

#### Evaluation metrics include:

1. Support = how frequently it occurs. The number of transactions of that unique itemset / all transactions. EX: number of times item 'X' occurs/total transaction

2. Confidence = how often rule is likely to be true. confidence of X = frequency of X and Y co-occuring/ frequency of X occuring in entire dataset. Conditional probability of Y given X. P(Ey|Ex)

3. Lift = How likely is item Y given item X occurs, controlling for how frequent Y occurs in the entire dataset. For rule X-->Y, lift = P(Y|X)/P(Y).
    Lift = 1 means X and Y are independent
    Lift >1 = X and Y are positively correlated
    Lift <1 = X and Y are negatively correlated

The Apriori algorithm is the most basic association rules algorithm, where it's job is to find the 'frequentist' itemsets. That's it, just generate a list of items that occur frequently together that pass your specified threshold for 'frequent'.

The 'Apriori' assumption states that if an individual item (say X) is not frequent, neither are any of the larger sets containing that item ([X,Y], and [X,Y,Z] are NOT frequent), therefore the Apriori algorithm does not go through the trouble of even generating those other itemsets and checking them for frequency, because it knows they are NOT frequent.

1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
2. Extract all the subsets having higher value of support than minimum threshold.
3. Select all the rules from the subsets with confidence value higher than minimum threshold. 

In [1]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori  

  return f(*args, **kwds)


In [55]:
store_df = pd.read_csv('store_data.csv',  header=None)  

In [56]:
store_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [57]:
#convert pd to list of lists, goes through entire dataframe and for each row simply pulls out the values and puts them into a list

records = []  
for i in range(0, len(store_df)):  
    records.append([str(store_df.values[i,j]) for j in range(0, len(list(store_df)))])#makes sure it goes through all 20 possible items in each transaction

In [58]:
records

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers',
  'meatballs',
  'eggs',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['chutney',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['turkey',
  'avocado',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 ['mineral water',
  'milk',
  'energy bar',
  'whole wheat rice',
  'green tea',
  'nan',
  'nan',
  'nan',
 

Convert from list of lists, to a matrix where each unique item is a column and each row is a transaction. If that transaction contained that item, the value is 'true', otherwise 'false'

In [59]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(records).transform(records)
records_matrix = pd.DataFrame(te_ary, columns=te.columns_)
records_matrix.head()
#remove nan column, I'm not sure why the np.nan is coded as a variable instead of NAN...
records_matrix=records_matrix.drop('nan', axis=1)

## Execute Apriori algorithm to find the frequentist items and use these to set as 'rules'

Need to select hyperperamters.

min support = minimum percent of time that itemset occurs, if min_support=0.1 we need that itemset to occur 10% of the time in transactions to be considered 'frequent'.


In [60]:
from mlxtend.frequent_patterns import apriori

apriori_rules=apriori(records_matrix, min_support=0.05, use_colnames=True)#if use_colnames=True we return the item name, otherwise the column index is returned

In [61]:
apriori_rules

Unnamed: 0,support,itemsets
0,0.087188,(burgers)
1,0.081056,(cake)
2,0.059992,(chicken)
3,0.163845,(chocolate)
4,0.080389,(cookies)
5,0.05106,(cooking oil)
6,0.179709,(eggs)
7,0.079323,(escalope)
8,0.170911,(french fries)
9,0.063325,(frozen smoothie)


In [62]:
#Next, filter these out, to only include the itemsets with at least 2 items

apriori_rules['length']=apriori_rules['itemsets'].apply(lambda x: len(x))

apriori_rules=apriori_rules[apriori_rules['length']>1]
apriori_rules

Unnamed: 0,support,itemsets,length
25,0.05266,"(chocolate, mineral water)",2
26,0.050927,"(eggs, mineral water)",2
27,0.059725,"(spaghetti, mineral water)",2


Interesting, mineral water seems to co-occur frequently with chocolate, eggs, and spagehetti. Obviously the store owner can't put all these items near each other. They will have to make a smart decision.

## There are many other packages out there that do apriori.
this one allows you to be more specific in setting confidence, lift and min lenght directly in the function. However, the output is a bit more cumbersome to work with. But this is also a good option

min_confidence = how confident you are that both items (X and Y) occur normalized by how frequent item X occurs. 0.2 = 20%

min_length = minimum length of itemset, 2= itemset must contain at least 2 items

In [63]:
from apyori import apriori  

association_rules = apriori(records, min_support=0.05, min_confidence=0.0, min_lift=0, min_length=2)  
association_results = list(association_rules)  


In [64]:
association_results

[RelationRecord(items=frozenset({'burgers'}), support=0.0871883748833489, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'burgers'}), confidence=0.0871883748833489, lift=1.0)]),
 RelationRecord(items=frozenset({'cake'}), support=0.08105585921877083, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'cake'}), confidence=0.08105585921877083, lift=1.0)]),
 RelationRecord(items=frozenset({'chicken'}), support=0.05999200106652446, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'chicken'}), confidence=0.05999200106652446, lift=1.0)]),
 RelationRecord(items=frozenset({'chocolate'}), support=0.1638448206905746, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'chocolate'}), confidence=0.1638448206905746, lift=1.0)]),
 RelationRecord(items=frozenset({'cookies'}), support=0.08038928142914278, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=

-----------------------------------------------------------------------------------------

## Apriori is very memory intensive, if you have a large dataset, this may not work and you will need to use a generator.

An example below:

In [70]:
#Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().as_matrix()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]   

In [71]:
#apriori rules

def association_rules(order_item, min_support, min_confidence):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    #filter rules that are below minimum confidence set by user
    
    item_pairs                   = item_pairs[item_pairs['confidenceAtoB'] >= min_confidence]
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

#data must be in series for this generator to work, where each row is a unique item at a specific transaction

transaction_ID | Item
        1         A
        1         B
        2         H
        2         G
        3         A

In [72]:
#use the IP pair dataset from Manifold

df=pd.read_csv('df_test.csv')

In [73]:
data=df[['Date', 'Src_IP', 'Dst_IP']]
#melt 
data2=data
data2=pd.melt(data2, id_vars=['Date'])

# Convert from DataFrame to a Series, with order_id as index and item_id as value
#necessary for assocation mining
data2=data2[['Date', 'value']]
data2.columns=['Date', 'IP']
data_series = data2.set_index('Date')['IP'].rename('IP')

type(data_series)

pandas.core.series.Series

In [77]:
#run the model
from collections import Counter
from itertools import combinations, groupby

rules = association_rules(data_series, min_support=0.001, min_confidence=0.01)  #0.001 is minimum support or that the IP address must appear in the requests 1% of time

Starting order_item:                   8000
Items with support >= 0.001:              68
Remaining order_item:                  8000
Remaining orders with 2+ items:        3781
Remaining order_item:                  8000
Item pairs:                             107
Item pairs with support >= 0.001:        107



  app.launch_new_instance()


In [78]:
rules

Unnamed: 0,item_A,item_B,freqAB,supportAB,freqA,supportA,freqB,supportB,confidenceAtoB,confidenceBtoA,lift
104,10014_233,10016_204,1,0.026448,13,0.343824,8,0.211584,0.076923,0.125,0.363558
103,192.168.220.31,192.168.220.31,1,0.026448,25,0.661201,25,0.661201,0.04,0.04,0.060496
66,192.168.210.46,192.168.210.52,3,0.079344,170,4.496165,12,0.317376,0.017647,0.25,0.055603
102,192.168.210.26,192.168.210.26,1,0.026448,27,0.714097,27,0.714097,0.037037,0.037037,0.051866
94,192.168.210.255,192.168.210.255,56,1.48109,203,5.36895,203,5.36895,0.275862,0.275862,0.051381
95,192.168.220.255,192.168.220.255,64,1.692674,219,5.792118,219,5.792118,0.292237,0.292237,0.050454
78,192.168.100.14,192.168.100.14,1,0.026448,40,1.057921,40,1.057921,0.025,0.025,0.023631
0,DNS,DNS,30,0.793441,228,6.030151,228,6.030151,0.131579,0.131579,0.02182
90,192.168.210.52,192.168.210.47,1,0.026448,12,0.317376,160,4.231685,0.083333,0.00625,0.019693
27,192.168.210.2,192.168.210.54,1,0.026448,4,0.105792,601,15.895266,0.25,0.001664,0.015728
