In [None]:
# default_exp models.ar

# AR
> Association Rules

Simple Association Rules (AR) are a simplified version of the association rule mining technique [Agrawal et al. 1993] with a maximum rule size of two. The method is designed to capture the frequency of two co-occurring events, e.g., “Customers who bought . . . also bought”. Algorithmically, the rules and their corresponding importance are “learned” by counting how often the items i and j occurred together in a session of any user. Let a session s be a chronologically ordered tuple of item click events s = ($s_1$,$s_2$,$s_3$, . . . ,$s_m$) and $S_p$ the set of all past sessions. Given a user’s current session s with $s_{|s|}$ being the last item interaction in s, we can define the score for a recommendable item i as follows, where the indicator function $1_{EQ}(a,b)$ is 1 in case a and b refer to the same item and 0 otherwise.

$$score_{AR}(i,s) = \dfrac{1}{\sum_{p \in S_p}\sum_{x=1}^{|p|}1_{EQ}(s_{|s|},p_x)\cdot(|p|-1)}\sum_{p \in s_p}\sum_{x=1}^{|p|}\sum_{y=1}^{|p|}1_{EQ}(s_{|s|},p_x)\cdot1_{EQ}(i,p_y)$$

In the above equation, the sums at the right-hand side represent the counting scheme. The term at the left-hand side normalizes the score by the number of total rule occurrences originating from the current item $s_{|s|}$. A list of recommendations returned by the ar method then contains the items with the highest scores in descending order. No minimum support or confidence thresholds are applied.

In [None]:
#hide
from nbdev.showdoc import *
from fastcore.nb_imports import *
from fastcore.test import *

In [None]:
#export
import numpy as np
import pandas as pd
import collections as col

In [None]:
#export
class AssosiationRules: 
    '''
    AssosiationRules(pruning=10, session_key='SessionId', item_keys=['ItemId'])
    Parameters
    --------
    pruning : int
        Prune the results per item to a list of the top N co-occurrences. (Default value: 10)
    session_key : string
        The data frame key for the session identifier. (Default value: SessionId)
    item_keys : string
        The data frame list of keys for the item identifier as first item in list 
        and features keys next. (Default value: [ItemId])
    '''
    def __init__( self, pruning=10, session_key='SessionID', item_keys=['ItemID'] ):
        self.pruning = pruning
        self.session_key = session_key
        self.item_keys = item_keys
        self.items_features = {}
        self.predict_for_item_ids = []
        
    def fit( self, data):
        '''
        Trains the predictor.
        
        Parameters
        --------
        data: pandas.DataFrame
            Training data. It contains the transactions of the sessions. 
            It has one column for session IDs, one for item IDs and many for the
            item features if exist.
            It must have a header. Column names are arbitrary, but must 
            correspond to the ones you set during the initialization of the 
            network (session_key, item_keys).
        '''
        cur_session = -1
        last_items = []
        all_rules = []
        indices_item = []
        for i in self.item_keys:
            all_rules.append(dict())
            indices_item.append( data.columns.get_loc(i) )

        data.sort_values(self.session_key, inplace=True)
        index_session = data.columns.get_loc(self.session_key)
        
        #Create Dictionary of items and their features
        for row in data.itertuples( index=False ):
            item_id = row[indices_item[0]]
            if not item_id in self.items_features.keys() :
                self.items_features[item_id] = []
                for i in indices_item:
                    self.items_features[item_id].append(row[i])
              
        for i in range(len(self.item_keys)):
            rules = all_rules[i]
            index_item = indices_item[i]
            for row in data.itertuples( index=False ):
                session_id, item_id = row[index_session], row[index_item]
                if session_id != cur_session:
                    cur_session = session_id
                    last_items = []
                else: 
                    for item_id2 in last_items:                
                        if not item_id in rules :
                            rules[item_id] = dict()                
                        if not item_id2 in rules :
                            rules[item_id2] = dict()                
                        if not item_id in rules[item_id2]:
                            rules[item_id2][item_id] = 0           
                        if not item_id2 in rules[item_id]:
                            rules[item_id][item_id2] = 0
                        
                        rules[item_id][item_id2] += 1
                        rules[item_id2][item_id] += 1
                        
                last_items.append(item_id)
                
            if self.pruning > 0:
                rules = self.prune(rules) 
                
            all_rules[i] = rules
        self.all_rules = all_rules
        self.predict_for_item_ids = list(self.all_rules[0].keys())
    def predict_next(self, session_items, k = 20):
        '''
        Gives predicton scores for a selected set of items on how likely they be the next item in the session.
                
        Parameters
        --------
        session_items : List
            Items IDs in current session.
        k : Integer
            How many items to recommend
        Returns
        --------
        out : pandas.Series
            Prediction scores for selected items on how likely to be the next item of this session. 
            Indexed by the item IDs.
        
        '''
        all_len = len(self.predict_for_item_ids)
        input_item_id = session_items[-1]
        preds = np.zeros( all_len ) 
             
        if input_item_id in self.all_rules[0].keys():
            for k_ind in range(all_len):
                key = self.predict_for_item_ids[k_ind]
                if key in session_items:
                    continue
                try:
                    preds[ k_ind ] += self.all_rules[0][input_item_id][key]
                except:
                    pass
                for i in range(1, len(self.all_rules)):
                    input_item_feature = self.items_features[input_item_id][i]
                    key_feature = self.items_features[key][i]
                    try:
                        preds[ k_ind ] += self.all_rules[i][input_item_feature][key_feature]
                    except:
                        pass
        
        series = pd.Series(data=preds, index=self.predict_for_item_ids)
        series = series / series.max()
        
        return series.nlargest(k).index.values
    
    def prune(self, rules): 
        '''
        Gives predicton scores for a selected set of items on how likely they be the next item in the session.
        Parameters
            --------
            rules : dict of dicts
                The rules mined from the training data
        '''
        for k1 in rules:
            tmp = rules[k1]
            if self.pruning < 1:
                keep = len(tmp) - int( len(tmp) * self.pruning )
            elif self.pruning >= 1:
                keep = self.pruning
            counter = col.Counter( tmp )
            rules[k1] = dict()
            for k2, v in counter.most_common( keep ):
                rules[k1][k2] = v
        return rules

In [None]:
import os
import time
import argparse
import pandas as pd
from recohut.utils.common_utils import download_url

In [None]:
data_root = '/content/data'
download_url('https://github.com/RecoHut-Datasets/yoochoose/raw/v4/yoochoose_train.txt', data_root)
download_url('https://github.com/RecoHut-Datasets/yoochoose/raw/v4/yoochoose_valid.txt', data_root)

Downloading https://github.com/RecoHut-Datasets/yoochoose/raw/v4/yoochoose_train.txt
Downloading https://github.com/RecoHut-Datasets/yoochoose/raw/v4/yoochoose_valid.txt


'/content/data/yoochoose_valid.txt'

In [None]:
parser = argparse.ArgumentParser()
parser.add_argument('--prune', type=int, default=10, help="Association Rules Pruning Parameter")
parser.add_argument('--K', type=int, default=20, help="K items to be used in Recall@K and MRR@K")
parser.add_argument('--itemid', default='sid', type=str)
parser.add_argument('--sessionid', default='uid', type=str)
parser.add_argument('--item_feats', default='', type=str, 
                    help="Names of Columns containing items features separated by #")
parser.add_argument('--valid_data', default='yoochoose_valid.txt', type=str)
parser.add_argument('--train_data', default='yoochoose_train.txt', type=str)
parser.add_argument('--data_folder', default=data_root, type=str)

# Get the arguments
args = parser.parse_args([])
train_data = os.path.join(args.data_folder, args.train_data)
x_train = pd.read_csv(train_data)
valid_data = os.path.join(args.data_folder, args.valid_data)
x_valid = pd.read_csv(valid_data)
x_valid.sort_values(args.sessionid, inplace=True)

items_feats = [args.itemid]
ffeats = args.item_feats.strip().split("#")
if ffeats[0] != '':
    items_feats.extend(ffeats)

print('Finished Reading Data.')
# Fitting AR Model
print('Start Model Fitting...')
t1 = time.time()
model = AssosiationRules(session_key = args.sessionid, item_keys = items_feats, pruning=args.prune)
model.fit(x_train)
t2 = time.time()
print('End Model Fitting with total time =', t2 - t1)

print('Start Predictions...')
# Test Set Evaluation
test_size = 0.0
hit = 0.0
MRR = 0.0
cur_length = 0
cur_session = -1
last_items = []
t1 = time.time()
index_item = x_valid.columns.get_loc(args.itemid)
index_session = x_valid.columns.get_loc(args.sessionid)
train_items = model.items_features.keys()
counter = 0
for row in x_valid.itertuples(index=False):
    counter += 1
    if counter % 5000 == 0:
        print('Finished Prediction for ', counter, 'items.')
    session_id, item_id = row[index_session], row[index_item]
    if session_id != cur_session:
        cur_session = session_id
        last_items = []
        cur_length = 0
    
    if not item_id in last_items and item_id in train_items:
        if len(last_items) > cur_length: #make prediction
            cur_length += 1
            test_size += 1
            # Predict the most similar items to items
            predictions = model.predict_next(last_items, k = args.K)
            #print('preds:', predictions)
            # Evaluation
            rank = 0
            for predicted_item in predictions:
                rank += 1
                if predicted_item == item_id:
                    hit += 1.0
                    MRR += 1/rank
                    break
        
        last_items.append(item_id)
t2 = time.time()
print('Recall: {}'.format(hit / test_size))
print ('\nMRR: {}'.format(MRR / test_size))
print('End Model Predictions with total time =', t2 - t1)

Finished Reading Data.
Start Model Fitting...
End Model Fitting with total time = 47.870760679244995
Start Predictions...
Finished Prediction for  5000 items.
Recall: 0.26574500768049153

MRR: 0.1308005998098165
End Model Predictions with total time = 110.56373572349548


> **References:-**
- [https://arxiv.org/pdf/1803.09587.pdf](https://arxiv.org/pdf/1803.09587.pdf)
- [http://www.rakesh.agrawal-family.com/papers/sigmod93assoc.pdf](http://www.rakesh.agrawal-family.com/papers/sigmod93assoc.pdf)
- [https://github.com/mmaher22/iCV-SBR/tree/master/Source Codes/AR%26SR_Python](https://github.com/mmaher22/iCV-SBR/tree/master/Source%20Codes/AR%26SR_Python)

In [None]:
#hide
%reload_ext watermark
%watermark -a "Sparsh A." -m -iv -u -t -d

Author: Sparsh A.

Last updated: 2022-01-01 06:12:09

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.4.144+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

pandas  : 1.1.5
numpy   : 1.19.5
argparse: 1.1
IPython : 5.5.0

