# Apriori algorithm

The Apriori algorithm is an algorithm for finding association rules and performing association analysis. 

The algorithm has the following metrics:

1. Support

   Indicates the rate at which a relationship is repeated across all purchases. In our case, those map to which artists you have listened to.

2. Confidence

    Indicates the probability that customers who buy product X will buy product Y. In our case, this is the probability that a user who listens to artist X will listen to artist Y.

3. Lift
   
   The probability of purchasing one product, given that you have purchased another.


**Metrics formulas**

![Apriori algorithm metrics formulas](data/apriory.jpeg?raw=true "Apriori algorithm metrics formulas")

In [1]:
from collections import defaultdict
import pandas as pd

In [2]:
THRESHOLD = 180  # occurance that makes an itemset "frequent"

In [3]:
item_counts = defaultdict(int)
pair_counts = defaultdict(int)
triple_counts = defaultdict(int)

In [4]:
# read in data



df = pd.read_csv(('data/lastfm.csv'))
lastfm = df[['user', 'artist', 'country']]

In [9]:
records = {}
for i in lastfm['user'].unique():
    records[i] = list(lastfm[lastfm['user'] == i]['artist'].values)
    #records.append(list(lastfm[lastfm['user'] == i]['artist'].values))

In [10]:
def normalize_group(*args):
    return str(sorted(args))

In [11]:
def generate_pairs(*args):
    pairs = []
    for idx_1 in range(len(args)):
        for idx_2 in range(idx_1 + 1, len(args)):
            pairs.append(normalize_group(args[idx_1], args[idx_2]))
    return pairs

In [12]:
# first pass
# find candidate items
for artists in records.values():
    for item in artists:
        item_counts[item] += 1

# filter for frequent items
frequent_artists = set()
for key in item_counts:
    if item_counts[key] > THRESHOLD:
        frequent_artists.add(key)

In [13]:
# second pass ----------------------------------------
# get counts of candidate pairs
for artists in records.values():
    for idx_1 in range(len(artists)-1):
        if artists[idx_1] not in frequent_artists:
            continue
        for idx_2 in range(idx_1 + 1, len(artists)):
            if artists[idx_2] not in frequent_artists:
                continue
            # [a, b] is the same as [b, a] with this normalization
            pair = normalize_group(artists[idx_1], artists[idx_2])
            pair_counts[pair] += 1

# get frequent pairs
frequent_pairs = set()
for key in pair_counts:
    if pair_counts[key] >= THRESHOLD:
        frequent_pairs.add(key)

In [14]:
# third pass -------------------------------------
# find candidate triples
for artists in records.values():
    for idx_1 in range(len(artists)-2):
        if artists[idx_1] not in frequent_artists:
            continue
        for idx_2 in range(idx_1 + 1, len(artists) - 1):
            if artists[idx_2] not in frequent_artists:
                continue
            first_pair = normalize_group(artists[idx_1], artists[idx_2])
            if first_pair not in frequent_pairs:
                continue
            for idx_3 in range(idx_2 + 1, len(artists)):
                if artists[idx_3] not in frequent_artists:
                    continue
                # now check that all pairs are frequent
                pairs = generate_pairs(
                    artists[idx_1], artists[idx_2], artists[idx_3])
                if any(pair not in frequent_pairs for pair in pairs):
                    continue
                triple = normalize_group(
                    artists[idx_1], artists[idx_2], artists[idx_3])
                triple_counts[triple] += 1

In [15]:
# get frequent triples
frequent_triples = set()
for key in triple_counts:
    if triple_counts[key] > THRESHOLD:
        frequent_triples.add(key)

In [16]:
# print(frequent_triples)
# view our results -----------------------------------
triple_counts = {k: v for k, v in triple_counts.items() if v > THRESHOLD}
sorted_triples = sorted(triple_counts.items(), key=lambda elem: elem[1])
# print(triple_counts)
# print(sorted_triples)
for entry in sorted_triples:
    print(f'{entry[0]} : {entry[1]}')

['coldplay', 'radiohead', 'sigur rós'] : 181
['coldplay', 'muse', 'red hot chili peppers'] : 181
['coldplay', 'muse', 'the beatles'] : 184
['radiohead', 'the beatles', 'the white stripes'] : 184
['arctic monkeys', 'coldplay', 'the killers'] : 185
['radiohead', 'red hot chili peppers', 'the beatles'] : 187
['death cab for cutie', 'radiohead', 'the beatles'] : 187
['coldplay', 'the beatles', 'the killers'] : 188
['coldplay', 'oasis', 'radiohead'] : 191
['muse', 'radiohead', 'the killers'] : 192
['beck', 'radiohead', 'the beatles'] : 195
['led zeppelin', 'radiohead', 'the beatles'] : 196
['arctic monkeys', 'coldplay', 'radiohead'] : 196
['coldplay', 'red hot chili peppers', 'the beatles'] : 201
['muse', 'placebo', 'radiohead'] : 205
['muse', 'radiohead', 'the beatles'] : 207
['bob dylan', 'radiohead', 'the beatles'] : 208
['david bowie', 'radiohead', 'the beatles'] : 209
['coldplay', 'death cab for cutie', 'radiohead'] : 212
['coldplay', 'radiohead', 'red hot chili peppers'] : 222
['coldp