# Aggregate

This notebook contains an implementation of Aggregate - an experimental method for building recommender systems based on computing a similarity matrix.

## What does this algorithm do?

The algorithm takes a set of items as input and outputs an extension of this set that contains additional items which are most likely to belong to it, based on previous transaction data.

## Sampling the dataset

In this demonstration, the transaction data is playlists belonging to the [Spotify Million Playlist Dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge), and the items are artists.

Since the original dataset is 1 million playlists long, we will select a random sample of 100 thousand playlists for running our experiment.

In [1]:
import json

`filtered-data.json` contains transaction data - the playlists.  
`artist-info.json` contains meta-data about the artists - a mapping from artist ID to additional information such as their names.

In [2]:
data_file = open("preprocessed-data/filtered-data.json")
data = json.load(data_file)
info_file = open("preprocessed-data/artist-info.json")
info = json.load(info_file)

In [3]:
TOTAL_PLAYLISTS = len(data)
NUM_ARTISTS = len(info)

In [4]:
NUM_PLAYLISTS = 100000

In [5]:
import random

In [6]:
random.seed(30)

In [7]:
selected_playlists = random.sample(range(TOTAL_PLAYLISTS), NUM_PLAYLISTS)
sample = []
for i in selected_playlists:
    sample.append(data[i])

## Constructing the similarity matrix

The algorithm first scans transaction data to build a matrix, whose values denote the similarities between artists.  
This value is represented by a measure we have devised called modified-lift, which we have also used in Crawl.

$$modified-lift(A, B) = \frac{support(\{A, B\})}{\sqrt{support(\{A\}) * support(\{B\})}}$$

In [8]:
mat = []
for i in range(1000):
    row = []
    for j in range(1000):
        row.append(0)
    mat.append(row)

In [9]:
for playlist in sample:
    n = len(playlist)
    for i in range(n):
        for j in range(i):
            mat[playlist[j]][playlist[i]] += 1
            mat[playlist[i]][playlist[j]] += 1
        mat[playlist[i]][playlist[i]] += 1
for i in range(1000):
    for j in range(1000):
        if i != j:
            mat[i][j] /= (mat[i][i] * mat[j][j]) ** 0.5
for i in range(1000):
    mat[i][i] = 1

## The algorithm

The array `scores` maintains the score of each item at any point of time - the summation of the modified-lift of that item with the items that are a part of playlist.  
At each step, the algorithm greedily picks the unselected item with the highest score - or in other words, what the algorithm predicts to be the item most likely to belong to the set.  
The scores of all the items are updated by adding the modified-left of each item with the item that has just been added.  
The process repeats until all the artists are added to the playlist.  
The array `selected` keeps track of which items are already a part of the playlist that is being constructed.

In [10]:
def init_scores(playlist):
    score, selected = [], []
    for i in range(1000):
        score.append(0)
        selected.append(False)
    for item in playlist:
        for i in range(1000):
            score[i] += mat[item][i]
        selected[item] = True
    return score, selected

In [11]:
def extend(playlist, size):
    playlist = playlist.copy()
    score, selected = init_scores(playlist)
    while len(playlist) < size:
        maxi, val = -1, -1
        for i in range(1000):
            if (not selected[i]) and (score[i] > val):
                maxi = i
                val = score[i]
        playlist.append(maxi)
        for i in range(1000):
            score[i] += mat[maxi][i]
        selected[maxi] = True
    return playlist

## Function for printing playlists

In [12]:
def print_playlist(playlist):
    for idx, item in enumerate(playlist):
        print(f"{idx + 1}:\t{info[item]['name']}")

## Testing with Rock artists

We now test the performance of the algorithm by providing it with a playlist of some artists belong to the genre of Rock.  
Ideally, the algorithm must extend the playlist in such a way that other Rock artists are present at the top.

In [13]:
rock_playlist = [150, 155, 221, 239, 753]
print_playlist(rock_playlist)

1:	alt-J
2:	Arctic Monkeys
3:	MGMT
4:	Nirvana
5:	Dire Straits


In [14]:
extended_rock_playlist = extend(rock_playlist, 20)
print_playlist(extended_rock_playlist)

1:	alt-J
2:	Arctic Monkeys
3:	MGMT
4:	Nirvana
5:	Dire Straits
6:	Cage The Elephant
7:	The Black Keys
8:	Red Hot Chili Peppers
9:	Weezer
10:	The Killers
11:	The White Stripes
12:	The Strokes
13:	Modest Mouse
14:	Vampire Weekend
15:	Two Door Cinema Club
16:	Phoenix
17:	Foster The People
18:	Grouplove
19:	Passion Pit
20:	Young the Giant


As you can see from the results, the algorithm does indeed add Rock artists.  
It was able to correctly identify the nature of the sample playlist without being explicitly told about it, and by merely looking at transaction data.

## Testing with Pop artists

In [15]:
pop_playlist = [53, 165, 319, 538, 620]
print_playlist(pop_playlist)

1:	Shawn Mendes
2:	Zara Larsson
3:	MØ
4:	Dua Lipa
5:	Billie Eilish


In [16]:
extended_pop_playlist = extend(pop_playlist, 20)
print_playlist(extended_pop_playlist)

1:	Shawn Mendes
2:	Zara Larsson
3:	MØ
4:	Dua Lipa
5:	Billie Eilish
6:	The Chainsmokers
7:	Hailee Steinfeld
8:	Kygo
9:	Major Lazer
10:	Calvin Harris
11:	DJ Snake
12:	Martin Garrix
13:	Zedd
14:	Galantis
15:	Jonas Blue
16:	David Guetta
17:	Cheat Codes
18:	Clean Bandit
19:	Selena Gomez
20:	Ariana Grande


## Testing with Metal and Punk artists

In [17]:
metal_punk_playlist = [156, 200, 355, 388, 705]
print_playlist(metal_punk_playlist)

1:	Green Day
2:	Led Zeppelin
3:	My Chemical Romance
4:	Metallica
5:	Black Sabbath


In [18]:
extended_metal_punk_playlist = extend(metal_punk_playlist, 20)
print_playlist(extended_metal_punk_playlist)

1:	Green Day
2:	Led Zeppelin
3:	My Chemical Romance
4:	Metallica
5:	Black Sabbath
6:	Guns N' Roses
7:	Aerosmith
8:	AC/DC
9:	Queen
10:	Van Halen
11:	Lynyrd Skynyrd
12:	Kansas
13:	Boston
14:	The Rolling Stones
15:	Creedence Clearwater Revival
16:	Eagles
17:	The Who
18:	Steve Miller Band
19:	Bon Jovi
20:	Journey
