# Minhash for recommendations

In the [previous notebook](04-minhash.ipynb) we saw how Minhash can be used to approximate the similarity of sets. In this notebook we will see that Minhash can also be used to make recommendations. 

We illustrate this technique using a data set which contains users' listening history from a music streaming service. If you're interested in how we generated this data, take a look at [this notebook](99a-data-generator.ipynb). 

In [1]:
import os
import pandas as pd

path = 'data/' 
files = os.listdir(path)
files_data = [i for i in files if i.startswith('userdat')][0:2] #listing all files in the directory of the correct form
df = pd.DataFrame(columns=['user', 'artist','plays'])
for j in files_data:
    pseudo_data = pd.read_parquet('data/'+j)
    df = pd.concat([df, pseudo_data])
    print(df.shape)

(2593050, 3)
(5558500, 3)


The data contains three columns and over 44million rows. Let's take a closer look at a sample of the data. 

In [2]:
df.sample(10, random_state=1)

Unnamed: 0,user,artist,plays
25,16835,159869,1
286,20566,112677,1
345,15874,28069,1
5034,17126,5326,3
1072,21593,104998,28
35,16299,50306,1
254,20407,158054,1
171,16519,17408,1
458,18413,94156,12
767,16707,91662,3


The first column is an integer representing a user id, the second is an integer representing an artist name, and the third column is an integer indicating how many times the user listened to the artist. 

We take one pass through the data to identify all unique artists all unique users and artists. 

In [3]:
artists = df['artist'].unique()
users = df['user'].unique()

In [4]:
print("There are ", len(artists), " artists and ", len(users), " users in our data." , sep="")

There are 45785 artists and 6200 users in our data.


We map those user names to unique integers and store those in a dictionary.

In [5]:
dusers = {x+1:y for x,y in enumerate(sorted(set(users)))}

We also load in a dictionary which maps from the artist integers to artist names. 

In [6]:
import pickle
file = open("data/dartists.pkl","rb")
dartists = pickle.load(file)

We want to convert the integers representing artist names back into artist names, using the dictionary. 

We group the data set by user. From there we can see which artists a particular user has listened to:

In [7]:
import numpy as np

def user_data(user, grouped_data, dusers):
    return grouped_data.get_group(dusers[user]) 

def top_k_listens(listening_history, k=10):
    top_k = listening_history.sort_values(by="plays", ascending = False)["artist"].head(k).values
    return artist_names(top_k)

def artist_names(artist_ints, artist_dic = dartists):
    return [artist_dic[k] for k in artist_ints]

In [8]:
df.sample(10)

Unnamed: 0,user,artist,plays
1222,21261,50314,1
1516,20600,159867,1
37,19208,27917,1
1865,19025,73240,7
457,18494,54504,1
459,16449,159589,19
396,18508,114830,2
2593,15540,167984,80
779,15549,71498,1
1151,21328,154714,142


In [9]:
df['artists']=np.array(artist_names(df['artist'].values))

In [10]:
df.sample(10)

Unnamed: 0,user,artist,plays,artists
149,20365,162345,48,Thirdorgan
1494,17973,123627,5,Ruth Rendell
93,16205,49970,7,Yuppster
145,18951,51945,3,Altoriø Ðeðëliai
28,17175,119050,1,Al Bowlly & New Mayfair Dance
42,21187,1509,180,Come Undone
669,21451,127596,30,"Destiny'S Child, Sugababes, Inaya Day, Etc."
2954,20598,37203,3,The Klezmer Conservatory Band
97,18960,83796,1,The Samuel Jackson Five
760,17207,70590,1,Chaoze One


In [11]:
grouped_df = df.groupby(['user'])

For a particular user we can have a look at their listening history as well as their most listened to artists. 

In [12]:
import numpy as np
u100 = user_data(100, grouped_df, dusers)
u100_samp = u100.sample(10)

In [13]:
u100_samp

Unnamed: 0,user,artist,plays,artists
377,15600,101082,9,Badgeman
99,15600,144794,17,Sway Feat. Akon
138,15600,137355,1,Ricardo Villalobos/Studio 1
100,15600,125234,5,Elephant
72,15600,57372,2,Carnival
148,15600,126762,2,The Sect & Raiden
49,15600,43264,1,Wighnomy Brothers
118,15600,169158,22,John Boswell
165,15600,157873,1,Digital Beach Feat Sandra O
15,15600,106108,10,Lighthouse Family


In [14]:
top_k_listens(u100, 10)

['Beyblade',
 'Soichiro Otsuka',
 'Subwave',
 '98º',
 'Cccp-Svegliami',
 'The Assassination Collective',
 'Conjunto Época De Ouro',
 'Quinn Golden',
 'Impellitteri',
 'Clubhouse']

For each user, we want to generate a minhash of their listening history. The minhash class which we used in the previous notebook has been put into its own module for ease. 

In [15]:
from datasketching.minhash import SimpleMinhash
from datasketching.minhash import murmurmaker

In [16]:
def generate_minhash_sig(user_dat, nhash):
    mh = SimpleMinhash(nhash)
    for row in user_dat:
        mh.add(row)
    return mh

So for each user, we want to compose a list of all the artists they listened to. From there we will generate minhashes for each user, then make predictions. 

In [17]:
un_artists = grouped_df['artist']

In [19]:
## this next cell takes about ten minutes to run with 128 nhash
## 80 minutes with 1024 hash functions. 

In [20]:
import time

start = time.time()
mh_sigs = un_artists.apply(generate_minhash_sig, nhash = 128)
end = time.time()

print(end-start)

632.7212808132172


Once we have minhash signatures for all of the users we can compare them. But this isnt a quick process - suppose we want to find users who are similar to user 2. 

In [21]:
sim=[]
for mh in range(1, len(mh_sigs)):
    sim.append(mh_sigs[dusers[mh]].similarity(mh_sigs[dusers[100]]))

Let's take a look at the users who are most similar to user 2:

In [22]:
similar = set(sorted(sim, reverse = True)[1:10])
similar_users = ([i for i, e in enumerate(sim) if e in similar])
for j in similar_users:
    print(dusers[j])

15555
15558
15568
15574
15577
15579
15587
15589
15593
15598


These are the most similar users. Let's go ahead and look at the top artists listened to by all these users. 

Going to look at the unique artists listened to by each of these, remove uniques listened to by user 2, and then return the most listened across the other users. 

Look at the ?top 10 artists most listened to by these users that our user didnt listen to. 

In [23]:
unheard = []
for u in similar_users:
    u_dat = user_data(u, grouped_df, dusers)
    unheard = unheard + list(top_k_listens(u_dat, 2))

In [24]:
unheard

['The Paper Dolls',
 'Lazze Ohlyz',
 'Дони И Нети & Мариана Попова & Графа',
 'Dumb Dan',
 'Îðäàëèîí',
 "Sex O'Clock Usa",
 'Isookschitterend',
 'Danilo Ercole',
 'Urban Cookie Collective',
 'Carl Loewe',
 'Zhiguli',
 'Cyril Paulus',
 'Dam',
 'Chris Berry & Panjea',
 'I Pilot Daemon',
 'De Mens',
 'Quinn Golden',
 'Andrea Williams',
 'Screen Test',
 'Louden Swain']

So that's just a quick example of how we can use minhash to identify songs we should recommend to a particular user. This method works fine on a small number of users, but falls into dificulty when the number of users grows, and the number of users for which we want to make recommendations for grows. 

## Locality-Sensitive Minhash

One big disadvantage of using Minhash signatures to identify similar users is the number of pairwise comparisons which must be made to determine similarity. 

Locality-sensitive Minhash is a technique we can use to identify candidate pairs of similar users for a much smaller computational cost. The method works by hashing subsets minhash signatures. If 2 users have identical signatures in ANY of the subsets these users are considered a candidate pair. And from there you can go and compute their approximate Jaccard index, or similarity, using the full minhash signatures, to determine just how similar they are, and decide if you want to make recommendations. 

The way in locality sensitive minhash works is by splitting the minhash signatures into bands. The bands are then hashed to buckets. 

if, in any band, two users map to the same bucket, they would be considered a candidate pair. At that point you’d go back and look at their minhash signatures, and compare those to determine how similar the users are. 


And thus we only have to compute the similarity of the minhash signatures for a subset of the whole population. 

In [25]:
from datasketching.minhash import LSHMinhash
import random

In [26]:
def lsmh(mh_sig, bands):
    ### assumes that the lenth of the mhsig is divisible by bands. 
    ### make more robust
    rows = int(len(mh_sig.buckets)/bands)
    return [mh_sig.hashes[0]([b for b in band]) for band in mh_sig.buckets.copy().reshape((rows, bands))]
    

In [28]:
from collections import defaultdict

bands = [defaultdict(lambda: list()) for i in range(16)]

for ind, mh_sg in enumerate(mh_sigs):
    for idx, key in enumerate(lsmh(mh_sg, bands=16)):
        bands[idx][key % (1 << 14)].append(ind)

We've made a dictionary of values for each band, where the keys correspond to buckets, and the values are indexes of minhash signature which mapped to that bucket. 

If two minhash signatures hash to the same key in ANY band we consider them to be 'candidate pairs'. 

This means that the corresponding users _may_ be similar. We can check if they are similar by comparing the set of artists they have each listend to. From there we can use this information to make recomendations, or move on to consider other candidate pairs. 

In [29]:
bands

[defaultdict(<function __main__.<listcomp>.<lambda>()>,
             {15922: [0],
              9137: [1],
              11451: [2, 19],
              10018: [3, 3246],
              8993: [4, 660, 799],
              7087: [5, 5163],
              4477: [6, 1132],
              6296: [7, 43],
              5901: [8],
              14274: [9, 4584],
              9295: [10],
              13749: [11],
              15662: [12],
              3040: [13],
              4237: [14],
              1938: [15],
              2496: [16, 1819],
              10517: [17],
              10: [18, 5268],
              4395: [20],
              11511: [21],
              10497: [22, 2313],
              1582: [23],
              4565: [24, 279, 2948, 4831],
              7433: [25],
              11778: [26, 845],
              6989: [27],
              14322: [28, 2089],
              14640: [29],
              3598: [30],
              7864: [31],
              12040: [32, 5299],
              252