# Minhash for recommendations

In the [previous notebook](04-minhash.ipynb) we saw how Minhash can be used to approximate the similarity of sets. In this notebook we will see that Minhash can also be used to make recommendations. 

We illustrate this technique using a data set which contains users' listening history from a music streaming service. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/music.parquet")
df.shape

(19098853, 3)

The data contains three columns and over 19million rows. Let's take a closer look at a sample of the data. 

In [2]:
df.sample(10, random_state=1)

Unnamed: 0,0,1,2
12708245,user_000684,2006-11-20T00:52:10Z,The Tragically Hip
574545,user_000023,2007-11-17T09:00:20Z,Rebekah Del Rio
11024724,user_000590,2008-06-15T20:16:28Z,Explosions In The Sky
7138851,user_000366,2006-11-18T00:59:37Z,Meltiis
7982242,user_000427,2007-12-23T06:33:29Z,Derrick Morgan
7447167,user_000384,2008-07-04T14:30:41Z,Girl Talk
17460811,user_000906,2006-09-18T12:25:52Z,Creedence Clearwater Revisited
15220317,user_000793,2007-09-27T17:38:17Z,Lightning Bolt
13575207,user_000714,2009-03-16T06:26:37Z,Britney Spears
2938512,user_000147,2006-05-08T19:43:47Z,Broadcast


The data contains three columns - the first ['0'] is a user id, the second ['1'] is a timestamp representing when the user listened to the song and the artist is named in column ['2'].

We take one pass through the data to identify all unique artists all unique users. 

In [3]:
artists = df['2'].unique()
users = df['0'].unique()

We map those artist names and user names to unique integers and store those in a dictionary.

In [4]:
dartists = {y:x+1 for x, y in enumerate(sorted(set(artists)))}
dartists2 = {x+1:y for x,y in enumerate(sorted(set(artists)))}
dusers = {y:x+1 for x,y in enumerate(sorted(set(users)))}
dusers2 = {x+1:y for x,y in enumerate(sorted(set(users)))}

In [5]:
print("There are ", len(dartists), " artists and ", len(dusers), " users in our data." , sep="")

There are 173921 artists and 992 users in our data.


We group the data set by user. From there we can see which artists a particular user has listened to:

In [6]:
grouped_df = df.groupby(['0'])

In [7]:
def user_data(user, grouped_data, dusers2):
    return grouped_data.get_group(dusers2[user]) 

def top_k_listens(listening_history, k=10):
    hist = listening_history.groupby(['2'])
    return hist.count().sort_values(by='0', ascending=False).head(k).index.values

In [8]:
u2 = user_data(2, grouped_df, dusers2)

In [9]:
(top_k_listens(u2, 10))

array(['The Libertines', 'Babyshambles', 'Kettcar', 'The Kooks',
       'Maxïmo Park', 'Death Cab For Cutie', 'Sophie Milman',
       'Bright Eyes', 'Adam Green', 'Peter Doherty'], dtype=object)

For each user, we want to generate a minhash of their listening history. The minhash class which we used in the previous notebook has been put into its own module for ease. 

In [10]:
from datasketching.minhash import SimpleMinhash
from datasketching.minhash import murmurmaker

In [11]:
def generate_minhash_sig(user_dat, nhash):
    mh = SimpleMinhash(nhash)
    for row in user_dat:
        mh.add(row)
    return mh

So for each user, we want to compose a list of all the artists they listened to. From there we will generate minhashes for each user, then make predictions. 

In [13]:
def unique_artists(df):
    return df['2'].unique()

In [14]:
un_artists = grouped_df.apply(unique_artists)

In [15]:
mh_sigs = un_artists.apply(generate_minhash_sig, nhash = 128)

Once we have minhash signatures for all of the users we can compare them. But this isnt a quick process - suppose we want to find users who are similar to user 2. 

In order to do that we have to compare 997 pairs of users. This isn't too bad, but clearly is going to get out of hand as the number of users we want to make recomendations for, and the number of users in the data set grows. 

Code which runs the comparison for user 2. 

In [16]:
sim=[]
for mh in range(0, 992):
    sim.append(mh_sigs[mh].similarity(mh_sigs[1]))

Let's take a look at the users who are most similar to user 2:

In [17]:
similar = set(sorted(sim, reverse = True)[1:10])
similar

{0.1796875, 0.1875, 0.1953125, 0.203125, 0.2265625}

In [18]:
similar_users = ([i for i, e in enumerate(sim) if e in similar])

These are the most similar users. Let's go ahead and look at the top artists listened to by all these users. 

Going to look at the unique artists listened to by each of these, remove uniques listened to by user 2, and then return the most listened across the other users. 

Look at the ?top 10 artists most listened to by these users that our user didnt listen to. 

In [19]:
def user_data(user, grouped_data, dusers2):
    return grouped_data.get_group(dusers2[user]) 

def top_k_listens(listening_history, k=10):
    hist = listening_history.groupby(['2'])
    return hist.count().sort_values(by='0', ascending=False).head(k).index.values

In [20]:
u2 = user_data(2, grouped_df, dusers2)
top_k_listens(u2, 10)

array(['The Libertines', 'Babyshambles', 'Kettcar', 'The Kooks',
       'Maxïmo Park', 'Death Cab For Cutie', 'Sophie Milman',
       'Bright Eyes', 'Adam Green', 'Peter Doherty'], dtype=object)

In [21]:
import numpy as np
unheard = []
for u in similar_users:
    u_dat = user_data(u, grouped_df, dusers2)
    unheard = unheard + list(top_k_listens(u_dat, 10))

In [22]:
np.setdiff1d(unheard, un_artists[1])

array(['2Pac', 'A Boy Named Thor', 'Afi', 'Akira Yamaoka', 'Amaral',
       'Annihilator', 'Baba Zula', 'Black Sabbath', 'Brigitte Bardot',
       'Bt', 'Camel', 'Chris Vrenna', 'Depeche Mode', 'Franz Schubert',
       'Frédéric Chopin', 'Genesis', 'Giuseppe Verdi', 'Gogol Bordello',
       'Gustavo Santaolalla', 'Göksel Baktagir', 'Infusion',
       'Iron Maiden', 'Jane Birkin & Serge Gainsbourg', 'Johannes Brahms',
       'Jurassic 5', 'Laura Pausini', 'Luz Casal', 'M. Ward', 'Mae',
       'Megadeth', 'Mercan Dede', 'Mew', 'Ministry Of Sound', 'Mis-Teeq',
       'N.E.R.D.', 'Orbital', 'Ozric Tentacles', 'Paul Mccartney',
       'Queens Of The Stone Age', 'Rush', 'Scorpions', 'Serge Gainsbourg',
       'Serj Tankian', 'Story One', 'The Bear Quartet',
       'The Crystal Method', 'The Minders', 'The Twin Atlas', 'Tool',
       'Tori Amos', 'Unkle', 'Сергей Васильевич Рахманинов',
       'Сергей Сергеевич Прокофьев'], dtype='<U30')

In [23]:
# Also tried an alternative method, lookin at "most listened to by all those similar users",
# but once you remove everything user 2 listens to, there is no artist listened to by multiple users! 

import numpy as np
unheard = []
for u in similar_users:
    unheard = unheard + list(np.setdiff1d(un_artists[u], un_artists[1]))
    
from itertools import groupby
freqs = [(key, len(list(group))) for key, group in groupby(unheard)]

sorted_fr = sorted(freqs, key = lambda i: i[1],reverse=False) 
sorted_fr[:10]

[('!!!', 1),
 ('+/-', 1),
 ('1 Mile North', 1),
 ('10,000 Maniacs', 1),
 ('13 & God', 1),
 ('1800S Sea Monster', 1),
 ('1990S', 1),
 ("2 Many Dj'S", 1),
 ('2Pac', 1),
 ('4Hero', 1)]

## Locality-Sensitive Minhash

One big disadvantage of using Minhash signatures to identify similar users is the number of pairwise comparisons which must be made to determine similarity. 

Locality-sensitive Minhash is a technique we can use to identify candidate pairs of similar users for a much smaller computational cost. The method works by hashing subsets minhash signatures. If 2 users have identical signatures in ANY of the subsets these users are considered a candidate pair. And from there you can go and compute their approximate Jaccard index, or similarity, using the full minhash signatures, to determine just how similar they are, and decide if you want to make recommendations. 

The way in locality sensitive minhash works is by splitting the minhash signatures into bands. The bands are then hashed to buckets. 

if, in any band, two users map to the same bucket, they would be considered a candidate pair. At that point you’d go back and look at their minhash signatures, and compare those to determine how similar the users are. 


And thus we only have to compute the similarity of the minhash signatures for a subset of the whole population. 