# Data Generator 

In this notebook we use a small, real dataset to simulate synthetic user data. The data we begin with is the 'last.fm' data set and can be found [here](http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html).  

The data set contains music listening history from 992 users. Specifically, we have access to the songs a user listened to, the recording artist and a timestamp for each listen. 

For the purpose of our application (Minhash for recommendation) it will be enough for us to just have a list of artists listened to by a user, as well as how many times they played that artist.

We begin by loading in the existing data, and extracting the artists played and number of plays for each user. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/music.parquet")

In [2]:
df.sample(10, random_state=1)

Unnamed: 0,0,1,2
12708245,user_000684,2006-11-20T00:52:10Z,The Tragically Hip
574545,user_000023,2007-11-17T09:00:20Z,Rebekah Del Rio
11024724,user_000590,2008-06-15T20:16:28Z,Explosions In The Sky
7138851,user_000366,2006-11-18T00:59:37Z,Meltiis
7982242,user_000427,2007-12-23T06:33:29Z,Derrick Morgan
7447167,user_000384,2008-07-04T14:30:41Z,Girl Talk
17460811,user_000906,2006-09-18T12:25:52Z,Creedence Clearwater Revisited
15220317,user_000793,2007-09-27T17:38:17Z,Lightning Bolt
13575207,user_000714,2009-03-16T06:26:37Z,Britney Spears
2938512,user_000147,2006-05-08T19:43:47Z,Broadcast


In [3]:
users = df['0'].unique()
dusers = {x+1:y for x,y in enumerate(sorted(set(users)))}
len(dusers)

992

For each user, we want to compute the unique artists they listened to.
and generate minhash signatures for them

In [4]:
from datasketching.minhash import SimpleMinhash
from datasketching.minhash import murmurmaker

In [5]:
def generate_minhash_sig(user_dat, nhash):
    mh = SimpleMinhash(nhash)
    for row in user_dat:
        mh.add(row)
    return mh

def unique_artists(df):
    return df['2'].unique()

In [6]:
grouped_df = df.groupby(['0'])

In [7]:
un_artists = grouped_df.apply(unique_artists)
mh_sigs = un_artists.apply(generate_minhash_sig, nhash = 128)

In [8]:
len(un_artists[0])

657

from each user we are going to generate 100 pseudo users. For user x we do this by: 
    
    1. sampling 70% of user x's listening history 
    2. finding the 10 most similar users from the existing set of users
        a. sample 6 of these
        b. sample 5% of each of these user's listening history (excluding *everything* that user x listened to, and everything already sampled from other users.)
    3. apply a poisson rv to their history.


In [9]:
x = mh_sigs[0]
artists_listened = len(un_artists[0])
sim=[]
for mh in range(0, 992):
    sim.append(mh_sigs[mh].similarity(mh_sigs[0]))

similar = set(sorted(sim, reverse = True)[1:11])
similar
similar_users = ([i for i, e in enumerate(sim) if e in similar])

In [10]:
similar_users

[52, 77, 355, 383, 446, 501, 836, 840, 867, 937]

In [11]:
from random import sample 

selected = sample(similar_users, 6)

In [12]:
selected

[840, 77, 383, 836, 501, 446]

In [13]:
#### take a subsample of 70% of user 0's listening history. 

In [14]:
import numpy as np

####### all of that user's listening history 

heard = un_artists[0]
listened = []
to_sample = int(np.floor(artists_listened)*0.05)
for u in selected:
    print(len(listened))
    possible = np.setdiff1d(un_artists[u], (list(un_artists[0])+listened))
    listened = listened + list(np.random.choice(un_artists[u], size=to_sample, replace = False))
    print(len(possible))

listened = listened + list(np.random.choice(un_artists[0], size=int(np.floor(artists_listened*0.7))))



0
2910
32
1252
64
1342
96
3230
128
434
160
3800


In [15]:
len(listened)

651

In [16]:
def user_data(user, grouped_data, dusers):
    return grouped_data.get_group(dusers[user]) 

In [17]:

def top_k_listens(listening_history, k=artists_listened):
    hist = listening_history.groupby(['2'])
    return hist.count().sort_values(by='0', ascending=False) #.head(k).index.values

In [18]:
u_0 = user_data(1, grouped_df, dusers)

In [19]:
listen_vals = top_k_listens(u_0)['1'].values

In [20]:
## that's the number of listens. 

In [21]:
user_plays = np.random.choice(listen_vals, size=len(listened), replace = False)

In [22]:
type(listened)

list

In [23]:
type(user_plays)

numpy.ndarray

In [24]:
user_data = {'user':np.repeat('user_0',len(listened), axis=0) , 'artist':listened, 'plays':user_plays} 
user_df = pd.DataFrame(user_data) 

In [25]:
user_df

Unnamed: 0,user,artist,plays
0,user_0,Angus And Julia Stone,7
1,user_0,Pink Martini,16
2,user_0,Total Science Vs Undercover Agent,1
3,user_0,Kubiks,115
4,user_0,Pinch,3
5,user_0,Put Your Woman First,1
6,user_0,Paolo Fedreghini,3
7,user_0,Solar Fields,15
8,user_0,Teebee,2
9,user_0,The Sentinel,1
