# Data Generator 

In this notebook we use a small, real dataset to simulate synthetic user data. The data we begin with is the 'last.fm' data set and can be found [here](http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html).  

The data set contains music listening history from 992 users. Specifically, we have access to the songs a user listened to, the recording artist and a timestamp for each listen. 

For the purpose of our application (Minhash for recommendation) it will be enough for us to just have a list of artists listened to by a user, as well as how many times they played that artist.

We begin by loading in the existing data, and extracting the artists played and number of plays for each user. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/music.parquet")

In [2]:
df.sample(10, random_state=1)

Unnamed: 0,0,1,2
12708245,user_000684,2006-11-20T00:52:10Z,The Tragically Hip
574545,user_000023,2007-11-17T09:00:20Z,Rebekah Del Rio
11024724,user_000590,2008-06-15T20:16:28Z,Explosions In The Sky
7138851,user_000366,2006-11-18T00:59:37Z,Meltiis
7982242,user_000427,2007-12-23T06:33:29Z,Derrick Morgan
7447167,user_000384,2008-07-04T14:30:41Z,Girl Talk
17460811,user_000906,2006-09-18T12:25:52Z,Creedence Clearwater Revisited
15220317,user_000793,2007-09-27T17:38:17Z,Lightning Bolt
13575207,user_000714,2009-03-16T06:26:37Z,Britney Spears
2938512,user_000147,2006-05-08T19:43:47Z,Broadcast


In [3]:
users = df['0'].unique()
dusers = {x+1:y for x,y in enumerate(sorted(set(users)))}
len(dusers)

992

For each user, we want to compute the unique artists they listened to.
and generate minhash signatures for them

In [4]:
from datasketching.minhash import SimpleMinhash
from datasketching.minhash import murmurmaker

In [5]:
def generate_minhash_sig(user_dat, nhash):
    mh = SimpleMinhash(nhash)
    for row in user_dat:
        mh.add(row)
    return mh

def unique_artists(df):
    return df['2'].unique()

In [6]:
grouped_df = df.groupby(['0'])

In [7]:
un_artists = grouped_df.apply(unique_artists)
mh_sigs = un_artists.apply(generate_minhash_sig, nhash = 128)

In [8]:
len(un_artists[0])

657

from each user we are going to generate pseudo users. For user x we do this by: 
    
    1. sampling 70% of user x's listening history 
    2. finding the 10 most similar users from the existing set of users
        a. sample 6 of these
        b. sample 5% of each of these user's listening history (excluding *everything* that user x listened to, and everything already sampled from other users.)
    3. sample the frequency of plays for each artist randomly, without replacement, from user x's play frequencies. 


In [9]:
x = mh_sigs[0]
artists_listened = len(un_artists[0])
sim=[]
for mh in range(0, 992):
    sim.append(mh_sigs[mh].similarity(mh_sigs[0]))

similar = set(sorted(sim, reverse = True)[1:11])
similar
similar_users = ([i for i, e in enumerate(sim) if e in similar])

In [10]:
similar_users

[52, 77, 355, 383, 446, 501, 836, 840, 867, 937]

In [11]:
from random import sample 

selected = sample(similar_users, 6)

In [12]:
selected

[836, 501, 867, 383, 840, 77]

In [13]:
import numpy as np

####### all of that user's listening history 

heard = un_artists[0]
listened = []
to_sample = int(np.floor(artists_listened)*0.05)
for u in selected:
    possible = np.setdiff1d(un_artists[u], (list(un_artists[0])+listened))
    listened = listened + list(np.random.choice(un_artists[u], size=to_sample, replace = False))

    
#### take a subsample of 70% of user 0's listening history. 
listened = listened + list(np.random.choice(un_artists[0], size=int(np.floor(artists_listened*0.7))))

In [14]:
listen_vals = grouped_df.get_group(dusers[1]).groupby(['2']).count()['1'].values
listen_vals[0:100]

array([ 26,   1,   1,   2, 146,  64,   2,   1,   5,   4,   3,   2,  21,
         1,  86,  18,   5,   1, 128,   6,  15,   2,   5,   1,  12,   5,
         2,   7,   1,   4,  35,   1,   1,   2,   1,   1,   6,   8,  21,
         4,  10,   7,  19,   1,   4,   1,   1,   2,   8,   1,   7,   5,
         4,   3,  76,   4,   8,   9,  19,   1,   1,   1,  14, 448,   1,
         4,   1,   1,   1, 180,   6,  12,  17,   1,   1,   4,   1,  13,
        16,   4,   1, 100,   3,  67,   6,   1,   1,  11,   2, 115,   4,
         1,   1,   7,   1,   1,  12,   1,   4,  22])

In [15]:
user_plays = np.random.choice(listen_vals, size=len(listened), replace = False)

In [16]:
user_data = {'user':np.repeat('user_0',len(listened), axis=0), 'artist':listened, 'plays':user_plays} 
user_df = pd.DataFrame(user_data) 

In [17]:
user_df.sample(10)

Unnamed: 0,user,artist,plays
397,user_0,Hagen String Quartett,1
353,user_0,Alva Noto,1
7,user_0,Älien Mutation,1
510,user_0,A Hundred Birds,4
628,user_0,Will Web,29
261,user_0,Steve Orchard,1
169,user_0,Maroon 5,13
473,user_0,Hagen String Quartett,1
231,user_0,Pleasure,2
121,user_0,Gil Scott-Heron,2


Now we wrap this in a function to simulate multiple pseudo users from each user. 

In [18]:
import random
random.seed(102)
new_users = pd.DataFrame( columns=['user', 'artist','plays'])    
ii = 0 

for u in range(0, 30):    
    print(u)
    x = mh_sigs[u]
    artists_listened = len(un_artists[u])
    to_sample = int(np.floor(artists_listened)*0.05)
    sim=[]
    for mh in range(0, 992):
        sim.append(mh_sigs[mh].similarity(mh_sigs[0]))
    
    similar = set(sorted(sim, reverse=True)[1:11]) # the ten largest similarities
    similar_users = ([i for i, e in enumerate(sim) if e in similar]) # extract the user values
    
    
    user_play_fr = grouped_df.get_group(dusers[(u+1)]).groupby(['2']).count()['1'].values
    
    
    for j in range(0, 20):
        # print(j)
        ### make 5 new users for each user
        username = 'user'+str(u)+str(j)
        #print(username)
        selected = sample(similar_users, 6)
        listened = []
        for k in selected:
            possible = np.setdiff1d(un_artists[k], (list(un_artists[u])+listened))
            listened = listened + list(np.random.choice(un_artists[k], size = to_sample, replace = False))
            
        listened = listened + list(np.random.choice(un_artists[u], size=int(np.floor(artists_listened*0.7))))
        
        ### now simulate user plays. 
        user_plays = np.random.choice(user_play_fr, size=len(listened), replace = False)
        
        user_data = {'user':np.repeat(username,len(listened), axis=0) , 'artist':listened, 'plays':user_plays} 
        user_df = pd.DataFrame(user_data) 
        new_users = pd.concat([new_users, user_df])
        
    ii = ii + 1
    #print(ii)
    if ii == 10:
        ### write file to parquet every 10th user, and begin a new file
        filename='data/userdat'+str(u)+'.parquet'
        print(filename)
        new_users.to_parquet(filename)
        ii = 0
        new_users = pd.DataFrame( columns=['user', 'artist','plays'])    


        


0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
data/userdat9.parquet
10
1
11
2
12
3
13
4
14
5
15
6
16
7
17
8
18
9
19
10
data/userdat19.parquet
20
1
21
2
22
3
23
4
24
5
25
6
26
7
27
8
28
9
29
10
data/userdat29.parquet
