### <span style="color:red">This notebook was used to automate the process of hyperparameters tuning </span>
We grouped the whole process from setting the parameters until training the model into one function that we iterate over in order to vary the hyperparameter we wish to optimize, the metrics are then saved inside of a textfile after each model is done training.

# Deezer playlist dataset and song recommendation with word2vec

In this mini project we will develop a word2vec network and use it to build a playlist completion tool (song suggestion). The data is hosted on the following repository: http://github.com/comeetie/deezerplay.git. To know more about word2vec and these data you can read the two following references:

- Efficient estimation of word representations in vector space, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. (https://arxiv.org/abs/1301.3781)
- Word2with applied to Recommendation: Hyperparameters Matter, H. Caselles-Dupré, F. Lesaint and J. Royo-Letelier. (https://arxiv.org/pdf/1804.04212.pdf)

The elements you have to do are highlighted in red.

## Preparation of data

The data is in the form of a playlist list. Each playlist is a list with the deezer ID of the psong followed by the artist ID.

In [1]:
import numpy as np
data = np.load("./music_2.npy",allow_pickle=True)
[len(data), np.mean([len(p) for p in data])]

[100000, 24.21338]

The dataset we are going to work on contains 100000 playlists which are composed of an average of 24.1 songs. We will start by keeping only the song identifiers. 

In [2]:
playlist_track = [list(filter(lambda w: w.split("_")[0]==u"track",playlist)) for playlist in data]
playlist_artist = [list(filter(lambda w: w.split("_")[0]==u"artist",playlist)) for playlist in data]

In [3]:
# songs != ?
tracks = np.unique(np.concatenate(playlist_track))
Vt = len(tracks)
Vt

338509

The number of different songs in this data-set is quite high with more than 300,000 songs.

## Creating a song dictionary
We will assign to each song an integer that will serve as a unique identifier and input for our network. In order to save a little bit of resources we will only work in this project on songs that appear in at least two playlists.

In [4]:
# how many occurence for each track ?
track_counts = dict((tracks[i],0) for i in range(0, Vt))
for p in playlist_track:
    for a in p:
        track_counts[a]=track_counts[a]+1;

In [5]:
# Filter very rare songs to save ressources
playlist_track_filter = [list(filter(lambda a : track_counts[a]> 1, playlist)) for playlist in playlist_track]
# get the counts
counts  =  np.array(list(track_counts.values()))
# sort
order = np.argsort(-counts)
# deezed_id array
tracks_list_ordered = np.array(list(track_counts.keys()))[order]
# Vocabulary size = number of kept songs
Vt=np.where(counts[order]==1)[0][0]
# dict construction id_morceaux num_id [0,Vt]
track_dict = dict((tracks_list_ordered[i],i) for i in range(0, Vt))
# playlist conversion to list of integers
corpus_num_track = [[track_dict[track] for track in play ] for play in playlist_track_filter]

### Creation of test and validation learning sets

To learn the parameters of our method we will keep the first l-1 songs of each playlist (with l the length of the playlist) for learning. To evaluate the completion performance of our method we keep for each playlist the last two songs. The objective will be to find the last one from the next-to-last one. 



In [6]:
# playlist main part used for trainning
play_app  = [corpus_num_track[i][:(len(corpus_num_track[i])-1)] 
             for i in range(len(corpus_num_track)) if len(corpus_num_track[i])>1]
# the two last elements are used for validation and training
index_tst = np.random.choice(100000,20000)
index_val = np.setdiff1d(range(100000),index_tst)

play_tst  = np.array([corpus_num_track[i][(len(corpus_num_track[i])-2):len(corpus_num_track[i])] 
             for i in index_tst if len(corpus_num_track[i])>3])
play_val  = np.array([corpus_num_track[i][(len(corpus_num_track[i])-2):len(corpus_num_track[i])] 
             for i in index_val if len(corpus_num_track[i])>3])[:10000]


In [7]:
# import Keras
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense,Flatten
from keras.layers.merge import Dot
from keras.utils import np_utils
from keras.preprocessing.sequence import skipgrams

### hyper-paramètres de word2vec :

La méthode word2vec fait intervennir un certains nombre d'hyper paramètres. Nous allons les définirs et leurs donner des première valeurs que nous affinerons par la suite:


In [8]:
from sklearn.neighbors import KDTree
import time

In [9]:
def full_process(vector_dim_input,window_width_input,neg_sample_input,min_batch_size_input,samp_coef_input,sub_samp_input):  
    # latent space dimension
    vector_dim = vector_dim_input
    # window size
    window_width = window_width_input
    # number of negative sample per positive sample
    neg_sample = neg_sample_input
    # taille des mini-batch
    min_batch_size = min_batch_size_input
    # smoothing factor for the sampling table of negative pairs 
    samp_coef = samp_coef_input
    # cparameter to sub-sample frequent song
    sub_samp = sub_samp_input

    # get the counts
    counts = np.array(list(track_counts.values()),dtype='float')[order[:Vt]]
    # normalization
    st =  counts/np.sum(counts)
    # smoothing
    st_smooth = np.power(st,samp_coef)
    st_smooth = st_smooth/np.sum(st_smooth)

    # inputs
    input_target = Input((1,), dtype='int32')
    input_context = Input((1,), dtype='int32')

    embedding = Embedding(Vt, vector_dim, input_length=1, name='embedding')
    target = embedding(input_target)
    context = embedding(input_context)
    dot_product = Dot(axes=2, normalize=True)([target, context])
    dot_product = Flatten()(dot_product)

    output = Dense(1, activation='sigmoid',name="classif")(dot_product)

    Track2Vec = Model(inputs=[input_target, input_context], outputs=output)
    Track2Vec.compile(loss='binary_crossentropy', optimizer='adam',metrics=["accuracy"])

    # function to generate word2vec positive and begative pairs 
    # from an array of int that represent a text ot here a playlist
    # params 
    # seq : input text or playlist (array of int)
    # neg_samples : number of negative sample to generate per positive ones
    # neg_sampling_table : sampling table for negative samples
    # sub sampling_table : sampling table for sub sampling common words songs
    # sub_t : sub sampling parameter
    def word2vecSampling(seq,window,neg_samples,neg_sampling_table,sub_sampling_table,sub_t):
        # vocab size
        V = len(neg_sampling_table)
        # extract positive pairs 
        positives = skipgrams(sequence=seq, vocabulary_size=V, window_size=window,negative_samples=0)
        ppairs    = np.array(positives[0])
        # sub sampling
        if (ppairs.shape[0]>0):
            f = sub_sampling_table[ppairs[:,0]]
            subprob = ((f-sub_t)/f)-np.sqrt(sub_t/f)
            tokeep = (subprob<np.random.uniform(size=subprob.shape[0])) | (subprob<0)
            ppairs = ppairs[tokeep,:]
        nbneg     = ppairs.shape[0]*neg_samples
        # sample negative pairs
        if (nbneg > 0):
            negex     = np.random.choice(V, nbneg, p=neg_sampling_table)
            negexcontext = np.repeat(ppairs[:,0],neg_samples)
            npairs    = np.transpose(np.stack([negexcontext,negex]))
            pairs     = np.concatenate([ppairs,npairs],axis=0)
            labels    = np.concatenate([np.repeat(1,ppairs.shape[0]),np.repeat(0,nbneg)])
            perm      = np.random.permutation(len(labels))
            res = [pairs[perm,:],labels[perm]]
        else:
            res=[[],[]]
        return res

    import random

    def track_ns_generator(corpus_num, nbm):

        while 1:

            # tirage de nbm playlist dans corpus_num
            result = [word2vecSampling(batch, window_width, neg_sample, st_smooth, st, sub_samp) for batch in random.sample(corpus_num, nbm)]
            x_temp = np.array([i for fxres in [rot[0] for rot in result if len(rot[0]) > 0] for i in fxres], dtype=np.int32)

            # création des données x et y 
            y = np.array([i for fyres in [rot[1] for rot in result if len(rot[1]) > 0] for i in fyres], dtype=np.int32)
            x = [x_temp[:,0], x_temp[:,1]]

            yield (x, y)

    start_time = time.time()
    #Learning
    # %%script false --no-raise-error
    hist=Track2Vec.fit(x=track_ns_generator(play_app,min_batch_size),steps_per_epoch = 200,epochs=60)
    end_time = time.time()
    #save weights
    vectors_tracks = Track2Vec.get_weights()[0]
    with open('latent_positions.npy', 'wb') as f:
        np.save(f, vectors_tracks)
    #load them    
    vectors_tracks=np.load("latent_positions.npy")
    #closest neighbours algo (+- 5min on top of the training time)
    kdt = KDTree(vectors_tracks, leaf_size=10, metric='euclidean')
    def predict_batch(seeds,k,X,kdt):
        return kdt.query(X[seeds,:], k=k+1, return_distance=False)[:,1:]
    indexes = predict_batch(play_val[:,0],10,vectors_tracks,kdt)

    #metrics that we have to keep track of along with accuracy
    
    NDGCatK=0
    for k in [np.where(indexes[i] == value)[0] for i, value in enumerate(play_val[:,1])]:
        if len(k)>0:
            NDGCatK+= sum(1/np.log2(k+2))/len(play_val[:,1])
    
    n,HitatK = 0,0
    for i,value in enumerate(play_val[:,1]):
        if value in indexes[i]:
            n+=1
    HitatK = n/len(play_val[:,1])
    
    #save hyperparameters used + metrics + computational time in a file
    f = open("Results.txt", "a")
    f.write("samp_coef= {} || NDGCatK={} || HitatK={} || accuracy={} || loss={} || computational time={} seconds  \n"
            .format(samp_coef_input,NDGCatK,HitatK,hist.history.get('accuracy')[-1],hist.history.get('loss')[-1],end_time - start_time))
    f.close()
    

In [11]:
samp_coef_list = [-1,-0.7, -0.5, -0.2, 0, 0.2, 0.5, 0.7, 1]
for i in range (len(samp_coef_list)):
    full_process(30,3,5,50,samp_coef_list[i],0.00001)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60
Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/6