<a href="https://colab.research.google.com/github/sergiomora03/AdvancedTopicsAnalytics/blob/main/exercises/E3-Song%20EmbeddingsVisualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Song Embeddings - Skipgram Recommender

In this notebook, we'll use human-made music playlists to learn song embeddings. We'll treat a playlist as if it's a sentence and the songs it contains as words. We feed that to the word2vec algorithm which then learns embeddings for every song we have. These embeddings can then be used to recommend similar songs. This technique is used by Spotify, AirBnB, Alibaba, and others. It accounts for a vast portion of their user activity, user media consumption, and/or sales (in the case of Alibaba).

The [dataset we'll use](https://www.cs.cornell.edu/~shuochen/lme/data_page.html) was collected by Shuo Chen from Cornell University. The dataset contains playlists from hundreds of radio stations from around the US.

## Importing packages and dataset

In [None]:
import numpy as np
import pandas as pd
import gensim
from gensim.models import Word2Vec
from urllib import request
import warnings
warnings.filterwarnings('ignore')

The playlist dataset is a text file where every line represents a playlist. That playlist is basically a series of song IDs.

In [None]:
# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]


The `playlists` variable now contains a python list. Each item in this list is a playlist containing song ids. We can look at the first two playlists here:

In [None]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

## Training the Word2Vec Model
Our dataset is now in the shape the the Word2Vec model expects as input. We pass the dataset to the model, and set the following key parameters:
 * **vector_size**: Embedding size for the songs.
 * **window**: word2vec algorithm parameter -- maximum distance between the current and predicted word (song) within a sentence
 * **negative**: word2vec algorithm parameter -- Number of negative examples to use at each training step that the model needs to identify as noise


In [None]:
model = Word2Vec(playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4)

The model is now trained. Every song has an embedding. We only have song IDs, though, no titles or other info. Let's grab the song information file.

## Song Title and Artist File
Let's load and parse the file containing song titles and artists

In [None]:
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]

Now, `songs` is a list containing the id, title, and artist of every song in our datset. It looks like this:

In [None]:
songs[:3]

[['0 ', 'Gucci Time (w\\/ Swizz Beatz)', 'Gucci Mane'],
 ['1 ', 'Aston Martin Music (w\\/ Drake & Chrisette Michelle)', 'Rick Ross'],
 ['2 ', 'Get Back Up (w\\/ Chris Brown)', 'T.I.']]

To simplify looking up song titles by ID, we'll define a pandas dataframe to hold song information.

In [None]:
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [None]:
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


Pandas dataframes give us the ability to easily search through the columns of our dataset. We can look at the songs of a certain artist, for example.

In [None]:
songs_df[songs_df.artist == 'Rush'].head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1861,Tom Sawyer,Rush
2640,Red Barchetta,Rush
2655,Fly By Night,Rush
2691,Freewill,Rush
2748,Limelight,Rush


### Looking up songs by their IDs
Pandas also give us the ability to retrieve the information of multiple songs by passing their ids. Let's for example retrieve the info for songs number 1, 10, and 100.

In [None]:
songs_df.iloc[[1,10,100]]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
10,Shake It,Elephant Man
100,I'm Yours,Jason Mraz


## Recommending Similar Songs
Let's now pick a song, and see what similar songs the model recommends

In [None]:
songs_df.iloc[2172]

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object

In [None]:
song_id = 2172

# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

[('2849', 0.9976274967193604),
 ('3116', 0.9950224161148071),
 ('3119', 0.994614839553833),
 ('3126', 0.9944372177124023),
 ('11596', 0.9942123293876648),
 ('11517', 0.9939004778862),
 ('1922', 0.9933257699012756),
 ('5586', 0.9932605028152466),
 ('1954', 0.9928205013275146),
 ('2014', 0.992260754108429)]

Let's look up the titles and artists of these songs:

In [None]:
similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
songs_df.iloc[similar_songs]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2849,Run To The Hills,Iron Maiden
3116,Communication Breakdown,Led Zeppelin
3119,There's Only One Way To Rock,Sammy Hagar
3126,Heavy Metal,Sammy Hagar
11596,Hallowed Be Thy Name,Iron Maiden
11517,Mary Had A Little Lamb,Stevie Ray Vaughan & Double Trouble
1922,One,Metallica
5586,The Last In Line,Dio
1954,The Number Of The Beast,Iron Maiden
2014,Youth Gone Wild,Skid Row


Let's define a function that prints out both the song title and the recommendations based on it:


In [None]:
def print_recommendations(song_id):
    print( songs_df.iloc[song_id] )
    similar_songs = np.array(model.wv.most_similar(positive=str(song_id)))[:,0]
    return  songs_df.iloc[similar_songs]


## More Example Recommendations

### Paranoid Android - Radiohead

In [None]:
print_recommendations(19563)

title     Paranoid Android
artist           Radiohead
Name: 19563 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
21628,I Got Mine,The Black Keys
40279,Something To Die For,The Sounds
24752,The Denial Twist,The White Stripes
44542,-,-
42416,-,-
34358,One Time One Night,Los Lobos
18002,Bittersweet Memories,Bullet For My Valentine
62660,I Don't Believe You,The Thermals
32649,Teenage Lobotomy,The Ramones
34407,Bad Vibrations,The Black Angels


### California Love - 2Pac

In [None]:
print_recommendations(842)

title     California Love (w\/ Dr. Dre & Roger Troutman)
artist                                              2Pac
Name: 842 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
330,Hate It Or Love It (w\/ 50 Cent),The Game
6741,Love In This Club (w\/ Young Jeezy),Usher
886,Heartless,Kanye West
5668,How We Do (w\/ 50 Cent),The Game
5788,Drop It Like It's Hot (w\/ Pharrell),Snoop Dogg
890,Knock You Down (w\/ Ne-Yo & Kanye West),Keri Hilson
413,If I Ruled The World (Imagine That) (w\/ Laury...,Nas
12205,Give It Up To Me,Sean Paul
18844,Murder She Wrote,Chaka Demus & Pliers
1560,In Da Club,50 Cent


### Billie Jean - Michael Jackson

In [None]:
print_recommendations(3822)

title         Billie Jean
artist    Michael Jackson
Name: 3822 , dtype: object


Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3358,Maneater,Daryl Hall & John Oates
4187,I Wanna Dance With Somebody (Who Loves Me),Whitney Houston
4288,Straight Up,Paula Abdul
1506,The Way You Make Me Feel,Michael Jackson
12749,Wanna Be Startin' Somethin',Michael Jackson
500,Don't Stop 'Til You Get Enough,Michael Jackson
4157,P.Y.T. (Pretty Young Thing),Michael Jackson
3384,Hungry Eyes,Eric Carmen
4271,Walking On Sunshine,Katrina & The Waves
3859,Always Something There To Remind Me,Naked Eyes


### Exercise:

Build visualization for the embeddings of the song recommender.