# Recommending songs by embeddings

**NOTE:** This notebook is based on the tutorial in Chapter 2 of *[Hands-On Large Language Models](https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/)* by [Jay Alammar](https://www.linkedin.com/in/jalammar/) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

The idea here is that we have a bunch of song playlists like this...

- Rossana * Billy Jean * Let's go crazy * etc.
- Fack to black * Between the lines * One * etc.

...and the word embedding model will cluster songs that appear next to each other in a bunch of playlists. We can then use those similarities to generate new playlists based on individual songs.

In [3]:
%%capture 
# %%capture prevents this cell from printing a ton of STDERR stuff to the screen

## NOTE: Uncomment the next line to install stuff if you need to.
##       Also, installing can take a few minutes...

# !pip install gensim # we use gensim to download a word2vec model

In [4]:
## Import modules we'll need
import urllib.request
from gensim.models import word2vec # We will train a word2vec model with playlist data
import pandas as pd # we'll use pandas to format data

In [5]:
## Read in a tab-delimited file that contains song id numbers
## paired with song names and artists.
# id_to_title = pd.read_csv("song_hash.txt", sep="\t", 
#                           header=None, 
#                           names=["id", "title", "artist"])
id_to_title = pd.read_csv("https://raw.githubusercontent.com/StatQuest/embeddings_for_recommendations/main/song_hash.txt", 
                          sep="\t", 
                          header=None, 
                          names=["id", "title", "artist"])
id_to_title.head() # print out the first few rows

Unnamed: 0,id,title,artist
0,0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,2,Get Back Up (w\/ Chris Brown),T.I.
3,3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,4,Whip My Hair,Willow


----

# Import the playlist data

In [6]:
## NOTE: The data files were originally created by Shuo Chen (shuochen@cs.cornell.edu) 
##       in the Dept. of Computer Science, Cornell University.
## I downloaded them from here: https://www.cs.cornell.edu/~shuochen/lme/data_page.html
##
## open() opens the file...
## read() reads it in...
## split('\n') makes it legible
## [2:] skips the first to lines of metadata
# data = open("train.txt", "r").read().split('\n')[2:]

data = urllib.request.urlopen('https://raw.githubusercontent.com/StatQuest/embeddings_for_recommendations/main/train.txt')
data = data.read().decode("utf-8").split('\n')[2:]

In [7]:
## Remove playlists with only one song
playlists = [s.rstrip().split() for s in data if len(s.split()) > 1]

In [8]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

In [9]:
## Train a word embedding model with our playlists
## NOTE: By default Word2Vec uses the "CBOW" (continuous bag of words) method for 
##       training. CBOW uses surrounding words to predict a word in the middle.
##       For example, if the training set was "Troll2 is great", then
##       CBOW would use "Troll2" and "great" to predicet "is".
## vector_size : dimensionality of the word vectors.
## negative : If > 0, negative sampling will be used, 
##            and specifies how many “noise words” should be drawn (usually between 5-20).
## min_count : Ignores all words with total frequency lower than this.
## workers : Use these many worker threads to train the model
## NOTE: The value I selected for the arguments allowed for relatively fast training and 
##       worked well enough.
model = word2vec.Word2Vec(playlists, vector_size=32, negative=10, min_count=1, workers=4) #

In [10]:
song_id = 3822 # Billie Jean - Michael Jackson
# song_id = 2172 # Fade To Black - Metallica
# song_id = 842 # California Love - 2Pac

In [11]:
id_to_title.iloc[song_id]

id                   3822
title         Billie Jean
artist    Michael Jackson
Name: 3822, dtype: object

In [12]:
## find the most similar songs
new_playlist = pd.DataFrame(model.wv.most_similar(positive=str(song_id)),
                            columns=["id", "sim"])  

In [13]:
new_playlist

Unnamed: 0,id,sim
0,4111,0.985519
1,11622,0.981036
2,500,0.977418
3,4181,0.975966
4,19162,0.971487
5,3809,0.969897
6,3791,0.966917
7,3381,0.966449
8,3893,0.965715
9,3385,0.965069


In [14]:
## Print out the song names and artists for the new
id_to_title.iloc[new_playlist["id"]]

Unnamed: 0,id,title,artist
4111,4111,Rosanna,Toto
11622,11622,Mandolin Rain,Bruce Hornsby & The Range
500,500,Don't Stop 'Til You Get Enough,Michael Jackson
4181,4181,Kiss,Prince & The Revolution
19162,19162,I Can't Wait,Nu Shooz
3809,3809,Super Freak,Rick James
3791,3791,When Doves Cry,Prince & The Revolution
3381,3381,Let's Go Crazy,Prince & The Revolution
3893,3893,Word Up,Cameo
3385,3385,She's So High,Tal Bachman


# Bam!!!