# Seminar 1 - Implementacija glazbenog sustava preporuke

## Uvod

Sustavi preporuke predstavljaju potklasu sustava za filtriranje informacija koji daje prijedloge za stavke koje su najrelevantnije za određenog korisnika. Dobra implementacija jednog takvog sustava uvelike doprinosi kvaliteti platforme nad kojom radi. <br>
Uzmimo društvenu mrežu Tik-Tok na primjer. U svega par kratkih godina TikTok je zaprimio ogromnu popularnost za koju je u većini odgovoran njihov algoritam preporuke. Osim duhoviti videa sustavi preporuke se koriste za preporuku filmova, glazbe, knjiga i proizvoda razne prirode.<br>
Općenito postoje dvije vrste sustava preporuke: zajedničko filtriranje i filtriranje na temelju sadržaja.<br>
**Zajedničko filtriranje** je pristup dizajnu sustava preporuke koji se temelji na pretpostavci da korisnicima sličnih karakteristika su potrebni isti proizvodi. <br>
**Filtriranje na temelju sadržaja** je oblik sustava preporuke gdje se preporuke temelje na sadržaju artikla i korisnikovih preferencija.<br>

## Glazbeni sustavi preporuke

Spotify i Apple Music samo su neki od sustava koji imaju u svojoj implementaciji razvijene sustave preporuke za glazbu. Tako Spotify tjedno generira niz personaliziranih playlista za svakog korisnika. Playliste mogu biti generirane na temelju raspoloženja korisnika ili na temelju prijašnjih slušanja.
Spotify je popularan po svojim sustavima preporuke i skoro uvijek mnogo hvaljen zbog točnosti tih sustava. Ali kako zapravo ti sustavi rade? 
U ovom seminaru predstavit ćemo samo jednu metodu implementiranja sustava preporuke. Fokusirat ćemo se na pristup **filtriranje na temelju sadržaja** koji uspoređuje različite pjesme te predlaže one koje su slične. Sličnost pjesama se može mjeriti na različite načine npr. po glazbeniku, stilu pjesme, žanru pjesme itd.

### Izrada sustava preporuke.

Finalni produkt ovog seminara je sustav koji će na temelju neke playliste preporučiti niz drugih pjesama. Implementacija sustava ove prirode zahtjeva par koraka.<br>
Moramo započeti s procesom čišćenja podataka. Radimo s podatkovnim skupom "playlistDF" koji sadrži veliki broj playlista i pjesama, te niz atributa koji opisuje svaku pjesmu unutar neke playliste. Pogledajmo bolje ponašanje i oblik našeg dataseta.

In [8]:
# Import library
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

In [9]:
# Import processed data
playlistDF = pd.read_csv("data/processed_dataPlaylist.csv")
print(playlistDF.columns)
playlistDF.drop(columns=["Unnamed: 0",'Unnamed: 0.1'], inplace = True)
playlistDF.head()

Index(['Unnamed: 0.1', 'Unnamed: 0', 'pos', 'artist_name', 'track_uri',
       'artist_uri', 'track_name', 'album_uri', 'duration_ms_x', 'album_name',
       'name', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url',
       'duration_ms_y', 'time_signature', 'artist_pop', 'genres', 'track_pop'],
      dtype='object')


Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,type,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
1,73,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,w o r k o u t,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
2,14,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,party playlist,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
3,42,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Dance mix,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
4,1,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,spin,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69


In [10]:
playlistDF.shape

(67499, 30)

In [11]:
for col in playlistDF.columns:
    print(col)

pos
artist_name
track_uri
artist_uri
track_name
album_uri
duration_ms_x
album_name
name
danceability
energy
key
loudness
mode
speechiness
acousticness
instrumentalness
liveness
valence
tempo
type
id
uri
track_href
analysis_url
duration_ms_y
time_signature
artist_pop
genres
track_pop


Podatkovni skup kojeg koristimo je kompleksan. Postoje 30 varijabli koje opisuju jedan zapis, čak i nakon što smo izbacili nepotrebne stupce "Unnamed: 0",'Unnamed: 0.1'. Iz gornjeg ispisa podatkovnog skupa vidimo da pojedine pjesme se ponavljaju, što ima smisla ako uzmemo u obzir da jedna pjesma može biti unutar različitih playlista. Pogledajmo bolje koliko zapravo imamo pjesama na raspolaganju, te kako su pjesme opisane.

In [12]:
## Get all songs that we have from the playlist dataset

## We have duplicates in songs across playlists. We need a separate dataFrame for song data
def remove_duplicates(data):
    data['artists_song'] = data.apply(lambda row: row['artist_name']+row['track_name'],axis = 1)
    return data.drop_duplicates('artists_song')

songsDF = remove_duplicates(playlistDF)
print("Are all songs unique: ",len(pd.unique(songsDF.artists_song))==len(songsDF))

Are all songs unique:  True


In [13]:
songsDF.head()

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop,artists_song
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69,Missy ElliottLose Control (feat. Ciara & Fat M...
6,1,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,Throwbacks,0.774,...,6I9VzXrHxO9rA9A5euc8Ak,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,https://api.spotify.com/v1/tracks/6I9VzXrHxO9r...,https://api.spotify.com/v1/audio-analysis/6I9V...,198800,4,84,dance_pop pop post-teen_pop,83,Britney SpearsToxic
19,2,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),Throwbacks,0.664,...,0WqIKmW4BTrj3eJFmnCKMv,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,https://api.spotify.com/v1/tracks/0WqIKmW4BTrj...,https://api.spotify.com/v1/audio-analysis/0WqI...,235933,4,86,dance_pop pop r&b,25,BeyoncéCrazy In Love
46,3,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,Throwbacks,0.892,...,1AWQoqb9bSvzTjaLralEkT,spotify:track:1AWQoqb9bSvzTjaLralEkT,https://api.spotify.com/v1/tracks/1AWQoqb9bSvz...,https://api.spotify.com/v1/audio-analysis/1AWQ...,267267,4,82,dance_pop pop,79,Justin TimberlakeRock Your Body
55,4,Shaggy,1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,Throwbacks,0.853,...,1lzr43nnXAijIGYnCT8M8H,spotify:track:1lzr43nnXAijIGYnCT8M8H,https://api.spotify.com/v1/tracks/1lzr43nnXAij...,https://api.spotify.com/v1/audio-analysis/1lzr...,227600,4,75,pop_rap reggae_fusion,2,ShaggyIt Wasn't Me


In [14]:
print("Broj pjesama u podatkovnom skupu : ",len(songsDF))

Broj pjesama u podatkovnom skupu :  34247


Imamo 34 247 pjesama na raspolaganju. Za izgradnju sustava preporuke nisu potrebne sve značajke iz gore navedenog skupa. Orijentirat ćemo se na temeljne značajke koje opisuju jednu pjesmu. Gradit ćemo preporuke na temelju glazbenika, žanrova i opisa pjesme('danceability', 'energy', 'key', 'loudness'...). Temeljne značajke su podijeljene u sljedeće tri skupine:
1. **Metadata** : id,genres,artist_pop,track_pop
2. **Audio** : 
        - Mood (Danceability, Valence, Energy, Tempo)
        - Properties (Loudness, Speechiness, Instrumentalness)
        - Context (Liveness, Acousticness)
        - Key, Mode
3. **Text** (track name)

Metadata predstavlja informacije koje su vezane za datoteku pjesme, kao što su ime glazbenika, datum objave pjesme i slično. Ključne karakteristike ćemo izdvojiti iz našeg skupa, te na temelju njih izgraditi sustav preporuke.

In [15]:
# Select useful columns
def select_useful_cols(df):
    return df[['artist_name','id','track_name','danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', "artist_pop", "genres", "track_pop"]]
songsDF = select_useful_cols(songsDF)
songsDF.head(5)

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop
0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),0.904,0.813,4,-7.105,0,0.121,0.0311,0.00697,0.0471,0.81,125.461,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
6,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,Toxic,0.774,0.838,5,-3.914,0,0.114,0.0249,0.025,0.242,0.924,143.04,84,dance_pop pop post-teen_pop,83
19,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,0.664,0.758,2,-6.583,0,0.21,0.00238,0.0,0.0598,0.701,99.259,86,dance_pop pop r&b,25
46,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,Rock Your Body,0.892,0.714,4,-6.055,0,0.141,0.201,0.000234,0.0521,0.817,100.972,82,dance_pop pop,79
55,Shaggy,1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,0.853,0.606,0,-4.596,1,0.0713,0.0561,0.0,0.313,0.654,94.759,75,pop_rap reggae_fusion,2


In [16]:
songsDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34247 entries, 0 to 67498
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist_name       34247 non-null  object 
 1   id                34247 non-null  object 
 2   track_name        34247 non-null  object 
 3   danceability      34247 non-null  float64
 4   energy            34247 non-null  float64
 5   key               34247 non-null  int64  
 6   loudness          34247 non-null  float64
 7   mode              34247 non-null  int64  
 8   speechiness       34247 non-null  float64
 9   acousticness      34247 non-null  float64
 10  instrumentalness  34247 non-null  float64
 11  liveness          34247 non-null  float64
 12  valence           34247 non-null  float64
 13  tempo             34247 non-null  float64
 14  artist_pop        34247 non-null  int64  
 15  genres            34247 non-null  object 
 16  track_pop         34247 non-null  int64 

U tekstu iznad su bolje prikazane varijable s kojima od sada raspolažemo. Uspjeli smo smanjiti dimenzionalnost našeg podatkovnog skupa sa 31 na 17. Važno je za primijetiti da nemamo nedostajućih vrijednosti u našem skupu što je odlično.

**Žanr glazbe** jedan je od najvažnijih značajki prilikom izgradnje sustava preporuke. Većina ljudi prilikom istraživanja nove glazbe orijentira se najviše na žanr glazbenika ili pjesme. Podatkovni skup s kojim mi raspolažemo jednoj pjesmi ne pridružuje samo jedan žanr, već jedna pjesma može biti klasificirana u više različitih žanrova. Zbog kasnije potrebnih razloga, stupac "genres" ćemo pretvoriti u listu.

In [17]:
def genre_preprocess(df):
    df['genres_list'] = df['genres'].apply(lambda x: x.split(" "))
    return df
songsDF = genre_preprocess(songsDF)

songsDF[['track_name','genres_list']].head(5)


Unnamed: 0,track_name,genres_list
0,Lose Control (feat. Ciara & Fat Man Scoop),"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&..."
6,Toxic,"[dance_pop, pop, post-teen_pop]"
19,Crazy In Love,"[dance_pop, pop, r&b]"
46,Rock Your Body,"[dance_pop, pop]"
55,It Wasn't Me,"[pop_rap, reggae_fusion]"


Iznad ispis predstavlja žanrove po kojim su klasificirani prvih pet pjesama našeg skupa. Vidimo npr. da pjesma "Rock Your Body" je klasificirana u 2 različita žanra.

**Naslov pjesme** isto može biti faktor koji služi za preporuku glazbe. Zbog toga bilo bi dobro izvući neku informaciju iz naslova pjesme. Točnije mi ćemo u naš podatkovni skup nadodati dva stupca **"Subjectivity"**, i **"Polarity"**.<br> 
Varijabla **Subjectivity** predstavlja količinu osobnog mišljenja i činjeničnih informacija sadržanih u tekstu. Izražena je u obliku kategoričke varijable s vrijednostima [low, high, medium].<br>
Varijabla **Polarity** predstavlja stupanj snažnog ili jasno definiranog negativnog osjećaja. Također je izražena kao kategorijska varijabla s vrijednostima [Negative, Neutral, Positive].<br>
Varijable su implementirane putem "TextBlob" biblioteke.Proširit ćemo naš podatkovni skup s ove dvije varijable.

In [18]:
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
    return TextBlob(text).sentiment.polarity

def getAnalysis(score, task="polarity"):
      if task == "subjectivity":
        if score < 1/3:
          return "low"
        elif score > 1/3:
          return "high"
        else:
          return "medium"
      else:
        if score < 0:
          return 'Negative'
        elif score == 0:
          return 'Neutral'
        else:
          return 'Positive'

def sentiment_analysis(df, text_col): #analysis based on the track name
    df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
    df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
    return df

In [19]:
# Show result
sentiment = sentiment_analysis(songsDF, "track_name")
sentiment[['track_name','subjectivity','polarity']].head()

Unnamed: 0,track_name,subjectivity,polarity
0,Lose Control (feat. Ciara & Fat Man Scoop),low,Neutral
6,Toxic,low,Neutral
19,Crazy In Love,high,Negative
46,Rock Your Body,low,Neutral
55,It Wasn't Me,low,Neutral


Vidimo npr. da pjesma "Crazy in love" ima visok stupanj subjektivnosti, a nisku vrijednost polariteta što znači da ovakav naziv ne bi trebao uzrokovati negativne osjećaje kod korisnika.

U svrhu implementiranja dobrog sustava preporuke potrebno je određene stupce obraditi putem **One-hot encoding** operacije. To je operacija koja pretvara jednu varijablu u skupinu bitova s jednom jedinicom i ostalim nulama. Jednostavni primjer koji opisuje rad ove operacije prikazan je na slici ispod.

![ohe.png](attachment:ohe.png)

Stvorili smo stupac "subjectivity" koji je zapravo kategorička varijabla. Taj stupac ćemo obraditi putem OHE metode te pokazati kako izgleda proširenje te varijable putem OHE metode.

In [20]:
def ohe_prep(df, column, new_name): 
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df


# One-hot encoding for ONLY subjectivity num
subject_ohe = ohe_prep(sentiment, 'subjectivity','subject')
subject_ohe.head(5)

Unnamed: 0,subject|high,subject|low,subject|medium
0,0,1,0
1,0,1,0
2,1,0,0
3,0,1,0
4,0,1,0


Prije smo naglasili važnost varijable žanra, ali isto istaknuli kako jedna pjesma ili glazbenik može biti klasificiran u više žanrova. Iz prijašnjih ispisa skupa podataka mogli smo vidjeti da neke vrijednosti žanra su općenitije od drugih. Tako npr. pjesma može biti klasificirana u "pop", "dance_pop" ili neki drugi oblik pop glazbe. Očito je da nisu sve vrijednosti žanrova jednako važne tj. ne nose istu količinu informacije u njima. Stoga je potrebno pridodijeliti težinu žanrovima. To ćemo učiniti putem **TF_IDF** (engl. Term Frequency-Inverse Document Frequency) metode. Formula za **TF-IDF** je prikazana na slici ispod.

![Screenshot%20%2878%29.png](attachment:Screenshot%20%2878%29.png)

1. **Term frequency (TF)** : broj ponavljanja riječi (tj. termina) unutar nekog dokumenta, podijeljen s ukupnim brojem riječi u tom dokumentu.

2. **Inverse Document Frequency(IDF)** : logaritamska vrijednost ukupnog broj dokumenata gdje je određena riječ (tj. termin) prisutan.

Vrijednost **TF** predstavlja važnost riječi u jednom dokumentu, dok **IDF** predstavlja važnost riječi u svim dokumentima. U našem slučaju dokumenti su zapravo pjesme, a termini ili riječi su pojedinačni žanrovi. Želimo znati koliko je dominantan žanr u jednoj pjesmi te koliko je dominantan u čitavom podatkovnom skupu. Na ovaj način svakom tipu žanra ćemo pridodijeliti njemu odgovarajuću težinu. Pregledniji prikaz metode **TF-IDF** prikazan je ispod.

![Screenshot%20%2879%29.png](attachment:Screenshot%20%2879%29.png)

Ova metoda je implementirana pomoću TfidfVectorizer funkcije iz scikit learn.

In [21]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songsDF['genres_list'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
genre_df.drop(columns='genre|unknown')
genre_df.reset_index(drop = True, inplace=True)
genre_df.head(5)



Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,genre|xtra_raw,genre|yacht_rock,genre|ye_ye,genre|yodeling,genre|york_indie,genre|zambian_hip_hop,genre|zhongguo_feng,genre|zolo,genre|zouk,genre|zouk_riddim
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Podatkovni skup kojeg koristimo sadrži ogroman broj žanrova i podžanrova (njih 2147!). Ovdje vidimo koliko zapravo Spotify detaljno razlikuje pojedine žanrove. Na ispisu iznad ne vidimo točno koji žanrovi imaju težinu za prvih pet pjesama, ali vidimo da su određeni žanrovi beznačajni za neke pjesme. 

Značajke s kojima raspolažemo su različite magnitude, te je potrebno **skalirati** te značajke. Ovdje koristimo MinMaxScaler te za sad ćemo samo normalizirati značajku "artist_pop".

In [22]:
# Normalization
pop = songsDF[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.74
1,0.84
2,0.86
3,0.82
4,0.75


Koraci koje smo predstavili iznad su implementirani nad pomoćnim objektima. Te korake potrebno je implementirati nad čitavim podatkovnim skupom te spremiti novoobrađeni skup podataka.

## Pretvorba podatkovnog skupa.

Sažmimo sve potrebne korake koje smo predstavili. Prvo izvukli smo sve žanrove te dodijelili im težine putem **TF-IDF** metode. Dodali smo stupce **"Subjectivity"** i **"polarity"** te predstavili važnost **OHE** metode. Na kraju smo napomenuli potrebu za **skaliranjem značajki**. Te korake smo proveli na pomoćnim objektima koje je sada potrebno pretvoriti u pravi podatkovni skup. To radimo putem ispod definirane metode.

In [23]:
def create_feature_set(df, float_cols):
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['genres_list'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.drop(columns='genre|unknown') # drop unknown genre
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis => subjectivity and polarity
    df = sentiment_analysis(df, "track_name")

    # One-hot Encoding => for categorical values
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [24]:
# Save the data and generate the features
float_cols = songsDF.dtypes[songsDF.dtypes == 'float64'].index.values
songsDF.to_csv("data/allsongs_data.csv", index = False)

# Generate features
complete_feature_set = create_feature_set(songsDF, float_cols=float_cols)
complete_feature_set.to_csv("data/complete_feature.csv", index = False)
complete_feature_set.head()



Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H


Konačno imamo podatkovni skup kojeg možemo koristiti za preporuku pjesme. Operacija **OHE** je odgovorna za povećanje broja varijabli na 2179. Ovaj podatkovni skup ćemo koristiti za izradu sustava preporuke.

### Sustav preporuke.

Krenimo s preporukom pjesama. Za testni primjeri iskoristit ćemo "Mom's playlist", listu pjesama koje predstavljaju jednu playlistu. Pogledajmo sadržaj te playliste.

In [25]:
# Test playlist:  Mom's playlist
playlistDF_test = playlistDF[playlistDF['name']=="Mom's playlist"]
playlistDF_test.head(5)
playlistDF_test.to_csv("data/test_playlist.csv")
playlistDF_test.head(5)

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop,artists_song
413,59,The Killers,7oK9VyNzrYvRFo7nQEYkWN,spotify:artist:0C0XlULifJtAgn6ZNCW2eu,Mr. Brightside,spotify:album:4undIeGmofnAYKhnDclN1w,222586,Hot Fuss,Mom's playlist,0.356,...,7oK9VyNzrYvRFo7nQEYkWN,spotify:track:7oK9VyNzrYvRFo7nQEYkWN,https://api.spotify.com/v1/tracks/7oK9VyNzrYvR...,https://api.spotify.com/v1/audio-analysis/7oK9...,222587,4,80,alternative_rock dance_rock modern_rock perman...,78,The KillersMr. Brightside
1234,18,Rihanna,6qn9YLKt13AGvpq9jfO8py,spotify:artist:5pKCCKE2ajJHZ9KAiaK11H,We Found Love,spotify:album:2g1EakEaW7fPTZC6vBmBCn,215226,Talk That Talk,Mom's playlist,0.734,...,6qn9YLKt13AGvpq9jfO8py,spotify:track:6qn9YLKt13AGvpq9jfO8py,https://api.spotify.com/v1/tracks/6qn9YLKt13AG...,https://api.spotify.com/v1/audio-analysis/6qn9...,215227,4,90,barbadian_pop dance_pop pop pop_rap urban_cont...,77,RihannaWe Found Love
1363,32,American Authors,5j9iuo3tMmQIfnEEQOOjxh,spotify:artist:0MlOPi3zIDMVrfA9R04Fe3,Best Day Of My Life,spotify:album:2AAVQqcejMEgNpdg2raPYE,194240,"Oh, What A Life",Mom's playlist,0.67,...,5j9iuo3tMmQIfnEEQOOjxh,spotify:track:5j9iuo3tMmQIfnEEQOOjxh,https://api.spotify.com/v1/tracks/5j9iuo3tMmQI...,https://api.spotify.com/v1/audio-analysis/5j9i...,194240,4,70,indie_poptimism modern_alternative_rock modern...,0,American AuthorsBest Day Of My Life
1579,38,Clean Bandit,5HuqzFfq2ulY1iBAW5CxLe,spotify:artist:6MDME20pz9RveH9rEXvrOM,Rather Be (feat. Jess Glynne),spotify:album:2xVeccmEU0zklK4XSKiDCW,227833,I Cry When I Laugh,Mom's playlist,0.799,...,5HuqzFfq2ulY1iBAW5CxLe,spotify:track:5HuqzFfq2ulY1iBAW5CxLe,https://api.spotify.com/v1/tracks/5HuqzFfq2ulY...,https://api.spotify.com/v1/audio-analysis/5Huq...,227833,4,80,dance_pop edm pop pop_dance tropical_house uk_...,53,Clean BanditRather Be (feat. Jess Glynne)
1732,17,Sia,4VrWlk8IQxevMvERoX08iC,spotify:artist:5WUlDfRSoLAfcVSX1WnrxN,Chandelier,spotify:album:3xFSl9lIRaYXIYkIn3OIl9,216120,1000 Forms Of Fear,Mom's playlist,0.399,...,4VrWlk8IQxevMvERoX08iC,spotify:track:4VrWlk8IQxevMvERoX08iC,https://api.spotify.com/v1/tracks/4VrWlk8IQxev...,https://api.spotify.com/v1/audio-analysis/4VrW...,216120,5,89,australian_dance australian_pop pop,81,SiaChandelier


In [32]:
playlistDF_test[['artist_name','track_name']].head(20
                                                )

Unnamed: 0,artist_name,track_name
413,The Killers,Mr. Brightside
1234,Rihanna,We Found Love
1363,American Authors,Best Day Of My Life
1579,Clean Bandit,Rather Be (feat. Jess Glynne)
1732,Sia,Chandelier
3986,Hozier,Jackie And Wilson
3999,Aloe Blacc,I Need a Dollar
4002,Aloe Blacc,Wake Me Up - Acoustic
4007,John Legend,All of Me
4027,Pharrell Williams,"Happy - From ""Despicable Me 2"""


Ovu playlistu želimo predstaviti u obliku jednog vektora koji se može usporediti sa svim ostalim pjesmama u podatkovnom skupu. Prvo ćemo izdvojiti pjesme iz playliste od skupa svih pjesama, jer ne želimo preporučiti korisniku pjesme koje se već nalaze u njegovoj playlisti.<br> Nakon toga potrebno je obraditi playlistu s koracima koji su opisani prije te sažeti u metodi **complete_feature_set**. Funkcija **generate_playlist_feature** vraća vektor značajki svih pjesama playliste u obliku objekta Pandas series. Također vraća DataFrame ostatka skupa, tj. značajke onih pjesama koje nisu prisutne u testnoj playlisti.

In [27]:
def generate_playlist_feature(complete_feature_set, playlist_df):
    '''
    Input: 
    complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
    playlist_df (pandas dataframe): playlist dataframe
        
    Output: 
    complete_feature_set_playlist_final (pandas series): single vector feature that summarizes the playlist
    complete_feature_set_nonplaylist (pandas dataframe): 
    '''
    
    # Find song features in the playlist
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    # Find all non-playlist song features
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
    return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist

In [28]:
# Generate the features
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlistDF_test)
# Non-playlist features
complete_feature_set_nonplaylist.head()

Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H


### Kako mjeriti sličnost između pjesama.

Imamo sve značajke playliste u obliku jednog vektora te ostatak pjesama.Možemo mjeriti sličnost između vektora playliste i svih preostalih pjesama zasebno. Ovdje koristimo **kosinusnu udaljenost**. Ova mjera sličnost je otporna na porast dimenzije za razliku od euklidske udaljenosti, što nam odgovara jer radimo s podatkovnim skupom s 2179 varijabli. Na grafu ispod vidimo razliku udaljenosti između dva vektora s porastom dimenzije.

![Screenshot%20%2881%29.png](attachment:Screenshot%20%2881%29.png)

Formula za izračun ove mjere sličnosti je prikazana ispod.

![Screenshot%20%2880%29.png](attachment:Screenshot%20%2880%29.png)

Primjer kosinusne udaljenosti između dva vektora bolje je prikazan na slici ispod.

![Screenshot%20%2882%29.png](attachment:Screenshot%20%2882%29.png)

Želimo pronaći pjesme čiji koje imaju veliku sličnost s našom playlistom na temelju kosinusne udaljenosti. Kreirajmo sada preporuke na temelju naše playliste putem funkcije **generate_playlist_recos** koja je definirana ispod.

In [29]:
def generate_playlist_recos(df, features, nonplaylist_features):
    '''
    Generated recommendation based on songs in aspecific playlist.
    ---
    Input: 
    df (pandas dataframe): spotify dataframe
    features (pandas series): summarized playlist feature
    nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
    Output: 
    non_playlist_df_top_40: Top 40 recommendations for that playlist
    '''
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40) #return top 40 recomendations
    
    return non_playlist_df_top_40

In [30]:
# Genreate top 10 recommendation
recommend = generate_playlist_recos(songsDF, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
recommend.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]


Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,artist_pop,genres,track_pop,genres_list,subjectivity,polarity,sim
28834,American Authors,64ybTt8CKxPdeXBNnu08Op,Believer,0.583,0.968,1,-2.909,1,0.0368,0.00141,...,0.13,0.91,119.999,70,indie_poptimism modern_alternative_rock modern...,55,"[indie_poptimism, modern_alternative_rock, mod...",low,Neutral,0.784019
51128,American Authors,1obisQNOcikRvTdStbW3pG,Go Big Or Go Home,0.665,0.875,1,-4.272,1,0.0426,0.00939,...,0.0897,0.66,122.008,70,indie_poptimism modern_alternative_rock modern...,63,"[indie_poptimism, modern_alternative_rock, mod...",low,Neutral,0.781734
43254,The 1975,51cd3bzVmLAjlnsSZn4ecW,She's American,0.647,0.857,1,-3.94,1,0.0547,0.167,...,0.0763,0.55,115.976,78,modern_alternative_rock modern_rock pop rock,55,"[modern_alternative_rock, modern_rock, pop, rock]",low,Neutral,0.768755
28926,Neon Trees,0K1KOCeJBj3lpDYxEX9qP2,Sleeping With A Friend,0.582,0.882,2,-4.256,1,0.0355,0.00189,...,0.32,0.507,107.034,71,modern_alternative_rock modern_rock pop pop_ro...,59,"[modern_alternative_rock, modern_rock, pop, po...",low,Neutral,0.763502
54403,American Authors,4gHD93RNqEhEh2NkYzl3x6,Luck,0.554,0.806,0,-3.463,1,0.046,0.00177,...,0.165,0.646,144.923,70,indie_poptimism modern_alternative_rock modern...,54,"[indie_poptimism, modern_alternative_rock, mod...",low,Neutral,0.763177
55426,WALK THE MOON,71wT7aMCFPYfzutF66OLac,Aquaman,0.63,0.772,1,-6.986,1,0.0297,0.51,...,0.0881,0.721,99.964,72,dance_pop dance_rock indie_poptimism modern_al...,46,"[dance_pop, dance_rock, indie_poptimism, moder...",low,Neutral,0.756824
44455,Neon Trees,1fBl642IhJOE5U319Gy2Go,Animal,0.482,0.833,5,-5.611,1,0.0449,0.000346,...,0.365,0.74,148.039,71,modern_alternative_rock modern_rock pop pop_ro...,74,"[modern_alternative_rock, modern_rock, pop, po...",low,Neutral,0.754931
43408,The 1975,1v07ywlVYd02pOCnXRBDNA,Menswear,0.708,0.539,1,-10.281,1,0.0681,0.541,...,0.0856,0.159,97.015,78,modern_alternative_rock modern_rock pop rock,51,"[modern_alternative_rock, modern_rock, pop, rock]",low,Neutral,0.754552
43278,The 1975,3xrwXWG4O9uhtRyAd3MCou,Heart Out,0.706,0.83,2,-4.918,1,0.0274,0.00822,...,0.0763,0.886,118.446,78,modern_alternative_rock modern_rock pop rock,54,"[modern_alternative_rock, modern_rock, pop, rock]",low,Neutral,0.751452
14547,The 1975,5hc71nKsUgtwQ3z52KEKQk,Somebody Else,0.61,0.788,0,-5.724,1,0.0585,0.195,...,0.153,0.472,101.045,78,modern_alternative_rock modern_rock pop rock,75,"[modern_alternative_rock, modern_rock, pop, rock]",low,Neutral,0.748933


Konačno, imamo 10 novih pjesama koje je naš sustav preporučio na temelju "Mom's playlist". Za razliku kod algoritama strojnog učenja ovdje nemamo definiranu mjeru točnosti kako bismo definirali točnost našeg algoritma. Ovdje jedino možemo evaluirati rad sustava na temelju subjektivnoj procjeni. Po osobnoj procjeni možemo reći da sustav dobro predlaže nove pjesme. Podsjetimo da za generiranje preporuke tražile su se pjesme s najmanjoj kosinusnom udaljenosti, tj. one pjesme koje su slične po žanru,ključu,subjektivnosti,polaritetu ...  Za usporedbu pogledajmo ponovno djelomični sadržaj naše "Mom's playlist" playliste.

In [31]:
recommend[["artist_name","track_name"]][:20]

Unnamed: 0,artist_name,track_name
28834,American Authors,Believer
51128,American Authors,Go Big Or Go Home
43254,The 1975,She's American
28926,Neon Trees,Sleeping With A Friend
54403,American Authors,Luck
55426,WALK THE MOON,Aquaman
44455,Neon Trees,Animal
43408,The 1975,Menswear
43278,The 1975,Heart Out
14547,The 1975,Somebody Else


In [21]:
playlistDF_test[["artist_name","track_name"]][:20]

Unnamed: 0,artist_name,track_name
413,The Killers,Mr. Brightside
1234,Rihanna,We Found Love
1363,American Authors,Best Day Of My Life
1579,Clean Bandit,Rather Be (feat. Jess Glynne)
1732,Sia,Chandelier
3986,Hozier,Jackie And Wilson
3999,Aloe Blacc,I Need a Dollar
4002,Aloe Blacc,Wake Me Up - Acoustic
4007,John Legend,All of Me
4027,Pharrell Williams,"Happy - From ""Despicable Me 2"""


## Zaključak

U ovom seminaru predstavili smo samo jedan način preporuke glazbe, i to samo putem filtriranja na temelju sadržaja. Naravno to nije jedini način preporuke glazbenog sadržaja. Postoje tehnike preporuke koje se temelje isključivo na popularnosti glazbenika ili pjesme a ne na temelju žanrova kao što smo mi radili u ovom seminaru. Također postoji pristup zajedničkim filtriranjem kojeg nismo obradili u ovom seminaru.