# Seminar 1

## Uvod

Sustavi preporuke predstavljaju jednu obitelj algoritama strojnog učenja koji predlažu proizvode korisnicima na temelju njihovih karakteristika, želja ili povijesnih ponašanja. Dobra implementacija jednog takvog sustava uvelike doprinosi kvaliteti platforme nad kojom radi. <br>
Uzmimo društvenu mrežu Tik-Tok za primjer. U svega par kratkih godina TikTok je zaprimio ogromnu popularnost za koju je u većini odgovoran njihov algoritam preporuke. Osim duhoviti videa sustavi preporuke se koriste za preporuku filmova, glazbe, knjiga i proizvoda razne prirode.<br>
U ovom seminaru opisat ćemo kako rade ovi sustavi, koje vrste postoje i na kraju ćemo implementirati jedan takav sustav za preporuku glazbe.

## Glazbeni sustavi preporuke

Spotify,Apple Music samo su neki od sustava koji imaju u svojoj implementaciji razvijene sustave preopruke za glazbu. Tako Spotify tjedno generira niz personaliziranih playlista za svakog korisnika. Playliste mogu biti generirane na temelju raspoloženja korisnika ili na temelju prijašnjih slušanja.
Spotify je popularan po svojim sustavima preporuke i skoro uvijek mnogo hvaljen na točnosti tih sustava. Ali kako zapravo ti sustavi rade? 
U ovom seminaru predstavit ćemo jednu metodu implementiranja sustava preporuke. Fokusirati ćemo se na Content-Based Filtering pristup koji uspoređuje različite pjesme te predlaže one koje su slične. Sličnost pjesama se može mjeriti na različite načine npr. po glazbeniku, stilu pjesme, žanru pjesme itd.

### Izrada sustava preporuke.

Finalni produkt ovog seminara je sustav koji će na temelju neke playliste preporučiti niy drugih pjesama.

Implementacija sustava ove prirode zahtjeva par koraka. Ti koraci su prikazani na slici ispod.

![Screenshot%20%2873%29.png](attachment:Screenshot%20%2873%29.png)

Moramo započeti s procesom čišćenja podataka. Radimo s podatkovnim skupom "playlistDF" koji sadrži veliki broj playlista, te niz atributa koji opisuje svaku pjesmu unutar neke playliste. Pogledajmo bolje ponašanje i oblik našeg dataseta.

In [1]:
# Import library
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

In [2]:
# Import processed data
playlistDF = pd.read_csv("data/processed_dataPlaylist.csv")
print(playlistDF.columns)
playlistDF.drop(columns=["Unnamed: 0",'Unnamed: 0.1'], inplace = True)
playlistDF.head()

Index(['Unnamed: 0.1', 'Unnamed: 0', 'pos', 'artist_name', 'track_uri',
       'artist_uri', 'track_name', 'album_uri', 'duration_ms_x', 'album_name',
       'name', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url',
       'duration_ms_y', 'time_signature', 'artist_pop', 'genres', 'track_pop'],
      dtype='object')


Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,type,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
1,73,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,w o r k o u t,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
2,14,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,party playlist,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
3,42,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Dance mix,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
4,1,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,spin,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69


Podatkovni skup kojeg koristimo je kompleksan. Postoje 30 varijabli koje opisuju jedan zapis. Pogledajmo bolje koliko zapravo imamo pjesama na raspolaganju, te kako su pjesme opisane

In [3]:
## Get all songs that we have from the playlist dataset

## We have duplicates in songs across playlists. We need a separate dataFrame for song data
def remove_duplicates(data):
    data['artists_song'] = data.apply(lambda row: row['artist_name']+row['track_name'],axis = 1)
    return data.drop_duplicates('artists_song')

songsDF = remove_duplicates(playlistDF)
print("Are all songs unique: ",len(pd.unique(songsDF.artists_song))==len(songsDF))

Are all songs unique:  True


In [4]:
songsDF.head()

Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop,artists_song
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69,Missy ElliottLose Control (feat. Ciara & Fat M...
6,1,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,Throwbacks,0.774,...,6I9VzXrHxO9rA9A5euc8Ak,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,https://api.spotify.com/v1/tracks/6I9VzXrHxO9r...,https://api.spotify.com/v1/audio-analysis/6I9V...,198800,4,84,dance_pop pop post-teen_pop,83,Britney SpearsToxic
19,2,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),Throwbacks,0.664,...,0WqIKmW4BTrj3eJFmnCKMv,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,https://api.spotify.com/v1/tracks/0WqIKmW4BTrj...,https://api.spotify.com/v1/audio-analysis/0WqI...,235933,4,86,dance_pop pop r&b,25,BeyoncéCrazy In Love
46,3,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,Throwbacks,0.892,...,1AWQoqb9bSvzTjaLralEkT,spotify:track:1AWQoqb9bSvzTjaLralEkT,https://api.spotify.com/v1/tracks/1AWQoqb9bSvz...,https://api.spotify.com/v1/audio-analysis/1AWQ...,267267,4,82,dance_pop pop,79,Justin TimberlakeRock Your Body
55,4,Shaggy,1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,Throwbacks,0.853,...,1lzr43nnXAijIGYnCT8M8H,spotify:track:1lzr43nnXAijIGYnCT8M8H,https://api.spotify.com/v1/tracks/1lzr43nnXAij...,https://api.spotify.com/v1/audio-analysis/1lzr...,227600,4,75,pop_rap reggae_fusion,2,ShaggyIt Wasn't Me


In [5]:
print("Broj pjesama u podatkovnom skupu : ",len(songsDF))

Broj pjesama u podatkovnom skupu :  34247


Imamo 34 247 pjesama na raspolaganju. Za izgradnju sustava preporuke nisu potrebne sve značajke iz gore navedenog skupa. Orijentirat ćemo se na temeljne značajke koje opisuju jednu pjesmu. Gradit ćemo preporuke na temelju glazbenika, zanrova i opisa pjesme('danceability', 'energy', 'key', 'loudness'... ). Temeljne značajke su podijeljene u sljedeće tri skupine:
1. Metadata : id,genres,artist_pop,track_pop
2. Audio : 
        - Mood (Danceability, Valence, Energy, Tempo)
        - Properties (Loudness, Speechiness, Instrumentalness)
        - Context (Liveness, Acousticness)
        - Key, Mode
3. Text (track name)

Metadata predstavlja informacije koje su vezane za datoteku pjesme, kao što su ime glazbenika, datum objave pjesme, ime pjesme i slično.

Ključne karakteristike ćemo izdvojiti iz našeg skupa, te na temelju njih izgraditi sustav preporuke.

In [6]:
# Select useful columns
def select_useful_cols(df):
    return df[['artist_name','id','track_name','danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', "artist_pop", "genres", "track_pop"]]
songsDF = select_useful_cols(songsDF)
songsDF.head(5)

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop
0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),0.904,0.813,4,-7.105,0,0.121,0.0311,0.00697,0.0471,0.81,125.461,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
6,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,Toxic,0.774,0.838,5,-3.914,0,0.114,0.0249,0.025,0.242,0.924,143.04,84,dance_pop pop post-teen_pop,83
19,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,0.664,0.758,2,-6.583,0,0.21,0.00238,0.0,0.0598,0.701,99.259,86,dance_pop pop r&b,25
46,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,Rock Your Body,0.892,0.714,4,-6.055,0,0.141,0.201,0.000234,0.0521,0.817,100.972,82,dance_pop pop,79
55,Shaggy,1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,0.853,0.606,0,-4.596,1,0.0713,0.0561,0.0,0.313,0.654,94.759,75,pop_rap reggae_fusion,2


In [7]:
songsDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34247 entries, 0 to 67498
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist_name       34247 non-null  object 
 1   id                34247 non-null  object 
 2   track_name        34247 non-null  object 
 3   danceability      34247 non-null  float64
 4   energy            34247 non-null  float64
 5   key               34247 non-null  int64  
 6   loudness          34247 non-null  float64
 7   mode              34247 non-null  int64  
 8   speechiness       34247 non-null  float64
 9   acousticness      34247 non-null  float64
 10  instrumentalness  34247 non-null  float64
 11  liveness          34247 non-null  float64
 12  valence           34247 non-null  float64
 13  tempo             34247 non-null  float64
 14  artist_pop        34247 non-null  int64  
 15  genres            34247 non-null  object 
 16  track_pop         34247 non-null  int64 

U tekstu iznad su bolje prikazane varijable s kojima od sada raspolažemo. Važno je za primjetiti da nemamo nedostajućih vrijednosti u našem skupu što je odlično.

Žarn glazbe jedan je od najvažnijih značajkih prilikom izgradnje sustava prepruke. Većina ljudi prilikom istraživanja nove glazbe orijentira se najviše na žanr glazbenika ili pjesme. Podatkovni skup s kojim mi raspolažemo jednoj pjesmi ne pridružuje samo jedan žanr, već jedna pjesma može biti klasificirana u više različitih žanrova.
Zbog kasnije potrebnih razloga, stupac "genres" ćemo pretvoriti natrag u listu.

In [11]:
def genre_preprocess(df):
    df['genres_list'] = df['genres'].apply(lambda x: x.split(" "))
    return df
songsDF = genre_preprocess(songsDF)

songsDF[['track_name','genres_list']].head(5)


Unnamed: 0,track_name,genres_list
0,Lose Control (feat. Ciara & Fat Man Scoop),"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&..."
6,Toxic,"[dance_pop, pop, post-teen_pop]"
19,Crazy In Love,"[dance_pop, pop, r&b]"
46,Rock Your Body,"[dance_pop, pop]"
55,It Wasn't Me,"[pop_rap, reggae_fusion]"


Iznad ispis predstavlja žanrove po kojim su klasificirani prvih pet pjesama našeg skupa. Vidimo npr da pjesma "Rock Your Body" je klasificirana u 2 različita žanra.

Naslov pjesme isto može biti faktor koji služi za preporuku glazbe. Zbog toga bilo bi dobro izvući neku informaciju iz naslova pjesme. Točnije mi ćemo u naš podatkovni skup nadodati dva stupca "Subjectivity", i "Polarity".<br> 
Varijabla "Subjectivity" predstavlja količinu osobnog mišljenja i činjeničnih informacija sadržanih u tekstu. Izražena je u obliku kategoričke varijable s vrijednostima [low, high, medium].<br>
Varijabla "Polarity" predstavlja stupanj snažnog ili jasno definiranog negativnog osjećaja. Također je izražena kao kategorijska varijabla s vrijednostima [Negative, Neutral, Positive].<br>
Varijable su implementirane putem "TextBlob" biblioteke.Proširit ćemo naš podatkovni skup s ove dvije varijable.

In [12]:
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
    return TextBlob(text).sentiment.polarity

def getAnalysis(score, task="polarity"):
      if task == "subjectivity":
        if score < 1/3:
          return "low"
        elif score > 1/3:
          return "high"
        else:
          return "medium"
      else:
        if score < 0:
          return 'Negative'
        elif score == 0:
          return 'Neutral'
        else:
          return 'Positive'

def sentiment_analysis(df, text_col): #analysis based on the track name
    df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
    df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
    return df

In [14]:
# Show result
sentiment = sentiment_analysis(songsDF, "track_name")
sentiment[['track_name','subjectivity','polarity']].head()

Unnamed: 0,track_name,subjectivity,polarity
0,Lose Control (feat. Ciara & Fat Man Scoop),low,Neutral
6,Toxic,low,Neutral
19,Crazy In Love,high,Negative
46,Rock Your Body,low,Neutral
55,It Wasn't Me,low,Neutral


Vidimo npr. da pjesma "Crazy in love" ima visok stupanj subjektivnosti, a nisku vrijednost polariteta što znači da ovakav naziv ne bi trebao uzrokovati negativne osjećaje kod korisnika.

U svrhu implementranja dobrog sustava preporuke potrebno je određene stupce obraditi putem **One-hot encoding** operacije. To je operacija koja pretvara jednu varijablu u skupinu bitova s jednom jedinicom i ostalim nulama. Jednostavni primjer koji opisuje rad ove operacije prikazan je na slici ispod.

![ohe.png](attachment:ohe.png)

Stvorili smo stupac "subjectivity" koji je zapravo kategorička varijabla. Taj stupac ćemo obraditi putem OHE metode te pokazati kako izgleda naš skup nakon obrade.

In [17]:
def ohe_prep(df, column, new_name): 
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df


# One-hot encoding for only subjectivity num
subject_ohe = ohe_prep(sentiment, 'subjectivity','subject')
subject_ohe.head(5)

Unnamed: 0,subject|high,subject|low,subject|medium
0,0,1,0
1,0,1,0
2,1,0,0
3,0,1,0
4,0,1,0


Vidimo kako je stupac "subjectivity" preimenovan u "subject" te zatim proširen po standardu OHE.

Prije smo naglasili važnost varijable žanra, ali isto istaknuli kako jedna pjesma ili glazbenik može biti klasificran u više žanrova. Iz prijašnjih ispisa skupa podataka mogli smo vidjeti da neke vrijednosti žanra su općenitije od drugih. Tako npr. pjesma može biti klasificirana u "pop", "dance_pop" ili neki drugi oblik pop glazbe. Očito je da nisu sve vrijednosti žanrova jednako važne tj. ne nose istu količinu informacije u njima. Stoga je potrebno pridodijeliti težinu žanrovima. To ćemo učiniti putem **TF_IDF** (engl. Term Frequency-Inverse Document Frequency) metode. Formula za **TF-IDF** je prikazana na slici ispod.

![Screenshot%20%2878%29.png](attachment:Screenshot%20%2878%29.png)

1. **Term frequency (TF)** : broj ponavljanja riječi (tj. termina) unutar nekog dokumenta, podijeljen sa ukupnim brojem riječi u tom dokumentu.

2. **Inverse Document Frequency(IDF)** : logaritamska vrijednost ukupnog broj dokumenata gdje je određena riječ (tj. termin) prisutan.

Vrijednost **TF** predstavlja važnost riječi u jednom dokumentu, dok **IDF** predstavlja važnost riječi u svim dokumentima. U našem slučaju dokumenti su zapravo pjesme, a termini ili riječi su pojedinačni žanrovi. Želimo znati koliko je dominantan žarn u jednoj pjesmi te koliko je dominatan u čitavom podatkovnom skupu. Na ovaj način svakom tipu žanra ćemo pridodijelit njemu odgovorajuću veličinu. Pregledniji metode **TF-IDF** prikazan je ispod.

![Screenshot%20%2879%29.png](attachment:Screenshot%20%2879%29.png)

Ova metoda je implementirana pomoću TfidfVectorizer funkcije iz scikit learn.

In [26]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songsDF['genres_list'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
genre_df.drop(columns='genre|unknown')
genre_df.reset_index(drop = True, inplace=True)
genre_df.head(5)



Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,genre|xtra_raw,genre|yacht_rock,genre|ye_ye,genre|yodeling,genre|york_indie,genre|zambian_hip_hop,genre|zhongguo_feng,genre|zolo,genre|zouk,genre|zouk_riddim
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Podatkovni skup kojeg koristimo sadrži ogroman broj žanrova i podžanrova (njih 2147!). Na ispisu iznad ne vidimo točno koji žanrovi imaju težinu za prvih pet pjesama, ali vidimo da su određeni žanrovi beznačajni za neke pjesme. 

Značajke s kojima raspolažemo su različite magnitude, te je potrebno normalizirati te značajke. Ovdje koristimo MInMaxScaler te za sad ćemo samo normalizirati značajku "artist_pop".

In [27]:
# Normalization
pop = songsDF[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.74
1,0.84
2,0.86
3,0.82
4,0.75


## Pretvorba podatkovnog skupa.

Sažmimo što je potrebne korake koje smo predstavili. Pvo izvukli smo sve žanrove te pridodijelili im težine putem **TF-IDF** metode. Dodali smo stupce "Subjectivity" i "polarity" te predstavili važnost **OHE** metode. Na kraju smo napomenuli potrebu za skaliranjem značajki. Te korake smo proveli na pomoćnim objektima koje je sada potrebno pretvoriti u pravi podatkovni skup. To metoda ispod radi.

In [28]:
def create_feature_set(df, float_cols):
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['genres_list'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.drop(columns='genre|unknown') # drop unknown genre
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "track_name")

    # One-hot Encoding
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [29]:
# Save the data and generate the features
float_cols = songsDF.dtypes[songsDF.dtypes == 'float64'].index.values
songsDF.to_csv("data/allsongs_data.csv", index = False)

# Generate features
complete_feature_set = create_feature_set(songsDF, float_cols=float_cols)
complete_feature_set.to_csv("data/complete_feature.csv", index = False)
complete_feature_set.head()



Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H


Konačno imamo podatkovni skup kojeg možemo koristiti za preporuku pjesme. Operacija **OHE** je odgovorna za povećanje broja varijabli na 2179. Pogledajmo kako ćemo sada preporučiti pjesme

### Sustav preporuke.

In [None]:

# Test playlist:  Mom's playlist
playlistDF_test = playlistDF[playlistDF['name']=="Mom's playlist"]
playlistDF_test.head()
playlistDF_test.to_csv("data/test_playlist.csv")

In [None]:
def generate_playlist_feature(complete_feature_set, playlist_df):
    '''
    Summarize a user's playlist into a single vector
    ---
    Input: 
    complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
    playlist_df (pandas dataframe): playlist dataframe
        
    Output: 
    complete_feature_set_playlist_final (pandas series): single vector feature that summarizes the playlist
    complete_feature_set_nonplaylist (pandas dataframe): 
    '''
    
    # Find song features in the playlist
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    # Find all non-playlist song features
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
    return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist

In [None]:
# Generate the features
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlistDF_test)
# Non-playlist features
complete_feature_set_nonplaylist.head()

### Similarity

In [None]:
def generate_playlist_recos(df, features, nonplaylist_features):
    '''
    Generated recommendation based on songs in aspecific playlist.
    ---
    Input: 
    df (pandas dataframe): spotify dataframe
    features (pandas series): summarized playlist feature (single vector)
    nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
    Output: 
    non_playlist_df_top_40: Top 40 recommendations for that playlist
    '''
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    
    return non_playlist_df_top_40

In [None]:
# Genreate top 10 recommendation
recommend = generate_playlist_recos(songsDF, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
recommend.head(10)

In [None]:
playlistDF_test[["artist_name","track_name"]][:20]