# Welcome to "ML4Recsys : Intro to content-based filtering" Notebook

In this notebook we will try recommend list of film based on one film that the user already watch, so the instruction is:

1. Read the data
2. Make the vector representation
3. Calculate the similarity betweenfilm based on the vector representation

## Read the data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('SongsDataset.csv')

df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik
0,1301180000.0,1.0,Reality Club,Is It The Answer?,I make you break\nYou move I take\nLove is the...
1,1301180000.0,2.0,Simple Plan,Jet lag,"Whoa, oh, oh\nWhoa, oh, oh\nSo jet-lagged\n\nW..."
2,1301180000.0,3.0,The Script,Superheroes,All the life she has seen\nAll the meaner side...
3,1301180000.0,4.0,The Script,Breakeven,I'm still alive but I'm barely breathing\nJust...
4,1301180000.0,5.0,Green Day,21 Guns,"Do you know what's worth fighting for,\nWhen i..."


In [2]:
df.isnull().sum()

NIM          1
Submisike    1
Artis        2
Judul        2
Lirik        2
dtype: int64

In [3]:
df.fillna(method='ffill', inplace=True)

In [4]:
df.isnull().sum()

NIM          0
Submisike    0
Artis        0
Judul        0
Lirik        0
dtype: int64

## Preprocessing

In [5]:
%%time
import re

# Converting all words to lower case and removing punctuation
df['Lirik'] = [re.sub(r'\d+\S*', '',
                  row.lower().replace('.', ' ').replace('_', '').replace('/', ''))
                  for row in df['Lirik']]

df['Lirik'] = [re.sub(r'(?:^| )\w(?:$| )', '', row)
                  for row in df['Lirik']]
                  
df["Lirik"] = df["Lirik"].apply(lambda x: str(x).replace("\n", " "))
# Removing numbers
df['Lirik'] = [re.sub(r'\d+', '', row) for row in df['Lirik']]

CPU times: user 79.3 ms, sys: 670 µs, total: 80 ms
Wall time: 81.2 ms


In [6]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break you movetake love is the answer...
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh whoa, oh, oh so jet-lagged what ..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen all the meaner side ...
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing just ...
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for, when it..."


In [7]:
%%time
import nltk
nltk.download("stopwords")

# Tokenizing comments and putting them into a new column
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')  # by blank space
df['tokens'] = df['Lirik'].apply(tokenizer.tokenize)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
CPU times: user 712 ms, sys: 225 ms, total: 938 ms
Wall time: 1.35 s


In [8]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,tokens
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break you movetake love is the answer...,"[make, you, break, you, movetake, love, is, th..."
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh whoa, oh, oh so jet-lagged what ...","[whoa, oh, oh, whoa, oh, oh, so, jet, lagged, ..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen all the meaner side ...,"[all, the, life, she, has, seen, all, the, mea..."
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing just ...,"[i, m, still, alive, but, i, m, barely, breath..."
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for, when it...","[do, you, know, what, s, worth, fighting, for,..."


In [9]:
%%time
# Removing Stopwords & Punctuation
from nltk.corpus import stopwords
#stopwords.words('english')

filtered_words = []
for row in df['tokens']:
    filtered_words.append([
        word.lower() for word in row
        if word.lower() not in nltk.corpus.stopwords.words('english')
    ])

df['tokens'] = filtered_words

CPU times: user 18.1 s, sys: 2.13 s, total: 20.3 s
Wall time: 20.3 s


In [10]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,tokens
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break you movetake love is the answer...,"[make, break, movetake, love, answer, say, ifw..."
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh whoa, oh, oh so jet-lagged what ...","[whoa, oh, oh, whoa, oh, oh, jet, lagged, time..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen all the meaner side ...,"[life, seen, meaner, side, took, away, prophet..."
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing just ...,"[still, alive, barely, breathing, prayin, togo..."
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for, when it...","[know, worth, fighting, worth, dying, take, br..."


In [11]:
%%time
# Setting the Lemmatization object
nltk.download("wordnet")
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

# Looping through the words and appending the lemmatized version to a list
stemmed_words = []
for row in df['tokens']:
    stemmed_words.append([
        # Verbs
        lmtzr.lemmatize(  
            # Adjectives
            lmtzr.lemmatize(  
                # Nouns
                lmtzr.lemmatize(word.lower()), 'a'), 'v')
        for word in row
        if word.lower() not in nltk.corpus.stopwords.words('english')])

# Adding the list as a column in the data frame
df['tokens'] = stemmed_words

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
CPU times: user 12.3 s, sys: 1.24 s, total: 13.6 s
Wall time: 13.6 s


In [12]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,tokens
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break you movetake love is the answer...,"[make, break, movetake, love, answer, say, ifw..."
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh whoa, oh, oh so jet-lagged what ...","[whoa, oh, oh, whoa, oh, oh, jet, lag, time, m..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen all the meaner side ...,"[life, see, mean, side, take, away, prophet, d..."
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing just ...,"[still, alive, barely, breathe, prayin, togod,..."
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for, when it...","[know, worth, fight, worth, die, take, breath,..."


In [13]:
# Appends all words to a list in order to find the unique words
allWords = []
for row in stemmed_words:
    for word in row:
        allWords.append(str(word))
            
uniqueWords = np.unique(allWords)

print('Number of unique words:', len(uniqueWords), '\n')
print('Previewing sample of unique words:\n', uniqueWords[1234:1244])

Number of unique words: 6417 

Previewing sample of unique words:
 ['couragehad' 'course' 'courtyard' 'cousin' 'cover' 'covergirls' 'crack'
 'crash' 'crasher' 'crater']


In [14]:
stemmed_sentences = []

# Spacing out the words in the reviews for each restaurant
for row in df['tokens']:
    stemmed_string = ''
    for word in row:
        stemmed_string = stemmed_string + ' ' + word
    stemmed_sentences.append(stemmed_string)
    
df['tokens'] = stemmed_sentences

## TF/IDF

In [15]:
%%time
import sklearn
# Creating the sklearn object
tfidf = sklearn.feature_extraction.text.TfidfVectorizer(smooth_idf=False)

# Transforming our 'tokens' column into a TF-IDF matrix and then a data frame
tfidf_df = pd.DataFrame(tfidf.fit_transform(df['tokens']).toarray(), 
                        columns=tfidf.get_feature_names())

CPU times: user 90.8 ms, sys: 9.1 ms, total: 99.9 ms
Wall time: 111 ms


In [16]:
print(tfidf_df.shape)
tfidf_df.head()

(541, 6403)


Unnamed: 0,aaliyah,aback,abandon,abide,able,aboutgirlfriend,abouthouse,abouthundred,aboutlife,absence,absolute,absolutely,abuse,aby,ac,acapulco,accent,accept,accessory,accord,account,accurate,accuse,ache,achestill,achilles,achoo,acquaintance,acre,across,acrossfallen,act,actavis,actfool,actin,action,activity,add,addict,addiction,...,youset,yousick,yousmirk,youstory,youth,youthink,youtime,youwhore,youwill,youwon,ypocrites,yuh,zappa,zaytoven,zenzenzense,zero,ziggy,zimmerman,zipper,zone,zoom,くだけて泣いて咲いて散ったこの思いは,このままじゃまだ終わらせる事は出来ないでしょ,ただ隠せないもの,でも,なにもないように映ってるだけ,何度くたばりそうでも朽ち果てようとも,君を,失うものなどなかった日々の惰性を捨てて,失わぬようにと,悲しみと切なさの艶麗,愁いを含んだ閃光,手を広げればこぼれ落ちそうで,握ったこの手は離さない,握りしめた,狂おしいほど刹那の艶麗,眼光は感覚的衝動,終わりはないさ,譲れないもの,飾ったように見せかけてる
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.085774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076488,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# Removing sparse columns
tfidf_df = tfidf_df[tfidf_df.columns[tfidf_df.sum() > 1.25]]

# Removing any remaining digits
tfidf_df = tfidf_df.filter(regex=r'^((?!\d).)*$')

print(tfidf_df.shape)
tfidf_df.head()

(541, 452)


Unnamed: 0,act,afraid,ago,ah,alarm,alive,allneed,allwant,almost,alone,along,already,alright,always,andcan,anddon,andknow,andneed,andwill,angel,another,anymore,anyone,anything,apart,arm,around,ask,asleep,awake,away,ayy,babe,baby,back,bad,beat,beautiful,become,bed,...,uh,understand,use,voice,wait,wake,walk,walkin,wanna,want,warm,waste,watch,water,way,wear,well,white,whoa,whoam,whole,wind,wish,without,wonder,word,work,world,worry,worth,would,write,wrong,ya,yeah,year,yes,yesterday,yet,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062431,0.0,0.0,0.0,0.0,0.066693,0.0,0.0,0.0,0.0,0.148629,0.0,0.0,0.0,0.076374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.231384,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096245,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.085919,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094652,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029645,0.202937,0.0,0.0,0.0,0.099947,...,0.0,0.0,0.0,0.0,0.019807,0.101674,0.0,0.0,0.031214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.049164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010055,0.0,0.0,0.0,0.019338,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028394,0.0,0.0,0.0,0.0,0.0,0.016035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008894,0.0,0.016165,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.074402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042272,0.0,0.0,0.0,0.0,0.0,0.057793,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097075,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058347,0.0,0.0,0.0,0.0,0.354506,0.0,0.0,0.0,0.0,0.040343,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.054784,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# Storing the original data frame before the merge in case of changes
df_orig = df.copy()

# Renaming columns that conflict with column names in tfidfCore
df.rename(columns={'judul': 'Judul', 
                   'nim': 'Nim', 
                   'submisike': 'Submisike', 
                   'artis': 'Artis', 
                   'lirik': 'Lirik', 
                   'tokens': 'Tokens'}, inplace=True)

# Merging the data frames by index
df = pd.merge(df, tfidf_df, how='inner', left_index=True, right_index=True)

df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,Tokens,act,afraid,ago,ah,alarm,alive,allneed,allwant,almost,alone,along,already,alright,always,andcan,anddon,andknow,andneed,andwill,angel,another,anymore,anyone,anything,apart,arm,around,ask,asleep,awake,away,ayy,babe,baby,...,uh,understand,use,voice,wait,wake,walk,walkin,wanna,want,warm,waste,watch,water,way,wear,well,white,whoa,whoam,whole,wind,wish,without,wonder,word,work,world,worry,worth,would,write,wrong,ya,yeah,year,yes,yesterday,yet,young
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break you movetake love is the answer...,make break movetake love answer say ifwent aw...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062431,0.0,0.0,0.0,0.0,0.066693,0.0,0.0,0.0,0.0,0.148629,0.0,0.0,0.0,0.076374,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096245,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.085919,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh whoa, oh, oh so jet-lagged what ...",whoa oh oh whoa oh oh jet lag time miss anyth...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094652,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.019807,0.101674,0.0,0.0,0.031214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.049164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen all the meaner side ...,life see mean side take away prophet dream fo...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010055,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028394,0.0,0.0,0.0,0.0,0.0,0.016035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008894,0.0,0.016165,0.0,0.0,0.0
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing just ...,still alive barely breathe prayin togod thatd...,0.0,0.0,0.0,0.0,0.0,0.074402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042272,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097075,0.0,0.0,0.0,0.0,0.0
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for, when it...",know worth fight worth die take breath away f...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058347,0.0,0.0,0.0,0.0,0.354506,0.0,0.0,0.0,0.0,0.040343,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.054784,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Summary stats of TF-IDF
print('Max:', np.max(tfidf_df.max()), '\n',
      'Mean:', np.mean(tfidf_df.mean()), '\n',
      'Standard Deviation:', np.std(tfidf_df.std()))

Max: 0.996272777928468 
 Mean: 0.006922498684016186 
 Standard Deviation: 0.013301895629081475


In [20]:


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
cosine_similarities = cosine_similarity(tfidf_df)

In [22]:
similarities = {}

In [23]:
for i in range(len(cosine_similarities)):
    # Now we'll sort each element in cosine_similarities and get the indexes of the songs. 
    similar_indices = cosine_similarities[i].argsort()[:-50:-1] 
    # After that, we'll store in similarities each name of the 50 most similar songs.
    # Except the first one that is the same song.
    similarities[df['Judul'].iloc[i]] = [(cosine_similarities[i][x], df['Judul'][x], df['Artis'][x]) for x in similar_indices][1:]

In [24]:
print(similarities)

{'Is It The Answer?': [(0.3174368882396096, 'Best Part', 'H.E.R, Daniel Caesar'), (0.30262730998719384, 'Numb', 'linkin park '), (0.2673583648773926, 'Stuck On You', 'Lionel Richie'), (0.26557625331650464, 'These Days', 'Rudimental'), (0.2539690877503431, "If You Know That I'm Lonely", 'FUR'), (0.23286468819718598, 'Stay With Me', 'Sam Smith'), (0.22690006282728134, "It's Nothing", 'RADWIMPS'), (0.2177437543738378, 'Seven Years', 'Saosin'), (0.216366241769701, "If You're Too Shy (Let Me Know)", 'The 1975'), (0.2145581202696391, 'Say Something', 'A Great Big World'), (0.20947577168709303, 'Layla', 'Eric Clapton'), (0.206661222647189, "Beggin'", 'Maneskin'), (0.20263386345014597, "Why Don't You", 'Cleo Sol'), (0.20250673067353192, 'Love is gone', 'Slander ft. Dylan Matthew'), (0.2022850554743099, 'James', 'Laufey'), (0.20216502455290014, 'Bad Liar', 'Imagine Dragon'), (0.2018741662215805, 'Flamming Hot Cheetos', 'Clairo'), (0.19827173296401862, 'My Future', 'Billie Eilish'), (0.197496261

In [25]:
class ContentBasedRecommender:
    def __init__(self, matrix):
        self.matrix_similar = matrix

    def _print_message(self, song, recom_song):
        rec_items = len(recom_song)
        
        print(f'The {rec_items} recommended songs for {song} are:')
        for i in range(rec_items):
            print(f"Number {i+1}:")
            print(f"{recom_song[i][1]} by {recom_song[i][2]} with {round(recom_song[i][0], 3)} similarity score") 
            print("--------------------")
        
    def recommend(self, recommendation):
        # Get song to find recommendations for
        song = recommendation['song']
        # Get number of songs to recommend
        number_songs = recommendation['number_songs']
        # Get the number of songs most similars from matrix similarities
        recom_song = self.matrix_similar[song][:number_songs]
        # print each item
        self._print_message(song=song, recom_song=recom_song)

In [26]:
recommedations = ContentBasedRecommender(similarities)

In [27]:
recommendation = {
    "song": 'Is It The Answer?',
    "number_songs": 10 
}

In [28]:
recommedations.recommend(recommendation)

The 10 recommended songs for Is It The Answer? are:
Number 1:
Best Part by H.E.R, Daniel Caesar with 0.317 similarity score
--------------------
Number 2:
Numb by linkin park  with 0.303 similarity score
--------------------
Number 3:
Stuck On You by Lionel Richie with 0.267 similarity score
--------------------
Number 4:
These Days by Rudimental with 0.266 similarity score
--------------------
Number 5:
If You Know That I'm Lonely by FUR with 0.254 similarity score
--------------------
Number 6:
Stay With Me by Sam Smith with 0.233 similarity score
--------------------
Number 7:
It's Nothing by RADWIMPS with 0.227 similarity score
--------------------
Number 8:
Seven Years by Saosin with 0.218 similarity score
--------------------
Number 9:
If You're Too Shy (Let Me Know) by The 1975 with 0.216 similarity score
--------------------
Number 10:
Say Something by A Great Big World with 0.215 similarity score
--------------------


##TASK: Please Implement the Recommender Systems Using The Songs dataset (created by all class members)

1. Input: Song title (_st = "Is It The Answer?"), number of recomended songs (_nt = 10 )
2. Process: calculate cosine similarity over TFIDF columns
3. Returning: _nt songs which close to _st based on cosine similarity
4. Powerpoint yang menjelaskan tugas TF/IDF dengan Cosine similarity

Dataset Lagu dapat diisi dan dilihat pada: https://docs.google.com/spreadsheets/d/1vjszULKCcS4LPup3VJ9MofYPiYhcaoXTC4zdohLFwpQ/edit?usp=sharing


