###Rahadian Perwita Putra

##TASK: Please Implement the Recommender Systems Using The Songs dataset (created by all class members)

1. Input: Song title (_st = "Is It The Answer?"), number of recomended songs (_nt = 10 )
2. Process: calculate cosine similarity over TFIDF columns
3. Returning: _nt songs which close to _st based on cosine similarity
4. Powerpoint yang menjelaskan tugas TF/IDF dengan Cosine similarity

Dataset Lagu dapat diisi dan dilihat pada: https://docs.google.com/spreadsheets/d/1vjszULKCcS4LPup3VJ9MofYPiYhcaoXTC4zdohLFwpQ/edit?usp=sharing




In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://docs.google.com/spreadsheets/d/' +'1vjszULKCcS4LPup3VJ9MofYPiYhcaoXTC4zdohLFwpQ' +'/export?gid=0&format=csv',)

df = df[['NIM','Submisike','Artis','Judul','Lirik']]
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik
0,1301180000.0,1.0,Reality Club,Is It The Answer?,I make you break\nYou move I take\nLove is the...
1,1301180000.0,2.0,Simple Plan,Jet lag,"Whoa, oh, oh\nWhoa, oh, oh\nSo jet-lagged\n\nW..."
2,1301180000.0,3.0,The Script,Superheroes,All the life she has seen\nAll the meaner side...
3,1301180000.0,4.0,The Script,Breakeven,I'm still alive but I'm barely breathing\nJust...
4,1301180000.0,5.0,Green Day,21 Guns,"Do you know what's worth fighting for,\nWhen i..."


###Pre-processing

In [2]:
import re

#changing all of data type into string
df.Lirik=df.Lirik.astype(str)

# Converting all words to lower case and removing punctuation
df['Lirik'] = [re.sub(r'\d+\S*', '',
                  row.lower().replace('.', ' ').replace('_', '').replace('/', ''))
                  for row in df['Lirik']]

df['Lirik'] = [re.sub(r'(?:^| )\w(?:$| )', '', row)
                  for row in df['Lirik']]

# Removing numbers
df['Lirik'] = [re.sub(r'\d+', '', row) for row in df['Lirik']]

In [3]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break\nyou movetake\nlove is the answ...
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh\nwhoa, oh, oh\nso jet-lagged\n\nw..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen\nall the meaner side...
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing\njust...
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for,\nwhen i..."


In [4]:
%%time
import nltk
nltk.download("stopwords")

# Tokenizing comments and putting them into a new column
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')  # by blank space
df['tokens'] = df['Lirik'].apply(tokenizer.tokenize)

CPU times: user 525 ms, sys: 58.4 ms, total: 583 ms
Wall time: 596 ms


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,tokens
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break\nyou movetake\nlove is the answ...,"[make, you, break, you, movetake, love, is, th..."
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh\nwhoa, oh, oh\nso jet-lagged\n\nw...","[whoa, oh, oh, whoa, oh, oh, so, jet, lagged, ..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen\nall the meaner side...,"[all, the, life, she, has, seen, all, the, mea..."
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing\njust...,"[i, m, still, alive, but, i, m, barely, breath..."
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for,\nwhen i...","[do, you, know, what, s, worth, fighting, for,..."


In [6]:
%%time
# Removing Stopwords & Punctuation
from nltk.corpus import stopwords
#stopwords.words('english')

filtered_words = []
for row in df['tokens']:
    filtered_words.append([
        word.lower() for word in row
        if word.lower() not in nltk.corpus.stopwords.words('english')
    ])

df['tokens'] = filtered_words

CPU times: user 22.1 s, sys: 2.66 s, total: 24.7 s
Wall time: 31.1 s


In [7]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,tokens
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break\nyou movetake\nlove is the answ...,"[make, break, movetake, love, answer, say, ifw..."
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh\nwhoa, oh, oh\nso jet-lagged\n\nw...","[whoa, oh, oh, whoa, oh, oh, jet, lagged, time..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen\nall the meaner side...,"[life, seen, meaner, side, took, away, prophet..."
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing\njust...,"[still, alive, barely, breathing, prayin, togo..."
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for,\nwhen i...","[know, worth, fighting, worth, dying, take, br..."


In [8]:
%%time
# Setting the Lemmatization object
nltk.download("omw-1.4")
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

# Looping through the words and appending the lemmatized version to a list
stemmed_words = []
for row in df['tokens']:
    stemmed_words.append([
        # Verbs
        lmtzr.lemmatize(  
            # Adjectives
            lmtzr.lemmatize(  
                # Nouns
                lmtzr.lemmatize(word.lower()), 'a'), 'v')
        for word in row
        if word.lower() not in nltk.corpus.stopwords.words('english')])

# Adding the list as a column in the data frame
df['tokens'] = stemmed_words

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


CPU times: user 10.1 s, sys: 1.02 s, total: 11.1 s
Wall time: 11.1 s


In [9]:
df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,tokens
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break\nyou movetake\nlove is the answ...,"[make, break, movetake, love, answer, say, ifw..."
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh\nwhoa, oh, oh\nso jet-lagged\n\nw...","[whoa, oh, oh, whoa, oh, oh, jet, lag, time, m..."
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen\nall the meaner side...,"[life, see, mean, side, take, away, prophet, d..."
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing\njust...,"[still, alive, barely, breathe, prayin, togod,..."
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for,\nwhen i...","[know, worth, fight, worth, die, take, breath,..."


In [10]:
# Appends all words to a list in order to find the unique words
allWords = []
for row in stemmed_words:
    for word in row:
        allWords.append(str(word))
            
uniqueWords = np.unique(allWords)

print('Number of unique words:', len(uniqueWords), '\n')
print('Previewing sample of unique words:\n', uniqueWords[1234:1244])

Number of unique words: 6418 

Previewing sample of unique words:
 ['couragehad' 'course' 'courtyard' 'cousin' 'cover' 'covergirls' 'crack'
 'crash' 'crasher' 'crater']


In [11]:
stemmed_sentences = []

# Spacing out the words for each songs
for row in df['tokens']:
    stemmed_string = ''
    for word in row:
        stemmed_string = stemmed_string + ' ' + word
    stemmed_sentences.append(stemmed_string)
    
df['tokens'] = stemmed_sentences

###TF/IDF

In [12]:
%%time
import sklearn
# Creating the sklearn object
tfidf = sklearn.feature_extraction.text.TfidfVectorizer(smooth_idf=False)

# Transforming our 'tokens' column into a TF-IDF matrix and then a data frame
tfidf_df = pd.DataFrame(tfidf.fit_transform(df['tokens']).toarray(), 
                        columns=tfidf.get_feature_names())

CPU times: user 72 ms, sys: 13 ms, total: 85 ms
Wall time: 84.5 ms




In [13]:
print(tfidf_df.shape)
tfidf_df.head()

(541, 6404)


Unnamed: 0,aaliyah,aback,abandon,abide,able,aboutgirlfriend,abouthouse,abouthundred,aboutlife,absence,...,悲しみと切なさの艶麗,愁いを含んだ閃光,手を広げればこぼれ落ちそうで,握ったこの手は離さない,握りしめた,狂おしいほど刹那の艶麗,眼光は感覚的衝動,終わりはないさ,譲れないもの,飾ったように見せかけてる
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# Removing sparse columns
tfidf_df = tfidf_df[tfidf_df.columns[tfidf_df.sum() > 1.25]]

# Removing any remaining digits
tfidf_df = tfidf_df.filter(regex=r'^((?!\d).)*$')

print(tfidf_df.shape)
tfidf_df.head()

(541, 452)


Unnamed: 0,act,afraid,ago,ah,alarm,alive,allneed,allwant,almost,alone,...,would,write,wrong,ya,yeah,year,yes,yesterday,yet,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.085876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094601,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.008916,0.0,0.016161,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.074344,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.097258,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# Storing the original data frame before the merge in case of changes
df_orig = df.copy()

# Renaming columns that conflict with column names in tfidfCore
df.rename(columns={'nim': 'NIM', 
                   'submisike': 'Submisike', 
                   'artis': 'Artis', 
                   'judul': 'Judul', 
                   'lirik': 'Lirik', 
                   'tokens': 'Tokens'}, inplace=True)

# Merging the data frames by index
df = pd.merge(df, tfidf_df, how='inner', left_index=True, right_index=True)

df.head()

Unnamed: 0,NIM,Submisike,Artis,Judul,Lirik,Tokens,act,afraid,ago,ah,...,would,write,wrong,ya,yeah,year,yes,yesterday,yet,young
0,1301180000.0,1.0,Reality Club,Is It The Answer?,make you break\nyou movetake\nlove is the answ...,make break movetake love answer say ifwent aw...,0.0,0.0,0.0,0.0,...,0.085876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1301180000.0,2.0,Simple Plan,Jet lag,"whoa, oh, oh\nwhoa, oh, oh\nso jet-lagged\n\nw...",whoa oh oh whoa oh oh jet lag time miss anyth...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1301180000.0,3.0,The Script,Superheroes,all the life she has seen\nall the meaner side...,life see mean side take away prophet dream fo...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.008916,0.0,0.016161,0.0,0.0,0.0
3,1301180000.0,4.0,The Script,Breakeven,i'm still alive but i'm barely breathing\njust...,still alive barely breathe prayin togod thatd...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.097258,0.0,0.0,0.0,0.0,0.0
4,1301180000.0,5.0,Green Day,21 Guns,"do you know what's worth fighting for,\nwhen i...",know worth fight worth die take breath away f...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# Summary stats of TF-IDF
print('Max:', np.max(tfidf_df.max()), '\n',
      'Mean:', np.mean(tfidf_df.mean()), '\n',
      'Standard Deviation:', np.std(tfidf_df.std()))

Max: 1.0 
 Mean: 0.006904056425648002 
 Standard Deviation: 0.013339322349206845


###RECOMMENDER

In [17]:
#importing cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

In [18]:
#using tfidf_df as dataframe to find the cosine_similarity
cosim = cosine_similarity(tfidf_df, tfidf_df)

In [19]:
cosim.shape

(541, 541)

In [20]:
cosim[1]

array([0.05109497, 1.        , 0.11246399, 0.09410556, 0.06632893,
       0.11870485, 0.05379863, 0.0373751 , 0.11667586, 0.07682899,
       0.05599072, 0.12162935, 0.02957613, 0.02491246, 0.16059455,
       0.08736736, 0.10146401, 0.09429694, 0.12400276, 0.06014256,
       0.02148501, 0.02211884, 0.11585352, 0.06801826, 0.04214805,
       0.00676456, 0.06789553, 0.39222652, 0.03952226, 0.03547418,
       0.04690533, 0.10080832, 0.08611026, 0.02121159, 0.01306314,
       0.05378159, 0.06174378, 0.11658644, 0.13054953, 0.01384811,
       0.00450131, 0.13361773, 0.17203397, 0.04836828, 0.02506862,
       0.18779658, 0.04815444, 0.02128919, 0.04750636, 0.07202413,
       0.04638058, 0.05154537, 0.01878878, 0.02533435, 0.07969725,
       0.15902   , 0.11013533, 0.02138363, 0.01230829, 0.04165828,
       0.2012808 , 0.05935759, 0.08474772, 0.08359421, 0.06495131,
       0.03581642, 0.02587643, 0.06799713, 0.14626092, 0.04676861,
       0.14855756, 0.04588909, 0.01268687, 0.03673855, 0.11686

In [21]:
#using 'Judul' as index in mapping
mapping = pd.Series(df.index, index=df['Judul'])

In [22]:
#displaying number of songs
mapping[:10]

Judul
Is It The Answer?                  0
Jet lag                            1
Superheroes                        2
Breakeven                          3
21 Guns                            4
Wake me up when September Ends     5
You're Gonna Live Forever in Me    6
Day 1                              7
Like me better                     8
Maps                               9
dtype: int64

In [23]:
def recommended_song(song_title, cosim=cosim):
    # Get the index of the songs that matches the title
    song_index = mapping[song_title]

    # Get the pairwsie similarity scores of all songs 
    similarity_scores = list(enumerate(cosim[song_index]))

    # Sort the songs based on the similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar songs
    similarity_scores = similarity_scores[1:11]

    # Get the movie from mapping
    song_mapping = [i[0] for i in similarity_scores]

    # Return the top 10 most similar songs based on Judul
    return df['Judul'].iloc[song_mapping]

In [24]:
#input the title of the song below
recommended_song('Is It The Answer?')

183                          Best Part
40                                Numb
268                       Stuck On You
206                         These Days
30         If You Know That I'm Lonely
449                       Stay With Me
365                       It's Nothing
163                        Seven Years
106    If You're Too Shy (Let Me Know)
446                      Say Something
Name: Judul, dtype: object