### Requirements

For text processing
- pip install gensim
- pip install pyldavis

For data cleaning - removing any language other than English
- pip install pycld2

To test for encoding in dataset
- pip install chardet

For PV-DM
- run python in command line and run the lines below:
- import nltk
- nltk.download('punkt')

In [1]:
# import required packages/libraries
import pandas as pd

### Data Review

In [2]:
# check encoding type of data
import chardet
file = 'final_lyrics.csv'
with open(file, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

In [3]:
# open/read data
lyrics_data = pd.read_csv('final_lyrics.csv', encoding='utf-8')
# drop first col
lyrics_df = lyrics_data.iloc[: , 1:]

In [4]:
print(lyrics_df.head())

  Genre    Artist                  Title  \
0   Pop  dua lipa              new rules   
1   Pop  dua lipa        don’t start now   
2   Pop  dua lipa                  idgaf   
3   Pop  dua lipa  blow your mind (mwah)   
4   Pop  dua lipa             be the one   

                                              Lyrics  
0  one one one one one   talkin' in my sleep at n...  
1  if you don't wanna see me   did a full 80 craz...  
2  you call me all friendly tellin' me how much y...  
3  i know it's hot i know we've got something tha...  
4  i see the moon i see the moon i see the moon o...  


In [5]:
print(len(lyrics_df))

243406


In [6]:
# lyric check
lyrics_df['Lyrics'][8]

"common love isn't for us we created something phenomenal don't you agree don't you agree you got me feeling\u2005diamond\u2005rich nothing on this\u2005planet compares to it don't you agree don't\u2005you agree  pre who needs to go to sleep when i got you next to me   all night i'll riot with you i know you got my back and you know i got you so come on come on come on come on come on come on let's get physical lights out follow the noise baby keep on dancing like you ain't got a choice so come on come on come on come on come on let's get physical   adrenaline keeps on rushing in love the simulation we're dreaming in don't you agree don't you agree i don't wanna live another life 'cause this one's pretty nice living it up  pre who needs to go to sleep when i got you next to me   all night i'll riot with you i know you got my back and you know i got you so come on come on come on come on come on come on let's get physical lights out follow the noise baby keep on dancing like you ain't g

### Data Cleaning

In [7]:
import pycld2 as cld2

In [8]:
lyrics_df['Lyrics'] =  [str(lyrics.encode('ascii', 'replace')).replace('b"','').replace('?',' ').replace('"','').replace('\\n', ' ').replace("b'",'').replace('instrumental','').replace('[\[],:*!?]','').replace('(','').replace(')','').replace('.','').replace(',','').replace('\\','').replace('verse','').replace('!','').replace('chorus','').replace('*','').replace('\n',' ')
               for lyrics in lyrics_df['Lyrics'].str.decode('unicode_escape')]

In [9]:
# lyric check
lyrics_df['Lyrics'][8]

"common love isn't for us we created something phenomenal don't you agree don't you agree you got me feeling   diamond   rich nothing on this   planet compares to it don't you agree don't   you agree  pre who needs to go to sleep when i got you next to me   all night i'll riot with you i know you got my back and you know i got you so come on come on come on come on come on come on let's get physical lights out follow the noise baby keep on dancing like you ain't got a choice so come on come on come on come on come on let's get physical   adrenaline keeps on rushing in love the simulation we're dreaming in don't you agree don't you agree i don't wanna live another life 'cause this one's pretty nice living it up  pre who needs to go to sleep when i got you next to me   all night i'll riot with you i know you got my back and you know i got you so come on come on come on come on come on come on let's get physical lights out follow the noise baby keep on dancing like you ain't got a choice 

In [10]:
# lyric check
lyrics_df['Lyrics'][42]

"hwasa                                                                 don't you agree don't you agree                                                    just   wasting time don't you   agree don't you agree bae  pre hwasa who needs to   go to sleep when i got you next to me   hwasa dua lipa                                                                       so   come on come   on come on come on come on come on let's get physical                                                                       so come on come on come on come on come on let's get physical   dua lipa adrenaline keeps on rushing in love the simulation we're dreaming in don't you agree don't you agree i don't wanna live another life 'cause this one's pretty nice living it up  pre dua lipa who needs to go to sleep when i got you next to me   dua lipa all night i'll riot with you i know you got my back and you know i got you so come on come on come on come on come on come on let's get physical lights out follow the n

In [11]:
# lyric check
lyrics_df['Lyrics'][23]

"dababy billboard baby dua lipa make 'em dance when it come on everybody lookin' for a dancefloor to run on   dua lipa if you wanna run away with me i know a galaxy and i can take you for a ride i had a premonition that we fell into a rhythm where the music don't stop for life glitter in the sky glitter in my eyes shining just the way i like if you're feeling like you need a little bit of company you met me at the perfect time  pre dua lipa you want me i want you baby my sugarboo i'm levitating the milky way we're renegading yeah yeah y  ah yeah yeah   dua lipa i got you moonlight you're my starlight i need you all night com   on dance with me i'm levitating you moonlight you're my starlight you're the moonlight i need you all night come on dance with me i'm levitating   dababy i'm one of the greatest ain't no debatin' on it let's go i'm still levitated i'm heavily medicated ironic i gave 'em love and they end up hatin' on me go she told me she love me and she been waitin' been fightin

Filter for only English Lyrics

In [12]:
en_lyrics = []
for i in range(len(lyrics_df)):
    _, _, _, detected_language = cld2.detect(lyrics_df['Lyrics'][i],  returnVectors=True)
    if len(detected_language) == 1:
        if detected_language[0][2] == 'ENGLISH':
#             lyrics_df.drop(i, inplace = True)
            en_lyrics.append(i)

lyrics_df = lyrics_df.iloc[en_lyrics,:] 
        
len(lyrics_df)
# data cut from 243406 to 211873

211874

In [13]:
# reset df index
lyrics_df.reset_index(inplace=True)
lyrics_df

Unnamed: 0,index,Genre,Artist,Title,Lyrics
0,0,Pop,dua lipa,new rules,one one one one one talkin' in my sleep at n...
1,1,Pop,dua lipa,don’t start now,if you don't wanna see me did a full 80 craz...
2,2,Pop,dua lipa,idgaf,you call me all friendly tellin' me how much y...
3,3,Pop,dua lipa,blow your mind (mwah),i know it's hot i know we've got something tha...
4,4,Pop,dua lipa,be the one,i see the moon i see the moon i see the moon o...
...,...,...,...,...,...
211869,243401,Country,edens edge,who am i drinking tonight,I gotta say Boy after only just a couple of da...
211870,243402,Country,edens edge,liar,I helped you find her diamond ring You made me...
211871,243403,Country,edens edge,last supper,Look at the couple in the corner booth Looks a...
211872,243404,Country,edens edge,christ alone live in studio,When I fly off this mortal earth And I'm measu...


## Part 2.1 PV-DM

NO PREPROCESSING NEEDED BECAUSE WANT TO KEEP WORDS TOGETHER

CANT USE LDA HERE EITHER, AS IT USES BAG-OF-WORDS VECTORIZING (LIKE TF-IDF) FOR TOPIC MODELLING

In [14]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [54]:
%%time

#create corpus
#Use cleaned complete lyrics instead of pre-processed text
corpus = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(lyrics_df['Lyrics'])]
#build model
#100 epochs is impractically long
# high vac_size value increases performace but also takes longer/memory
# max_epochs = 100
vec_size = 100
alpha = 0.025
# dm=1 is for PV-DM
model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.0025,
                min_count=1,
                dm =1,
                epochs=10)

model.build_vocab(corpus)

# train model
model.train(corpus,
            total_examples=model.corpus_count,
            epochs=model.epochs)

# # Train model with Hyper-paramenter tuning --> takes long as well
# for epoch in range(max_epochs):
#     print('iteration {0}'.format(epoch))
#     model.train(corpus,
#                 total_examples=model.corpus_count,
#                 epochs=model.epochs)
#     # decrease the learning rate
#     model.alpha -= 0.0002
#     # fix the learning rate, no decay
#     model.min_alpha = model.alpha
    
#save model
model.save("pvdm_model")

CPU times: user 22min 37s, sys: 4min 41s, total: 27min 18s
Wall time: 21min 48s


In [55]:
model= Doc2Vec.load("pvdm_model")

## Part 2.2: Similarity Calculation - Cosine Distance 

In [56]:
# open text file of lyric
input_file = 'input_lyrics1.txt'
with open(input_file, 'r') as f:
    input_lyrics = f.read()
    # clean input lyrics
    input_lyrics = input_lyrics.replace('b"','').replace('?',' ').replace('"','').replace('\\n', ' ').replace("b'",'').replace('instrumental','').replace('[\[],:*!?]','').replace('(','').replace(')','').replace('.','').replace(',','').replace('\\','').replace('verse','').replace('!','').replace('chorus','').replace('*','').replace('\n',' ')

In [57]:
input_lyrics

"You must think that I'm stupid You must think that I'm a fool You must think that I'm new to this But I have seen this all before I'm never gonna let you close to me Even though you mean the most to me 'Cause every time I open up it hurts So I'm never gonna get too close to you Even when I mean the most to you In case you go and leave me in the dirt But every time you hurt me the less that I cry And every time you leave me the quicker these tears dry And every time you walk out the less I love you Baby we don't stand a chance it's sad but it's true I'm way too good at goodbyes I'm way too good at goodbyes I'm way too good at goodbyes I'm way too good at goodbyes I know you're thinkin' I'm heartless I know you're thinkin' I'm cold I'm just protectin' my innocence I'm just protectin' my soul I'm never gonna let you close to me Even though you mean the most to me 'Cause every time I open up it hurts So I'm never gonna get too close to you Even when I mean the most to you In case you go a

In [58]:
#tokenization - get doc vector
input_data = word_tokenize(input_lyrics.lower())
lyric_vector = model.infer_vector(input_data)

In [59]:
# to find index of most similar song
similar_doc = model.docvecs.most_similar(lyric_vector)
print(similar_doc)

  similar_doc = model.docvecs.most_similar(lyric_vector)


[('41817', 0.5265682339668274), ('42029', 0.5142006874084473), ('41959', 0.5125315189361572), ('54287', 0.4998117685317993), ('177618', 0.4707123637199402), ('58975', 0.4634052515029907), ('116079', 0.4576190114021301), ('4252', 0.45436087250709534), ('111321', 0.45301464200019836), ('115787', 0.45294860005378723)]


In [60]:
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(lyrics_df['Lyrics'][int(similar_doc[0][0])])

I don't want my heart to be broken 'Cause it's the only one I've got So darling please be careful You know I care a lot Darling please don't break my heart I beg of you I don't want no tears a-falling You know I hate to cry But that's what's bound to happen If you only say goodbye Darling please don't say goodbye I beg of you Hold my hand and promise That you'll always love me true Make me know you'll love me The same way I love you little girl You got me at your mercy Now that I'm in love with you So please don't take advantage 'Cause you know my love is true Darling please please love me too I beg of you Hold my hand and promise That you'll always love me true Make me know you'll love me The same way I love you little girl You got me at your mercy Now that I'm in love with you So please don't take advantage 'Cause you know my love is true Darling please please love me too I beg of you


In [61]:
song_index = int(similar_doc[0][0])
song_index

41817

In [62]:
# OUTPUT reccommended song
rec_song = {'Artist': lyrics_df['Artist'][song_index],
                'Title': lyrics_df['Title'][song_index],
                'Genre': lyrics_df['Genre'][song_index],
                'Lyrics': lyrics_df['Lyrics'][song_index]}

rec_song
# output_filename = 'output2.json'
# with open(output_filename, 'w') as fout:
#     json.dump(rec_song, fout)

{'Artist': 'elvis presley',
 'Title': 'i beg of you',
 'Genre': 'Rock',
 'Lyrics': "I don't want my heart to be broken 'Cause it's the only one I've got So darling please be careful You know I care a lot Darling please don't break my heart I beg of you I don't want no tears a-falling You know I hate to cry But that's what's bound to happen If you only say goodbye Darling please don't say goodbye I beg of you Hold my hand and promise That you'll always love me true Make me know you'll love me The same way I love you little girl You got me at your mercy Now that I'm in love with you So please don't take advantage 'Cause you know my love is true Darling please please love me too I beg of you Hold my hand and promise That you'll always love me true Make me know you'll love me The same way I love you little girl You got me at your mercy Now that I'm in love with you So please don't take advantage 'Cause you know my love is true Darling please please love me too I beg of you"}