<center>
    Predicting Song Similarity using Deep Neural Networks
    <br>
    Part 01: Data Processor
</center>
<p style="text-align:right">
    Sneha Shet
    <br>
    Sivaraman Lakshmipathy
    <br>
    Sudheer Kumar Reddy Beeram
</p>

<b>Data Processor:</b>
<br>
This Jupyter Notebook contains the source code that is used to
<ul>
    <li>download the lyrics data corpus</li>
    <li>filter it to identify the data set to run experiments</li>
    <li>generate custom word embeddings to be used for the experiments</li>
</ul>


<center><b>Data download and initial operations</b></center>

<b>Part 1: Downloading the data corpus</b>

01. The initial step is to download the similarity ground truth, available at http://millionsongdataset.com/lastfm

    Extract the contents and ensure that 'lastfm_similars.db' is available at the root level.

02. The second step is to download the Musixmatch data in order to identify the mapping ids that relate the ground truth to the actual track ids. The same can be found here: http://millionsongdataset.com/musixmatch

     Extract the contents and ensure that the following files are available:<br>
     • Musixmatch\mxm_dataset_train.txt<br>
     • Musixmatch\mxm_dataset_test.txt

03. Executing the following query on the <b>lastfm_similars.db</b> provided the list of all unique track ids in the data corpus, which is saved as a CSV file named <b>unique_track_ids.csv</b>

    SELECT  s.tid,d.tid<br>
    FROM similars_dest d<br>
    LEFT JOIN similars_src s USING(tid)<br>
    UNION ALL<br>
    SELECT  s.tid,d.tid<br>
    FROM similars_src s<br>
    LEFT JOIN similars_dest d USING(tid)<br>
    WHERE d.tid IS NULL;

04. Extract the track ids and download the corresponding lyrics from Musixmatch databases. A developer API key is required for this operation and a sample working key is provided here which expires once 2019 ends.

In [5]:
#Generate dict for Trackid : Musixmatch id

import json

dictObj = {}

mxm_train_file = "Musixmatch\mxm_dataset_train.txt"
mxm_test_file = "Musixmatch\mxm_dataset_test.txt"

def processFile(fileName, dictObj):
    f = open(fileName, 'r')
    for entry in f:
        if not entry.startswith('#') and not entry.startswith('%'):
            splitEntry = entry.split(',')
            trackId = splitEntry[0]
            mxmId = splitEntry[1]
            
            #sanity check
            if not trackId.startswith('TR'):
                print("Error!")
                break
                
            if trackId in dictObj:
                print("Track id already present!")
                print(trackId, ":", mxmId)
            else:
                dictObj[trackId] = mxmId
    f.close()
    return dictObj
        
dictObj = processFile(mxm_train_file, dictObj)
dictObj = processFile(mxm_test_file, dictObj)

writeToFile = open("track_mxm_map.txt", 'w')
writeToFile.write(json.dumps(dictObj))
writeToFile.close()

In [6]:
#Get trackids extracted from CSV

import csv

trackIdList = []

with open('unique_track_ids.csv') as trackidFile:
    trackFile = csv.reader(trackidFile, delimiter=',')
    for row in trackFile:
        if row[0] != '':
            trackIdList.append(row[0])
        else:
            trackIdList.append(row[1])

In [7]:
#Get mxmids

trackIds = []
mxmIdList = []

f = open("mxm_download.txt", "w")
f.write("#mxmId, #trackid per line\n")
f_mxm_unavailable = open("mxm_unavailable.txt", "w")

for entry in trackIdList:
    if entry in dictObj:
        tId = entry
        mxmId = dictObj[entry]
        fileStr = mxmId + "," + tId + "\n"
        f.write(fileStr)
    else:
        f_mxm_unavailable.write(entry + "\n")
        
f.close()
f_mxm_unavailable.close()

In [8]:
#API call to musixmatch

import requests
import time
import os
import traceback
import json

def checkIfLyricsAvailable(trackId):
    fullJsonPath = "Lyrics_dump\\Full_json\\"
    lyricsPath = "Lyrics_dump\\Lyrics\\"
    if os.path.isfile(fullJsonPath + trackId + ".json") and os.path.isfile(lyricsPath + trackId + ".txt"):
        return True
    return False

def getLyrics(api_key, mxmId, trackId):
    
    #Return status key
    #200 - Lyrics available / downloaded successfully
    #402 - Change API
    #400 - Error
    
    if checkIfLyricsAvailable(trackId):
        return 200
    
    copyright_text = "******* This Lyrics is NOT for Commercial use *******" #to remove text from lyrics
    
    url = "https://api.musixmatch.com/ws/1.1/track.lyrics.get"
    url_params = {
        "apikey": api_key,
        "track_id": mxmId
    }
    result = requests.get(url, params=url_params)
    res_json = result.json()
    status_code = res_json["message"]["header"]["status_code"]
    
    if status_code == 200: #successful request
        try:
            lyrics = res_json["message"]["body"]["lyrics"]["lyrics_body"]
            if lyrics == "":
                unavailFile = open("lyrics_unavailable_list.txt", "a")
                unavailFile.write(str(mxmId) + "," + trackId + "\n")
                unavailFile.close()
                return 404

            if copyright_text in lyrics:
                lyrics = lyrics[:lyrics.index(copyright_text)].strip()

            f1 = open("Lyrics_dump\\Full_json\\" + trackId + ".json", "w", encoding="utf-8")
            f1.write(json.dumps(res_json["message"]["body"]))
            f1.close()

            f2 = open("Lyrics_dump\\Lyrics\\" + trackId + ".txt", "w", encoding="utf-8")
            f2.write(lyrics)
            f2.close()
            return 200
        except Exception as e:
            traceback.print_exc()
            return 400
    elif status_code == 401 or status_code == 402: #API hit limit reached
        print("Time to swap api_key")
        return 402
    elif status_code == 404: #Lyrics not found
        unavailFile = open("lyrics_unavailable_list.txt", "a")
        unavailFile.write(str(mxmId) + "," + trackId + "\n")
        unavailFile.close()
        return 404
    else:
        print("Unexpected error for mxmId:", mxmId, ", trackId:", trackId)
        print(res_json)
        return 400
        
    time.sleep(0.200)

#File containing the apikeys
f = open("mxm_apikey.txt", "r")
apikeys = []
for entry in f:
    apikey = entry.strip()
    apikeys.append(apikey)
f.close()
    
i = 0

#File containing the mxmids
mxmIdFile = open("mxm_download.txt", "r")
#Skip first line
next(mxmIdFile)

for line in mxmIdFile:
    entries = line.strip().split(",")
    mxmId = int(entries[0])
    trackId = entries[1]
    if trackId in tracks_unavailable:
        continue
    processStatus = getLyrics(apikeys[i], mxmId, trackId)
    if processStatus != 200:
        if processStatus == 402:
            print("Updating API key.")
            i += 1
            if i == len(apikeys):
                print("API keys exhausted!")
                break
        elif processStatus != 404:
            print("Unexpected error encountered!")
            break
    
mxmIdFile.close()

<b>Part 2: Data filtering</b>

Filter out non-English lyrics and create a dump file containing the entire data corpus. This will be used later for training the word embeddings.
<br><br>
Note: A Python package named 'langdetect' is required to perform this operation.
<br>
The final generated file is provided as part of the source code.

In [9]:
#Package dependency: langdetect
#https://pypi.org/project/langdetect
from langdetect import detect

In [10]:
import os
import sys
import glob

lyrics_files_path = "Lyrics_dump" + os.path.sep + "Lyrics" + os.path.sep

#Get the list of all files with lyrics
filelist = [f for f in glob.glob(lyrics_files_path + "*.txt")]

In [11]:
#Write list of lyric files in languages other than English to a separate file

other_lang_Lyrics_list_file = "lyrics_otherLang.txt"

In [12]:
otherLangFile = open(other_lang_Lyrics_list_file, "a")

for entry in filelist:
    try:
        with open(entry, "r", encoding="utf8") as f:
            curLyrics = f.read()
            curLyrics = curLyrics[:-3].replace("\n", " ")
            lyrics_language = detect(curLyrics)
            if lyrics_language != 'en':
                otherLangFile.write(entry + "," + lyrics_language + "\n")
    except Exception as e:
        otherLangFile.write(entry + "," + "undefined" + "\n")

otherLangFile.close()

<center><b>Creating Track ID pairs and their scores<b><center>

The following set of code blocks identify the similar tracks for each unique track id in the database and generate (source, target) pairs along with their respective similarity label. Similarly, tracks with at least 5 similar and non-similar tracks are identified and recorded separately.

In [13]:
import sqlite3
# Establishing connection to the database
con = sqlite3.connect("lastfm_similars.db")
cursorObj = con.cursor()
cursorObj.execute("SELECT *FROM similars_src")
rows = cursorObj.fetchall()

In [14]:
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

In [15]:
#Collecting the all track ids that doesn't have lyrics 
f = open("lyrics_unavailable_list.txt","r")
lyrics_unavailable=[]
for row in f:
    lyrics_unavailable.append(row.split(",")[1][:-1])
f.close()

In [16]:
# Adding track ids of songs that are not in English Language to list of Lyrics Unavailable list
f_non_English = open("lyrics_otherLang_tracks.txt","r")
for row in f_non_English:
    lyrics_unavailable.append(row[:-1])
f_non_English.close()

In [19]:
# Adding track ids that are not in mxm database to list of Lyrics Unavailable list
f_no_mxm = open("mxm_unavailable.txt", "r")
for row in f_no_mxm:
    lyrics_unavailable.append(row[:-1])
f_no_mxm.close()

In [20]:
lyrics_unavailable = set(lyrics_unavailable)

In [21]:
# Splitting target songs row in to similar and non similar songs list
def find_mid_split_into_two(target_songs_row):
    Left_split=[] # >=0.5
    Right_split = [] # <=0.5
    k=0
    for i in range(len(target_songs_row)):
        if target_songs_row[k] not in lyrics_unavailable:
            if float(target_songs_row[k+1])>0.5:
                Left_split.append([target_songs_row[k],target_songs_row[k+1]])
            else:
                Right_split.append([target_songs_row[k],target_songs_row[k+1]])
        k = k+2
        if k > len(target_songs_row)-2:
            return Left_split, Right_split

In [22]:
# Generating top 5 most similar and 5 non similar target track IDs and their similarity score with source songs
import random
f_similar= open("Pairs_labels_similar.txt","w")
f_non_similar = open("Pairs_labels_non_similar.txt","w")

Target_row_lengths=[]
for row in rows:
    ID_L = row[0]
    if ID_L not in lyrics_unavailable:
        splitted_row = row[1].split(",")
        Left_split = find_mid_split_into_two(splitted_row)[0]
        Right_split = find_mid_split_into_two(splitted_row)[1]
        Target_row_lengths.append([len(Left_split),len(Right_split)])
        if len(Left_split) >=5 and len(Right_split) >= 5:
            Left_random = random.sample(Left_split,5)
            Right_random = random.sample(Right_split,5)
            for i in range(5):
                f_similar.write(ID_L+","+Left_random[i][0]+","+Left_random[i][1]+"\n")
                f_non_similar.write(ID_L+","+Right_random[i][0]+","+Right_random[i][1]+"\n")
            
f_similar.close()
f_non_similar.close()

In [23]:
# Checking the statistics of the similar and non similar targets songs for each source song.
similars =[]
for i in range(len(Target_row_lengths)):
    similars.append(Target_row_lengths[i][0])
non_similars=[]
for i in range(len(Target_row_lengths)):
    non_similars.append(Target_row_lengths[i][1])

len(Target_row_lengths)
import statistics 
statistics.mean(similars)
statistics.stdev(similars)
statistics.mean(non_similars)
statistics.stdev(non_similars)

In [24]:
# Getting the most similar and non similar target track IDs and their similarity score with source songs.
f_similar= open("Pairs_labels_similar2.txt","w")
f_non_similar = open("Pairs_labels_non_similar2.txt","w")

for row in rows:
    ID_L = row[0]
    if ID_L not in lyrics_unavailable:
        splitted_row = row[1].split(",")
        k = 0
        ID_R_similar = 1
        ID_R_non_similar = 1
        for i in range(len(splitted_row)):
            
            if ID_R_similar == 1 and float(splitted_row[k+1])>0.5 and splitted_row[k] not in lyrics_unavailable:
                ID_R_similar = splitted_row[k]
                score_similar = splitted_row[k+1]
            
            if ID_R_non_similar == 1 and float(splitted_row[len(splitted_row)-k-1])<=0.5 and splitted_row[len(splitted_row)-k-2] not in lyrics_unavailable:
                ID_R_non_similar = splitted_row[len(splitted_row)-k-2]
                score_non_similar = splitted_row[len(splitted_row)-k-1]
            
            if ID_R_similar!=1 and ID_R_non_similar !=1:
                f_similar.write(ID_L+","+ID_R_similar+","+score_similar+"\n")
                f_non_similar.write(ID_L+","+ID_R_non_similar+","+score_non_similar+"\n")
                break
            k = k+2
            if k>len(splitted_row)-2:
                break
f_similar.close()
f_non_similar.close()

In [25]:
# Sorting the Track ID pairs based on similarity score
def sorted_file(filename):
    if filename == "Pairs_labels_similar2.txt":
        f= open(filename,"r").readlines()
        output = open("sorted_similar_pairs.txt","w")
        for line in sorted(f, key=lambda line: line.split(",")[2],reverse = True)[0:35000]:
            line = ','.join(line.split(",")[0:2])+",1\n"
            output.write(line)
        output.close()
    elif filename =="Pairs_labels_non_similar2.txt":
        f= open(filename,"r").readlines()
        output = open("sorted_non_similar_pairs.txt","w")
        for line in sorted(f, key=lambda line: line.split(",")[2])[0:35000]:
            line = ','.join(line.split(",")[0:2])+",0\n"
            output.write(line)
        output.close()

In [26]:
sorted_file("Pairs_labels_similar2.txt")
sorted_file("Pairs_labels_non_similar2.txt")

In [27]:
# Shuffling the files
import random
seed = 100

def shuffle_file(filename):
    fid = open(filename, "r")
    li = fid.readlines()
    fid.close()
    random.Random(seed).shuffle(li)
    fid = open("Shuffled_"+filename, "w")
    fid.writelines(li)
    fid.close()

In [28]:
shuffle_file("sorted_non_similar_pairs.txt")
shuffle_file("sorted_similar_pairs.txt")

In [29]:
con.close()

<center><b>Generating final dataset</b></center>

The final step in the data processor is to extract the lyrics for all the track ids generated in the previous section and write them to a CSV file as pairs along with their respective similarity scores.
<br>
The CSV file will have 5 columns:
<ul>
    <li>X_left_trackid: Track id of the source</li>
    <li>X_left: Lyric snippet corresponding to the source</li>
    <li>X_right_trackid: Track id of the target</li>
    <li>X_right: Lyric snippet corresponding to the target</li>
    <li>Y: Similarity score for the pair of lyrics</li>
</ul>

In [30]:
import os
import sys
import csv
import pandas as pd
import numpy as np

In [31]:
#Picking first N data from similar and non-similar data
dataset_filepath = "Project_dataset"
similar_file = "Pairs_labels_similar.txt"
non_similar_file = "Pairs_labels_non_similar.txt"

lyrics_files_path = "Lyrics_dump" + os.path.sep + "Lyrics" + os.path.sep
final_dataset = "final_dataset.csv"
final_dataset_shuffled = "final_dataset_shuffled.csv"

In [32]:
similar_pairs = []
f = open(similar_file)
for entry in f:
    similar_pairs.append(entry.strip())
f.close()

non_similar_pairs = []
f = open(non_similar_file)
for entry in f:
    non_similar_pairs.append(entry.strip())
f.close()

In [33]:
import gc
gc.disable()
#Separate CSV files for obtained track pairs
dataframe_entries = []
processed_trackids = []
counter = 0
for list_entry in [similar_pairs, non_similar_pairs]:
    for entry in list_entry:
        entries = entry.split(",")
        lyrics = []
        if os.path.exists(lyrics_files_path + entries[0] + ".txt") and os.path.exists(lyrics_files_path + entries[1] + ".txt"):
            for i in range(2):
                trackid = entries[i]
                lyrics_file = lyrics_files_path + trackid + ".txt"
                with open(lyrics_file, "r", encoding="utf8") as f:
                    curLyrics = f.read()
                    curLyrics = curLyrics[:-3].replace("\n", " ")
                    lyrics.append(curLyrics)
            label = 0
            if float(entries[2]) >= 0.5:
                label = 1
            if entries[0] not in processed_trackids:
                processed_trackids.append(entries[0])
            if entries[1] not in processed_trackids:
                processed_trackids.append(entries[1])
            dataframe_entries.append([entries[0], lyrics[0], entries[1], lyrics[1], label])
        else:
            print("Unexpected error occurred!") #All entries in dataset MUST have been verified against availability
            print("The following trackid pair is unavailable:", entries)
        counter += 1
        if counter % 1000 == 0:
            print(counter)
        
gc.enable()

In [34]:
df = pd.DataFrame(dataframe_entries, columns = ["X_left_trackid", "X_left", "X_right_trackid", "X_right", "Y"])
display(df.shape)
display(df.head())
df.to_csv(dataset_filepath + os.path.sep + final_dataset, sep = "\t", encoding = "utf-8", index = False)

In [35]:
df_shuffle = df.reindex(np.random.permutation(df.index))
display(df_shuffle.shape)
display(df_shuffle.head())
#Write shuffled dataframe to file
df_shuffle.to_csv(dataset_filepath + os.path.sep + final_dataset_shuffled, sep = "\t", encoding = "utf-8", index = False)

<center><b>Custom word embeddings</b></center>


A Word2Vec embedding model is generated by training on the lyrics corpus to be used for training the Neural Network.
<br><br>
This results in a file with all the lyrics named 'fulldump.txt' which is used during the training as well.

In [36]:
lyrics_files_path = "Lyrics_dump" + os.path.sep + "Lyrics" + os.path.sep

#Get the list of all files with lyrics
filelist = [f for f in glob.glob(lyrics_files_path + "*.txt")]

In [37]:
otherLangFile = "lyrics_otherLang.txt"
otherLangFiles = []

f = open(otherLangFile, 'r', encoding='utf8')
for entry in f:
    otherLangFiles.append(entry.split(',')[0])
f.close()

In [38]:
embedding_train_corpus = "Lyrics_dump" + os.path.sep + "fullDump.txt"

dumpFile = open(embedding_train_corpus, "a", encoding="utf-8")

for entry in filelist:
    if entry not in otherLangFiles:
        with open(entry, "r", encoding="utf8") as f:
            curLyrics = f.read()
            curLyrics = curLyrics[:-3]
            dumpFile.write(curLyrics)

dumpFile.close()

In [39]:
from gensim.models import Word2Vec, KeyedVectors

embedding_train_corpus = "Lyrics_dump" + os.path.sep + "fullDump.txt"

trainFileObj = open(embedding_train_corpus, "r", encoding='utf8')

lines = []
for entry in trainFileObj:
    lineTokens = re.findall(r"[\w']+", entry) #tokenizing the words from the corpus
    entry = [''.join(e.lower() for e in x if e.isalpha()) for x in lineTokens]
    if len(entry) != 0:
        lines.append(entry)

In [40]:
lyrics_model = Word2Vec(size=100, min_count=100, workers=3, window =3, sg = 1)
lyrics_model.build_vocab(lines)
lyrics_model.train(lines, total_examples=lyrics_model.corpus_count, epochs=200)
lyrics_model.wv.save_word2vec_format('lyrics_model_100dim.txt', binary = False)

In [41]:
lyrics_model_50 = Word2Vec(size=50, min_count=100, workers=3, window =3, sg = 1)
lyrics_model_50.build_vocab(lines)
lyrics_model_50.train(lines, total_examples=lyrics_model_50.corpus_count, epochs=200)
lyrics_model_50.wv.save_word2vec_format('lyrics_model_50dim.txt', binary = False)

In [36]:
lyrics_model_10 = Word2Vec(size=10, min_count=100, workers=3, window =3, sg = 1)
lyrics_model_10.build_vocab(lines)
lyrics_model_10.train(lines, total_examples=lyrics_model_10.corpus_count, epochs=200)
lyrics_model.wv.save_word2vec_format('lyrics_model_10dim.txt', binary = False)