## Spotify Playlist Generator

Import/download necessary libraries

In [1]:
pip install -r requirements.txt

You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import requests
import sys
import os

## Generate data frame to use

The data is in a folder named "lyrics/" where there are about 5 songs from the top 500 artists ( you can view the collection process in 'lyric_collector.py'. The name of each file is the artist name, followed by a number, ranging from 1-5 corresponding to their 1st to 5th most popular songs. Each file contains both the song title on the first line, then the song lyrics on the remaining lines.

The goal is to extract this information into a data frame with columns corresponding to artist name, song name, and lyrics.

In [3]:
def extract_info(text, doc_name, details):
    # The format for Happy by Pharrell for example would be :
    # Happy Lyrics
    # {Song Lyrics}
    # Hence, we plit on 'Lyrics'
    try:
        song_name, lyrics = text.split('Lyrics', maxsplit= 1)
        #print(song_name)
    except Exception as e:
        #print(e)
        print(f"{doc_name} has no lyrics or had an error")
        return
    artist, _ = doc_name.rsplit('_', maxsplit = 1)
    details["artist"].append(artist)
    details["song_name"].append(song_name)
    details["lyrics"].append(lyrics)

In [4]:
def read_docs(input_dir):
    details = {
        "artist": [],
        "song_name": [],
        "lyrics": []
    }
    docs = os.listdir(input_dir)
    for doc in docs:
        #print(f"{input_dir}{doc}")
        with open(f"{input_dir}/{doc}") as f:
            doc_text = f.read()
            extract_info(doc_text, doc, details)
    df = pd.DataFrame(details)
    return df


In [33]:
df = read_docs('lyrics/')

In [34]:
df["id"] = df.index
df.head()

Unnamed: 0,artist,song_name,lyrics,id
0,MGMT,Electric Feel,\nAll along the Western front\nPeople line up ...,0
1,AFI,The Days of the Phoenix,I remember when I was told of story of\nCrushe...,1
2,TVontheRadio,DLZ,\nCongratulations on the mess you made of thin...,2
3,SonicYouth,Schizophrenia,\nI went away to see an old friend of mine\nHi...,3
4,NickCave&TheBadSeeds,O Children,"Pass me that lovely little gun\nMy dear, my da...",4


### Remove Stop words and Tokenize

In [35]:
stop = set()
with open('stopwords.txt', 'r') as f:
    for line in f:
        stop.add(line.rstrip())

df['lyrics_nostop'] = df['lyrics'].apply(lambda x: ' '.join([word.lower() for word in x.split() if word not in (stop)]))

In [36]:
df.head()

Unnamed: 0,artist,song_name,lyrics,id,lyrics_nostop
0,MGMT,Electric Feel,\nAll along the Western front\nPeople line up ...,0,all along western front people line up receive...
1,AFI,The Days of the Phoenix,I remember when I was told of story of\nCrushe...,1,"i remember i told story crushed velvet, candle..."
2,TVontheRadio,DLZ,\nCongratulations on the mess you made of thin...,2,congratulations mess made things i'm trying re...
3,SonicYouth,Schizophrenia,\nI went away to see an old friend of mine\nHi...,3,i went away see old friend mine his sister cam...
4,NickCave&TheBadSeeds,O Children,"Pass me that lovely little gun\nMy dear, my da...",4,"pass lovely little gun my dear, darling one th..."


In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [45]:
analyze = TfidfVectorizer().build_analyzer()
df['tokens'] = df['lyrics_nostop'].apply(analyze)

Now would be an appropriate time to get rid of some redundant columns

In [46]:
df = df.drop(['lyrics'], axis=1)
df.head()

KeyError: "['lyrics'] not found in axis"

## Find similarities between songs and queries

### General cosine similarity based on lyric similarity

In [65]:
query = "I am very happy, give me happy"

q_ser = pd.Series([query])
l_ser = pd.Series(df['lyrics_nostop'])

q_lyrics = q_ser.append(l_ser)

# the linear kernel is the dot product
tf_idf = TfidfVectorizer().fit_transform(q_lyrics)
# the query is the first item in the series, so compare it to the other items
cosine_sim = cosine_similarity(tf_idf[0:1], tf_idf[1:]).flatten()
top_songs_indices = cosine_sim.argsort()[:-7:-1]
print(top_songs_indices)
print(cosine_sim[top_songs_indices])

df.iloc[top_songs_indices]

[1286 2174  449 1174   94 2227]
[0.40734742 0.38548732 0.37506168 0.3115518  0.28000354 0.23671946]


Unnamed: 0,artist,song_name,id,lyrics_nostop,tokens
1286,DemiLovato,Stone Cold,1286,"stone cold, stone cold you see standing, i'm d...","[stone, cold, stone, cold, you, see, standing,..."
2174,TheKooks,Junk of the Heart (Happy),2174,junk heart junk mind so hard leave alone we ge...,"[junk, heart, junk, mind, so, hard, leave, alo..."
449,SherylCrow,If It Makes You Happy,449,"i've long, long way put poncho, played mosquit...","[ve, long, long, way, put, poncho, played, mos..."
1174,PharrellWilliams,Happy,1174,it might seem crazy i'm 'bout say sunshine she...,"[it, might, seem, crazy, bout, say, sunshine, ..."
94,TheWombats,Let’s Dance to Joy Division,94,i'm back liverpool everything seems same but i...,"[back, liverpool, everything, seems, same, but..."
2227,CatStevens,Tea for the Tillerman,2227,bring tea tillerman steak sun wine woman made ...,"[bring, tea, tillerman, steak, sun, wine, woma..."


If you look at the lyrics, key words like "happy" and "give" show up a lot, but the query is only stop words and those two words, and there are no stop words in the lyrics! So we need to look at some other methods.

My idea is to incorporate some data from spotify, as they have some nice metrics such as 'danceability', 'energy', 'loudness' etc. If we create a vector for each song, we can use a similar method to the one above, where we measure the 'closeness' of each vector. The query in this case would have to be a song, so we can extract the same metrics and measure closeness.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=020269be-b327-479a-994d-3e58f34cfdc4' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>