# Get lyrics for songs
The purpose of this notebook is to create a dataset which has a public key that matches that for the scraped songs: the spotify id, and can then download the lyrics for that song and store in inside the csv. 
Then we will be able to add more features to the dataset, which use the lyrics to create more features that we can classify songs with.


In [1]:
# Perform the relevant imports
import pandas as pd
import numpy as np

# The genius api created by meself
from genius_api import GeniusApi

https://genius.com/Michael-jackson-history-lyrics
[Verse 1]
He got kicked in the back
He say that he needed that
He hot willed in the face
Keep daring to motivate
He say one day you will see
His place in world history
He dares to be recognized
The fires deep in his eyes
[Pre-Chorus]
How many victims must there be
Slaughtered in vain across the land
And how many struggles must there be
Before we choose to live the prophet's plan
Everybody sing
[Chorus]
Every day create your history
Every path you take you're leaving your legacy
Every soldier dies in his glory
Every legend tells of conquest and liberty
[Verse 2]
Don't let no one get you down
Keep moving on higher ground
Keep flying until
You are the king of the hill
No force of nature can break
Your will to self-motivate
She say this face that you see
Is destined for history
https://genius.com/Dj-spinn-feelin-you-lyrics
[Produced by DJ Spinn]
https://genius.com/Bicep-atlas-lyrics
None
https://genius.com/Foxy-brown-if-i-lyrics
[Intro]
Uhh

The below code cell imports the dataset which we will be getting the lyrics for


In [8]:
import_df = pd.read_csv("featured_tagged_v_3.csv")
# Read the name and artist
song_list = import_df.iloc[:,1]
artist_list = import_df.iloc[:,0]
spotify_ids = import_df.iloc[:,2]

print(song_list)
print(artist_list)

0                    Deep Sea Creature
1                              If I...
2                           Murder One
3                     On My Feet Again
4                    Bedroom Acoustics
                     ...              
16923                    Wrecking Ball
16924                It Doesn't Matter
16925                    Why, Why, Why
16926    Everybody [Backstreet's Back]
16927    Probably Wouldn't Be This Way
Name: Title, Length: 16928, dtype: object
0                    Mastodon
1                  Foxy Brown
2        Bone Thugs-N-Harmony
3                      Utopia
4                        Muse
                 ...         
16923             Miley Cyrus
16924          Stephen Stills
16925        Billy Currington
16926         Backstreet Boys
16927             LeAnn Rimes
Name: Artist name, Length: 16928, dtype: object


### Get the lyrics for each of the songs searched
We probably should add a gate that only performs the search if the song is known to have lyrics, can we retreive this from the spotify api itself

  
The exported csv will have the features: 
 - song_id - this is the spotify id for the song
 - song_name
 - artist_name
 - Has lyrics - 0 1 A song will be determined to not have lyrics if the genius page for the song has no lyrics present in itself
 - lyrics - this cell will hold the actual lyrics of the song

In [9]:
genius_API = GeniusApi()

NUMBER_OF_SONGS = len(artist_list)

# keep track of all of the successful rows
next_dataset_row = 0

# declare a response dataframe
outputframe = pd.DataFrame(columns=("Artist name", "Title", "spotify_id", "on_genius", "genius_url", "returned_lyrics"), index=np.arange(0,NUMBER_OF_SONGS))

# for our test set of 10 songs
for i in range(0 ,NUMBER_OF_SONGS):
    title = song_list[i]
    artist = artist_list[i]
    spotify_id = spotify_ids[i]

    print(title + artist)

    # get the relevant song id
    song_url = genius_API.request_song_lyrics_url(song_name=title, artist=artist) 
    
    # check for None being returned for failed results
    if (song_url != False):
        # get the features from the api
        lyrics = genius_API.get_lyrics_from_url(song_url)
        if (lyrics != False):
            # Add the features to a new row in a new dataframe
            outputframe.loc[next_dataset_row] = [artist, title, spotify_id, 1, song_url, lyrics]

            # increment the next row to enter to
            next_dataset_row = next_dataset_row + 1
    else:
        outputframe.loc[next_dataset_row] = [artist, title, spotify_id, 0, "", ""]
        next_dataset_row = next_dataset_row + 1


print(outputframe)

ney-honey-lyrics
If I Die YoungThe Band Perry
https://genius.com/The-band-perry-if-i-die-young-lyrics
Surfin' U.S.A.The Beach Boys
https://genius.com/The-beach-boys-surfin-usa-lyrics
Call It LovePoco
https://genius.com/Poco-call-it-love-lyrics
JamieEddie Holland
https://genius.com/Eddie-holland-jamie-lyrics
Tell Me (You're Coming Back)The Rolling Stones
https://genius.com/The-rolling-stones-tell-me-youre-coming-back-lyrics
Who's Making LoveBlues Brothers
https://genius.com/The-blues-brothers-whos-making-love-lyrics
Ebony EyesBob Welch
https://genius.com/Bob-welch-ebony-eyes-lyrics
All For YouSister Hazel
https://genius.com/Sister-hazel-all-for-you-lyrics
Let Her CryHootie & The Blowfish
https://genius.com/Hootie-and-the-blowfish-let-her-cry-lyrics
I've Got The Music In MeThe Kiki Dee Band
https://genius.com/The-kiki-dee-band-ive-got-the-music-in-me-lyrics
Step Into A World (Rapture's Delight)KRS-One
https://genius.com/Krs-one-step-into-a-world-raptures-delight-lyrics
Show StopperDanity

In [10]:
outputframe.to_csv("song_lyrics_attempt_2_large.csv")

Now we have a dataset of song lyrics, we can produce a dataset that only has lyric values - i should run this for a new million song dataset file, we should clean all the song lyrics, and then remove all unneeded words

In [12]:
import re
# does the number of verses impact the song?
# remove all tags
# lyrics_frame = pd.read_csv("song_lyrics_attempt_1.csv")
lyric_frame = pd.read_csv("song_lyrics_attempt_2_large.csv")
clean_frame = lyric_frame
NUM = len(clean_frame.iloc[:,1])

match = re.compile('[\(\[].*?[\)\]]')

for i in range(0,NUM):
    try:
        lyric = lyric_frame.iloc[i, 6] 
        # remove the paragraph headers
        clean_lyrics = match.sub("", lyric)
        # remove empty lines
        clean_lyrics = os.linesep.join([s for s in clean_lyrics.splitlines() if s])
        
        new_string = ''.join('{}.\n'.format(item) if (item and item[-1] not in '!?.,-') else '{}\n'.format(item) for item in clean_lyrics.split('\n'))

        clean_frame.iloc[i, 6] = new_string
    except Exception as e:
        # THe lyrics are likely to be NAN or something, so just replace these instances with ""
        clean_frame.iloc[i, 6] = ""

print("CSV created successfully")
clean_frame.to_csv("song_lyrics_withoutheaders_attempt_3_line_breaks_commas.csv")


CSV created successfully


Now i want to add more features to our dataset, by sculpting features that will allow us to determine sort of what the song is about.