## Scraping Lyrics Using MetroLyrics and LastFM
###### Updated on 10/12/2019
----------------

#### To create conda environment from file:

1. `cd` into `pyGhostWriter` repo
2. `conda env create --file environment.yml`
3. (Optional) `source ~/.bashrc`

In [14]:
!pip install --upgrade pip; pip install python-decouple; pip install tswift;

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (19.3.1)


In [15]:
import string
import random

import numpy as np
import pandas as pd
import requests as r
from tswift import Artist, Song
from decouple import config

# LastFM
API_KEY = config('LASTFM_API_KEY', cast=str)

### MetroLyrics with `tswift` Usage Example

`tswift` is currently broken, and I suspect its due to a change in the MetroLyrics API. I've submitted an issue on Github, but until that is resolved another solution needs to be implemented to retrieve lyrics for a given song or artist. **TODO**

In [3]:
# Verify tswift is operational
the_cure = Artist('The Cure')
the_cure.songs[:10]

[Song(title='I Dont Know Whats Going On', artist='The Cure'),
 Song(title='1015 Saturday Night', artist='The Cure'),
 Song(title='13Th', artist='The Cure'),
 Song(title='2 Late', artist='The Cure'),
 Song(title='39', artist='The Cure'),
 Song(title='A Boy I Never Knew', artist='The Cure'),
 Song(title='A Chain Of Flowers', artist='The Cure'),
 Song(title='A Few Hours After This', artist='The Cure'),
 Song(title='A Foolish Arrangement', artist='The Cure'),
 Song(title='A Forest', artist='The Cure')]

In [4]:
# <|endoftext|> is the GPT-2 training data delimiter
song = random.choice(the_cure.songs)
print(song.lyrics+'\n<|endoftext|>')

Shout
Shout
Shout
New day
New lie

I see this again
The same
This time has gone again

This
In the sky
This
Here in my hand
Shout

Jumping on it from the trees
Doing the same
On the air
On the air

Shout
Shout

In heaven
The ground is waiting
See my eye

Once more
It's been the same song
Babbling

On this
On all of you
Shout
Shout
Shout
<|endoftext|>


In [5]:
# Create dataframe of all songs and lyrics for The Cure
lyrics_dict = {s.title: s.lyrics for s in the_cure.songs}
lyrics_zipped = list(zip(lyrics_dict.keys(), lyrics_dict.values()))

In [6]:
lyrics_df = pd.DataFrame(lyrics_zipped, columns=['title', 'lyrics'])

print('Observations/Songs: ', len(lyrics_df))
lyrics_df.head()

Observations/Songs:  301


Unnamed: 0,title,lyrics
0,I Dont Know Whats Going On,I don't know what's going on\nI am so up close...
1,1015 Saturday Night,10.15 on a Saturday night\nAnd the tap\nDrips ...
2,13Th,"'Everyone feels good in the room,' she swings\..."
3,2 Late,So I'll wait for you\nWhere I always wait\nBeh...
4,39,So the fire is almost out and there's nothing ...


### LastFM Query Examples

The API expects UTF-8 encoded strings, but in Python 3 this is the default string encoding.

In [7]:
# Query similar artists of The Cure
# To run LastFM queries input your API key
response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist=thecure&api_key={API_KEY}&format=json')
response

<Response [200]>

In [8]:
resp_json = response.json()['similarartists']['artist']

print([band['name'] for band in resp_json])

['The Glove', 'Siouxsie and the Banshees', 'New Order', 'Joy Division', 'Bauhaus', 'The Smiths', 'Echo & the Bunnymen', 'Depeche Mode', 'The Sisters of Mercy', 'The Chameleons', 'Peter Murphy', 'The Jesus and Mary Chain', 'Nick Cave & The Bad Seeds', 'The Church', 'The Psychedelic Furs', 'Cocteau Twins', 'Love And Rockets', 'The Sound', 'Talking Heads', 'Morrissey', 'Killing Joke', 'Japan', 'Christian Death', 'Public Image Ltd.', 'Sonic Youth', 'The The', 'Gang of Four', 'Xmal Deutschland', 'Simple Minds', 'Sad Lovers and Giants', 'The Mission', 'New Model Army', 'Modern English', 'Tones on Tail', 'The Creatures', 'Duran Duran', 'XTC', 'Wire', 'The Stranglers', 'Devo', 'Adam and the Ants', 'Pixies', 'Interpol', 'The Damned', 'Placebo', 'Magazine', 'Gene Loves Jezebel', 'Alien Sex Fiend', 'She Wants Revenge', 'Tears for Fears', "The B-52's", 'R.E.M.', 'Red Lorry Yellow Lorry', 'Gary Numan', 'Television', 'Fad Gadget', 'U2', 'The Cult', 'Soft Cell', 'Pet Shop Boys', 'Talk Talk', 'Oingo B

In [9]:
# Query top tags for The Cure
response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getTopTags&artist=thecure&api_key={API_KEY}&format=json')
response

<Response [200]>

In [10]:
resp_json = response.json()['toptags']['tag']

print([tag['name'] for tag in resp_json])

['post-punk', 'new wave', 'alternative', '80s', 'rock', 'seen live', 'alternative rock', 'goth', 'british', 'indie', 'Gothic Rock', 'Gothic', 'The Cure', 'pop', 'Post punk', 'darkwave', 'punk', 'goth rock', 'indie rock', 'classic rock', 'dark', 'UK', '90s', 'britpop', 'cold wave', 'electronic', 'melancholic', 'dark wave', 'male vocalists', 'Cure', '70s', 'favorites', 'english', "80's", 'synth pop', 'Love', 'emo', 'england', 'robert smith', 'punk rock', 'synthpop', 'psychedelic', 'indie pop', 'melancholy']


In [11]:
# Query top artists in the post-punk top tag
response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=tag.gettopartists&tag=post-punk&api_key={API_KEY}&format=json')
response

<Response [200]>

In [12]:
resp_json = response.json()['topartists']['artist']

print([band['name'] for band in resp_json])

['The Cure', 'Joy Division', 'Nick Cave & The Bad Seeds', 'Swans', 'Siouxsie and the Banshees', 'Echo & the Bunnymen', 'Bauhaus', 'She Wants Revenge', 'Wire', 'Parquet Courts', 'The Fall', 'Killing Joke', 'Gang of Four', 'Motorama', 'Iceage', 'Television', 'Idles', 'The Chameleons', 'Public Image Ltd.', 'Savages', 'These New Puritans', 'Les Savy Fav', 'The Soft Moon', 'Protomartyr', 'New Model Army', 'The The', 'Mission of Burma', 'Suicide', 'Felt', 'Lebanon Hanover', 'The Durutti Column', 'The Birthday Party', 'Crystal Stilts', 'Peter Murphy', 'The Gun Club', 'The Sound', 'The Feelies', 'Orange Juice', 'Pere Ubu', 'Magazine', 'Young Marble Giants', 'Sleaford Mods', 'Ought', 'Love And Rockets', 'Preoccupations', 'DRAB MAJESTY', 'Tuxedomoon', 'The Clean', "I Love You But I've Chosen Darkness", 'Television Personalities']


### Combining LastFM and MetroLyrics

In [13]:
# Choose a band, and instantiate the Artist class for it, if it exists.
user_band = input('Input artist (ex: The Cure): ')

try:
    band = Artist(user_band)
    if len(band.songs) == 0:
        raise ValueError('No songs for this artist.')
except Exception as e:
    print(f'Exception occured: {e}')
else:
    print(f'Your artist has been chosen: {band}. Some of their songs include:')
    print('* '+'\n* '.join([song.title for song in band.songs[:10]]))

Input artist (ex: The Cure):  Frank Zappa


Your artist has been chosen: Artist('frank-zappa'). Some of their songs include:
* No No No
* Quotno No No Quot
* Ere Ian Whips Itjcb Spits Itmotorhead Rips It
* 14 Tone Unit
* 200 Motels Finale
* 200 Years Old
* 50 50
* 98 Objects
* A Cold Dark Matter
* A Diffrent Octave


In [14]:
# Before collecting any lyrics, build a list of all artists similar to the chosen artist.
def query_lastfm(artist, query_type=2):
    
    artist_name = str(artist.name).replace('-', '')
    
    if query_type == 1:
        # Get similar artists
        response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist={artist_name}&api_key={API_KEY}&format=json')
        if response.status_code != 200:
            raise IOError(r.status_codes._codes[response.status_code])

        resp_json = response.json()['similarartists']['artist']
        similar_artists = [similar_artist['name'] for similar_artist in resp_json]
        return similar_artists
        
    elif query_type == 2:
        # Get similar artists
        response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist={artist_name}&api_key={API_KEY}&format=json')
        if response.status_code != 200:
            raise IOError(f'IOError: {r.status_codes._codes[response.status_code]}')
        resp_json = response.json()['similarartists']['artist']
        similar_artists = [similar_artist['name'] for similar_artist in resp_json]
        
        # Get top tags
        response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getTopTags&artist={artist_name}&api_key={API_KEY}&format=json')
        resp_json = response.json()['toptags']['tag']
        top_tags = [tag['name'] for tag in resp_json]
        
        # Get tags top artists
        tag_top_artists = []
        for tag in top_tags[:3]:
            response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=tag.gettopartists&tag={tag}&api_key={API_KEY}&format=json')
            resp_json = response.json()['topartists']['artist']
            tag_top_artists.extend([artist['name'] for artist in resp_json])
            
        return similar_artists, top_tags, tag_top_artists
          
    else:
        raise ValueError(f'query_type {query_type} unavailable.')


In [15]:
similar_artists, top_tags, tag_top_artists = query_lastfm(band)

In [16]:
print(similar_artists)

['The Mothers of Invention', 'Frank Zappa & Captain Beefheart', 'King Crimson', 'Dweezil Zappa', 'Captain Beefheart & His Magic Band', 'Gong', 'Gentle Giant', 'Soft Machine', 'Van der Graaf Generator', 'Brand X', 'Jethro Tull', 'Mahavishnu Orchestra', 'Magma', 'Henry Cow', 'Adrian Belew', 'Emerson, Lake & Palmer', 'Yes', 'Caravan', 'Genesis', 'Robert Wyatt', 'Camel', 'The Residents', 'Can', 'Hatfield and the North', 'Traffic', 'Focus', 'The Who', 'Peter Hammill', 'Robert Fripp', 'Faust', 'Rush', "Manfred Mann's Earth Band", 'Jeff Beck', 'Return to Forever', "Aphrodite's Child", 'Banco del Mutuo Soccorso', 'Ween', 'Amon Düül II', 'UK', 'Allan Holdsworth', 'Todd Rundgren', 'Premiata Forneria Marconi', 'Nektar', 'Bill Bruford', 'Renaissance', 'Steve Hackett', 'Kevin Ayers', 'Procol Harum', 'Peter Gabriel', 'The Nice', 'Steve Hillage', 'Steely Dan', 'The Mars Volta', 'Pink Floyd', 'Marillion', 'Area', 'Matching Mole', 'Mr. Bungle', 'The Alan Parsons Project', 'Cream', 'Hawkwind', 'Rahsaan 

In [17]:
print(top_tags)

['Progressive rock', 'experimental', 'rock', 'jazz', 'classic rock', 'Avant-Garde', 'Progressive', 'alternative', 'psychedelic', 'zappa', 'Fusion', 'Experimental Rock', 'jazz fusion', 'Psychedelic Rock', 'Jazz Rock', 'Frank Zappa', 'art rock', 'guitar', 'american', '70s', 'comedy', 'jazz-rock', 'Classical', 'guitar virtuoso', 'genius', 'singer-songwriter', '60s', 'hard rock', 'Comedy Rock', 'blues', '80s', 'composer', 'contemporary classical', 'avantgarde', 'alternative rock', 'funk', 'prog', 'Avant-Prog', 'satire', 'prog rock', 'USA', 'proto-punk', 'instrumental', 'pop', 'weird', 'free jazz', 'seen live', 'blues rock', 'humor', 'indie', 'humour', 'electronic', 'funny', 'Awesome', 'impossible for liberals to deal with', 'avant garde', 'conservative']


In [18]:
print(tag_top_artists)

['Pink Floyd', 'Porcupine Tree', 'Rush', 'Coheed and Cambria', 'The Mars Volta', 'Genesis', 'Frank Zappa', 'Jethro Tull', 'King Crimson', 'Yes', 'Peter Gabriel', 'Mike Oldfield', 'Riverside', 'dredg', 'Steven Wilson', 'Marillion', 'The Alan Parsons Project', 'Camel', 'The Dear Hunter', 'Blackfield', 'David Gilmour', 'Roger Waters', 'Emerson, Lake & Palmer', 'Antimatter', 'Karnivool', 'Oceansize', 'The Pineapple Thief', 'Gentle Giant', 'Asia', 'Fair to Midland', 'Rishloo', 'Procol Harum', 'Lunatic Soul', 'Ozric Tentacles', 'Gazpacho', "Spock's Beard", 'Van der Graaf Generator', 'Wishbone Ash', 'Pure Reason Revolution', 'Soft Machine', 'Closure in Moscow', 'Nothing More', 'Steve Hackett', 'Focus', 'Kingston Wall', 'Amplifier', 'Gong', 'Robert Wyatt', 'Caravan', 'Eloy', 'Animal Collective', 'Portugal. The Man', 'Buckethead', 'Chelsea Wolfe', 'Panda Bear', 'Dirty Projectors', 'Bibio', 'Xiu Xiu', 'Deerhoof', 'Captain Beefheart & His Magic Band', 'Health', 'Liars', 'Shlohmo', 'The Books', 't