## Scraping Lyrics Using MetroLyrics and LastFM
###### Updated on 11/12/2019
----------------

#### To create conda environment from file:

1. `cd` into `pyGhostWriter` repo
2. `conda env create --file environment.yml`
3. (Optional) `source ~/.bashrc`

In [1]:
!pip install --upgrade pip; pip install python-decouple; pip install tswift;

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/00/b6/9cfa56b4081ad13874b0c6f96af8ce16cfbc1cb06bedf8e9164ce5551ec1/pip-19.3.1-py2.py3-none-any.whl (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 12.3MB/s ta 0:00:01
[?25hInstalling collected packages: pip
  Found existing installation: pip 10.0.1
    Uninstalling pip-10.0.1:
      Successfully uninstalled pip-10.0.1
Successfully installed pip-19.3.1
Collecting python-decouple
  Downloading https://files.pythonhosted.org/packages/9b/99/ddfbb6362af4ee239a012716b1371aa6d316ff1b9db705bfb182fbc4780f/python-decouple-3.1.tar.gz
Building wheels for collected packages: python-decouple
  Building wheel for python-decouple (setup.py) ... [?25ldone
[?25h  Created wheel for python-decouple: filename=python_decouple-3.1-cp36-none-any.whl size=7922 sha256=ae081e93b76c93b0e4aa459fe25f61df0ab8f9c246f60d63f29514f426a3063a
  Stored in directory: /home/ec2-user/.cache/pip/wheels/0f/ee/80/75b684060dc6ecc5a28c07b75e

In [2]:
import string
import random

import numpy as np
import pandas as pd
import requests as r
from tswift import Artist, Song
from decouple import config

# LastFM
API_KEY = config('LASTFM_API_KEY', cast=str)

### MetroLyrics with `tswift` Usage Example

`tswift` is currently broken, and I suspect its due to a change in the MetroLyrics API. I've submitted an issue on Github, but until that is resolved another solution needs to be implemented to retrieve lyrics for a given song or artist. **TODO**

In [3]:
# Verify tswift is operational
the_cure = Artist('The Cure')
the_cure.songs[:10]

[Song(title='I Dont Know Whats Going On', artist='The Cure'),
 Song(title='1015 Saturday Night', artist='The Cure'),
 Song(title='13Th', artist='The Cure'),
 Song(title='2 Late', artist='The Cure'),
 Song(title='39', artist='The Cure'),
 Song(title='A Boy I Never Knew', artist='The Cure'),
 Song(title='A Chain Of Flowers', artist='The Cure'),
 Song(title='A Few Hours After This', artist='The Cure'),
 Song(title='A Foolish Arrangement', artist='The Cure'),
 Song(title='A Forest', artist='The Cure')]

In [4]:
# <|endoftext|> is the GPT-2 training data delimiter
song = random.choice(the_cure.songs)
print(song.lyrics+'\n<|endoftext|>')

You wear your smile
Like it was going out of fashion
Dress to inflame
But douse any ideas of passion

You carry your love in a trinket
Hanging round your throat
Always inviting, always exciting
But I must not take off my coat

Well I'm tired of hanging around
I want somebody new
I'm not sure who I've got in mind
but I know

That it's not you
That it's not you

You ask me questions
That I never wanted to hear
I am the only one
Just until you finish this year

I would murder you if I had an alibi
Here in my hand
You just laugh
Because you don't understand

I'm tired of hanging around
I want somebody new
I'm not sure who I've got in mind
But I know that it's not you

That it's not you
That it's not you
That it's not you
That it's not you

That it's not you
No it's not you
That it's not you
It's not you
It's not you
<|endoftext|>


In [5]:
# Create dataframe of all songs and lyrics for The Cure
lyrics_dict = {s.title: s.lyrics for s in the_cure.songs}
lyrics_zipped = list(zip(lyrics_dict.keys(), lyrics_dict.values()))

In [6]:
lyrics_df = pd.DataFrame(lyrics_zipped, columns=['title', 'lyrics'])

print('Observations/Songs: ', len(lyrics_df))
lyrics_df.head()

Observations/Songs:  301


Unnamed: 0,title,lyrics
0,I Dont Know Whats Going On,I don't know what's going on\nI am so up close...
1,1015 Saturday Night,10.15 on a Saturday night\nAnd the tap\nDrips ...
2,13Th,"'Everyone feels good in the room,' she swings\..."
3,2 Late,So I'll wait for you\nWhere I always wait\nBeh...
4,39,So the fire is almost out and there's nothing ...


### LastFM Query Examples

The API expects UTF-8 encoded strings, but in Python 3 this is the default string encoding.

In [7]:
# Query similar artists of The Cure
# To run LastFM queries input your API key
response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist=thecure&api_key={API_KEY}&format=json')
response

<Response [200]>

In [8]:
resp_json = response.json()['similarartists']['artist']

print([band['name'] for band in resp_json])

['New Order', 'Bauhaus', 'Joy Division', 'Siouxsie and the Banshees', 'The Glove', 'Echo & the Bunnymen', 'The Smiths', 'Depeche Mode', 'The Sisters of Mercy', 'Peter Murphy', 'The Chameleons', 'The Jesus and Mary Chain', 'The Church', 'Cocteau Twins', 'Love And Rockets', 'Morrissey', 'Nick Cave & The Bad Seeds', 'Killing Joke', 'The Sound', 'The Psychedelic Furs', 'Talking Heads', 'The Mission', 'Public Image Ltd.', 'Christian Death', 'Simple Minds', 'Xmal Deutschland', 'New Model Army', 'Sonic Youth', 'Sad Lovers and Giants', 'Modern English', 'Japan', 'Gang of Four', 'Tones on Tail', 'The Creatures', 'The Stranglers', 'Devo', 'The The', 'Pixies', 'Gene Loves Jezebel', 'Adam and the Ants', 'Duran Duran', 'XTC', 'Wire', 'The Damned', "The B-52's", 'Orchestral Manoeuvres in the Dark', 'R.E.M.', 'U2', 'Oingo Boingo', 'Magazine', 'Placebo', 'Alien Sex Fiend', 'Fad Gadget', 'She Wants Revenge', 'The Pretenders', 'Tears for Fears', 'The Soft Moon', 'Red Lorry Yellow Lorry', 'Television', '

In [9]:
# Query top tags for The Cure
response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getTopTags&artist=thecure&api_key={API_KEY}&format=json')
response

<Response [200]>

In [10]:
resp_json = response.json()['toptags']['tag']

print([tag['name'] for tag in resp_json])

['post-punk', 'new wave', 'alternative', '80s', 'rock', 'seen live', 'alternative rock', 'goth', 'british', 'indie', 'Gothic Rock', 'Gothic', 'The Cure', 'pop', 'Post punk', 'darkwave', 'punk', 'goth rock', 'indie rock', 'classic rock', 'dark', 'UK', '90s', 'britpop', 'cold wave', 'electronic', 'melancholic', 'dark wave', 'male vocalists', 'Cure', '70s', 'favorites', 'english', "80's", 'synth pop', 'Love', 'emo', 'england', 'robert smith', 'punk rock', 'synthpop', 'psychedelic', 'indie pop', 'melancholy']


In [11]:
# Query top artists in the post-punk top tag
response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=tag.gettopartists&tag=post-punk&api_key={API_KEY}&format=json')
response

<Response [200]>

In [12]:
resp_json = response.json()['topartists']['artist']

print([band['name'] for band in resp_json])

['The Cure', 'Joy Division', 'Nick Cave & The Bad Seeds', 'Swans', 'Siouxsie and the Banshees', 'Echo & the Bunnymen', 'Bauhaus', 'She Wants Revenge', 'Wire', 'Parquet Courts', 'The Fall', 'Killing Joke', 'Gang of Four', 'Motorama', 'Iceage', 'Television', 'Idles', 'The Chameleons', 'Public Image Ltd.', 'Savages', 'These New Puritans', 'Les Savy Fav', 'The Soft Moon', 'Protomartyr', 'New Model Army', 'The The', 'Mission of Burma', 'Suicide', 'Felt', 'Lebanon Hanover', 'The Durutti Column', 'The Birthday Party', 'Crystal Stilts', 'Peter Murphy', 'The Gun Club', 'The Sound', 'Orange Juice', 'The Feelies', 'Pere Ubu', 'Magazine', 'Young Marble Giants', 'Sleaford Mods', 'Ought', 'DRAB MAJESTY', 'Love And Rockets', 'Preoccupations', 'Tuxedomoon', 'The Clean', 'Буерак', 'Television Personalities']


### Combining LastFM and MetroLyrics

In [13]:
# Choose a band, and instantiate the Artist class for it, if it exists.
user_band = input('Input artist (ex: The Cure): ')

try:
    band = Artist(user_band)
    if len(band.songs) == 0:
        raise ValueError('No songs for this artist.')
except Exception as e:
    print(f'Exception occured: {e}')
else:
    print(f'Your artist has been chosen: {band}. Some of their songs include:')
    print('* '+'\n* '.join([song.title for song in band.songs[:10]]))

Input artist (ex: The Cure):  Frank Zappa


Your artist has been chosen: Artist('frank-zappa'). Some of their songs include:
* No No No
* Quotno No No Quot
* Ere Ian Whips Itjcb Spits Itmotorhead Rips It
* 14 Tone Unit
* 200 Motels Finale
* 200 Years Old
* 50 50
* 98 Objects
* A Cold Dark Matter
* A Diffrent Octave


In [14]:
# Before collecting any lyrics, build a list of all artists similar to the chosen artist.
def query_lastfm(artist, query_type=2):
    
    artist_name = str(artist.name).replace('-', '')
    
    if query_type == 1:
        # Get similar artists
        response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist={artist_name}&api_key={API_KEY}&format=json')
        if response.status_code != 200:
            raise IOError(r.status_codes._codes[response.status_code])

        resp_json = response.json()['similarartists']['artist']
        similar_artists = [similar_artist['name'] for similar_artist in resp_json]
        return similar_artists
        
    elif query_type == 2:
        # Get similar artists
        response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getsimilar&artist={artist_name}&api_key={API_KEY}&format=json')
        if response.status_code != 200:
            raise IOError(f'IOError: {r.status_codes._codes[response.status_code]}')
        resp_json = response.json()['similarartists']['artist']
        similar_artists = [similar_artist['name'] for similar_artist in resp_json]
        
        # Get top tags
        response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=artist.getTopTags&artist={artist_name}&api_key={API_KEY}&format=json')
        resp_json = response.json()['toptags']['tag']
        top_tags = [tag['name'] for tag in resp_json]
        
        # Get tags top artists
        tag_top_artists = []
        for tag in top_tags[:3]:
            response = r.get(f'http://ws.audioscrobbler.com/2.0/?method=tag.gettopartists&tag={tag}&api_key={API_KEY}&format=json')
            resp_json = response.json()['topartists']['artist']
            tag_top_artists.extend([artist['name'] for artist in resp_json])
            
        return similar_artists, top_tags, tag_top_artists
          
    else:
        raise ValueError(f'query_type {query_type} unavailable.')


In [15]:
similar_artists, top_tags, tag_top_artists = query_lastfm(band)

In [16]:
print(similar_artists)

['The Mothers of Invention', 'Frank Zappa & Captain Beefheart', 'Captain Beefheart & His Magic Band', 'King Crimson', 'Dweezil Zappa', 'Gentle Giant', 'Gong', 'Van der Graaf Generator', 'Soft Machine', 'Brand X', 'Henry Cow', 'Mahavishnu Orchestra', 'Jethro Tull', 'Magma', 'Emerson, Lake & Palmer', 'Genesis', 'Yes', 'Adrian Belew', 'Hatfield and the North', 'Caravan', 'Robert Wyatt', 'Camel', 'Robert Fripp', 'The Residents', 'Traffic', 'Focus', 'Can', 'Peter Hammill', 'UK', 'Todd Rundgren', 'Jeff Beck', 'Faust', 'Rush', 'Allan Holdsworth', 'Ween', 'The Who', 'Premiata Forneria Marconi', 'Return to Forever', 'Bill Bruford', 'Peter Gabriel', 'Nektar', 'Amon Düül II', 'Banco del Mutuo Soccorso', "Aphrodite's Child", 'Renaissance', 'The Nice', 'Kevin Ayers', 'Steve Hackett', "Manfred Mann's Earth Band", 'Steely Dan', 'Area', 'Matching Mole', 'Procol Harum', 'Marillion', 'Ozric Tentacles', 'Steve Hillage', 'John Zorn', 'Cream', 'Mr. Bungle', 'Jimi Hendrix', 'Neil Young', 'The Mars Volta', '

In [17]:
print(top_tags)

['Progressive rock', 'experimental', 'rock', 'jazz', 'classic rock', 'Avant-Garde', 'Progressive', 'alternative', 'psychedelic', 'zappa', 'Fusion', 'Experimental Rock', 'jazz fusion', 'Psychedelic Rock', 'Jazz Rock', 'Frank Zappa', 'art rock', 'guitar', 'american', '70s', 'comedy', 'jazz-rock', 'Classical', 'guitar virtuoso', 'genius', 'singer-songwriter', '60s', 'hard rock', 'Comedy Rock', 'blues', '80s', 'composer', 'contemporary classical', 'avantgarde', 'alternative rock', 'funk', 'prog', 'Avant-Prog', 'satire', 'prog rock', 'USA', 'proto-punk', 'instrumental', 'pop', 'weird', 'free jazz', 'seen live', 'blues rock', 'humor', 'indie', 'humour', 'electronic', 'funny', 'Awesome', 'impossible for liberals to deal with', 'avant garde', 'conservative']


In [18]:
print(tag_top_artists)

['Pink Floyd', 'Porcupine Tree', 'Rush', 'Coheed and Cambria', 'The Mars Volta', 'Genesis', 'Frank Zappa', 'Jethro Tull', 'King Crimson', 'Yes', 'Peter Gabriel', 'Mike Oldfield', 'Riverside', 'dredg', 'Steven Wilson', 'Marillion', 'The Alan Parsons Project', 'Camel', 'The Dear Hunter', 'Blackfield', 'David Gilmour', 'Roger Waters', 'Emerson, Lake & Palmer', 'Antimatter', 'Karnivool', 'Oceansize', 'The Pineapple Thief', 'Gentle Giant', 'Asia', 'Fair to Midland', 'Rishloo', 'Procol Harum', 'Lunatic Soul', 'Ozric Tentacles', 'Gazpacho', "Spock's Beard", 'Van der Graaf Generator', 'Wishbone Ash', 'Pure Reason Revolution', 'Soft Machine', 'Closure in Moscow', 'Nothing More', 'Steve Hackett', 'Focus', 'Kingston Wall', 'Amplifier', 'Gong', 'Robert Wyatt', 'Caravan', 'Eloy', 'Animal Collective', 'Portugal. The Man', 'Buckethead', 'Chelsea Wolfe', 'Panda Bear', 'Dirty Projectors', 'Bibio', 'Xiu Xiu', 'Deerhoof', 'Captain Beefheart & His Magic Band', 'Health', 'Liars', 'Shlohmo', 'The Books', 't