*Scrape radio shows for tracklist metadata.*

Online radio plays a major role in how I discover new music. I really dig how uniquely curated shows are and find myself studying tracklists, chasing down labels and trying to discern new moods and scenes from within the selections. I'd now like to explore augmenting how I discover new music through automating some of the following processes:

- finding record labels that are popular sources of music within the shows I listen to 
- finding significant periods of history to mine for music
- finding artists who perform at the same events

I will begin by scraping useful metadata from a single online radio station. NTS radio is an obvious choice for me as its one of the most popular online stations, has many of my favourite regular show, has some of the metadata I need already included (record labels, tracklists) and has a website that looks reasonably straightforward to scrape.

**Goal: *For a given NTS show, return a tracklist including artist, label and year recorded for each track***.

For this task we'll need requests to grab the relevant HTML (our input data) and then beautifulsoup to parse the HTML so we can easily extract the text we're looking for: 

In [1]:
import requests
from bs4 import BeautifulSoup

If we don't include headers, sites may not allow us to access their data as they will rightly think we're attempting to scrape. We can circumvent this by 'spoofing' a header to send with our request to make it seem like the request was made by a web browser.

Many services will want to see a header in the request because it lets them know more about the client requesting the resource. For instance check out the header below:

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

Included above is the expected user-agent header for Chrome browser. User agent is any software, acting on behalf of a user, which "retrieves, renders and facilitates end-user interaction with Web content". The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent. You might be wondering, if this is a Chrome header, why does it have Mozilla stuff in it? This is because Mozilla/5.0 is a general token that says the browser is Mozilla-compatible and most browsers (Chrome included) include this nowadays. Every browser has its own set of unique headers as seen [here]('https://www.useragentstring.com/pages/useragentstring.php'). There are all sorts of weird and wonderful browsers. Now lets attempt to grab the HTML from the following URL as an example:

In [3]:
url = "https://www.nts.live/shows/jam-city/episodes/jam-city-17th-june-2022"
req = requests.get(url, headers)
print(req)

<Response [200]>


We got a HTTP "200" OK success status response code indicating that the request has succeeded. Now if we want to actually check out the HTML it recieved in response, we can call the text property on our response object where the HTML is stored as a string:

In [None]:
print(req.content)

Ok it send a bunch of text, now to parse the content of our request object with the html parser that beautifulsoup includes:

In [None]:
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

Now we want to extract the tracklist from the HTML. Thankfully NTS haven't implemented methods that make it difficult to scrape like including dynamic classes. Instead we can quite easily describe what we're looking to extract from the HTML. 

- Each track's data (`track_title` and `track_artist`) are stored within span elements nested within list item elements which all sit within an unordered list `<ul>` parent element.

There are a few approaches we could take to extract the data we're looking for. A straightforward approach would simply involve finding all of the list elements and then searching within those elements for the span elements and extracting their text. To get all list elements, we could simply run `soup.find_all`: 

In [6]:
elements = soup.find_all("li")

Taking a look at the first one, we can see it includes the track title, artist and even a url contained within an `href` attribute: 

In [7]:
elements[0]

<li class="track"><a class="nts-app nts-link" data-category="Navigation" data-origin="from: tracklist" data-target="GoTo-Artist" data-track="event" href="/artists/9902-blue-magic"><span class="track__artist">Blue Magic</span></a><span class="track__artist track__artist--mobile" style="display:none">Blue Magic</span> <img alt="" src="/img/go-to.svg" style="height:.75em;width:.75em;position:relative;top:-.1em;margin-left:1px"/><br/><span class="track__title">Born On Halloween</span></li>

In [8]:
print(elements[0].find('a', href=True)['href'])

/artists/9902-blue-magic


One thing to notice is that not all tracks include a url as NTS doesn't have an artist page for every artist played on air. We could take the tracklist and correlate it with other databases (maybe Discogs or Bandcamp) to get somewhere near complete metadata but that's out of scope for this exercise. To handle the lack of url for some of the items in our tracklist, we can write a `get_url` function that handles this case:

In [9]:
def get_url(element, track):
    try:
        return element.find('a', href=True)['href']
    except: 
        print("no url for " + track)

Now we can loop though all HTML within the unordered list extracting track info and saving it within a list:

In [10]:
tracklist = []
for element in soup.find_all("li"):
    artist = element.find("span",{"class":"track__artist"}).text
    track = element.find("span",{"class":"track__title"}).text
    url = get_url(element, track)
    if url: 
        tracklist.append({"Artist": artist,
                          "Track" : track,
                          "Url" : url})
    else:
        pass

no url for Untitled
no url for Ninja H2r Flyby
no url for Only Fans (Edit)
no url for Polyphonic Love
no url for The Pearls (Edit)
no url for Freaky
no url for Unlock It (Himera Remix)
no url for The Rain
no url for Parties In Chelsea


There's quite a lot of tracks missing metadata but nevermind. Here is a working tracklist for now of items we can grab more metadata from: 

In [11]:
tracklist

[{'Artist': 'Blue Magic',
  'Track': 'Born On Halloween',
  'Url': '/artists/9902-blue-magic'},
 {'Artist': 'GQ',
  'Track': 'Lies (Theo Parrish Re-Edit)',
  'Url': '/artists/251-gq'},
 {'Artist': 'Otha', 'Track': "I'm On Top", 'Url': '/artists/79247-otha'},
 {'Artist': 'Actress',
  'Track': 'Shadow From Tartarus',
  'Url': '/artists/355-actress'},
 {'Artist': 'Salamandos',
  'Track': 'Expand',
  'Url': '/artists/118879-salamandos'},
 {'Artist': 'Chemotex',
  'Track': 'Early Death',
  'Url': '/artists/6521-chemotex'},
 {'Artist': 'Pev, ',
  'Track': 'End Point (Stenny & Andrea Remix)',
  'Url': '/artists/375-peverelist'},
 {'Artist': 'Kendrick Lamar',
  'Track': 'United In Grief',
  'Url': '/artists/4443-kendrick-lamar'},
 {'Artist': 'Cruel Santino',
  'Track': 'War In The Trenches',
  'Url': '/artists/111265-cruel-santino'},
 {'Artist': 'Machine Woman',
  'Track': 'Camile From Ohm Makes Me Feel Loved',
  'Url': '/artists/29327-machine-woman'},
 {'Artist': 'Reeko',
  'Track': 'Massive 

Now we want to get the record label for each artist in the tracklist. The record label metadata can be found at the Url for each track. However at some of the Url's this metadata will initially be hidden from us as NTS limits the number of tracks displayed for an artist's page. In this instance we'll have to click the 'MORE TRACKS' button with Selenium to load the full HTML before we can attempt scraping the metadata.

For ease lets solve for the easier use-case where we don't need Selenium and the metadata can be scraped from the HTML loaded at the Url. "Born on Halloween" by the band Blue Magic is a track we can test extracting metadata from as it loads without Javascript in the HTML. First we need to grab the HTML for the artist page which we extracted in the previous code, parse the HTML and then search for the element containing the track.

In [13]:
url = "https://www.nts.live" + tracklist[0]['Url']
url

'https://www.nts.live/artists/9902-blue-magic'

Get the HTML for Blue Magic's artist page: 

In [None]:
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

In [19]:
track = soup.find("div", string=tracklist[0]['Track'])
track

<div class="artist-track__name text-bold text-uppercase">Born On Halloween</div>

Now we have the div, we want to get the metadata within the parent div: 

In [21]:
label = track.parent.find_all("span")
label

[<span class="artist-track__meta__item">Blue Magic</span>,
 <span class="artist-track__meta__item">ATCO Records</span>,
 <span class="artist-track__meta__item">•</span>,
 <span class="artist-track__meta__item">1975</span>,
 <span class="artist-track__link-icon icon icon-link-arrow"></span>]

Metadata format is standard for all tracks so we know we can pull the record label from the same index: 

In [22]:
label[1].text

'ATCO Records'

Lets write a function which takes a url as input and outputs the record label:

In [23]:
def get_label(url, track):
    url = "https://www.nts.live" + url
    req = requests.get(url, headers)
    soup = BeautifulSoup(req.content, 'html.parser')
    track = soup.find("div", string=track)
    if track:
        label = track.parent.find_all("span")[1].text
        return label
    else: 
        print("need Selenium")
        return None

Works for the examples where we can pull the metadata from the HTML:

In [25]:
label = get_label(tracklist[1]['Url'],tracklist[1]['Track'])
label

'Ugly Edits'

But not so for examples where we need the Javascript to load the metadata: 

In [26]:
label = get_label(tracklist[3]['Url'],tracklist[3]['Track'])
label

need Selenium


Lets now solve for the case where we need selenium:

In [27]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By


options = Options()
options.headless = True
service = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

Test url: 

In [None]:
driver.get("https://www.nts.live/artists/355-actress")
soup= BeautifulSoup(driver.page_source, 'html.parser')
soup

Now we want to click the 'MORE TRACKS' button:

In [29]:
elem = driver.find_elements(By.CLASS_NAME, "artist-section__view-handler")
elem[1].click()

Although we actually want to keep on clicking another other 'MORE TRACKS' buttons until the full HTML loads:

In [30]:
while driver.find_element("xpath", "//*[contains(text(), 'MORE TRACKS')]"):
    try: 
        element = driver.find_element("xpath", "//*[contains(text(), 'MORE TRACKS')]")
        element.click()
    except: 
        break

Full HTML after clicking: 

In [None]:
soup= BeautifulSoup(driver.page_source, 'html.parser')
soup

Great! We can see the HTML includes the tracks that were previously hidden, now we can adjust our original function to account for the examples where selenium is required: 

In [33]:
def get_full_webpage(url, track): 
    options = Options()
    options.headless = True
    service = Service(executable_path=ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=options)
    driver.get(url)
    soup= BeautifulSoup(driver.page_source, 'html.parser')
    while driver.find_element("xpath", "//*[contains(text(), 'MORE TRACKS')]"):
        try: 
            element = driver.find_element("xpath", "//*[contains(text(), 'MORE TRACKS')]")
            element.click()
        except: 
            break
    soup= BeautifulSoup(driver.page_source, 'html.parser')
    track_info = soup.find("div", string=track)
    if track_info:
        label = track_info.parent.find_all("span")[1].text
        return label

def get_label(url, track):
    url = "https://www.nts.live" + url
    req = requests.get(url, headers)
    soup = BeautifulSoup(req.content, 'html.parser')
    track_info = soup.find("div", string=track)
    if track_info:
        label = track_info.parent.find_all("span")[1].text
        return label
    else: 
        label = get_full_webpage(url, track)
        return label

Now we can test and see if it grabs the label for Actress' Shadow From Tartarus which was previously hidden:

In [34]:
label = get_label(tracklist[3]['Url'],tracklist[3]['Track'])
label

"Honest Jon's Records"

We can now loop through the tracklist and find the rest of the labels: 

In [36]:
for track in tracklist: 
    label = get_label(track['Url'],track['Track'])
    print(track['Artist'] + " - " + track['Track'] + " [" + label + "]")

Blue Magic - Born On Halloween [ATCO Records]
GQ - Lies (Theo Parrish Re-Edit) [Ugly Edits]
Otha - I'm On Top [Not On Label (Otha (2) Self-released)]
Actress - Shadow From Tartarus [Honest Jon's Records]
Salamandos - Expand [Crème Organization]
Chemotex - Early Death [The Trilogy Tapes]
Pev,  - End Point (Stenny & Andrea Remix) [Livity Sound]
Kendrick Lamar - United In Grief [pgLang, Top Dawg Entertainment, Aftermath/Interscope Records]
Cruel Santino - War In The Trenches [Monster Boy, Interscope Records]
Machine Woman - Camile From Ohm Makes Me Feel Loved [Technicolour]
Reeko - Massive Garage Meetings [Avian]
jamesjamesjames - My Purple iPod Nano [Shall Not Fade]
Astrud Gilberto - Touching You [Perception Records]
DJ Gigola,  - Papi [Live From Earth Klub]
Broosnica - Vaporizer [Hyperboloid]
Section 25 - Be Brave [Factory]
