In [1]:
import spotipy
import sys
sp = spotipy.Spotify()
import spotipy.util as util
import pandas as pd

import pandas as pd
import urllib
from bs4 import BeautifulSoup
import lxml

import re

# Spotify Data Scraping, A Quick Guide

The goal of this tutorial  is to show you how you can use Python to analyse data relating to music that is streamed from spotify. This tutorial makes use of a couple of important packages, 'Beautiful Soup' and 'Spotipy'. Whilst there are many tutorials floating around the web that are aimed at showing us how you can use this packages to link with apps and dashboards, there isn't really anything for the purposes of data analysis. So here it is !

## Audio Features
There is an interesting element to the Spotify API and that is that tracks that are featured on the music sharing service actually have 'features' such as 'danceability','liveness' and 'loudness' attached to them. The meaning of these features is in the link provided, below we have some code that will allow us to extract these features.

https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/



## Authentication
In order to gain access to the Spotify API we need to authorise the client. Visit this link: https://beta.developer.spotify.com/documentation/general/guides/authorization-guide/ to obtain a 'client credentials flow'. That is where the client id numbers come from.

In [2]:
#in order to gain accesss to spotify API we needed to authorise client 

from spotipy.oauth2 import SpotifyClientCredentials

client_credentials_manager = SpotifyClientCredentials(client_id='a04614d3530b4d44b45fa7da39c576c0', client_secret='082376f38e3645238d96a72a3f317757')
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)


## Spotify URIs

Spotify identifies it's songs, artists, albums and even playlists through objects known as Uniform Resource identifiers or URIs. Each of the items listed above has attached to it a unique URI. These URIs are an important object because they allow us to identify items within Spotify and access data. For example Kanye West's unique artist URI is: spotify:artist:5K4W6rqBFWDnAN6FQUkS6x, and as we will see below, using this URI we are able to create a list of all of Kanye West's albums

In [7]:
#creates a list of all of Kanye West's albums

#kanye_west = 'spotify:artist:5K4W6rqBFWDnAN6FQUkS6x';

#results = sp.artist_albums(kanye_west, album_type = 'album');
#albums = results['items'];
#while results['next']:
    #results = sp.next(results);
   # albums.extend(results['items']);

#for album in albums: 
       #print(album['name'])

The problem with URIs is that we need to individually enter them as a string as the argument of some function that Spotipy allows us to use. This is really impractical. Imagine for example, you wanted to find information regarding the songs of each of these albums, that is a lot of URIs, and recovering them individually requires you to right click on an artist/song/playlist within the app and 'copy' the URI. So, we want to use a method which is a little more efficient, and that's where the data scraping comes in.

## Data Scraping

To demonstrate, I am going to find all of the songs in DZ Deathray's 'Total Meltdown' (great album), and am going to extract data regarding those internal Spotify 'audio features' I talked about earlier. Note that the link that I am using is not from the desktop app, but from Spotify's online app (just google Spotify online) this is an important difference as our webscraper is designed for website URLs.

In [8]:
link = "https://open.spotify.com/album/3UFYpiUu30pfh8qg24JfVG" #this is the URI of the album

This is a webscraper, what it does is takes URL links and then 'parses' them. Parsing is a process whereby some object is taken and it's useful data, extracted. With HTML parsing, what we are doing is reading a webpage and extracting the data in a way that allows us to use information on the page meaningfully.

In [9]:
page = urllib.request.urlopen(link);
soup = BeautifulSoup(page, "html.parser");
print(soup.prettify());

<!DOCTYPE html>
<html class="no-css3-filters no-focus-outline" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Bloody Lovely by DZ Deathrays on Spotify
  </title>
  <meta name="google" value="notranslate"/>
  <meta content="width=device-width,initial-scale=1,maximum-scale=1,viewport-fit=cover" name="viewport"/>
  <meta content="Spotify" property="og:site_name"/>
  <meta content="174829003346" property="fb:app_id"/>
  <meta content="Bloody Lovely, an album by DZ Deathrays on Spotify" property="description"/>
  <meta content="Bloody Lovely" property="og:title"/>
  <meta content="Bloody Lovely, an album by DZ Deathrays on Spotify" property="og:description"/>
  <meta content="https://open.spotify.com/album/3UFYpiUu30pfh8qg24JfVG" property="og:url"/>
  <meta content="https://i.scdn.co/image/a824851f2052567acca69a35b88b9c54e0ee555d" property="og:image"/>
  <meta content="music.album" property="og:type"/>
  <meta content="https://open.spotify.com/artist/0qGPycvPHafmEPTOm4M7Tu" prope

Note that here, the information in the link is now manageable, it is split into 'tags' which demarcate certain groupings of text on the page, in this way we are able to access parts of the text by calling the tag that is attached to it. For example the find_all function and the "div" tag show us:

In [6]:
soup.find_all("div");

There are tags that are likely to contain your data, and the following link will help you figure out which tags those may be https://www.w3schools.com/tags/ref_byfunc.asp. You can however, also do this by brute force, having a look through the data and seeing what patterns occur in terms of tags and what it is that you are looking for.

For example, I noticed that the "span" tag was attached to the text that contained the names of the songs. I also saw that the 'class' of what I was looking for was the 'track-name', which is where the second part of the find_all function comes from. The for loop is essentially going through the above text and returning the raw track names. The text.strip function is useful for this sort of stuff because it 'strips' the text of uncessary preffixes and suffixes that usually are associated with HTMl text.

In [10]:
#obtaining the titles

title_DZ = []

for tag in soup.find_all('span',{'class':'track-name'}):
    title_DZ.append(tag.text.strip())

print(title_DZ)

['Shred For Summer', 'Total Meltdown', 'Feeling Good, Feeling Great', 'Like People', 'High', 'Guillotine', 'Bad Influence', 'Over It', 'Back & Forth', 'Afterglow', 'Witchcraft Pt. II']


Similarly, we can obtain the duration of these songs.

In [8]:
#obtaining the song lengths

time_DZ = []

for tag in soup.findAll('span',{'class':'total-duration'}):
    time_DZ.append(tag.text.strip())

## Obtaining the URIs

**Disclaimer**: There are probably more elegant solutions to this problem, and I am not claiming to have figured out the most efficient way of doing this, but sometimes data analysis is about getting your hands dirty - an ugly solution to a problem is better than no solution to a problem.

After obtaining all of the album names, I used the 'Spotipy' package (more information on documentation: http://spotipy.readthedocs.io/en/latest/) to extract a string that contained information about each song. From this, using the fact that all URIs have a common trait (begining with "spotify:track"), you are able to loop through this string (for each song in the album) in order to obtain the URI for each of these songs.

In [10]:
#obtaining URIS

URI_DZ = []
new_name = []
new_URI = []
artist = "DZ Deathrays"
album = "Total Meltdown"


for title in title_DZ: #iterates through the list of titles we have obtained
    search_query = title + ' ' + artist
    result = sp.search(search_query)
    #for each title we have a string that contains information about the song
    result_string = str(result).rsplit(' ') #split allows us to split the entire string into substrings (each word rather than one big one)
    for string in result_string:
        if 'spotify:track' in string: #because uris are attached to this substring, any string that contains this will be put into our list
            #if the string contains
            string = string.replace('}],', '') #getting rid of unwanted suffixes which don't allow us to use the strings in a function
            string = string.replace('},', '')
            string = string.replace('\'', '')
            URI_DZ.append(string)
            for i in range(len(URI_DZ)):
                new_name.append(sp.track(URI_DZ[i])['name'])
            for i in range(len(URI_DZ)):
                new_URI.append(sp.track(URI_DZ[i])['uri'])
        
DZ_Deathrays = pd.DataFrame(
    {'Track Name': new_name,
     'URI': new_URI
    })



In [11]:
pig = DZ_Deathrays[DZ_Deathrays['Track Name'].str.contains("Total Meltdown")
pig

SyntaxError: invalid syntax (<ipython-input-11-32979c4663ff>, line 2)

In [None]:
DZ_Deathrays = pd.DataFrame(
    {'Track Name': title_DZ,
     'Length': time_DZ,
     'URI': URI_DZ
    })

In [112]:
URI_DZ[1];

"'spotify:track:7vg2AbsnHaGfZsHowompNt'}],"

In [None]:
title = "Shred for Summer"
artist = "DZ Deathrays"
album = "Bloody Lovely"

q = album


q="name:DZ Deathrays&type=album"
    
sp.search(q)

In [65]:
URI_DZ1 = []
URI_DZ2 = []
artist = "DZ Deathrays"
album = "Total Meltdown"


search_query = artist + ' ' 
result = sp.search(search_query)
#for each title we have a string that contains information about the song
result_string = str(result).rsplit(' ') #split allows us to split the entire string into substrings (each word rather than one big one)
for string in result_string:
        if 'spotify:track' in string: #because uris are attached to this substring, any string that contains this will be put into our list
            #if the string contains
            URI_DZ1.append(string)
        if 'name' in string:
            URI_DZ2.append(string)

In [93]:
print(result)

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=Witchcraft+Pt.+IIDZ+Deathrays&type=track&offset=0&limit=10', 'items': [], 'limit': 10, 'next': None, 'offset': 0, 'previous': None, 'total': 0}}


In [78]:
result  
    

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=Witchcraft+Pt.+IIDZ+Deathrays&type=track&offset=0&limit=10',
  'items': [],
  'limit': 10,
  'next': None,
  'offset': 0,
  'previous': None,
  'total': 0}}

In [110]:
sp.track('spotify:track:7vg2AbsnHaGfZsHowompNt')['name']

'Shred For Summer'

In [101]:
URI_DZ[1]

"'spotify:track:7vg2AbsnHaGfZsHowompNt'}],"

In [87]:
artist = "DZ Deathrays" #iterates through the list of titles we have obtained
search_query = "Shred for Summer" + ' ' + artist
result_test = sp.search(search_query)

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=Shred+for+Summer+DZ+Deathrays&type=track&offset=0&limit=10',
  'items': [{'album': {'album_type': 'single',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0qGPycvPHafmEPTOm4M7Tu'},
       'href': 'https://api.spotify.com/v1/artists/0qGPycvPHafmEPTOm4M7Tu',
       'id': '0qGPycvPHafmEPTOm4M7Tu',
       'name': 'DZ Deathrays',
       'type': 'artist',
       'uri': 'spotify:artist:0qGPycvPHafmEPTOm4M7Tu'}],
     'available_markets': ['AR',
      'AU',
      'BO',
      'BR',
      'CL',
      'CO',
      'CR',
      'DO',
      'EC',
      'GT',
      'HK',
      'HN',
      'ID',
      'IL',
      'JP',
      'MX',
      'MY',
      'NI',
      'NZ',
      'PA',
      'PE',
      'PH',
      'PY',
      'SG',
      'SV',
      'TH',
      'TR',
      'TW',
      'US',
      'UY',
      'VN',
      'ZA'],
     'external_urls': {'spotify': 'https://open.spotify.com/album/5nPBN2WPwVZbvMUJ7RnBpx'},

In [89]:
result_string_test = str(result_test).rsplit(' ');
result_string_test;

["{'tracks':",
 "{'href':",
 "'https://api.spotify.com/v1/search?query=Shred+for+Summer+DZ+Deathrays&type=track&offset=0&limit=10',",
 "'items':",
 "[{'album':",
 "{'album_type':",
 "'single',",
 "'artists':",
 "[{'external_urls':",
 "{'spotify':",
 "'https://open.spotify.com/artist/0qGPycvPHafmEPTOm4M7Tu'},",
 "'href':",
 "'https://api.spotify.com/v1/artists/0qGPycvPHafmEPTOm4M7Tu',",
 "'id':",
 "'0qGPycvPHafmEPTOm4M7Tu',",
 "'name':",
 "'DZ",
 "Deathrays',",
 "'type':",
 "'artist',",
 "'uri':",
 "'spotify:artist:0qGPycvPHafmEPTOm4M7Tu'}],",
 "'available_markets':",
 "['AR',",
 "'AU',",
 "'BO',",
 "'BR',",
 "'CL',",
 "'CO',",
 "'CR',",
 "'DO',",
 "'EC',",
 "'GT',",
 "'HK',",
 "'HN',",
 "'ID',",
 "'IL',",
 "'JP',",
 "'MX',",
 "'MY',",
 "'NI',",
 "'NZ',",
 "'PA',",
 "'PE',",
 "'PH',",
 "'PY',",
 "'SG',",
 "'SV',",
 "'TH',",
 "'TR',",
 "'TW',",
 "'US',",
 "'UY',",
 "'VN',",
 "'ZA'],",
 "'external_urls':",
 "{'spotify':",
 "'https://open.spotify.com/album/5nPBN2WPwVZbvMUJ7RnBpx'},",
 "'hr

In [None]:
for i in range(0,len(result_string)):
    if result_string[i].startswith('spotify:track'):
        print(string)

In [None]:
print(result_string.startswith('spotify:track'));

In [None]:
any(string.contains('href') for string in result_string);

In [135]:
URI = []

for string in result_string:
    if 'spotify:track' in string:
        URI.append(string);