# Modeling Spotify Data Using Bokeh

This tutorial will introduce one to some basic methods for visualizing data gathered from an API. An application programming interface (API) allows developers to access the functionality of an application's code through controlled calls. For our purposes, we will be using the [Spotify API](https://beta.developer.spotify.com/documentation/web-api/) with the [spotipy](http://spotipy.readthedocs.io/en/latest/) Python wrapper to gain insight into current trending music. Data visualization is also a key aspect of data science. It allows us to draw conclusions from the information that we have gathered. [Bokeh](https://bokeh.pydata.org/en/latest/) is a Python module that allows us to create captivating, interactive graphs. We will use bokeh to map the data that we have gathered about Spotify to formulate some interesting conclusions.

### Tutorial Content
We will be covering the following content in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Accessing the API](#Accessing the API)
- [Loading the data](#Loading the data)
- [Plots](#Plots)
- [Putting it all Together](#Putting it all Together)
- [Concluding Thoughts](#Concluding Thoughts)

## Installing the libraries
Before getting started, we need to install the libraries that we will be using. We can install both spotipy and bokeh using pip.

    $pip install spotipy
    
    $pip install bokeh

We will also be using pretty print (pprint) to format the JSON-like (Python dictionary) outputs we may get from our API calls into a more readable format. Since pprint is already included in Python, there is no need to install it. We'll start with accessing the API, so we can check that our applicable modules have been properly installed by running the code below:

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pprint

## Accessing the API
In order to pull data from the Spotify API, we will first need to acquire a client_id and a client_secret. We can acquire these by creating a free Spotify account and logging into the [Spotify Developers site](https://beta.developer.spotify.com/) with our account information. After accepting the terms and conditions, we can then access our developer dashboard and create a new app by selecting Create a new client id or My New App. Then we can get our client id and client secret from our app dashboard.

Spotipy uses oAuth to provide some methods that allow us to easily access the API. Try out these functions:

In [2]:
client_credentials_manager = SpotifyClientCredentials(client_id='YOUR_ID_HERE', client_secret='YOUR_SECRET_HERE')
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

We can test our authentication by searching for all of the artists in Spotify's database that contain the name 'adele' using spotipy's search function.

In [3]:
query = 'adele'
results = sp.search(query, type='artist')
print(results)

{'artists': {'href': 'https://api.spotify.com/v1/search?query=adele&type=artist&offset=0&limit=10', 'items': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4dpARuHxo51G3z768sgnrY'}, 'followers': {'href': None, 'total': 8588748}, 'genres': ['pop'], 'href': 'https://api.spotify.com/v1/artists/4dpARuHxo51G3z768sgnrY', 'id': '4dpARuHxo51G3z768sgnrY', 'images': [{'height': 1000, 'url': 'https://i.scdn.co/image/ccbe7b4fef679f821988c78dbd4734471834e3d9', 'width': 1000}, {'height': 640, 'url': 'https://i.scdn.co/image/f8737f6fda048b45efe91f81c2bda2b601ae689c', 'width': 640}, {'height': 200, 'url': 'https://i.scdn.co/image/df070ad127f62d682596e515ac69d5bef56e0897', 'width': 200}, {'height': 64, 'url': 'https://i.scdn.co/image/cbbdfb209cc38b2999b1882f42ee642555316313', 'width': 64}], 'name': 'Adele', 'popularity': 85, 'type': 'artist', 'uri': 'spotify:artist:4dpARuHxo51G3z768sgnrY'}, {'external_urls': {'spotify': 'https://open.spotify.com/artist/19RHMn8FFkEFmhPwyDW2ZC'}, 'follow

Whoa! That's a lot of hard to read data. Let's try making our output a little prettier...

In [4]:
pprint.pprint(results)

{'artists': {'href': 'https://api.spotify.com/v1/search?query=adele&type=artist&offset=0&limit=10',
             'items': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4dpARuHxo51G3z768sgnrY'},
                        'followers': {'href': None, 'total': 8588748},
                        'genres': ['pop'],
                        'href': 'https://api.spotify.com/v1/artists/4dpARuHxo51G3z768sgnrY',
                        'id': '4dpARuHxo51G3z768sgnrY',
                        'images': [{'height': 1000,
                                    'url': 'https://i.scdn.co/image/ccbe7b4fef679f821988c78dbd4734471834e3d9',
                                    'width': 1000},
                                   {'height': 640,
                                    'url': 'https://i.scdn.co/image/f8737f6fda048b45efe91f81c2bda2b601ae689c',
                                    'width': 640},
                                   {'height': 200,
                                    'url': 'ht

That's better. As we can see, pprint reformats the dictionary in a much more readable way. We can now scroll to the bottom of the output and see that there are 169 artists in Spotify's database containing the name 'adele'. However, we are currently limited to the first 10 entries. The next 10 can be found at the URL provided, but that's okay! We found the one that we were looking for. 

In [5]:
artist = results['artists']['items'][0]
pprint.pprint(artist)

{'external_urls': {'spotify': 'https://open.spotify.com/artist/4dpARuHxo51G3z768sgnrY'},
 'followers': {'href': None, 'total': 8588748},
 'genres': ['pop'],
 'href': 'https://api.spotify.com/v1/artists/4dpARuHxo51G3z768sgnrY',
 'id': '4dpARuHxo51G3z768sgnrY',
 'images': [{'height': 1000,
             'url': 'https://i.scdn.co/image/ccbe7b4fef679f821988c78dbd4734471834e3d9',
             'width': 1000},
            {'height': 640,
             'url': 'https://i.scdn.co/image/f8737f6fda048b45efe91f81c2bda2b601ae689c',
             'width': 640},
            {'height': 200,
             'url': 'https://i.scdn.co/image/df070ad127f62d682596e515ac69d5bef56e0897',
             'width': 200},
            {'height': 64,
             'url': 'https://i.scdn.co/image/cbbdfb209cc38b2999b1882f42ee642555316313',
             'width': 64}],
 'name': 'Adele',
 'popularity': 85,
 'type': 'artist',
 'uri': 'spotify:artist:4dpARuHxo51G3z768sgnrY'}


The first artist in the 'items' list appears to be the popular British singer that we were referring to. We can double check by opening up the link in the 'external_urls' key to view the [artist's page](https://open.spotify.com/artist/4dpARuHxo51G3z768sgnrY).

Lo and behold, we have our girl.

## Loading the data
Let's gather some more data. Say we want to answer the question: How has an artist's popularity changed over the years? How does this compare to other Billboard 100 artists?

Now that we know how to access the API, we can use spotipy methods as well as the corresponding object attributes to gather the information needed for our inquiry. For more detailed information, check out the documentation on the [spotipy methods](http://spotipy.readthedocs.io/en/latest/#api-reference) and [object models](https://beta.developer.spotify.com/documentation/web-api/reference/object-model/) in the attached links.

First, we need to make a list of artists that we're interested in. Then we need to find all of their ids to access other data about them.

In [24]:
# list of artist names, customizable
artist_names = ['adele', 'ed sheeran', 'taylor swift', 'the script', 'maroon 5']

# creating a list of artist ids
# since we have chosen a list of popular artists, we can assume that 
# they will have the greatest popularity out of all the search results
# or in other words we can assume they will be at the top of the results

artists = []
for name in artist_names:
    results = sp.search(name, type='artist')
    artist = results['artists']['items'][0]
    artists.append(artist)
    
#double checking code...
for artist in artists:
    print(artist['name'] + ": " + artist['id'])
    print(str(artist['popularity']))

Adele: 4dpARuHxo51G3z768sgnrY
85
Ed Sheeran: 6eUKZXaKkcviH0Ku9w2n3V
96
Taylor Swift: 06HL4z0CvFAxyc27GXpf02
90
The Script: 3AQRLZ9PuTAozP28Skbq8V
80
Maroon 5: 04gDigrS5kc9YWfZHwBETP
89


Just to double check that they are the artists we are looking for, we have printed out their popularities as well. Spotify ranks popularity on a scale of 0-100, 0 being the lowest popularity score to 100 being the most any object can get. Since these are trending artists, we expect them to be sufficiently popular.

Spotify has already generated popularity for artists, albums, tracks, etc., so we, the developers, are responsible for providing some measure of time to visualize how popularity has changed. The best way to do this is by comparing the popularity of an artist's album to its release date. Thus, we can see how an artist has evolved in the eyes of the populace with every album released.

So we need to get information about albums that an artist has produced. Spotipy has some built in functions like the artist_albums method. We then need to sort the data so that we're getting the information we expect to get. For instance, we need to ensure that each object returned from the artist_albums method is actually an album produced by the artist, rather than a single or an album they contributed to.

Another problem we may encounter lies in querying the database. Spotify limits the artist_albums methods to 20 results at one time. Since most artists appear in more than 20 albums, we need to repeat the query over and over again with an offset until we retrieve all results. This is tedious for us to do manually, so let's write a method to do the work for us.

In [7]:
def get_artist_albums(artist):
    """This function gets all of the relevant albums for an artist. Relevant includes only actual albums.
    
    Input: artist id (string)
    
    Output: albums (list of dictionaries)
    """
    id = artist['id']
    result = []
    # temporary list to store all album ids
    album_ids = []
    
    offset = 0
    nxt = "start"
    while nxt != None:
        albums = sp.artist_albums(id, offset=offset)
        for album in albums['items']:
            if album['album_group'] == 'album':
                album_ids.append(album['id'])
        nxt = albums['next']
        offset += albums['limit']
            
    for album in album_ids:
        info = sp.album(album)
        result.append(info)
    
    return result

We can test our function by calling it on our first artist, Adele. Let's take a look at what her albums look like.

In [8]:
adele = get_artist_albums(artists[0])
print(len(adele))
pprint.pprint(adele)

11
[{'album_type': 'album',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4dpARuHxo51G3z768sgnrY'},
               'href': 'https://api.spotify.com/v1/artists/4dpARuHxo51G3z768sgnrY',
               'id': '4dpARuHxo51G3z768sgnrY',
               'name': 'Adele',
               'type': 'artist',
               'uri': 'spotify:artist:4dpARuHxo51G3z768sgnrY'}],
  'available_markets': ['AR',
                        'BO',
                        'BR',
                        'CL',
                        'CO',
                        'CR',
                        'DO',
                        'EC',
                        'GT',
                        'HN',
                        'MX',
                        'NI',
                        'PA',
                        'PE',
                        'PY',
                        'SV',
                        'US',
                        'UY'],
  'copyrights': [{'text': '(P) 2015 XL Recordings Ltd., under exclu

                                              'CY',
                                              'CZ',
                                              'DE',
                                              'DK',
                                              'EE',
                                              'ES',
                                              'FI',
                                              'FR',
                                              'GB',
                                              'GR',
                                              'HK',
                                              'HU',
                                              'ID',
                                              'IE',
                                              'IL',
                                              'IS',
                                              'IT',
                                              'JP',
                                              'LI',
            

                        'href': 'https://api.spotify.com/v1/tracks/5deAPB1sLTUOxWbbO18Wnl',
                        'id': '5deAPB1sLTUOxWbbO18Wnl',
                        'is_local': False,
                        'name': 'One And Only',
                        'preview_url': 'https://p.scdn.co/mp3-preview/648c93025c13fec6da389fedf7c762a286585277?cid=c84d10ac23bd4a5a97699ce4b07fa405',
                        'track_number': 9,
                        'type': 'track',
                        'uri': 'spotify:track:5deAPB1sLTUOxWbbO18Wnl'},
                       {'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4dpARuHxo51G3z768sgnrY'},
                                     'href': 'https://api.spotify.com/v1/artists/4dpARuHxo51G3z768sgnrY',
                                     'id': '4dpARuHxo51G3z768sgnrY',
                                     'name': 'Adele',
                                     'type': 'artist',
                                     'uri': 's

                                              'IL',
                                              'IS',
                                              'IT',
                                              'JP',
                                              'LI',
                                              'LT',
                                              'LU',
                                              'LV',
                                              'MC',
                                              'MT',
                                              'MY',
                                              'NL',
                                              'NO',
                                              'NZ',
                                              'PH',
                                              'PL',
                                              'PT',
                                              'RO',
                                              'SE',
            

                        'external_urls': {'spotify': 'https://open.spotify.com/track/59tg0OPhiHlbsVZ9GFqUk5'},
                        'href': 'https://api.spotify.com/v1/tracks/59tg0OPhiHlbsVZ9GFqUk5',
                        'id': '59tg0OPhiHlbsVZ9GFqUk5',
                        'is_local': False,
                        'name': 'Chasing Pavements',
                        'preview_url': 'https://p.scdn.co/mp3-preview/0cb77f22200a5eaf53bfe3b3a51f352d44e05b3a?cid=c84d10ac23bd4a5a97699ce4b07fa405',
                        'track_number': 3,
                        'type': 'track',
                        'uri': 'spotify:track:59tg0OPhiHlbsVZ9GFqUk5'},
                       {'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4dpARuHxo51G3z768sgnrY'},
                                     'href': 'https://api.spotify.com/v1/artists/4dpARuHxo51G3z768sgnrY',
                                     'id': '4dpARuHxo51G3z768sgnrY',
                                     'n

                                              'SE',
                                              'SG',
                                              'SK',
                                              'TH',
                                              'TR',
                                              'TW',
                                              'VN',
                                              'ZA'],
                        'disc_number': 2,
                        'duration_ms': 232200,
                        'explicit': False,
                        'external_urls': {'spotify': 'https://open.spotify.com/track/5gcj6l37gMBeSAjFnSKeWh'},
                        'href': 'https://api.spotify.com/v1/tracks/5gcj6l37gMBeSAjFnSKeWh',
                        'id': '5gcj6l37gMBeSAjFnSKeWh',
                        'is_local': False,
                        'name': 'Chasing Pavements (Live at Hotel Cafe)',
                        'preview_url': 'https://p.scdn.co/mp3-preview/4d413

                                              'DE',
                                              'DK',
                                              'EE',
                                              'ES',
                                              'FI',
                                              'FR',
                                              'GB',
                                              'GR',
                                              'HK',
                                              'HU',
                                              'ID',
                                              'IE',
                                              'IL',
                                              'IS',
                                              'IT',
                                              'JP',
                                              'LI',
                                              'LT',
                                              'LU',
            

According to our current specifications, Adele has 11 actual albums in Spotify. The sp.album method also gave us more information from Spotify's database on the albums. We can start making some conclusions regarding an artist's popularity.

In our next function, we'll format our information with graphs in mind. get_release_popularity will thus take the list of albums we have just found and output a tuple of sorted dates, associated popularities, and album names. Since the release_date is a string in the Spotify database, we need the builtin datetime module so that we can convert the string into a datetime. This will help with formatting the axis when we plot against time.

We also need to filter some dates out. An album's release_date can actually contain more or less information than year-month-date. Some contain only years or may contain specific times. Fortunately, there is also a field called release_date_precision, telling us how precise the release_date. Our function needs to case on this as well and only allows albums with a precision of 'day' to be processed.

In [9]:
from datetime import datetime

def get_release_popularity(albums):
    '''This function takes in a list of albums and returns a tuple of sorted dates, associated popularities, and album names
    
    Input: albums (album list)
    
    Output: dates, popularities, names (datetime list, int list, str lst)
    '''
    
    result = dict()
    for album in albums:
        if album['release_date_precision'] == "day":
            date = datetime.strptime(album['release_date'], "%Y-%m-%d")
            pop = album['popularity']
            name = album['name']
            if ((date in result and pop > result[date][0]) or 
                date not in result):
                result[date] = [pop,name]

    dates = []
    pops = []
    names = []
    keys = sorted(result.keys())
    for key in keys:
        dates.append(key)
        pops.append(result[key][0])
        names.append(result[key][1])
    return (dates, pops, names)

If multiple albums are released on the same day, we will take the one with the greatest popularity. Thus we can see how an artist's maximum popularity appears throughout the years.

Great! Now we can run some tests on Adele again.

In [23]:
a, b, c = get_release_popularity(adele)
print(a)
print(b)
print(c)

[datetime.datetime(2008, 1, 28, 0, 0), datetime.datetime(2011, 1, 19, 0, 0), datetime.datetime(2015, 11, 20, 0, 0), datetime.datetime(2016, 6, 24, 0, 0)]
[77, 81, 82, 80]
['19', '21', '25', '25']


It appears as though Adele has only released 3 albums total. A quick Google search confirms this fact. Hm. It appears as though we have a repeat for her album '25'. Since the release dates are different, Spotify considers them two separate entities. For now, we'll leave this as be, but it's good to take note of this for future consideration.

## Plotting
Now that we can process the data, we can move onto visualizing it. [Bokeh](https://bokeh.pydata.org/en/latest/docs/user_guide/plotting.html) is a fantastic tool for making line graphs, scatter graphs, geospatial maps, bar graphs, and any other type of data visualization you can imagine! To be honest, it's not that well documented, but it certainly produces better graphs than matplotlib. Or at least more ~aesthetic~ graphs. No offense. 

Since we're trying to track the popularity of artists over time, we'll be working mainly with line graphs. Bokeh also allows us to add certain features to a graph, such as a HoverTool or a zoom feature. These tools can be found on the sidebar of a graph. Albeit minor, these small flares engage users by encouraging them to interact with the graphics.

As we saw from the Adele example, we're not working with too many points. Artists produce less than we anticipated. Thus, we'll start with a simple line graph and see how we can build it up from there.

First, we need to import all of our required libraries. For some fancy formatting, we'll include math as well.

In [11]:
from bokeh.plotting import figure, output_file, show
from bokeh.models import DatetimeTickFormatter, ColumnDataSource, HoverTool
import math

## Putting it all Together
As mentioned above, bokeh lets users interact with the HTML graphs generated. We'll start off by creating the graphs first before adding any type of interactions. Bokeh automatically saves the graphs generated in a file to the filename.html if the output is not specified to be saved under a different HTML filename. We'll let this happen for now since we'll only be working with a single graph.

Before we can do any graphing, we first need to create a figure of a specified width and height. We also want some tools to be included in the toolbar so we'll initialize them as well as different colors for various lines on the graph.

In [13]:
output_file("test_file.html") #<--if we wanted to save it to a specific file name

TOOLS="hover,crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,save"

p = figure(plot_width = 700, plot_height = 350, 
           title = "Popularity of Artists Over the Years", tools = TOOLS)

COLORS = ["red", "orange", "green", "blue", "purple"]

Now we can start gathering data and adding lines to our graph. Putting together our get_artist_albums and get_release_popularity, we can finally plot the dates and popularities list resulting from our functions. A bokeh figure can take lists directly, but we want to pass in an album name as well to the graph. So we'll use the [ColumnDataSource](https://bokeh.pydata.org/en/latest/docs/reference/models/sources.html) provided by bokeh to process a dictionary of our inputs. ColumnDataSource actually maps the names of the columns (or keys in the dictionary) to the columns. Thus calling the name of the column is equivalent to calling the data associated. For our purposes, this functionality isn't very helpful. However, if we were working with a large quantity of data, such as a CSV-imported pandas dataFrame, then we could just apply ColumnDataSource onto the dataframe to map each column name to its values. We'll only be using ColumnDataSource so we have an easier way of referencing album names. 

ColumnDataSource comes with its own set of problems such as the fact that we cannot put lists of different lengths in the data dictionary. We cannot include an entry for artists names or colors since those lists would be much shorter than the lists holding album information. We'll actually be looping through all the artists anyways, so we can use this index to get artist names from the OG artist_names list and the line color from the color list we just defined.

We're also going to add a legend of all the artist names by using the lists not included in the data dictionary.

In [14]:
for i in range(len(artists)):
    albums = get_artist_albums(artists[i])
    d, pop, n = get_release_popularity(albums)

    #generates column data
    data = {'date':d,
            'popularity':pop,
            'name':n,
           }
    source = ColumnDataSource(data=data)

    p.line(x = 'date', y = 'popularity', source = source, 
           legend=artist_names[i], color=COLORS[i])

This function might take a while. We're doing a lot of iterative steps, looping through each artist to find all their albums, filtering them, and formatting all the information.

Afterwards, we can start formatting our graph. Since we went through the trouble of converting all those date strings to datetime, we want to flaunt our datetime abilities. We can do this by formatting our x-axis with another bokeh builtin called DatetimeTickFormatter. And just to add some more flair, we'll format the labels so they're angled rather than flat (this is where the math comes in). 

<img src="meme.jpg" style="height:273px">

Let's add some axis labels to our graph and place our legend so it doesn't overlap with too much data. By default, the legend appears in the top right corner. However this covers a lot of important data since we're interested in how the popularity of each artist has changed. Thus we will relocate the legend. We can also add click_policies to the legend such as "hide" or "mute". These policies allow us to click on the legend label and change the graph by hiding or muting that line. 

In [15]:
p.xaxis.formatter=DatetimeTickFormatter()
p.xaxis.major_label_orientation = math.pi/4
p.grid.grid_line_alpha=0.3

p.xaxis.axis_label = 'Year'
p.yaxis.axis_label = 'Popularity'

p.legend.location = "top_left"
p.legend.click_policy= "hide"

Finally, we'll add our last touch of dynacism. We'll add a HoverTool to our graph that follows our mouse and shows the album name, release date, and popularity associated with a point. Here's where calling ColumnDataSource comes in handy. We can just reference each column with an @ symbol and bokeh will understand that these values should be replaced with the name, date, and popularity at that point. We also need to format our datetime again, so we'll add a formatter feature to our HoverTool.

In [19]:
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
    ("Album Name", "@name"),
    ("Release Date", "@date{%F}"),
    ("Popularity", "@popularity")
]
hover.formatters = {'date':'datetime'}

show(p)

Jupyter notebook interprets the show figure call at the bottom of our code by generating the graph in a temporary HTML file in a new tab. Our completed graph should look something like this:

In [22]:
from IPython.display import IFrame
IFrame('test_file.html', 750, 400)

As we can see, our graph contains all of the features we had hoped to achieve: overlapping lines, an interactive legend, hover features, and a tool bar! Now we can look at the visualization to make conclusions about the artists. Here are some conclusions and subsequent questions we can raise from our results:

- Artist popularity can fluctuate a lot from album to album.
- It appears that most artists experienced a dip in popularity around 2015.
- Adele has remained at constant popularity, around 80.
- Ed Sheeran appears to be the most popular overall, especially with his latest album divide.
- Taylor Swift, contrary to popular belief, has experienced increasing popularity after her transition from country to pop. Although she may be an online joke, she's definitely doing well enough in her actual field to not give a f***.
- Maroon 5 is the longest in production, and has stayed relatively popular throughout.

## Concluding Thoughts
Bokeh allows us to create interesting visualizations of our data. In addition to its additional features, bokeh also allows for even more dynacism with custom javascript callbacks. For a more robust data set with greater dependencies, we could've added a widget like a slider or text input box or selection tool that would actively change the data set of the graph to reflect these changes. However, since our data is not dependent on any variables other than already existing information, callbacks are not relevant and not covered in this tutorial.

Ultimately, data visualization allows us to formulate some interesting conclusions to pertinent inquiries.

### References
This tutorial only provides partial insight into API accessing and bokeh. More details about the Spotify API, spotipy, and bokeh can be found at the links below:
- [Spotify API](https://beta.developer.spotify.com/documentation/web-api/)
- [spotipy- Spotify python wrapper](http://spotipy.readthedocs.io/en/latest/)
- [bokeh homepage](https://bokeh.pydata.org/en/latest/)
- [bokeh callbacks](https://bokeh.pydata.org/en/latest/docs/user_guide/interaction.html)
- [Importance of Data Visualization](https://www.sas.com/en_us/insights/big-data/data-visualization.html)