# Lab | API wrappers - Create your collection of songs & audio features
#### Instructions
To move forward with the project, you need to create a collection of songs with their audio features - as large as possible!

These are the songs that we will cluster. And, later, when the user inputs a song, we will find the cluster to which the song belongs and recommend a song from the same cluster. The more songs you have, the more accurate and diverse recommendations you'll be able to give. Although... you might want to make sure the collected songs are "curated" in a certain way. Try to find playlists of songs that are diverse, but also that meet certain standards.

The process of sending hundreds or thousands of requests can take some time - it's normal if you have to wait a few minutes (or, if you're ambitious, even hours) to get all the data you need.

An idea for collecting as many songs as possible is to start with all the songs of a big, diverse playlist and then go to every artist present in the playlist and grab every song of every album of that artist. The amount of songs you'll be collecting per playlist will grow exponentially!

# Set Up and Authentication

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
from random import randint
from time import sleep
from itertools import islice
from pandas import json_normalize

In [2]:
secrets_file = open("secrets.txt","r")

In [3]:
string = secrets_file.read()
#string

In [4]:
string.split('\n')

['clientid:dc877e548f3d4f37bdc2507860899d3e',
 'clientsecret:0f136055e6f04ad4b97ec2842b77cb0c']

In [5]:
secrets_dict={}
for line in string.split('\n'):
    if len(line) > 0:
        #print(line.split(':'))
        secrets_dict[line.split(':')[0]]=line.split(':')[1].strip()
        
#secrets_dict

In [6]:
#Initialize SpotiPy with user credentials
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=secrets_dict['clientid'],
                                                           client_secret=secrets_dict['clientsecret']))

# Access the Playlist and Retrieve Track IDs

Fetch the playlist and get the IDs of all the tracks in it. This playlist is large, so we'll have to handle pagination.

In [7]:
# Getting the playlist
playlist = sp.user_playlist_tracks("spotify", "5S8SJdl1BDc0ugpkEvFsIL")
playlist

{'href': 'https://api.spotify.com/v1/playlists/5S8SJdl1BDc0ugpkEvFsIL/tracks?offset=0&limit=100&additional_types=track',
 'items': [{'added_at': '2017-11-20T02:52:18Z',
   'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/twgeb7mzdcv4u8h191dxrvlpc'},
    'href': 'https://api.spotify.com/v1/users/twgeb7mzdcv4u8h191dxrvlpc',
    'id': 'twgeb7mzdcv4u8h191dxrvlpc',
    'type': 'user',
    'uri': 'spotify:user:twgeb7mzdcv4u8h191dxrvlpc'},
   'is_local': False,
   'primary_color': None,
   'track': {'album': {'album_type': 'single',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/06HL4z0CvFAxyc27GXpf02'},
       'href': 'https://api.spotify.com/v1/artists/06HL4z0CvFAxyc27GXpf02',
       'id': '06HL4z0CvFAxyc27GXpf02',
       'name': 'Taylor Swift',
       'type': 'artist',
       'uri': 'spotify:artist:06HL4z0CvFAxyc27GXpf02'}],
     'available_markets': [],
     'external_urls': {'spotify': 'https://open.spotify.com/album/0HG8fMDhvN2tH5

In [8]:
playlist["total"]

10000

There are 10000 tracks in total.

In [9]:
len(playlist["items"])

100

There are 100 track items in the current retrieved list.

In [10]:
# Look at items and total:
playlist.keys() 

dict_keys(['href', 'items', 'limit', 'next', 'offset', 'previous', 'total'])

Each key represents a different piece of information about the playlist.

In [11]:
playlist["items"][0].keys()

dict_keys(['added_at', 'added_by', 'is_local', 'primary_color', 'track', 'video_thumbnail'])

These keys represent various attributes of a track.

In [12]:
# Accessing the first item from a list of tracks in the playlist dictionary
# Then further access the details of the track

playlist["items"][0]["track"]

{'album': {'album_type': 'single',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/06HL4z0CvFAxyc27GXpf02'},
    'href': 'https://api.spotify.com/v1/artists/06HL4z0CvFAxyc27GXpf02',
    'id': '06HL4z0CvFAxyc27GXpf02',
    'name': 'Taylor Swift',
    'type': 'artist',
    'uri': 'spotify:artist:06HL4z0CvFAxyc27GXpf02'}],
  'available_markets': [],
  'external_urls': {'spotify': 'https://open.spotify.com/album/0HG8fMDhvN2tH5uPHFsyZP'},
  'href': 'https://api.spotify.com/v1/albums/0HG8fMDhvN2tH5uPHFsyZP',
  'id': '0HG8fMDhvN2tH5uPHFsyZP',
  'images': [{'height': 640,
    'url': 'https://i.scdn.co/image/ab67616d0000b2734322e9bd7d57d061d0e19e1f',
    'width': 640},
   {'height': 300,
    'url': 'https://i.scdn.co/image/ab67616d00001e024322e9bd7d57d061d0e19e1f',
    'width': 300},
   {'height': 64,
    'url': 'https://i.scdn.co/image/ab67616d000048514322e9bd7d57d061d0e19e1f',
    'width': 64}],
  'name': '...Ready For It?',
  'release_date': '2017-09-03',
  'rele

Dictionary with several key-value pairs that provide detailed information about the track.

In [13]:
# Retrieving all the keys from the track's details dictionary
playlist["items"][0]["track"].keys()

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'episode', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track', 'track_number', 'type', 'uri'])

These keys represent the various properties of the track that can be accessed.

# Retrieve Track Details and Audio Features

For each track, we'll get its details and audio features.

1. We will iterate over the playlist, putting all songs into a list.
2. We will pull out the desired features into a dataframe.
3. We will get the audiofeatures for all songs in the dataframe.

We create a dataframe, where we will finally store our songs.

columns = ['id','title','artist_name','artist_id,'album','length','explicit']

## 1. Iterating over the playlist, pulling all songs out.

In [14]:
def get_playlist_tracks(playlist_id):
    """
    Retrieves all tracks from a Spotify playlist.

    Args:
        playlist_id (str): The unique identifier of the playlist.

    Returns:
        list: A list of dictionaries, where each dictionary represents a track in the playlist.
    """
    results = sp.user_playlist_tracks("spotify", playlist_id)
    tracks = results['items']
    while results['next'] is not None:
        results = sp.next(results)
        tracks = tracks + results['items']
        sleep(randint(1, 3))  # Respectful nap
        
    return tracks

In [15]:
playlist = get_playlist_tracks('5S8SJdl1BDc0ugpkEvFsIL')
len(playlist)

10000

### Getting all the song features from the playlist

In [16]:
def get_name_artists_from_track(track):
    """
    Extracts the names of artists from a track dictionary.

    Args:
        track (dict): A dictionary representing a music track, including artist details.

    Returns:
        list: A list of artist names associated with the track.
    """
    
    return [artist["name"] for artist in track["artists"]]

In [17]:
# Same for artist id
def get_name_artists_id_from_track(track):
    return [artist["id"] for artist in track["artists"]]

In [18]:
def get_tracks(playlist):
    """
    Extracts detailed information from a list of track dictionaries in a playlist.

    Args:
        playlist (list): A list of dictionaries, where each dictionary represents a music track.

    Returns:
        list: A list of lists, where each inner list represents detailed information about a track.
    """
    tracklist = []
    for i in range(len(playlist)):
        # Extract specific track details and append them to the tracklist
        # (Include details like URI, name, artist, album, duration, explicitness, and popularity)
        tracklist.append([
            playlist[i]['track']['uri'],
            playlist[i]['track']['name'],
            get_name_artists_from_track(playlist[i]['track'])[0],
            get_name_artists_id_from_track(playlist[i]['track'])[0],
            playlist[i]['track']['album']['id'],
            playlist[i]['track']['album']['name'],
            playlist[i]['track']['duration_ms'],
            playlist[i]['track']['explicit'],
            playlist[i]['track']['popularity']
        ])
        
    return tracklist

In [19]:
songs = get_tracks(playlist)

In [20]:
len(songs)

10000

## 2. Pulling out the desired features into a dataframe.

In [21]:
df_songs = pd.DataFrame(data = songs, columns = ['uri','title','artist_name','artist_id','album_id','album_name','length','explicit','popularity'])

In [22]:
df_songs.head()

Unnamed: 0,uri,title,artist_name,artist_id,album_id,album_name,length,explicit,popularity
0,spotify:track:7zgqtptZvhf8GEmdsM2vp2,...Ready For It?,Taylor Swift,06HL4z0CvFAxyc27GXpf02,0HG8fMDhvN2tH5uPHFsyZP,...Ready For It?,208198,False,0
1,spotify:track:4Vxu50qVrQcycjRyJQaZLC,Life Changes,Thomas Rhett,6x2LnllRG5uGarZMsD4iO8,4w5Jvreahp3yvLqc4vCr9I,Life Changes,190226,False,62
2,spotify:track:6b8Be6ljOzmkOmFslEb23P,24K Magic,Bruno Mars,0du5cEVh5yTK9QJze8zA0C,4PgleR09JVnm3zY1fW3XBA,24K Magic,225983,False,82
3,spotify:track:0afhq8XCExXpqazXczTSve,Galway Girl,Ed Sheeran,6eUKZXaKkcviH0Ku9w2n3V,3T4tUhGYeRNVUGevb0wThu,÷ (Deluxe),170826,False,81
4,spotify:track:1HNkqx9Ahdgi1Ixy2xkKkL,Photograph,Ed Sheeran,6eUKZXaKkcviH0Ku9w2n3V,1xn54DMo2qIqBuMqHtUsFd,x (Deluxe Edition),258986,False,87


In [23]:
df_songs.shape

(10000, 9)

## 3. Getting the audiofeatures for all songs in the dataframe.

In [24]:
def get_features_delayed(uri):
    """
    Retrieves audio features of a Spotify track using its URI with a respectful delay.

    Args:
        uri (str): The URI of the Spotify track for which audio features are requested.

    Returns:
        dict: A dictionary containing audio features of the track.
    """
    sleep(randint(1, 2))  # A respectful nap (waits for 1-2 seconds)
    
    return sp.audio_features(uri)

In [25]:
# Running this takes too long

#df_songs['features'] = df_songs['uri'].apply(get_features_delayed).copy()
#df_songs

In [26]:
#print(get_features_delayed('spotify:track:7zgqtptZvhf8GEmdsM2vp2'))

### Feature extraction

In [27]:
# We have to make the function resilient to errors in case the coinnection breaks
def get_features_delayed(uri):
    # First we try to get the features
    try:
        sleep(randint(1,2)) # respectful nap
        features = sp.audio_features(uri)
    # If it doesnt work we try again, for this we make the function recursive
    except:
        print('Error occured while getting feature names')
        features = get_features_delayed(uri)
        
    # Finally we return the features
    return features

In [28]:
df_songs.head()

Unnamed: 0,uri,title,artist_name,artist_id,album_id,album_name,length,explicit,popularity
0,spotify:track:7zgqtptZvhf8GEmdsM2vp2,...Ready For It?,Taylor Swift,06HL4z0CvFAxyc27GXpf02,0HG8fMDhvN2tH5uPHFsyZP,...Ready For It?,208198,False,0
1,spotify:track:4Vxu50qVrQcycjRyJQaZLC,Life Changes,Thomas Rhett,6x2LnllRG5uGarZMsD4iO8,4w5Jvreahp3yvLqc4vCr9I,Life Changes,190226,False,62
2,spotify:track:6b8Be6ljOzmkOmFslEb23P,24K Magic,Bruno Mars,0du5cEVh5yTK9QJze8zA0C,4PgleR09JVnm3zY1fW3XBA,24K Magic,225983,False,82
3,spotify:track:0afhq8XCExXpqazXczTSve,Galway Girl,Ed Sheeran,6eUKZXaKkcviH0Ku9w2n3V,3T4tUhGYeRNVUGevb0wThu,÷ (Deluxe),170826,False,81
4,spotify:track:1HNkqx9Ahdgi1Ixy2xkKkL,Photograph,Ed Sheeran,6eUKZXaKkcviH0Ku9w2n3V,1xn54DMo2qIqBuMqHtUsFd,x (Deluxe Edition),258986,False,87


In [29]:
def flatten_features(df):
    """
    Flatten audio features data in a DataFrame.

    This function takes a DataFrame containing audio features and flattens the data
    by extracting specific audio feature values for each row and creating new columns
    in the DataFrame with these values.

    Args:
        df (pd.DataFrame): The input DataFrame containing 'features' column.

    Returns:
        pd.DataFrame: A new DataFrame with flattened audio feature columns.
        
    """
    featurelist = []
    for i in range(len(df['features'])):
        try:
            featurelist.append([
            df['features'][i][0]['danceability'],
            df['features'][i][0]['energy'],
            df['features'][i][0]['key'],
            df['features'][i][0]['loudness'],
            df['features'][i][0]['mode'],
            df['features'][i][0]['speechiness'],
            df['features'][i][0]['acousticness'],
            df['features'][i][0]['instrumentalness'],
            df['features'][i][0]['liveness'],
            df['features'][i][0]['valence'],
            df['features'][i][0]['tempo']
            ])
        except:
            featurelist.append([0,0,0,0,0,0,0,0,0,0,0])
    featureframe = pd.DataFrame(featurelist,columns = ['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo'] )
    df = pd.concat([df,featureframe],axis = 1)
    df = df.drop('features',axis = 1)
    
    return df

In [30]:
# Testing the functions takes too long

# df_songs2 = df_songs.copy()
# df_test = flatten_features(df_songs)
# df_test.head(5)

## Getting all songs

In [31]:
def get_playlist_tracks_features_to_dataframe(playlist_id):
    """
    Retrieves tracks and their features from a Spotify playlist and returns them as a pandas DataFrame.

    Args:
    playlist_id (str): The Spotify ID of the playlist.

    Returns:
    pandas.DataFrame: A DataFrame containing track URI, title, artist name, artist ID, album ID, album name, length, explicit status, popularity, and audio features.
    """

    # Initialize the final DataFrame
    full_dataframe = pd.DataFrame()

    # Retrieve the first group of songs from the playlist
    results = sp.user_playlist_tracks("spotify", playlist_id)

    # Monitor progress
    errorcount = 0
    fetched = 100

    # Process the first batch of songs
    while results:
        try:
            # Extract data and convert it into a DataFrame
            flat = get_tracks(results['items'])
            resultframe = pd.DataFrame(data=flat, columns=['uri', 'title', 'artist_name', 'artist_id', 'album_id', 'album_name', 'length', 'explicit', 'popularity'])
            
            # Retrieve features of the songs
            resultframe['features'] = resultframe['uri'].apply(get_features_delayed)
            
            # Flatten the features and add the results to the final DataFrame
            resultframe = flatten_features(resultframe)
            full_dataframe = full_dataframe.append(resultframe, ignore_index=True)
            
            # Move to the next batch of songs, if available
            results = sp.next(results) if results['next'] else None

            # Report progress
            fetched += len(resultframe)
            print(f'Fetched {fetched} out of {results["total"]} ({100 * (fetched / results["total"])}%)' if results else "Fetch complete.")

        except Exception as e:
            errorcount += 1
            print(f'Error while fetching. # {errorcount}. Error: {e}')

    return full_dataframe

In [32]:
# Takes too long to check on this

# Call the function with the playlist ID
#playlist_id = '5S8SJdl1BDc0ugpkEvFsIL'
#df_result = get_playlist_tracks_features_to_dataframe(playlist_id)

# Display the head of the resulting DataFrame
#df_result.head()