# Data loading and prep

**objectives**
* load the songs from the zip file
* perform transformations to prepare the data for two-tower training

**steps**
1. Create a bq dataset
2. Load the million playlist data to Big Query
3. Create pipelines to download audio and artist features for training

## Notebook Setup

In [1]:
# naming convention for all cloud resources
VERSION        = "v1"                  # TODO
PREFIX         = f'ndr-{VERSION}'      # TODO

print(f"PREFIX = {PREFIX}")

PREFIX = ndr-v1


In [2]:
# staging GCS
GCP_PROJECTS             = !gcloud config get-value project
PROJECT_ID               = GCP_PROJECTS[0]

# GCS bucket and paths
BUCKET_NAME              = f'{PREFIX}-{PROJECT_ID}-bucket'
BUCKET_URI               = f'gs://{BUCKET_NAME}'

config = !gsutil cat {BUCKET_URI}/config/notebook_env.py
print(config.n)
exec(config.n)


PROJECT_ID               = "myproject32549"
PROJECT_NUM              = "683169793466"
LOCATION                 = "us-central1"

REGION                   = "us-central1"
BQ_LOCATION              = "US"
VPC_NETWORK_NAME         = "ucaip-haystack-vpc-network"

VERTEX_SA                = "683169793466-compute@developer.gserviceaccount.com"

PREFIX                   = "ndr-v1"
VERSION                  = "v1"

APP                      = "sp"
MODEL_TYPE               = "2tower"
FRAMEWORK                = "tfrs"
DATA_VERSION             = "v1"
TRACK_HISTORY            = "5"

BUCKET_NAME              = "ndr-v1-myproject32549-bucket"
BUCKET_URI               = "gs://ndr-v1-myproject32549-bucket"
SOURCE_BUCKET            = "spotify-million-playlist-dataset"

DATA_GCS_PREFIX          = "data"
DATA_PATH                = "gs://ndr-v1-myproject32549-bucket/data"
VOCAB_SUBDIR             = "vocabs"
VOCAB_FILENAME           = "vocab_dict.pkl"

CANDIDATE_PREFIX         = "candidates"
TRAIN_DIR_PREFIX  

### env variables

In [3]:
# Set your variables for your project, region, and dataset name
SOURCE_BUCKET = 'spotify-million-playlist-dataset'
PROJECT_ID = 'myproject32549'
REGION = 'us-central1'
BQ_DATASET = 'spotify_e2e_test2'

import time
from google.cloud import bigquery

bigquery_client = bigquery.Client(project=PROJECT_ID)

### create BigQuery dataset

In [4]:
# Create a bigquery dataset (one time operation)
# Construct a full Dataset object to send to the API.
dataset = bigquery.Dataset(f"`{PROJECT_ID}.{BQ_DATASET}`")
dataset.location = REGION

# Send the dataset to the API for creation, with an explicit timeout.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.

# dataset = bigquery_client.create_dataset(BQ_DATASET, timeout=30)  # Make an API request.

# print("Created dataset {}.{}".format(bigquery_client.project, dataset.dataset_id))

## Download data from GCS

(also `curl` from source, see readme)

* This step can take up to 30 minutes
* Iteration occurs over 1000 json files if you are using the full dataset
* This should give you a `playlists` bq data set with 1,076,000 rows (playlists)

In [None]:
!gsutil cp gs://{SOURCE_BUCKET}/spotify_million_playlist_dataset.zip .
# !unzip -n spotify_million_playlist_dataset.zip 

In [8]:
print(SOURCE_BUCKET)

spotify-million-playlist-dataset


## BigQuery data prep

### Import playlists
* load JSON files to BigQuery
* playlists are nested as one large string that needs to be parsed (use json compatible functionality for BQ)

In [12]:
import os
import json
import pandas as pd
from tqdm import tqdm
import multiprocessing

data_files = os.listdir('./data')

def load_data(filename: str):
    
    with open(f'data/{filename}') as f:
        json_dict = json.load(f)
        df = pd.DataFrame(json_dict['playlists'])
        df['tracks'] = df['tracks'].map(str)
        #write to bq
    return df
        
# make sure there is not already existing data in the playlists table
# loops over json files - converts to pandas then upload/appends

### add this if you want to limit to smaller number of playlists - this scales significantly later!
n_playliststs_limit = None #add if you want to use in for loop: while counter <= n_playliststs_limit:
total_files = len(data_files)
count = 0
batch_size = 15

for filename in tqdm(data_files):
    if count == 0 or (count-1) % batch_size == 0:
        append_df = load_data(filename)
        count += 1
    
    if count % batch_size == 0 or count == total_files:
        df = load_data(filename) 
        append_df = pd.concat([df, append_df])
        count += 1
        append_df.to_gbq(
            destination_table=f'{BQ_DATASET}.playlists', 
            project_id=PROJECT_ID, # TODO: param
            location='US', 
            progress_bar=False, 
            reauth=True, 
            if_exists='append'
        )
    else:
        df = load_data(filename) 
        append_df = pd.concat([df, append_df])
        count += 1

  0%|          | 5/1000 [00:03<11:01,  1.50it/s]


KeyboardInterrupt: 

#### check table

<!-- ![](img/tracks-string.png) -->
<img
  src="img/tracks-string.png"
  alt="Alt text"
  title="BQ table: playlists"
  style="display: inline-block; margin: 0 auto; max-width: 1200px">

### Nested playlists

* run parameterized queries to shape the data
* This query formats the json strings to be read as Bigquery structs, to be manipulated in subsequent queries
* Creates `playlists_nested` by parsing the string data to a struct with arrays

In [6]:
%%time

json_extract_query = f"""create or replace table `{PROJECT_ID}.{BQ_DATASET}.playlists_nested` as (
with json_parsed as (SELECT * except(tracks), JSON_EXTRACT_ARRAY(tracks) as json_data FROM `{PROJECT_ID}.{BQ_DATASET}.playlists` )

select json_parsed.* except(json_data),
ARRAY(SELECT AS STRUCT
JSON_EXTRACT_SCALAR(json_data, "$.pos") as pos, 
JSON_EXTRACT_SCALAR(json_data, "$.artist_name") as artist_name,
JSON_EXTRACT_SCALAR(json_data, "$.track_uri") as track_uri,
JSON_EXTRACT_SCALAR(json_data, "$.artist_uri") as artist_uri,
JSON_EXTRACT_SCALAR(json_data, "$.track_name") as track_name,
JSON_EXTRACT_SCALAR(json_data, "$.album_uri") as album_uri,
JSON_EXTRACT_SCALAR(json_data, "$.duration_ms") as duration_ms,
JSON_EXTRACT_SCALAR(json_data, "$.album_name") as album_name
from json_parsed.json_data
) as tracks,
from json_parsed) """

bigquery_client.query(json_extract_query).result()

CPU times: user 34.1 ms, sys: 820 µs, total: 34.9 ms
Wall time: 37.4 s


<google.cloud.bigquery.table._EmptyRowIterator at 0x7fba5d93a0b0>

#### check table

<!-- ![](img/playlists-nested.png) -->
<img
  src="img/playlists-nested.png"
  alt="Alt text"
  title="BQ table: playlists_nested"
  style="display: inline-block; margin: 0 auto; max-width: 1200px">

### Unique tracks

* This table will then be used to call the Spotify API and enrich with additional data about each track and artist

In [13]:
%%time

unique_tracks_sql = f"""create or replace table `{PROJECT_ID}.{BQ_DATASET}.tracks_unique` as (
SELECT distinct 
    track.track_uri,
    track.album_uri,
    track.artist_uri, 
FROM `{PROJECT_ID}.{BQ_DATASET}.playlists_nested`, UNNEST(tracks) as track)
"""

bigquery_client.query(unique_tracks_sql).result()

CPU times: user 20.7 ms, sys: 394 µs, total: 21 ms
Wall time: 6.88 s


<google.cloud.bigquery.table._EmptyRowIterator at 0x7f09d8170f70>

**Data loading complete**

* We now have our unique id tables we will use for grabbing additional audio and artist features 

# Spotify API Feature Extraction Pipeline
___________

**references**
* Spotify Mlllion Playlist Dataset Challenge [Homepage](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)
* [Spotify Web API docs](https://developer.spotify.com/documentation/web-api/reference/#/)

**Community Examples**
* [Extracting song lists](https://github.com/tojhe/recsys-spotify/blob/master/processing/songlist_extraction.py)
* [construct audio features with Spotify API](https://github.com/tojhe/recsys-spotify/blob/master/processing/audio_features_construction.py)
* [Using Spotify API](https://towardsdatascience.com/extracting-song-data-from-the-spotify-api-using-python-b1e79388d50)

**Spotify Developer Dashboard**
<!-- ![](img/spotify-dev-console.png) -->
<!-- <img src="https://github.com/jswortz/spotify_mpd_two_tower/blob/main/img/spotify-dev-console.png"  width="600" height="300"> -->

<img
  src="img/spotify-dev-console.png"
  alt="Alt text"
  title="Spotify Developer Dashboard"
  style="display: inline-block; margin: 0 auto; max-width: 900px">

## Setup

**TODO - required**
* In your repo, create `spotipy_secret_creds.py`,
* assign file to .gitignore
* define variables:
> * SPOTIPY_CLIENT_ID='YOUR_CLIENT_ID'
> * SPOTIPY_CLIENT_SECRET='YOUR_CLIENT_SECRET'
> * SPOTIFY_USERNAME='YOUR_USERNAME'

In [1]:
# local py file
import spotipy_secret_creds as creds

# SPOTIPY_CLIENT_ID=creds.SPOTIPY_CLIENT_ID
# SPOTIPY_CLIENT_SECRET=creds.SPOTIPY_CLIENT_SECRET

SPOTIPY_CLIENT_ID="c9dfb2904ba941a192265351fb847bae"
SPOTIPY_CLIENT_SECRET="b4213910730e448faae6df2d90f2674e"

**local json credentials - optional**
* if you are concerned about visibility to api keys (credentials), please see [GCP Secret Manager](https://cloud.google.com/secret-manager)
* Below is an example if you were to add the json file to secret manager (keys: `secret`, `id`)

```python
from google.cloud import secretmanager

###Note you copy/paste this from secret manager in console
SECRET_VERSION = 'projects/934903580331/secrets/spotify-creds1/versions/1'

sm_client = secretmanager.SecretManagerServiceClient()

name = sm_client.secret_path(PROJECT_ID, SECRET_ID)

response = client.access_secret_version(request={"name": SECRET_VERSION})   

payload = json.loads(response.payload.data.decode("UTF-8"))
```

In [None]:
# Get spotify credentials
# This file has id and secret stored as attributes

###########################################
### CAUTION THIS APPROACH WILL HAVE THE CREDENTIALS APPEAR IN THE CONSOLE - 
### USE SECRET MANAGER APPROACH IN EACH COMPONENT AS NEEDED (PROVIDED ABOVE)
###########################################

# # json 
# creds = open('spotify-creds.json')
# spotify_creds = json.load(creds)
# creds.close()

# SPOTIPY_CLIENT_ID=spotify_creds['id']
# SPOTIPY_CLIENT_SECRET=spotify_creds['secret']

### pip & package

In [40]:
!pip install -U spotipy google-cloud-storage google-cloud-aiplatform gcsfs --user -q
!pip install --user kfp google-cloud-pipeline-components --upgrade -q
!pip install --user -q google-cloud-secret-manager
!pip install spotipy --user



In [38]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"
! python3 -c "import google.cloud.aiplatform; print('aiplatform SDK version: {}'.format(google.cloud.aiplatform.__version__))"

KFP SDK version: 2.7.0
google_cloud_pipeline_components version: 2.13.1
aiplatform SDK version: 1.48.0


In [46]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

import re
from tqdm import tqdm

import os

import pandas as pd
import numpy as np
import json

import time

import gcsfs

# GCP
from google.cloud import aiplatform
from google.cloud import aiplatform as vertex_ai
from google.cloud import storage

# Pipelines
from typing import Any, Callable, Dict, NamedTuple, Optional, List
from google_cloud_pipeline_components import aiplatform as gcc_aip
from google_cloud_pipeline_components.types import artifact_types

# Kubeflow SDK
# TODO: fix these
from kfp.v2 import dsl
import kfp
import kfp.v2.dsl
from kfp.v2.google import client as pipelines_client
from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
                        OutputPath, component)

### env variables

In [5]:
PROJECT_ID = 'myproject32549' #update
LOCATION = 'us-central1' 

BUCKET_NAME = 'matching-engine-content-fred'

VERSION = 'v1'
PIPELINE_VERSION = f'pipe-v1' # pipeline code
PIPELINE_TAG = f'{PIPELINE_VERSION}-spotify-feature-enrich'

print("PIPELINE_TAG:", PIPELINE_TAG)

PIPELINE_TAG: pipe-v1-spotify-feature-enrich


**create bucket if needed**

In [34]:
!gsutil mb -l $LOCATION gs://$BUCKET_NAME

Creating gs://matching-engine-content-fred/...


### client & credentials
* Setup Vertex AI client for pipelines

In [35]:
# # Setup clients
vertex_ai.init(
    project=PROJECT_ID, 
    location=LOCATION, 
    staging_bucket=BUCKET_NAME
)

## Create pipeline components

In [36]:
# REPO_SRC = 'src'
# PIPELINES_SUB_DIR = 'feature_pipes'

# ! rm -rf {REPO_SRC}/{PIPELINES_SUB_DIR}
! mkdir {REPO_SRC}/{PIPELINES_SUB_DIR}

mkdir: cannot create directory ‘src/feature_pipes’: File exists


### component: track audio features

[Link to artist API and related features we will pull](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features)

In [62]:
%%writefile {REPO_SRC}/{PIPELINES_SUB_DIR}/call_spotify_api_audio.py

import kfp
from typing import Any, Callable, Dict, NamedTuple, Optional, List
from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
                        OutputPath, component, Metrics)

@kfp.v2.dsl.component(
    base_image="python:3.10.14",
    packages_to_install=[
        'fsspec', 'google-cloud-bigquery',
        'google-cloud-storage',
        'gcsfs',
        'spotipy','requests','db-dtypes',
        'numpy','pandas','pyarrow','absl-py', 'pandas-gbq==0.19.0',
        'tqdm'
    ]
)
def call_spotify_api_audio(
    project: str,
    location: str,
    client_id: str,
    batch_size: int,
    batches_to_store: int,
    target_table: str,
    client_secret: str,
    unique_table: str,
    sleep_param: float,
) -> NamedTuple('Outputs', [
    ('done_message', str),
]):
    print(f'pip install complete')
    import os
    
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    import pandas as pd
    import json
    import time
    import numpy as np
    
    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging
    
    from google.cloud import storage
    import gcsfs
    from google.cloud import bigquery
    
    import pandas_gbq
    from multiprocessing import Process
    from tqdm import tqdm
    from tqdm.contrib.logging import logging_redirect_tqdm
    
    from google.cloud.exceptions import NotFound

    import multiprocessing

    # print(f'package import complete')

    logging.set_verbosity(logging.INFO)
    logging.info(f'package import complete')

    
    bq_client = bigquery.Client(
      project=project, location=location
    )
    
    def spot_audio_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=10, 
            retries=10
        )
        ############################################################################
        # Create Track Audio Features DF
        ############################################################################
        
        uri_stripped = [u.replace('spotify:track:', '') for u in uri] #fix the quotes 
        #getting track popularity
        tracks = sp.tracks(uri_stripped)
        #Audio features
        time.sleep(sleep_param)
    
        a_feats = sp.audio_features(uri)
        features = pd.json_normalize(a_feats)#.to_dict('list')
        
        features['track_pop'] = pd.json_normalize(tracks['tracks'])['popularity']
        
        features['track_uri'] = uri
        return features

    bq_client = bigquery.Client(
        project=project, 
        location='US'
    )
    
    #check if target table exists and if so return a list to not duplicate records
    try:
        bq_client.get_table(target_table)  # Make an API request.
        logging.info("Table {} already exists.".format(target_table))
        target_table_incomplete_query = f"select distinct track_uri from `{target_table}`"
        loaded_tracks_df = bq_client.query(target_table_incomplete_query).result().to_dataframe()
        loaded_tracks = loaded_tracks_df.track_uri.to_list()
        
    except NotFound:
        logging.info("Table {} is not found.".format(target_table))
    
    query = f"select distinct track_uri from `{unique_table}`" 

    #refactor
    schema = [{'name':'danceability', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'energy', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'key', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'loudness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'mode', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'speechiness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'acousticness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'instrumentalness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'liveness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'valence', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'tempo', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'type', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'id', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'uri', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'track_href', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'analysis_url', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'duration_ms', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'time_signature', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'track_pop', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'track_uri', 'type': 'STRING', "mode": "REQUIRED"},
    ]
    
    tracks = bq_client.query(query).result().to_dataframe()
    track_list = tracks.track_uri.to_list()
    logging.info(f'finished downloading tracks')
    
    
    ### This section is used when there are tracks already loaded into BQ and you want to resume loading the data
    try:
        track_list = list(set(track_list) - set(loaded_tracks)) #sets the new track list to remove already loaded data in BQ
    except:
        pass
    

    from tqdm import tqdm
    def process_track_list(track_list):
        
        uri_list_length = len(track_list)-1 #starting count at zero
        inner_batch_count = 0 #avoiding calling the api on 0th iteration
        uri_batch = []
        
        for i, uri in enumerate(tqdm(track_list)):
            uri_batch.append(uri)
            if (len(uri_batch) == batch_size or uri_list_length == i) and i > 0: #grab a batch of 50 songs
                    # logging.info(f"appending final record for nth song at: {inner_batch_count} \n i: {i} \n uri_batch length: {len(uri_batch)}")
                    ### Try catch block for function
                try:
                    audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                    time.sleep(sleep_param)
                    uri_batch = []
                except ReadTimeout:
                    logging.info("'Spotify timed out... trying again...'")
                    audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                    
                    uri_batch = []
                    time.sleep(sleep_param)
                
                except HTTPError as err: #JW ADDED
                    logging.info(f"HTTP Error: {err}")
                
                except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                    logging.info(f"Spotify error: {spotify_error}")
                    
                # Accumulate batches on the machine before writing to BQ
                # if inner_batch_count <= batches_to_store or uri_list_length == i:
                if inner_batch_count == 0:
                    appended_data = audio_featureDF
                    # logging.info(f"creating new appended data at IBC: {inner_batch_count} \n i: {i}")
                    inner_batch_count += 1
                elif uri_list_length == i or inner_batch_count == batches_to_store: #send the batches to bq
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count = 0
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='US', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                    )
                    logging.info(f'{i+1} of {uri_list_length} complete!')
                else:
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count += 1

        logging.info(f'audio features appended')
    
    #multiprocessing portion - we will loop based on the modulus of the track_uri list
    #chunk the list 
    
    # Yield successive n-sized
    # chunks from l.
    def divide_chunks(l, n):
        # looping till length l
        for i in range(0, len(l), n):
            yield l[i:i + n]
            
    n_cores = multiprocessing.cpu_count() 
    chunked_tracks = list(divide_chunks(track_list, int(len(track_list)/n_cores))) #produces a list of lists chunked evenly by groups of n_cores
    
    logging.info(
        f"""
        total tracks downloaded: {len(track_list)}\n
        length of chunked_tracks: {len(chunked_tracks)}\n 
        and inner dims: {[len(x) for x in chunked_tracks]}
        """
    )

    procs = []
    
    def create_job(target, *args):
        p = multiprocessing.Process(target=target, args=args)
        p.start()
        return p

    # starting process with arguments
    for track_chunk in chunked_tracks:
        proc = create_job(process_track_list, track_chunk)
        time.sleep(np.pi)
        procs.append(proc)

    # complete the processes
    for proc in procs:
        proc.join()
        
    # process_track_list(track_list) #single thread
     
    logging.info(f'audio features appended')
    
    return (
          f'DONE',
      )

Overwriting src/feature_pipes/call_spotify_api_audio.py


### component: artist metadata 

[Link to artist API and related features we will pull](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-an-artist)

In [63]:
%%writefile {REPO_SRC}/{PIPELINES_SUB_DIR}/call_spotify_api_artist.py

import kfp
from typing import Any, Callable, Dict, NamedTuple, Optional, List
from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
                        OutputPath, component, Metrics)

### Artist tracks api call
@kfp.v2.dsl.component(
    base_image="python:3.10.14",
    packages_to_install=[
        'fsspec',' google-cloud-bigquery',
        'google-cloud-storage',
        'gcsfs', 'tqdm',
        'spotipy','requests','db-dtypes',
        'numpy','pandas','pyarrow','absl-py', 'pandas-gbq==0.19.0',
        'google-cloud-secret-manager'
    ]
)
def call_spotify_api_artist(
    project: str,
    location: str,
    unique_table: str,
    batch_size: int,
    batches_to_store: int,
    client_id: str,
    client_secret: str,
    sleep_param: float,
    target_table: str,
) -> NamedTuple('Outputs', [
    ('done_message', str),
]):
    print(f'pip install complete')
    
    import os
    
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    import pandas as pd
    import json
    import time
    import numpy as np
    
    from google.cloud import storage
    import gcsfs
    from google.cloud import bigquery

    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging

    import pandas_gbq
    pd.options.mode.chained_assignment = None  # default='warn'
    from google.cloud.exceptions import NotFound
    
    from multiprocessing import Process
    import multiprocessing

    logging.set_verbosity(logging.INFO)
    logging.info(f'package import complete')

    storage_client = storage.Client(
        project=project
    )
    
    logging.info(f'spotipy auth complete')
    
    def spot_artist_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=2, 
            retries=1 )

        ############################################################################
        # Create Track Audio Features DF
        ############################################################################ 

        # uri = [u.replace('spotify:artist:', '') for u in uri] #fix the quotes 

        artists = sp.artists(uri)
        features = pd.json_normalize(artists['artists'])
        smaller_features = features[['genres', 'popularity', 'name', 'followers.total']]
        smaller_features.columns = ['genres_list',  'artist_pop', 'name',  'followers']
        smaller_features['artist_uri'] = uri
        smaller_features['genres'] = smaller_features['genres_list'].map(lambda x: f"{x}")
        return smaller_features[['genres', 'artist_pop', 'artist_uri', 'followers']]
        

    bq_client = bigquery.Client(
      project=project, location='US'
    )

    #check if target table exists and if so return a list to not duplicate records
    try:
        bq_client.get_table(target_table)  # Make an API request.
        logging.info("Table {} already exists.".format(target_table))
        target_table_incomplete_query = f"select distinct artist_uri from `{target_table}`"
        loaded_tracks_df = bq_client.query(target_table_incomplete_query).result().to_dataframe()
        loaded_tracks = loaded_tracks_df.artist_uri.to_list()
        
    except NotFound:
        logging.info("Table {} is not found.".format(target_table))
        
        
    query = f"select distinct artist_uri from `{unique_table}`"
    

    schema = [
        {'name': 'artist_pop', 'type': 'INTEGER'},
        {'name':'genres', 'type': 'STRING'},
        {'name':'followers', 'type': 'INTEGER'},
        {'name':'artist_uri', 'type': 'STRING'}
    ]
    
    
    tracks = bq_client.query(query).result().to_dataframe()
    track_list = tracks.artist_uri.to_list()
    logging.info(f'finished downloading tracks')
    
    from tqdm import tqdm
    def process_track_list(track_list):
        uri_list_length = len(track_list)-1 #starting count at zero
        inner_batch_count = 0 #avoiding calling the api on 0th iteration
        uri_batch = []
        for i, uri in enumerate(tqdm(track_list)):
            uri_batch.append(uri)
            if (len(uri_batch) == batch_size or uri_list_length == i) and i > 0: #grab a batch of 50 songs
                ### Try catch block for function
                try:
                    audio_featureDF = spot_artist_features(uri_batch, client_id, client_secret)
                    time.sleep(sleep_param)
                    uri_batch = []
                
                except ReadTimeout:
                    logging.info("'Spotify timed out... trying again...'")
                    audio_featureDF = spot_artist_features(uri_batch, client_id, client_secret)
                    uri_batch = []
                    time.sleep(sleep_param)
                
                except HTTPError as err: #JW ADDED
                    logging.info(f"HTTP Error: {err}")
                
                except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                    logging.info(f"Spotify error: {spotify_error}")
                
                # Accumulate batches on the machine before writing to BQ
                # if inner_batch_count <= batches_to_store or uri_list_length == i:
                
                if inner_batch_count == 0:
                    appended_data = audio_featureDF
                    # logging.info(f"creating new appended data at IBC: {inner_batch_count} \n i: {i}")
                    inner_batch_count += 1
                
                elif uri_list_length == i or inner_batch_count == batches_to_store: #send the batches to bq
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count = 0
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='US', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                    )
                    logging.info(f'{i+1} of {uri_list_length} complete!')
                
                else:
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count += 1

        logging.info(f'audio features appended')
    
    #multiprocessing portion - we will loop based on the modulus of the track_uri list
    #chunk the list 
    
    # Yield successive n-sized
    # chunks from l.
    def divide_chunks(l, n):
        # looping till length l
        for i in range(0, len(l), n):
            yield l[i:i + n]
            
    n_cores = multiprocessing.cpu_count() 
    chunked_tracks = list(divide_chunks(track_list, int(len(track_list)/n_cores))) # produces a list of lists chunked evenly by groups of n_cores
    
    logging.info(
        f"""total tracks downloaded: {len(track_list)}\n
        length of chunked_tracks: {len(chunked_tracks)}\n 
        and inner dims: {[len(x) for x in chunked_tracks]}
        """
    )
    
    procs = []
    def create_job(target, *args):
        p = multiprocessing.Process(target=target, args=args)
        p.start()
        return p

    # starting process with arguments
    for track_chunk in chunked_tracks:
        proc = create_job(process_track_list, track_chunk)
        time.sleep(np.pi)
        procs.append(proc)

    # complete the processes
    for proc in procs:
        proc.join()
        
    # process_track_list(track_list)
    logging.info(f'artist features appended')
    
    return (
          f'DONE',
      )

Overwriting src/feature_pipes/call_spotify_api_artist.py


## Build pipeline

In [51]:
print(f'REPO_SRC: {REPO_SRC}')
print(f'PIPELINES_SUB_DIR: {PIPELINES_SUB_DIR}')

REPO_SRC: src
PIPELINES_SUB_DIR: feature_pipes


In [52]:
from src.feature_pipes import call_spotify_api_audio, call_spotify_api_artist

@kfp.v2.dsl.pipeline(
  name=f'spotify-feature-enrichment-{PIPELINE_TAG}'.replace('_', '-')
)
def pipeline(
    project: str,
    location: str,
    unique_table: str,
    target_table_audio: str,
    target_table_artist: str,
    target_table_popularity: str,
    batch_size: int,
    batches_to_store: int,
    sleep_param: float,
    spotify_id: str = SPOTIPY_CLIENT_ID, # = spotify_creds['id'],
    spotify_secret: str = SPOTIPY_CLIENT_SECRET, # = spotify_creds['secret'],
):
    
    call_spotify_api_artist_op = (
        call_spotify_api_artist.call_spotify_api_artist(
            project=project,
            location=location,
            client_id=spotify_id,
            client_secret=spotify_secret,
            batch_size=batch_size,
            sleep_param=sleep_param,
            unique_table=unique_table,
            target_table=target_table_artist,
            batches_to_store=batches_to_store,
        )
        .set_display_name("Get Artist Features From Spotify API")
    )

    call_spotify_api_audio_op = (
        call_spotify_api_audio.call_spotify_api_audio(
            project=project,
            location=location,
            client_id=spotify_id,
            client_secret=spotify_secret,
            batch_size=batch_size,
            sleep_param=sleep_param,
            unique_table=unique_table,
            target_table=target_table_audio,
            batches_to_store=batches_to_store,
        )
        .set_display_name("Get Track Audio Features From Spotify API")
        .after(call_spotify_api_artist_op)
    )


### Compile pipeline 
* compiles to json
* store in GCS bucket for tracking

In [64]:
PIPELINE_JSON_SPEC_LOCAL = "custom_track_meta_pipeline_spec.json"

! rm -f $PIPELINE_JSON_SPEC_LOCAL

kfp.v2.compiler.Compiler().compile(
  pipeline_func=pipeline, 
  package_path=PIPELINE_JSON_SPEC_LOCAL,
)

In [65]:
PIPELINE_ROOT_PATH = f'gs://{BUCKET_NAME}/{VERSION}/pipeline_root'

PIPELINES_FILEPATH = f'{PIPELINE_ROOT_PATH}/{PIPELINE_JSON_SPEC_LOCAL}'

print("PIPELINES_FILEPATH:", PIPELINES_FILEPATH)

PIPELINES_FILEPATH: gs://matching-engine-content-fred/v1/pipeline_root/custom_track_meta_pipeline_spec.json


In [66]:
PIPELINE_JSON_SPEC_LOCAL

'custom_track_meta_pipeline_spec.json'

In [67]:
BUCKET_NAME

'matching-engine-content-fred'

In [68]:
PIPELINES_FILEPATH

'gs://matching-engine-content-fred/v1/pipeline_root/custom_track_meta_pipeline_spec.json'

In [69]:
!gsutil cp $PIPELINE_JSON_SPEC_LOCAL $PIPELINES_FILEPATH

Copying file://custom_track_meta_pipeline_spec.json [Content-Type=application/json]...
- [1 files][ 24.9 KiB/ 24.9 KiB]                                                
Operation completed over 1 objects/24.9 KiB.                                     


## Submit pipeline

* Set pipeline parameters dict
* create pipeline job
* submit

### set pipeline params

In [77]:
# BUCKET_NAME = 'matching-engine-content'

ideal_batch_size = 20_000
bts = ideal_batch_size / 50

PIPELINE_PARAMETERS = dict(
    project = PROJECT_ID,
    location = 'us-central1',
    unique_table = f'{PROJECT_ID}.{BQ_DATASET}.tracks_unique', 
    target_table_audio = f'{PROJECT_ID}.{BQ_DATASET}.audio_features',
    target_table_artist = f'{PROJECT_ID}.{BQ_DATASET}.artist_features',
    target_table_popularity = f'{PROJECT_ID}.{BQ_DATASET}.track_popularity',
    batch_size = 50,
    batches_to_store = int(bts),
    sleep_param = .5,
    # spotify_id = SPOTIPY_CLIENT_ID,
    # spotify_secret = SPOTIPY_CLIENT_SECRET
)

PIPELINE_PARAMETERS

{'project': 'myproject32549',
 'location': 'us-central1',
 'unique_table': 'myproject32549.spotify_e2e_test2.tracks_unique',
 'target_table_audio': 'myproject32549.spotify_e2e_test2.audio_features',
 'target_table_artist': 'myproject32549.spotify_e2e_test2.artist_features',
 'target_table_popularity': 'myproject32549.spotify_e2e_test2.track_popularity',
 'batch_size': 50,
 'batches_to_store': 400,
 'sleep_param': 0.5}

### create pipeline job

In [78]:
job = aiplatform.PipelineJob(
    display_name = f'spotify-feature-enrichment-{PIPELINE_TAG}'.replace('_', '-'),
    template_path = PIPELINE_JSON_SPEC_LOCAL,
    pipeline_root = f'gs://{BUCKET_NAME}/{VERSION}',
    parameter_values = PIPELINE_PARAMETERS,
    project = PROJECT_ID,
    location = LOCATION,
    enable_caching=False
)

In [9]:

import kfp
from typing import Any, Callable, Dict, NamedTuple, Optional, List
# from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
#                         OutputPath, component, Metrics)


def call_spotify_api_artist(
    project: str,
    location: str,
    unique_table: str,
    batch_size: int,
    batches_to_store: int,
    client_id: str,
    client_secret: str,
    sleep_param: float,
    target_table: str,
) -> NamedTuple('Outputs', [
    ('done_message', str),
]):
    print(f'pip install complete')
    
    import os
    
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    import pandas as pd
    import json
    import time
    import numpy as np
    
    from google.cloud import storage
    import gcsfs
    from google.cloud import bigquery

    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging

    import pandas_gbq
    pd.options.mode.chained_assignment = None  # default='warn'
    from google.cloud.exceptions import NotFound
    
    from multiprocessing import Process
    import multiprocessing
    import time

    logging.set_verbosity(logging.INFO)

    storage_client = storage.Client(
        project=project
    )
    
    logging.info(f'spotipy auth complete')
    
    def spot_artist_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=6, 
            retries=0
        )
        
        # print("sp:", sp)
        ############################################################################
        # Create Track Audio Features DF
        ############################################################################ 

        # print("@@@ uri:", uri)
        artists = sp.artists(uri)
        # print("@artists:", artists)
        features = pd.json_normalize(artists['artists'])
        smaller_features = features[['genres', 'popularity', 'name', 'followers.total']]
        smaller_features.columns = ['genres_list',  'artist_pop', 'name',  'followers']
        smaller_features['artist_uri'] = uri
        smaller_features['genres'] = smaller_features['genres_list'].map(lambda x: f"{x}")
        # time.sleep(1)
        return smaller_features[['genres', 'artist_pop', 'artist_uri', 'followers']]
        

    bq_client = bigquery.Client(
      project=project
    )

    #check if target table exists and if so return a list to not duplicate records
    try:
        bq_client.get_table(target_table)  # Make an API request.
        logging.info("Table {} already exists.".format(target_table))
        target_table_incomplete_query = f"select distinct artist_uri from `{target_table}`"
        loaded_tracks_df = bq_client.query(target_table_incomplete_query).result().to_dataframe()
        loaded_tracks = loaded_tracks_df.artist_uri.to_list()
        
    except NotFound:
        logging.info("Table {} is not found.".format(target_table))
        
        
    query = f"select distinct artist_uri from `{unique_table}`"
    

    schema = [
        {'name': 'artist_pop', 'type': 'INTEGER'},
        {'name':'genres', 'type': 'STRING'},
        {'name':'followers', 'type': 'INTEGER'},
        {'name':'artist_uri', 'type': 'STRING'}
    ]
    
    
    tracks = bq_client.query(query).result().to_dataframe()
    # print(tracks)
    track_list = tracks.artist_uri.to_list()
    logging.info(f'finished downloading tracks')
    
    try:
        track_list = list(set(track_list) - set(loaded_tracks)) #sets the new track list to remove already loaded data in BQ
    except:
        pass
    
    from tqdm import tqdm
    def process_track_list(track_list):
        uri_list_length = len(track_list)-1 #starting count at zero
        inner_batch_count = 0 #avoiding calling the api on 0th iteration
        uri_batch = []
        for i, uri in enumerate(tqdm(track_list)):
            uri_batch.append(uri)
            # print("i:", i, "uri:", uri)
            if (len(uri_batch) == batch_size or uri_list_length == i) and i > 0: #grab a batch of 50 songs
                ### Try catch block for function
                try:
                    audio_featureDF = spot_artist_features(uri_batch, client_id, client_secret)
                    # print("audio_featureDF", audio_featureDF)
                    uri_batch = []
                
                except ReadTimeout:
                    logging.info("'Spotify timed out... trying again...'")
                    audio_featureDF = spot_artist_features(uri_batch, client_id, client_secret)
                    uri_batch = []
                    
                
                except HTTPError as err: #JW ADDED
                    logging.info(f"HTTP Error: {err}")
                
                except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                    logging.info(f"Spotify error: {spotify_error}")
                    
                # time.sleep(sleep_param)
                
                # Accumulate batches on the machine before writing to BQ
                # if inner_batch_count <= batches_to_store or uri_list_length == i:
                if inner_batch_count == 0:
                    appended_data = audio_featureDF
                    # logging.info(f"creating new appended data at IBC: {inner_batch_count} \n i: {i}")
                    inner_batch_count += 1
                
                elif uri_list_length == i or inner_batch_count == batches_to_store: #send the batches to bq
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count = 0
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='US', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                    )
                    logging.info(f'{i+1} of {uri_list_length} complete!')
                
                else:
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count += 1

        logging.info(f'audio features appended')
    
    #multiprocessing portion - we will loop based on the modulus of the track_uri list
    #chunk the list 
    
    # Yield successive n-sized
    # chunks from l.
    def divide_chunks(l, n):
        # looping till length l
        for i in range(0, len(l), n):
            yield l[i:i + n]
            
    n_cores = multiprocessing.cpu_count() 
    chunked_tracks = list(divide_chunks(track_list, int(len(track_list)/n_cores))) # produces a list of lists chunked evenly by groups of n_cores
    
    logging.info(
        f"""total tracks downloaded: {len(track_list)}\n
        length of chunked_tracks: {len(chunked_tracks)}\n 
        and inner dims: {[len(x) for x in chunked_tracks]}
        """
    )

    for track_chunk in chunked_tracks:
        process_track_list(track_chunk)

        
    # process_track_list(track_list)
    logging.info(f'artist features appended')
    
    return (
          f'DONE',
      )

# client_id="f4c3b0103b144195888f586f767cb748"
# client_secret="d36e9279fa8e4f12b98ef72a2e1f6a72"
call_spotify_api_artist(
    project="myproject32549",
    location="myproject32549",
    unique_table="myproject32549.spotify_e2e_test2.tracks_unique",
    batch_size=50,
    batches_to_store=50,
    client_id="f4c3b0103b144195888f586f767cb748",
    client_secret="d36e9279fa8e4f12b98ef72a2e1f6a72",
    sleep_param=0.5,
    target_table="myproject32549.spotify_e2e_test2.artist_features",
)

INFO:absl:spotipy auth complete


pip install complete


INFO:absl:Table myproject32549.spotify_e2e_test2.artist_features already exists.
INFO:absl:finished downloading tracks


ValueError: range() arg 3 must not be zero

In [5]:

import kfp
from typing import Any, Callable, Dict, NamedTuple, Optional, List

def call_spotify_api_audio(
    project: str,
    location: str,
    client_id: str,
    batch_size: int,
    batches_to_store: int,
    target_table: str,
    client_secret: str,
    unique_table: str,
    sleep_param: float,
) -> NamedTuple('Outputs', [
    ('done_message', str),
]):
    print(f'pip install complete')
    import os
    
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    
    import pandas as pd
    import json
    import time
    import numpy as np
    
    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging
    
    from google.cloud import storage
    import gcsfs
    from google.cloud import bigquery
    
    import pandas_gbq
    from multiprocessing import Process
    from tqdm import tqdm
    from tqdm.contrib.logging import logging_redirect_tqdm
    
    from google.cloud.exceptions import NotFound

    import multiprocessing

    # print(f'package import complete')

    logging.set_verbosity(logging.INFO)

    
    bq_client = bigquery.Client(
      project=project, location=location
    )
    
    def spot_audio_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=10, 
            retries=10
        )
        ############################################################################
        # Create Track Audio Features DF
        ############################################################################
        
        uri_stripped = [u.replace('spotify:track:', '') for u in uri] #fix the quotes 
        #getting track popularity
        tracks = sp.tracks(uri_stripped)
        #Audio features
        time.sleep(sleep_param)
    
        a_feats = sp.audio_features(uri)
        features = pd.json_normalize(a_feats)#.to_dict('list')
        
        features['track_pop'] = pd.json_normalize(tracks['tracks'])['popularity']
        
        features['track_uri'] = uri
        return features

    bq_client = bigquery.Client(
        project=project, 
        location='US'
    )
    
    #check if target table exists and if so return a list to not duplicate records
    try:
        bq_client.get_table(target_table)  # Make an API request.
        logging.info("Table {} already exists.".format(target_table))
        target_table_incomplete_query = f"select distinct track_uri from `{target_table}`"
        loaded_tracks_df = bq_client.query(target_table_incomplete_query).result().to_dataframe()
        loaded_tracks = loaded_tracks_df.track_uri.to_list()
        
    except NotFound:
        logging.info("Table {} is not found.".format(target_table))
    
    query = f"select distinct track_uri from `{unique_table}`" 

    #refactor
    schema = [{'name':'danceability', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'energy', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'key', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'loudness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'mode', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'speechiness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'acousticness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'instrumentalness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'liveness', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'valence', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'tempo', 'type': 'FLOAT', "mode": "NULLABLE"},
        {'name':'type', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'id', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'uri', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'track_href', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'analysis_url', 'type': 'STRING', "mode": "NULLABLE"},
        {'name':'duration_ms', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'time_signature', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'track_pop', 'type': 'INTEGER', "mode": "NULLABLE"},
        {'name':'track_uri', 'type': 'STRING', "mode": "REQUIRED"},
    ]
    
    tracks = bq_client.query(query).result().to_dataframe()
    track_list = tracks.track_uri.to_list()
    logging.info(f'finished downloading tracks')
    
    
    ### This section is used when there are tracks already loaded into BQ and you want to resume loading the data
    try:
        track_list = list(set(track_list) - set(loaded_tracks)) #sets the new track list to remove already loaded data in BQ
    except:
        pass
    

    from tqdm import tqdm
    def process_track_list(track_list):
        
        uri_list_length = len(track_list)-1 #starting count at zero
        inner_batch_count = 0 #avoiding calling the api on 0th iteration
        uri_batch = []
        
        for i, uri in enumerate(tqdm(track_list)):
            uri_batch.append(uri)
            if (len(uri_batch) == batch_size or uri_list_length == i) and i > 0: #grab a batch of 50 songs
                    # logging.info(f"appending final record for nth song at: {inner_batch_count} \n i: {i} \n uri_batch length: {len(uri_batch)}")
                    ### Try catch block for function
                try:
                    audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                    time.sleep(sleep_param)
                    uri_batch = []
                except ReadTimeout:
                    logging.info("'Spotify timed out... trying again...'")
                    audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                    
                    uri_batch = []
                    time.sleep(sleep_param)
                
                except HTTPError as err: #JW ADDED
                    logging.info(f"HTTP Error: {err}")
                
                except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                    logging.info(f"Spotify error: {spotify_error}")
                    
                # Accumulate batches on the machine before writing to BQ
                # if inner_batch_count <= batches_to_store or uri_list_length == i:
                if inner_batch_count == 0:
                    appended_data = audio_featureDF
                    # logging.info(f"creating new appended data at IBC: {inner_batch_count} \n i: {i}")
                    inner_batch_count += 1
                elif uri_list_length == i or inner_batch_count == batches_to_store: #send the batches to bq
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count = 0
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='US', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                    )
                    logging.info(f'{i+1} of {uri_list_length} complete!')
                else:
                    appended_data = pd.concat([audio_featureDF, appended_data])
                    inner_batch_count += 1

        logging.info(f'audio features appended')
    
    #multiprocessing portion - we will loop based on the modulus of the track_uri list
    #chunk the list 
    
    # Yield successive n-sized
    # chunks from l.
    def divide_chunks(l, n):
        # looping till length l
        for i in range(0, len(l), n):
            yield l[i:i + n]
            
    n_cores = multiprocessing.cpu_count() 
    chunked_tracks = list(divide_chunks(track_list, int(len(track_list)/n_cores))) #produces a list of lists chunked evenly by groups of n_cores
    
    logging.info(
        f"""
        total tracks downloaded: {len(track_list)}\n
        length of chunked_tracks: {len(chunked_tracks)}\n 
        and inner dims: {[len(x) for x in chunked_tracks]}
        """
    )

    procs = []
    
    def create_job(target, *args):
        p = multiprocessing.Process(target=target, args=args)
        p.start()
        return p

    # starting process with arguments
    for track_chunk in chunked_tracks:
        proc = create_job(process_track_list, track_chunk)
        time.sleep(np.pi)
        procs.append(proc)

    # complete the processes
    for proc in procs:
        proc.join()
        
    # process_track_list(track_list) #single thread
    logging.info(f'audio features appended')
    
    return f'DONE'


call_spotify_api_audio(
    project="myproject32549",
    location="myproject32549",
    client_id="f4c3b0103b144195888f586f767cb748",
    batch_size=50,
    batches_to_store=50,
    target_table="myproject32549.spotify_e2e_test2.audio_features",
    client_secret="d36e9279fa8e4f12b98ef72a2e1f6a72",
    unique_table="myproject32549.spotify_e2e_test2.tracks_unique",
    sleep_param=0.5,
)

pip install complete


INFO:absl:Table myproject32549.spotify_e2e_test2.audio_features already exists.
INFO:absl:finished downloading tracks
INFO:absl:
        total tracks downloaded: 2002207

        length of chunked_tracks: 1
 
        and inner dims: [2002207]
        
  0%|          | 0/2002207 [00:00<?, ?it/s]ERROR:spotipy.client:Max Retries reached
INFO:absl:Spotify error: http status: 429, code:-1 - /v1/audio-features/?ids=4wsNcZOMiU2ThJ0AhM0282,6h0MJAtzHMetVD8718zVgP,4RrbfMhc1AKNhylRC1mp1U,5KNBHbtDfOW48zjAnfY5iU,5F3i5W6FTuWtBFGmvckaO1,5qvThjvpC7q5paS9DCzDA1,6qJ6u80mylRXG7wcFTdr2w,3jLHXebqQ802XeP9fnnMWz,2cViIXIe8Pbd1sOJExMJlK,7abfUxNpRHuby0N2T4JCv1,1ovGM7ayJfoYYXkkF7Wzkb,1nEPEGmRi3o8ybsCepKaEp,3EW3BFkqqy5XDs9sUcBsJh,5ryqIuIKgjfgKXHSxy7gzn,7xcWmcZe7APzr8pNFP2nn3,1c5xEBeV4BbzDqmBmB5PSH,4gtER6hj0kb1amaXAHYlkk,1HWVbYxnJjMTBKEZNXRarz,001ayA5alFiUVW2fw5jpyW,2MLbGztItx5Lxytg40Bt8v,3prjpJbL2pAhaBfVrRAL2m,152601s4UKWKo784aQDGKm,5cRTQzUWa4jpiL1xCWOJTr,754DWmELcRbJ8nSxpidPt4,2QixklKF002TyGCQfzPWLZ,5s43J0J4GCXT

'DONE'

In [5]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

client_id="f4c3b0103b144195888f586f767cb748"
client_secret="d36e9279fa8e4f12b98ef72a2e1f6a72"

# Authenticate
client_credentials_manager = SpotifyClientCredentials(
    client_id=client_id, 
    client_secret=client_secret,
    # requests_session=False,
)
sp = spotipy.Spotify(
    client_credentials_manager = client_credentials_manager, 
    requests_timeout=50, 
    retries=0 
    )

# 0001wHqxbF2YYRQxGdbyER,000UxvYLQuybj6iVRRCAw1

uri=['spotify:artist:0001wHqxbF2YYRQxGdbyER', 'spotify:artist:000UxvYLQuybj6iVRRCAw1']
try:
  artists = sp.artists(uri)
  print(artists)
except Exception as ex:
  # print(artists.get("Retry-After"))
  print(ex)

{'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0001wHqxbF2YYRQxGdbyER'}, 'followers': {'href': None, 'total': 9898}, 'genres': ['progressive psytrance'], 'href': 'https://api.spotify.com/v1/artists/0001wHqxbF2YYRQxGdbyER', 'id': '0001wHqxbF2YYRQxGdbyER', 'images': [{'height': 640, 'url': 'https://i.scdn.co/image/ab67616d0000b273bdb6955d7b55866131d86f98', 'width': 640}, {'height': 300, 'url': 'https://i.scdn.co/image/ab67616d00001e02bdb6955d7b55866131d86f98', 'width': 300}, {'height': 64, 'url': 'https://i.scdn.co/image/ab67616d00004851bdb6955d7b55866131d86f98', 'width': 64}], 'name': 'Motion Drive', 'popularity': 11, 'type': 'artist', 'uri': 'spotify:artist:0001wHqxbF2YYRQxGdbyER'}, {'external_urls': {'spotify': 'https://open.spotify.com/artist/000UxvYLQuybj6iVRRCAw1'}, 'followers': {'href': None, 'total': 116}, 'genres': [], 'href': 'https://api.spotify.com/v1/artists/000UxvYLQuybj6iVRRCAw1', 'id': '000UxvYLQuybj6iVRRCAw1', 'images': [{'height': 640, 'url'

### submit pipeline job

In [4]:
job.submit()

NameError: name 'job' is not defined

## Vertex Pipeline UI (console) 

* It may take a couple of runs with different application credientials for the audio tracks
* This module can resume from where data was already loaded to BQ

<!-- ![](img/feature-enrich-pipeline.png) -->

<img
  src="img/feature-enrich-pipeline.png"
  alt="Alt text"
  title="Feature Enrichment Pipeline"
  style="display: inline-block; margin: 0 auto; max-width: 900px">

## GitIgnore

In [4]:
%%writefile .gitignore
*.cpython-310.pyc
*-checkpoint.*
*-checkpoint.md
*-checkpoint.py
*.ipynb_checkpoints
# .gcloudignore
# .git
# .github
.ipynb_checkpoints/*
*__pycache__
# *cpython-37.pyc
# .gitignore
# .DS_Store
spotipy_secret_creds.py

Overwriting .gitignore


# Next Steps

Now that all data is loaded in BigQuery, the [01-bq-data-prep.ipynb](01-bq-data-prep.ipynb) notebook will finish feature prep