---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

{{< include instructions.qmd >}} 


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Provide the source code used for this section of the project here.

If you're using a package for code organization, you can import it at this point. However, make sure that the **actual workflow steps**—including data processing, analysis, and other key tasks—are conducted and clearly demonstrated on this page. The goal is to show the technical flow of your project, highlighting how the code is executed to achieve your results.

Ensure that the code is well-commented to enhance readability and understanding for others who may review or use it. If relevant, link to additional documentation or external references that explain any complex components. This section should give readers a clear view of how the project is implemented from a technical perspective.

This page is a technical narrative, NOT just a notebook with a collection of code cells, include in-line Prose, to describe what is going on.

In the following code....

In [None]:
import spotipy
import json
from spotipy.oauth2 import SpotifyClientCredentials
import csv
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

#Read in the json file that has Spotify api credentials 
with open('info/info.json', 'r') as info_file:
    info = json.load(info_file)
client_id = info['client_id'] 
client_secret = info['client_secret']

#Set up Spotify client credentials manager using clientID and cleint secret
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
#Create Spotify client object
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)


In [79]:
tracks_data = []

#Rolling Stone Top 100 Spoitfy URL
playlist_uri = 'https://open.spotify.com/playlist/67ZwkI5fZsA7mTHbF4Wk86?si=76d7c1b4681a4b5e'

#Get the tracks from the playlist
playlist_tracks = sp.playlist_tracks(playlist_uri)

#Iterate through the playlist
while playlist_tracks:
    for item in playlist_tracks['items']:
        track = item['track']
        #Get rack name
        track_name = track['name']

        #Initialize list to store artist name
        artist_names = []
        #Get names of the artist with the track
        for artist in track['artists']:
            artist_names.append(artist['name'])
            #Joins multiple artists (if there is a song with more than one artist)
            track_artists = ' & '.join(artist_names)
        tracks_data.append([track_name, track_artists])
        
    #Exit loop
    playlist_tracks = None

In [80]:
#Create csv path 
csv_file_path = '../../data/raw-data/rolling_stone_top_100.csv'

#Creates the csv file and writes into it
with open(csv_file_path, mode='w', newline='') as file:
    #Creates csv wrriter
    writer = csv.writer(file)
    #Write headers
    writer.writerow(['Track Name', 'Artists'])
    #Add in tracks_data into the rows 
    writer.writerows(tracks_data)

In [81]:
#Intialize lists
tracks_info = []
artists_info = {}

#Iterate over each tack in tracks_Data
for item in tracks_data:
    #Grabs the track name
    track_name = item[0]  
    #Grabs the artist name
    artists = item[1]  

    #Initialize
    track_info = None

    #Uses Spotify API to search track
    track = sp.search(q=f"track:{track_name}", type='track', limit=50)
 
    while track:
        #Iterate through all the search results
        for search_result in track['tracks']['items']:
            search_result_artists = []
            for artist in search_result['artists']:
                #Extract the names of artists that was found in the search result into a list
                search_result_artists.append(artist['name'])
            #Check if the artists from tracks_data are a subset of search_result_artists
            if set(artists.split(' & ')).issubset(search_result_artists):
                #Check if the artists from tracks_data are a subset of search_result_artists
                track_info = search_result
                break
        #Break the outer loop if a match has been found
        if track_info:
            break
        #If next page available and song not found, go to the next page of results 
        if track['tracks']['next']:
            track = sp.next(track['tracks'])
        else:
            track = None

    #Even after all the searching, if song not found, 
    if not track_info:
        print(f"No matching track found for: {track_name} by {artists}")
        continue

    #Create dictionary with desired info about the track
    track_details = {
        'track_id': track_info['id'],
        'name': track_info['name'],
        'popularity': track_info['popularity'],
        'album': track_info['album']['name'],
        'release_date': track_info['album']['release_date'],
        'duration_ms': track_info['duration_ms'],
        'artist': track_info['artists'][0]['name'],
        'explicit': track_info['explicit']
    }
    tracks_info.append(track_details)

    #Iterates through the artists associated with the tracks
    for artist in track_info['artists']:
        #Get artist ID
        artist_id = artist['id']
        #Checks for duplicates 
        if artist_id not in artists_info:  
            #Uses the artist ID to get info about the artists
            artist_info = sp.artist(artist_id)
            #Adds the information to the created dictionary
            artists_info[artist_id] = {
                'id': artist_info['id'],
                'name': artist_info['name'],
                'genres': artist_info['genres'],
                'followers': artist_info['followers']['total'],
                'popularity': artist_info['popularity'],
            }


No matching track found for: Forever by Hovvdy
No matching track found for: Fun! by May Rio & Elegant Ensemble


In [82]:
#Create csv paths
tracks_csv_path = '../../data/raw-data/tracks_data.csv'
artists_csv_path = '../../data/raw-data/artists_data.csv'

#Creates the csv file and writes into it
with open(tracks_csv_path, mode='w', newline='') as tracks_file:
    #Creates csv wrriter
    writer = csv.writer(tracks_file)
    #Write headers
    writer.writerow(['Track ID', 'Track Name', 'Popularity', 'Album', 'Release Date', 'Duration (ms)', 'Artist', 'Explicit'])
    #Loops through dictionary to write into into csv
    for track in tracks_info:
        writer.writerow([
            track['track_id'],
            track['name'],
            track['popularity'],
            track['album'],
            track['release_date'],
            track['duration_ms'],
            track['artist'],
            track['explicit']
        ])

#Creates the csv file and writes into it
with open(artists_csv_path, mode='w', newline='') as artists_file:
    #Creates csv wrriter
    writer = csv.writer(artists_file)
    #Write headers
    writer.writerow(['Artist ID', 'Name', 'Genres', 'Followers', 'Popularity'])
    #Loops through dictionary to write into into csv
    for artist_id, artist in artists_info.items():
        writer.writerow([
            artist['id'],
            artist['name'],
            ', '.join(artist['genres']), 
            artist['followers'],
            artist['popularity']
        ])

In [None]:
#Open Genius API access token
with open('/Users/samyu/.api-keys.json') as f:
    keys = json.load(f)
ACCESS_TOKEN = keys['genius_api']

#Base URL for the Genius API
BASE_URL = "https://api.genius.com"

#Search for the song using Genius API
#If artist is given then the song will try tp match the artist 
def search_song(song_title, artist_name=None):
    headers = {"Authorization": f"Bearer {ACCESS_TOKEN}"}
    search_url = f"{BASE_URL}/search"
    params = {"q": song_title}
    response = requests.get(search_url, headers=headers, params=params)
    
    #Successful API response
    if response.status_code == 200:
        data = response.json()
        #Extract the search results
        hits = data.get("response", {}).get("hits", [])
        
        #Match the artist name
        for hit in hits:
            result = hit.get("result", {})
            if artist_name:
                if artist_name.lower() in result.get("primary_artist", {}).get("name", "").lower():
                    #Return the matched result
                    return result
            else:
                #Return the first result if artist is not specified
                return result  
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None

#Function to extract lyrics from the Genius webpage
def get_lyrics(song_title, artist_name=None):
    #Search for the song using the search_song function
    song = search_song(song_title, artist_name)
    if not song:
        print("Song not found!")
        return None

    #Extract the song URL from the search result
    song_url = song.get("url")
    if song_url:
        print(f"Lyrics URL: {song_url}")
        #Scrape the lyrics from the url
        response = requests.get(song_url)
        soup = BeautifulSoup(response.text, "html.parser")
        #Lyrics are usually in a <div> with a specific class
        lyrics = soup.find("div", class_="Lyrics-sc-1bcc94c6-1 bzTABU")
        if lyrics:
            #Clean the extracted lyrics 
            ly = lyrics.get_text(" ")
            ly = ly.replace('Intro', '')
            ly = ly.replace('Refrain', '')
            ly = ly.replace('Verse 1', '')
            ly = ly.replace('Verse 2', '')
            ly = ly.replace('Pre-Chorus', '')
            ly = ly.replace('Chorus', '')
            ly = ly.replace(':', '')
            ly = ly.replace('[', '')
            ly = ly.replace(']', '')

            #Remove artist names from lyrics if provided
            if artist_name:
                #Handles multiple artists
                if '&' in artist_name:
                    artists = artist_name.split('&')
                else:
                    artists = [artist_name]
                for a in artists:
                    ly = ly.replace(a, '')
            return ly
        else:
            return "Lyrics not found!"
    else:
        print("Lyrics URL not found!")
        return None

#Function to fetch and process lyrics from tracks_data
def top_100_song_lyrics(tracks_data, delay=1):
    for track in tracks_data:
        song_title = track[0]
        artist_name = track[1]
        track[2] = get_lyrics(song_title, artist_name)
        #Delay between API calls to avoid rate-limiting
        time.sleep(delay)
    return tracks_data


tracks_data = top_100_song_lyrics(tracks_data)

In [None]:
df = pd.DataFrame(tracks_data, columns=["Track Name", "Artists", "yrics"])
file_path = "../../data/raw-data/song_lyrics.csv"
df.to_csv(file_path, index=False)

{{< include closing.qmd >}} 