```yaml
    Course:   DS 5001,
    Author:   Raymundo Mora,
    Email:    rm3xw@virginia.edu
    Date:     Spring 2023
```

# Data Collection 

In this notebook we will generate the relevant datasets for analyzing the songs in your *Spotify* library. To make sure the code runs and 

## 0.0 Import Relevant Libraries

In [1]:
# the two must have libraries in any data science project 
import pandas as pd 
import numpy as np


import os
import time
import requests


# for handling our web requests and html 
from bs4 import BeautifulSoup

# for handling environment variables 
from dotenv import dotenv_values
from dotenv import load_dotenv

# handle our language detection 
from langdetect import detect

# for handling spotify endpoints 
import spotipy
from spotipy.oauth2 import SpotifyOAuth

# Helps Make our Notebook Pretty
from IPython.display import clear_output

## 1.0 Define Global Variables 

In [2]:
# load your environment variables 
load_dotenv()

# define the scope of your spotipy client 
scope = "user-library-read"

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(scope=scope))

# relative path to where data will be stored. 
data_dir = "datasets/"

## 1.1 Define helper functions 

In [3]:
def get_genre(track_id):
    """Gets the genre of a track. Important to notice
    that the genre is not a property of the track, but
    a property of the artist. Therefore, the genre is
    retrieved from the artist.

    Args:
        track_id (str): Spotify track id

    Returns:
        list: list  of genres associated with the artist
    """
    # Uses track endpoint to get artist id
    track_info = sp.track(track_id)
    
    # Uses artist endpoint to get genres associated with artist
    artist_info = sp.artist(track_info['artists'][0]['id'])
    
    # Returns list of genres associated with artist
    genre = artist_info['genres']
    
    return genre


def get_genius_access_token():
    """Get an access token from Genius API"""
    
    # Set API endpoint and headers
    endpoint = 'https://api.genius.com/oauth/token'
    headers = {'Content-Type': 'application/json'}
    
    # Set request parameters
    data = {
        'client_id': GENIUS_CLIENT_ID,
        'client_secret': GENIUS_CLIENT_SECRET,
        'grant_type': 'client_credentials'
    }
    
    # Send API request
    response = requests.post(endpoint, headers=headers, json=data)
    
    # Check if request was successful
    if response.status_code == 200:
        data = response.json()
        access_token = data['access_token']
        
        return access_token
    else:
        print("Failed to fetch access token. Please check your client ID and client secret.")
        return None
    
    
def get_lyrics(song_title, artist_name):
    """Get the lyrics for a song using Genius API.

    Args:
        song_title (str): title of the song.
        artist_name (str): artist name.

    Returns:
        str: lyrics of the song requested.
    """
    # Get your access token from https://genius.com/api-clients
    access_token = get_genius_access_token()
    if not access_token:
        return None
    
    # Set API endpoint and headers
    endpoint = 'https://api.genius.com/search'
    headers = {'Authorization': f'Bearer {access_token}'}
    
    # Set query parameters
    params = {
        'q': f'{song_title} {artist_name}',
    }
    
    # Send API request
    response = requests.get(endpoint, headers=headers, params=params)
    
    # Check if request was successful
    if response.status_code == 200:
        # Extract lyrics from response
        data = response.json()
        matches = data['response']['hits']
        
        # Check a match was found
        if matches:
            # Get the first match
            match = matches[0]
            song_lyrics_url = match['result']['url']
            
            # Get the lyrics from the song page
            lyrics_response = requests.get(song_lyrics_url)
            soup = BeautifulSoup(lyrics_response.content,"html.parser")
            lyrics = str()
            for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):
                t = tag.get_text(strip=True, separator='\n')
                if t:
                    lyrics += t 
            
            
            return lyrics
        else:
            print(f"No lyrics found for {song_title} by {artist_name}.")
            return None
    else:
        print("Failed to get lyrics. Please check your client ID and client secret.")
        return "API Call Failed"
    
    
def lang_detect(lyrics):
    """Detects the language of the lyrics using langdetect. We use a try/except
    in order to handle exceptions where we were not able to retrieve lyrics from
    the genius API
    
    Args:
        lyrics (str): The lyrics to detect the language of
    Returns: 
        str: The language of the lyrics"""
    try:
        return detect(lyrics)
    except:
        return None

## 2.0 Get our *liked songs* library from *Spotify*

In [4]:
offset = 0 # at what track to start retreiving 
limit = 50 # max limit of allowed songs to fetch at once before receiving error 

LIB = pd.DataFrame() # initiate the dataframe containing our library 

while True:
    # Get trackes from the user `liked songs` 
    results = sp.current_user_saved_tracks(limit=limit, offset=offset)
    
    # Append fetched tracks to the list
    LIB = pd.concat([LIB,pd.json_normalize(results['items'])])
    
    # Check if you are at the end of the playlist
    if len(results['items']) < limit:
        # Break if you have reached the end of the playlist
        break
    else:
        # Increment the offset for the next request
        offset += limit

Using `localhost` as redirect URI without a port. Specify a port (e.g. `localhost:8080`) to allow automatic retrieval of authentication code instead of having to copy and paste the URL your browser is redirected to.


Enter the URL you were redirected to: http://localhost/?code=AQDvGEVspUtEa09fIfJ50IQOCAO3eioXshURsqvis7bg76jJltBWwTTIPFkbDE6gEqcP34AoNnr3GIbn5GOj3By1qtOs4PlLaLQTmFqWNEeRoRsYE04oW--eyiXOEaxKGgT2YIoNjXIkeLAa9B40PH-lOWI151bfqP4uIVbk4rsRjDbH_uoMNw


In [5]:
# reset the index of our library `LIB`
LIB = LIB.reset_index(drop=True) 

## 2.1 Adding Genre to our library

In [None]:
genres = [] # initiate the list of of genres of each song
for i in LIB.index:
    # call our helper function 
    genre = get_genre(LIB['track.id'][i])
    
    # add to the `genres` list
    genres.append(genre)
    
    # print status 
    clear_output(wait=True)
    print(i,flush=True)
    print(genre,flush=True)

53
['candy pop', 'pixie', 'pop', 'pop emo', 'pop punk', 'rock']


In [None]:
LIB['genre'] = genres

## 2.2 Inspect and Save our `LIB` DataFrame 

In [None]:
LIB.head()

In [None]:
LIB.to_csv(data_dir+"LIB.csv",index=False)

In [None]:
LIB['genre'][0]

## 3.0 Generate CORPUS

## 3.1 Load Your Credentials for the Genius API

In [None]:
# Get client ID and client secret from environment variables
GENIUS_CLIENT_ID = os.getenv('GENIUS_CLIENT_ID')
GENIUS_CLIENT_SECRET = os.getenv('GENIUS_CLIENT_SECRET')

## 3.2 Download the Lyrics for Your Library

In [None]:
CORPUS = pd.DataFrame()

total_tracks = len(LIB)
for i in LIB.index:
    
    song_title = LIB['track.name'][i]
    artist_name = LIB['track.artists'][i][0]['name']
    lyrics = get_lyrics(song_title, artist_name)
    CORPUS = pd.concat([CORPUS,pd.DataFrame({'artist':[artist_name],
                                             'song':[song_title],
                                             'lyrics':[lyrics]})]
                      )
    
    # slow down our requests so that we can continue making them. 
    time.sleep(0.25)
    clear_output(wait=True)
    print(str(np.round(100*(i+1)/total_tracks,2))+"%"+f'       [{i+1}/{total_tracks}]   \n', flush=True)

In [None]:
CORPUS= CORPUS.reset_index(drop=True)

## 3.3 Add the Language as a Feature

In [None]:
CORPUS['language'] = CORPUS['lyrics'].apply(lang_detect)

In [None]:
# Add the full language name as provided in the `langdetect` documentation
# The map is avaiable here: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
# The `langdetect` documentation can be found here: https://pypi.org/project/langdetect/

lang_map = pd.read_csv('langugage_map.csv')
lang_names = []
for i in CORPUS.index:
    try:
        lang_name = lang_map.loc[lang_map['639-1'] == CORPUS['language'][i]]['ISO language name'].values[0]
    except:
        lang_name = None
    lang_names.append(lang_name)

In [None]:
CORPUS['language_name'] = lang_names

## 3.4  Inspect and Save our `LIB` DataFrame 

In [None]:
CORPUS.head()

In [None]:
CORPUS.to_csv(data_dir+'F_CORPUS.csv',index=False)