# <strong>Making a Killer Summer Playlist
## (While learning about web scraping and APIs)

### The science of the summer playlist
- Exactly 100 songs (~6 hours)
- All songs less than 3 years old
- No more than 3 songs per artist
- Created between St. Patrick's Day and Memorial Day
- On repeat until Hallloween

### Traditional method
- Step 1: Find new artists (Look up who is playing at [summer music festivals](https://www.musicfestivalwizard.com/festival-guide/us-festivals/))
- Step 2: Look up those artists on [Spotify](https://open.spotify.com/)
- Step 3: Pick ~500 of their best songs
- Step 4: Narrow it down (the hardest part)

### New and improved method
- Make the computer do the work for me

# Step 1: Web Scraping
- Data meant for human consumption
- Messy
- Breaks easily
- Complicated to set up
- Ethical questions
- Can scrape almost anything on the internet

## Configuration

In [50]:
# Import dependencies
import pandas as pd
import re
import os
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
# Config
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(
    executable_path=os.environ.get("CHROMEDRIVER_PATH"), 
    chrome_options=chrome_options, 
    desired_capabilities=capa)

  driver = webdriver.Chrome(
  driver = webdriver.Chrome(


## Make the browser navigate to a page

In [51]:
# Get the html with Selenium
driver.get('https://www.musicfestivalwizard.com/all-festivals/')

## Parse the html and scrape the data I want

In [52]:
# Select the html tag that contains the list of artists
x = driver.find_element(By.ID, value="artist").get_attribute("outerHTML")
# Create a BeautifulSoup object of the html
soup = BeautifulSoup(x)
# Get the artist's names in list form
artists = [a.text for a in soup.find_all("option")][1:]
# Convert to dataframe
scrape = pd.DataFrame()
scrape['artistName'] = artists
driver.quit()
scrape

Unnamed: 0,artistName
0,21 Savage
1,3 Doors Down
2,311
3,A Day To Remember
4,A Tribe Called Quest
...,...
1345,Knocked Loose
1346,Lilith Czar
1347,Puscifer
1348,The Bronx


# Step 2: API (Application Programming Interface)
- Data meant for machine consumption
- [Well-documented](https://developer.spotify.com/documentation/web-api/reference/#/operations/search)
- Strict protocols
- Reliable
- Generally ethical
- Often requires API keys or subscriptions

## Configuration

In [55]:
# Import dependencies
import tekore as tk
import spotipy
import os
import numpy as np
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Environment variables
CLIENTID = os.environ.get('CLIENTID')
CLIENTSECRET = os.environ.get('CLIENTSECRET')
# Set global variables
MINDATE = datetime.today() - relativedelta(years=3)
### Tekore
# Get client token
app_token = tk.request_client_token(CLIENTID, CLIENTSECRET)
# Create spotify instance
spotify = tk.Spotify(app_token)

## Find an artist's SpotifyID by searching for their name

In [56]:
artists, = spotify.search("Topaz Jones", types=['artist'], limit=10)
artistID = artists.items[0].id
print(artistID)

76bAuLD5jMIT1YDJ84KB8l


## Get the IDs, Names, and Popularities for any album this artist released in the last three years

In [57]:
# We can also specify in the API call that we only want actual albums or singles with the "include_groups=['album']" argument
# Make a function to join the multiple pages of albums together
def get_artistAlbums(artistID):
    # Initialize variables
    length = 50
    albumList = []
    offset=0
    # If the returned page is 50 items that means we need to call the API for another chunk
    while length==50:
        # Call the API to get the artist's albums
        albums = spotify.artist_albums(artistID, include_groups=['album', 'single'], limit=50, offset=offset)
        # Append this chunk to our existing list
        albumList = albumList + [x.id for x in albums.items if datetime.strptime(x.release_date, '%Y-%m-%d')>MINDATE]
        # Reset the length. If the length goes below 50 we will not call the API again
        length = len(albums.items)
        # Increase the offset so we are not getting the same albums again
        offset+=50
    return albumList

# Now we only have albumIDs that are in the dates that we want and the type that we want
albumList = get_artistAlbums(artistID)
# Next let's get the names and popularities from these albums
pops = [spotify.album(x).popularity for x in albumList]
names = [spotify.album(x).name for x in albumList]
total = zip(albumList, names, pops)
pd.DataFrame(list(total), columns=['albumID', 'albumName', 'albumPopularity'])

Response contains unknown attribute: `album_group`
  return try_post_func(request, response, *params)


Unnamed: 0,albumID,albumName,albumPopularity
0,1EieCilyiR9fOnjbV8sTEm,Don't Go Tellin' Your Momma,49
1,0xZKaOjDHI0Kx3IfS2eb0M,Broke,40
2,4792PsWMQDVUL5Ov33IcIs,Cardinal,30
3,7gwgFynhmmyROorysiFjzX,Model Home,21
4,2DiDH1MgKKSS6NhEPLrVuU,chuu,16
5,5DkWIqx5OhfFRzMUuPJx6b,D. I. A. L.,10
6,3y8sZLquAIwSOVz9g2mQnI,Herringbone,14


# Next step: Turn these IDs into a playlist using the API

In [None]:
# ?????????????

# What does this mean for PSPE?
- Salesforce API
- Census API
- Other databases

# Other Examples

### [US Census GUI](https://data.census.gov/table?q=dp02&tid=ACSDP5Y2020.DP02)

### [US Census API](https://api.census.gov/data/2020/acs/acs5/profile?get=group(DP02)&for=us:1)


## Make a call to the Census API

In [58]:
# Import dependencies
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib.request import urlopen
from io import StringIO

# Set url
url = 'https://api.census.gov/data/2020/acs/acs5/profile?get=group(DP02)&for=us:1'

def parseCensus(url):
    # Use BeautifulSoup to parse html and find the document text
    page = urlopen(url)
    html = page.read().decode("utf-8")
    soup = BeautifulSoup(html, "html.parser")
    censusData = soup.get_text()
    return censusData

censusData = parseCensus(url)
censusData

URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:997)>

## Clean it up

In [None]:
def cleanCensus(censusData):
    # Clean data by removing brackets
    censusData = censusData.replace("[", "")
    censusData = censusData.replace("]", "")

    # Create dataframe from string data
    censusData = pd.read_csv(StringIO(censusData), sep=",")
    return censusData
cleanCensus(censusData).transpose()