# ADS 509 Module 1: APIs and Web Scraping

This notebook has two parts. In the first part, you will scrape lyrics from AZLyrics.com. In the second part, you'll run code that verifies the completeness of your data pull.

For this assignment you have chosen two musical artists who have at least 20 songs with lyrics on AZLyrics.com. We start with pulling some information and analyzing them.

## Import Libraries

In [1]:
import os
import datetime
import re
import shutil

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import random

In [2]:
artists = {
    'mac miller': "https://www.azlyrics.com/m/macmiller.html",
    'pop smoke': "https://www.azlyrics.com/p/popsmoke.html"
}

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages.

In [3]:
lyrics_pages = defaultdict(list)

for artist, artist_page in artists.items():
    r = requests.get(artist_page)
    time.sleep(5 + 10 * random.random())
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.select("div.listalbum-item a[href]"):
        href = link["href"]
        if href.startswith("/lyrics/"):
            full_link = "https://www.azlyrics.com" + href
            lyrics_pages[artist].append(full_link)

In [4]:
for artist, lp in lyrics_pages.items():
    assert(len(set(lp)) > 20)

In [5]:
for artist, links in lyrics_pages.items():
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For mac miller we have 310.
The full pull will take for this artist will take 0.86 hours.
For pop smoke we have 106.
The full pull will take for this artist will take 0.29 hours.


In [6]:
def generate_filename_from_link(link):
    if not link:
        return None

    name = link.replace("https","").replace("http","")
    name = name.replace(".html","")
    name = name.replace("/lyrics/","")
    name = name.replace("://","").replace(".","_").replace("/","_")
    name = name + ".txt"
    return name

In [7]:
if os.path.isdir("lyrics"):
    shutil.rmtree("lyrics/")

os.mkdir("lyrics")

In [8]:
url_stub = "https://www.azlyrics.com"
start = time.time()

total_pages = 0

for artist in lyrics_pages:
    folder = f"lyrics/{artist}"
    os.makedirs(folder, exist_ok=True)

    for link in lyrics_pages[artist][:20]:
        try:
            r = requests.get(link)
            time.sleep(5 + 10 * random.random())

            soup = BeautifulSoup(r.text, "html.parser")
            title = soup.find("title").get_text(strip=True)
            divs = soup.find_all("div", class_=False, id=False)
            lyrics = divs[0].get_text(separator="\n").strip()

            filename = generate_filename_from_link(link)
            filepath = os.path.join(folder, filename)

            with open(filepath, "w", encoding="utf-8") as f:
                f.write(title + "\n\n" + lyrics)

            total_pages += 1

        except Exception as e:
            print(f"Failed to get {link} for {artist}: {e}")

In [9]:
print(f"Total run time was {round((time.time() - start)/3600, 2)} hours.")

Total run time was 0.12 hours.


In [10]:
def words(text):
    return re.findall(r'\w+', text.lower())

In [11]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders:
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files:
        with open("lyrics/" + artist + "/" + f_name, encoding="utf-8") as infile:
            artist_words.extend(words(infile.read()))

    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")

For mac miller we have 20 files.
For mac miller we have roughly 12582 words, 1835 are unique.
For pop smoke we have 20 files.
For pop smoke we have roughly 9296 words, 1358 are unique.
