# ADS 509 Module 1: APIs and Web Scraping

## Assignment 1.1 Data Acquisition

This notebook has two parts. In the first part, you will scrape lyrics from AZLyrics.com. In the second part, you'll run code that verifies the completeness of your data pull. 

For this assignment you have chosen two musical artists who have at least 20 songs with lyrics on AZLyrics.com. We start with pulling some information and analyzing them.


# Importing Libraries

In [1]:
import os
import datetime
import re

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import random
import shutil

---

# Lyrics Scrape

This section asks you to pull data by scraping www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [2]:
# Dictionary to hold the artist name and lyrics link
artists = {'bigsean':"https://www.azlyrics.com/b/bigsean.html",
           'johnlegend':"https://www.azlyrics.com/j/johnlegend.html"} 


## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

**Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know?**

A: We can view the robots.txt file for the website by visiting https://www.azlyrics.com/robots.txt. This file outlines the rules that permit or restrict specific crawlers from accessing certain paths or directories. For our intended web scraping, it is allowed, as the file only disallows crawling of two particular directories and their contents, "lyricsdb" and "song," while permitting access to all other URLs, except for one specific crawler (008).


In [3]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)

for artist, artist_page in artists.items() :
    # request the page and sleep
    r = requests.get(artist_page)
    time.sleep(5 + 10*random.random())
    
    # Extract the raw HTML content of the page
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # Extracts all the 'a' tags with href attribute links
    for link in soup.find_all('a', href=True):
        href = link.get('href')
        
        # Check if the link matches the corresponding artist
        if href.startswith(f'/lyrics/{artist}/'):
            lyrics_pages[artist].append(href)               # Store the link to the dictionary where key is the artist

Let's make sure we have enough lyrics pages to scrape. 

In [4]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20)            # Checks if there are more than 20 lyrics links

In [5]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items(): 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For bigsean we have 247.
The full pull will take for this artist will take 0.69 hours.
For johnlegend we have 204.
The full pull will take for this artist will take 0.57 hours.


## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [6]:
# Function to generate filename from url
def generate_filename_from_link(link) :
    
    if not link :
        return None
    
    # drop the http or https and the html
    name = link.replace("https","").replace("http","")
    name = link.replace(".html","")

    name = name.replace("/lyrics/","")
    
    # Replace useless characters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # tack on .txt
    name = name + ".txt"
    
    return(name)


In [7]:
# Function to extract title and lyrics 
url_stub = "https://www.azlyrics.com" 

def extract_lyrics(artist, link):
    url = url_stub + link
    response = requests.get(url)
    time.sleep(5 + 10*random.random())
    
    # Extracts the raw HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.find('h1').text.strip('"').split('"')[0]   # Extract title
    lyrics = soup.find('div', class_=False, id=False).text  # Extract lyrics
    return title, lyrics

In [8]:
# Make the lyrics folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then use shutil.rmtree to remove it and create a new one.

if os.path.isdir("lyrics") : 
    shutil.rmtree("lyrics/")

os.mkdir("lyrics")

In [9]:
start = time.time()

total_pages = 0 

for artist, links in lyrics_pages.items():
    artist_links = links[:20]
    
    parent_dir = "lyrics"
    # creating a subfolder for the artist
    artist_dir = os.path.join(parent_dir, artist)
    if os.path.isdir(artist_dir):
        shutil.rmtree(artist_dir)
    os.mkdir(artist_dir)
    
    print(f"Extracting Song Lyrics for {artist}...")
    
    for link in artist_links:
        title, lyrics = extract_lyrics(artist, link)
        filename = generate_filename_from_link(link)
        file_path = os.path.join(artist_dir, filename)
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(title)
            f.write('\n\n')
            f.write(lyrics)
            
print("Extraction Completed...")

Extracting Song Lyrics for bigsean...
Extracting Song Lyrics for johnlegend...
Extraction Completed...


In [10]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

Total run time was 0.15 hours.


---

# Evaluation

This assignment asks you to pull data by scraping www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [11]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [12]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For bigsean we have 20 files.
For bigsean we have roughly 10837 words, 1731 are unique.
For johnlegend we have 20 files.
For johnlegend we have roughly 7868 words, 898 are unique.
