# Building a Simple Web Scraper
In this exercise, we will build a simple web scraper using:
* the `requests` library for interacting with websites over HTTP
* the `bs4` (aka BeautifulSoup4) library for interacting with HTML content

To collect our dataset, we will need to generate a list of URLs to scrape and then, for each URL:
1. Get the page
2. Extract the text components of the page
3. Write the text to disk

# Imports
First, let's import the necessary libraries and objects

In [1]:
import requests
from bs4 import BeautifulSoup
from pathlib import Path

# Fetching a Page
In order to write our loop, we will need to define exactly what elements we want to extract from a target page. However, we haven't ever seen a single page yet! Let's define a function to do exactly that - given a `url` parameter, fetch the page and return the body.

You may have to configure a `user-agent` string in your header. To verify that you have not been blocked, you can check the `.status_code` attribute of the `Response` object.

In [2]:
def fetch_page(url: str) -> str:
  headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
  }
  r = requests.get(url, headers=headers)
  if r.status_code == 200:
    return r.text
  else:
    print(r.status_code)
    return r.text

In [3]:
# Test if the function fetch the page correctly
test_url = "http://books.toscrape.com/catalogue/olio_984/index.html"
test_result = fetch_page(test_url)
print(test_result)



<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    Olio | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="
    Part fact, part fiction, Tyehimba Jess&#39;s much anticipated second book weaves sonnet, song, and narrative to examine the lives of mostly unrecorded African American performers directly before and after the Civil War up to World War I. Olio is an effort to understand how they met, resisted, complicated, co-opted, and sometimes defeated attempts to minstrelize them.So, while Part fact, part 

# Parsing Web Pages
Since we can fetch arbitrary webpages and we have a test result already stored - let's write a function to extract only the text we want! Since our language model will be writing product descriptions, we want to extract the product description from each page.

To do this, we'll need to use the `.find()` or `.find_all()` method of `BeautifulSoup`.

Some tags have ids, so if we know the particular tag id, we can use the `id` keyword within `.find()`, like this: `soup.find('p', id='name')`. In some cases, we want the next tag of a given type once we find the relevant part of the text, so we can use `.find_next()`.

In [4]:
def parse_page(html_doc:str) -> str:
  soup = BeautifulSoup(html_doc, 'html.parser')

  # Find the <div> element with id="product_description"
  product_div = soup.find('div', id='product_description')

  # Find the <p> element that is immediate siblings of the product_div
  selected_elements = product_div.find_next('p')

  description = selected_elements.text
  return description

In [5]:
test_text = parse_page(test_result)
print(test_text)

Part fact, part fiction, Tyehimba Jess's much anticipated second book weaves sonnet, song, and narrative to examine the lives of mostly unrecorded African American performers directly before and after the Civil War up to World War I. Olio is an effort to understand how they met, resisted, complicated, co-opted, and sometimes defeated attempts to minstrelize them.So, while Part fact, part fiction, Tyehimba Jess's much anticipated second book weaves sonnet, song, and narrative to examine the lives of mostly unrecorded African American performers directly before and after the Civil War up to World War I. Olio is an effort to understand how they met, resisted, complicated, co-opted, and sometimes defeated attempts to minstrelize them.So, while I lead this choir, I still find thatI'm being ledâ¦I'm a missionarymending my faith in the midst of this flockâ¦I toil in their fields of praise. When folks seethese freedmen stand and sing, they hear their Godspeak in tongues. These nine dark mout

# Saving Files
Once we've scraped and parsed the page, we want to save the raw data, in the form of text, to a file. For this, we want to specify a directory. We want to be able to specify where to save the raw data - into a train or test directoy.

Since we have the URL of the file, we will want to save each file according to the unique identifying string. For example, one of URLs are of the form `"http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"` so the file name we want is "a-light-in-the-attic_1000.txt", which is composed by the second last component of the URL "a-light-in-the-attic_1000" and a ".txt" string.

In [6]:
def save_text(text, url, train=True):
  # Save the data to "./data/train/" if it's in the training set
  if train:
    file_path = Path("./data/train/")
    file_path.mkdir(parents=True, exist_ok=True)
  # If data is not in the training set, save it to "./data/test/"
  else:
    file_path = Path("./data/test/")
    file_path.mkdir(parents=True, exist_ok=True)

    # Split the URL by "/"
    split_url = url.split("/")

    # Pull the name from the URL. add a .txt extension to the end of the file
    file_name = f"{split_url[-2]}.txt"
    print(file_name)

    # Write the file to disk
    with open(file_path.joinpath(file_name), "w") as f:
      f.write(text)

In [7]:
save_text(test_text, test_url, train=True)

# Generating URLs
We have the test URL for "olio_984", but since we want to collect all the books and pages, we'll need to generate URLs for each of them.
Some sites have predictable page numbers and locations, but unfortunately, we'd need the name and index (e.g. a-light-in-the-attic_1000) to specify.

Luckily, we can scrape these from the home page (and from subsequent pages if we wish, since those pages are sequential!) using the same `requests` and `BeautifulSoup` methods we've seen previously.

In this case, we can re-use our `fetch_page` function and simply collect all the links on the page by using `BeautifulSoup`s `.find_all()` method to get all of the `<a>` tags.

For each tag, we'll want to access the `href` element to get the actual link text. Note that the URLs on this page are relative! That means they use `"../../"` instead of the full URL text.

If we `.split()` the URL on `"/"`, we can find that the array for a book title has exactly **4 elements**. And URL for books starts with **`"../../"`** (but NOT `"../../../"`). For example: `"../../set-me-free_988/index.html"`. These two condisions let us return only URLs for books.

Then, since our URLs are relative, we'll want to `.replace()` the relative reference with the appropriate prefix: `"http://books.toscrape.com/catalogue/"`. For example, the URL for "set-me-free_988" should be `"http://books.toscrape.com/catalogue/set-me-free_988/index.html"`.
It's also possible that we have duplicates, and so we'll want to remove those where possible to minimize how much we scrape.

In [10]:
def generate_url_list():
    # Create a list to store our urls
    url_list = list()

    # Specify the index page and fetch it
    home = "https://books.toscrape.com/catalogue/category/books_1/index.html"
    home_page = fetch_page(home)

    # Create a soup object for the home page
    soup = BeautifulSoup(home_page, 'html.parser')

    # Find all the links on the page
    links = soup.find_all('a', href=True)

    for element in links:
        # Pull out and clean the relevant link
        if len(element['href'].split("/")) == 4 and "../../" in element['href'] and "../../../" not in element['href']:
            # Extract the url with the relative (..) references
            relative_url = element['href']

            # Replace the relative references with the base URL
            full_url = relative_url.replace("../../", "http://books.toscrape.com/catalogue/")
            url_list.append(full_url)
#         url_list.append(full_url)
    # Deduplicate links in the list
    url_list = list(set(url_list))
    return url_list

In [11]:
# Check if the urls are valid
url_list = generate_url_list()
url_list

['http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/set-me-free_988/index.html',
 'http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html',
 'http://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html',
 'http://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html',
 'http://books.toscrape.com/catalogue/tipping-the-velv

# Bringing It All Together
Once we have our list of (probably!) valid URLs, we'll want to bring it all together.

FIrst, generate your url list. You'll want to make sure that your URL is valid since the provided URLs were relative. Then, iterate over it to fetch the product description for each book and save the text into text files.

Before writing the code, let's do some simple tests to make sure the function you wrote are correct.

In [12]:
# Test if the fetch_page and parse_page functions run correctly.
# Run the cell a few times to test if the description is extracted successfully on a random url from the url_list
import random
url = random.choice(url_list)

page_text = fetch_page(url)
product_description = parse_page(page_text)
print(url + "\n")
print(product_description)

http://books.toscrape.com/catalogue/olio_984/index.html

Part fact, part fiction, Tyehimba Jess's much anticipated second book weaves sonnet, song, and narrative to examine the lives of mostly unrecorded African American performers directly before and after the Civil War up to World War I. Olio is an effort to understand how they met, resisted, complicated, co-opted, and sometimes defeated attempts to minstrelize them.So, while Part fact, part fiction, Tyehimba Jess's much anticipated second book weaves sonnet, song, and narrative to examine the lives of mostly unrecorded African American performers directly before and after the Civil War up to World War I. Olio is an effort to understand how they met, resisted, complicated, co-opted, and sometimes defeated attempts to minstrelize them.So, while I lead this choir, I still find thatI'm being ledâ¦I'm a missionarymending my faith in the midst of this flockâ¦I toil in their fields of praise. When folks seethese freedmen stand and sing, 

In [14]:
# Bring it all together to production description texts from multiple urls and save them to the disk
for url in url_list:
    page_text = fetch_page(url)
    product_description = parse_page(page_text)
    save_text(product_description, url)