# Building a Simple Web Scraper
In this exercise, we will build a simple web scraper using:
* the `requests` library for interacting with websites over HTTP
* the `bs4` (aka BeautifulSoup4) library for interacting with HTML content
* the `pathlib` library for nicely structuring our directory

While many datasets are freely and openly available, specialized information is not always widely available.
For this exercise, we will use [ToScrape books](http://books.toscrape.com/), a site that explicitly permits scraping.

To collect our dataset, we will need to **generate a list of URLS to scrape** and then, for each URL:
1. Get the page
2. Extract the text components of the page
3. Write the text to disk

## Imports
First, let's import the necessary libraries and objects

In [1]:
import requests
from bs4 import BeautifulSoup
from pathlib import Path

## Fetching a Page

In order to write our loop, we will need to define exactly what elements we want to extract from a target page.
However, we haven't even seen a single page yet!
Let's define a function to do exactly that -- given a `url` parameter, fetch the page and return the body.

If you're not familiar with the `requests` library, you can check the [quickstart](https://requests.readthedocs.io/en/latest/user/quickstart/) documentation page.

Although our sample site is designed to permit scraping, note that many websites will block requests from `requests`, so you may have to configure a `user-agent` string in your header.
To verify that you have not been blocked, you can check the `.status_code` attribute of the `Response` object.

In [2]:
def fetch_page(url: str):
    headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
    }
    # Todo: fetch the page using the GET requests
    r = requests.get(url)
    if r.status_code == 200:
        return r.text
    else:
        return r.status_code

    # Todo check status code. Return the request body if code == 200, else print status code and return the body


In [3]:
# Test if the function fetch the page correctly
test_url = "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
test_result = fetch_page(test_url)
print(test_result)



<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    A Light in the Attic | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="
    It&#39;s hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein&#39;s humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and lov

Nice! 
Now we see that there is a lot of information on this page, and much of it is not very useful for us -- especially for training a book description-writing language model!
So let's go ahead and write a function to extract the relevant parts of the page using `BeautifulSoup`.

## Parsing Web Pages
Since we can fetch arbitrary webpages and we have a test result already stored -- let's write a function to extract only the text we want!
Since our language model will be writing product descriptions, we want to extract the product description from each page. 

To do this, we'll need to extract the text inside the `<p>` tag following the `"product_description"` `<div>` tag.

To navigate the tree, we'll need to use the `.find()` or `.find_all()` method of `BeautifulSoup`.

Some tags have ids, so if we know the particular tag id, we can use the `id` keyword within `.find()`, like this: `soup.find('p', id='name')`
In some cases, we want the *next* tag of a given type once we find the relevant part of the text, so we can use `.find_next()`.

There are a few valid ways to do this, but if we inspect a few pages, we're very lucky that our product description always seems to be the `<p>` tag immediate after the `<div>` element with `id="product_description"`. And we'll need the `.text` attribute of that `<p>` tag.

If you need more details on how to use `BeautifulSoup`, you can find them in the library's [documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)

In [4]:
def parse_page(html_doc: str):
    # Todo: parse the html doc returned from fetch_page using BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')

    # Todo: find the text with <div> tag with production_description id
    product_div = soup.find("div", id = "product_description")

#     # Todo: find the the <p> element that is immediate siblings of the product_div
    selected_elements = product_div.find_next("p")

    # Todo: return the attribute of the tag using .text
    description = selected_elements.text

    return description

In [5]:
# Check the product description
test_text = parse_page(test_result)
print(test_text)

It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded

Great! Now we can see the product description in a string format.
## Saving Files

Once we've scraped and parsed the page, we want to save the raw data, in the form of text, to a file.
For this, we want to specify a directory. We want to be able to specify where to save the raw data -- into a train or test directory.

Since we have the URL of the file, we'll want to save each file according to the unique identifying string.
For example, one of URLs are of the form `"http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"` so the file name we want is "a-light-in-the-attic_1000.txt", which is composed by the second last component of the URL "a-light-in-the-attic_1000" and a ".txt" string.

In [6]:
def save_text(text, url, train=True):
    # Save the data to "./data/train/" if it's in the training set
    if train:
        file_path = Path("./data/train/")
        file_path.mkdir(parents=True, exist_ok=True)

    # If data is not in the training set, save it to "./data/test/"
    else:
        file_path = Path("./data/test/")
        file_path.mkdir(parents=True, exist_ok=True)

    # Todo: split the URL by "/"
    split_url = url.split('/')

    # Todo: pull the name from the URL, and add a .txt extension to the end of the file
    file_name =split_url[-2] + '.txt'

    # Write the file to disk
    with open(file_path.joinpath(file_name), "w") as f:
        f.write(text)

In [7]:
# Test the save_file function
# You shoude see file_name output as olio_984.txt
# You should also find a exercise1/data/train folder with a file "olio_984.txt"
save_text(test_text, test_url, train=True)

## Generating URLs

We have the test URL for "olio_984", but since we want to collect all the books and pages, we'll need to generate URLs for each of them.

Some sites have predictable page numbers and locations, but unfortunately, we'd need the name and index (*e.g.* a-light-in-the-attic_1000) to specify.

Luckily, we can scrape these from the home page (and from subsequent pages if we wish, since those pages are sequential!) using the same `requests` and `BeautifulSoup` methods we've seen previously.

In this case, we can re-use our `fetch_page` function and simply collect all the links on the page by using `BeautifulSoup`'s `.find_all()` method to get all of the `<a>` tags.

For each tag, we'll want to access the `'href'` element to get the actual link text. Note that the URLs on this page are **relative**! That means they use `"../../"` instead of the full URL text.

If we `.split()` the URL on `"/"`, we can find that the array for a book title has exactly **4 elements**. And URL for books starts with **`"../../"`** (but NOT `"../../../"`). For example: `"../../set-me-free_988/index.html"`. These two condisions let us return only URLs for books.

Then, since our URLs are relative, we'll want to `.replace()` the relative reference with the appropriate prefix: `"http://books.toscrape.com/catalogue/"`. For example, the URL for "set-me-free_988" should be `"http://books.toscrape.com/catalogue/set-me-free_988/index.html"`.
It's also possible that we have duplicates, and so we'll want to remove those where possible to minimize how much we scrape.

In [8]:
def generate_url_list():
    # Create a list to store our urls
    url_list = list()
    
    # Specify the index page and fetch it
    home = "https://books.toscrape.com/catalogue/category/books_1/index.html"
    home_page = fetch_page(home)
    
    # Todo: create a soup object for the home page using BeautifulSoup
    soup = BeautifulSoup(home_page, 'html.parser')
    
    # Todo: find all the links on the page using the <a> tag and 'href' element
    links = soup.find_all("a")

    for element in links:
#         Todo: in the if statement, find the condition where element['href'] has 4 elements, 
#         contains "../../", but not "../../../"
        l = element['href'].split("/")
        if len(l) == 4 and l[2] != ".." and element['href'].startswith("../../"):
            # Extract the url with the relative (..) references
            relative_url = element['href']
            
            # Todo: replace the relative references "../../" 
            # with the base URL "http://books.toscrape.com/catalogue/"
            full_url = relative_url.replace("../../","http://books.toscrape.com/catalogue/")
            
            # Append the URL to the url_list
            url_list.append(full_url)
    # Deduplicate links in the list
    url_list = list(set(url_list))
    return url_list

In [9]:
# Check if the urls are valid
url_list = generate_url_list()
url_list

['http://books.toscrape.com/catalogue/soumission_998/index.html',
 'http://books.toscrape.com/catalogue/olio_984/index.html',
 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html',
 'http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'http://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html',
 'http://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
 'http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html',
 'http://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html',
 'http://books.toscrape.com/catalogue/set-me-free_988/index.html',
 'http://books.toscrape.com/catalogue/our-band-

## Bringing It All Together
Once we have our list of (probably!) valid URLs, we'll want to bring it all together.

First, generate your url list. You'll want to make sure that your URL is valid since the provided URLs were relative. Then, iterate over it to fetch the product description for each book and save the text into text files.

Before writing the code, let's do some simple tests to make sure the function you wrote are correct.

In [10]:
# Test if the fetch_page and parse_page functions run correctly.
# Run the cell a few times to test if the descrption is extracted successfully on a random url from the url_list
import random
url = random.choice(url_list)

page_text = fetch_page(url)
product_description = parse_page(page_text)
print(url + "\n")
print(product_description)

http://books.toscrape.com/catalogue/the-requiem-red_995/index.html

Patient Twenty-nine.A monster roams the halls of Soothing Hills Asylum. Three girls dead. 29 is endowed with the curseâ¦or gift of perception. She hears messages in music, sees lyrics in paintings. And the corn. A lifetime asylum resident, the orchestral corn music is the only constant in her life.Mason, a new, kind orderly, sees 29 as a woman, not a lunatic. And as his bel Patient Twenty-nine.A monster roams the halls of Soothing Hills Asylum. Three girls dead. 29 is endowed with the curseâ¦or gift of perception. She hears messages in music, sees lyrics in paintings. And the corn. A lifetime asylum resident, the orchestral corn music is the only constant in her life.Mason, a new, kind orderly, sees 29 as a woman, not a lunatic. And as his belief in her grows, so does her self- confidence. That perhaps she might escape, might see the outside world. But the monster has other plans. The missing girls share one common t

In [11]:
# Bring it all together to production description texts from mupliple urls and save them to the disk
for url in url_list:
    page_text = fetch_page(url)
    product_description = parse_page(page_text)
    save_text(product_description, url)

Now if you go back to the `exercise1/data/train` directory, you will see the descriptions are stored in many text files, with the correspoinding file name.