# ISYS613 - Data Sourcing and Quality
# Assignment 4
## Web Scraping

## Question 1 - Book Data Scraper

You have been asked to collect some price data about books from the Books to Scrape website.
Specifically, you are to scrape and capture the following book related information - book category,
book title, star rating and price. Once captured, display your results one book per output line.
```
Top-Level URL: http://books.toscrape.com/
```

### Requirements
1. Examine the HTML returned from the Books to Scrape top-level URL.
 Your objective is to identify and extract book category and the book
 category URL information from this page.
2. For each book category URL, follow the URL to the book category
 page. You may restrict your data scraping to the first page of books
 returned for the category URL.
3. For each of the books on a category page, capture the book title,
 star rating and price.
4. Convert the ordinal star rating data to a numeric scale. For
 example, the string 'star-rating One' would be converted to the integer number
 1, 'star-rating Two' would be converted to 2, and so on.
5. For each book, output the book category, title, numeric star rating and price.

### Challenge (Optional)
The challenge objective is to display all books (formatted as above) from all categories - not 
 just the first page of books from each category.  To see how this works, go the 
 top-level URL and observe how to 
 manually navigate from the first page of a category to the next, then next, etc. until you have 
 followed the Next link to all pages for a category.

Notice that book data from a category URL is returned 20 books at a time. If there more than 
20 books in category, the Next link appears at the bottom of the HTML page.  
When the list of books in a category has been exhausted, ie., when the last
category page has been reached, the Next link will no longer appear on the page.

Hint: That's how you will know your job is complete.

Copy your previous solution to a new code-cell. Modify your copied solution to follow
the Next page links located at the bottom of a category page. 

In [1]:
# TEST DATA
URL = 'http://books.toscrape.com/'
from requests.exceptions import HTTPError
import requests
from bs4 import BeautifulSoup
import re


# Function to get the content at a URL
def simple_get(url, *args, **kwargs):
    try:
        resp = requests.get(url, *args, **kwargs)
        resp.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
        raise http_err
    except Exception as err:
        print(f'Other error occurred: {err}')
        raise err
    return resp


# Function to prettify and save HTML to a file
def prettify_html(soup, o_file):
    with open(o_file, mode='w', encoding='utf-8') as ofh:
        prettyHTML = soup.prettify()
        print(prettyHTML, file=ofh, end=None)


# Function to create a BeautifulSoup object from a URL
def make_soup(url):
    resp = simple_get(url, timeout=5)
    html = resp.text
    assert re.search('html', resp.headers['Content-Type'], re.IGNORECASE)
    soup = BeautifulSoup(html, 'html.parser')
    return soup


# Function to convert star rating string to a numeric scale
def convert_star_rating(star_rating):
    rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    return rating_map[star_rating]


print(
    "Category                 Title                                           Rating                                    Price")


# Function to scrape book information from a category page
def scrape_category_page(url):
    soup = make_soup(url)

    books = soup.find_all('h3')

    for book in books:
        title = book.a['title']
        rating_str = book.find_previous('p')['class'][1]
        rating = convert_star_rating(rating_str)

        # Check if the price element exists
        price_elem = book.find_next('p', class_='price_color')
        price = price_elem.text.strip() if price_elem else 'Price not available'

        # Output the book information
        print(f"{category_name}, {title}, {rating}, {price}")
    # print(soup)

    # Check if there is a next page and scrape it if so
    next_page = soup.find('li', class_='next')
    if next_page is not None:
        if url.endswith('index.html'):
            next_page_url = url[:-10] + next_page.a['href']

        else:
            next_page_url = url[:-11] + next_page.a['href']
        #print(next_page_url)
        try:

            scrape_category_page(next_page_url)
            soup = make_soup(next_page_url)
            next_page = soup.find('li', class_='next')
        except HTTPError as http_err:
            print(f'HTTP error occurred: {http_err}')
            #break  # Exit the loop if an HTTP error occurs (e.g., 404 Not Found)
        except Exception as err:
            print(f'Other error occurred: {err}')
            #break  # Exit the loop if any other error occurs


# Top-level URL
URL = 'http://books.toscrape.com/'

# Get the content of the top-level URL
soup = make_soup(URL)

# Find all book categories and their URLs
categories = soup.find('div', class_='side_categories').ul.find_all('a')

# Iterate through each category
#Note: Without Book href link -- categories[1:]
for category in categories[1:]:
    category_name = category.text.strip()
    category_url = URL + category['href']

    # Output the category information
    # print(f"\nCategory Name: {category_name}\n")

    # Scrape the first page of the category
    scrape_category_page(category_url)


Category                 Title                                           Rating                                    Price
Travel, It's Only the Himalayas, 2, Â£45.17
Travel, Full Moon over Noahâs Ark: An Odyssey to Mount Ararat and Beyond, 4, Â£49.43
Travel, See America: A Celebration of Our National Parks & Treasured Sites, 3, Â£48.87
Travel, Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel, 2, Â£36.94
Travel, Under the Tuscan Sun, 3, Â£37.33
Travel, A Summer In Europe, 2, Â£44.34
Travel, The Great Railway Bazaar, 1, Â£30.54
Travel, A Year in Provence (Provence #1), 4, Â£56.88
Travel, The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2), 1, Â£23.21
Travel, Neither Here nor There: Travels in Europe, 3, Â£38.95
Travel, 1,000 Places to See Before You Die, 5, Â£26.08
Mystery, Sharp Objects, 4, Â£47.82
Mystery, In a Dark, Dark Wood, 1, Â£19.63
Mystery, The Past Never Ends, 4, Â£56.50
Mystery, A Murder in Time, 1, Â£16.64
Myst

In [2]:
# TEST DATA
URL = 'http://books.toscrape.com/'
from requests.exceptions import HTTPError
import requests
from bs4 import BeautifulSoup
import re


# Function to get the content at a URL
def simple_get(url, *args, **kwargs):
    try:
        resp = requests.get(url, *args, **kwargs)
        resp.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
        raise http_err
    except Exception as err:
        print(f'Other error occurred: {err}')
        raise err
    return resp


# Function to prettify and save HTML to a file
def prettify_html(soup, o_file):
    with open(o_file, mode='w', encoding='utf-8') as ofh:
        prettyHTML = soup.prettify()
        print(prettyHTML, file=ofh, end=None)


# Function to create a BeautifulSoup object from a URL
def make_soup(url):
    resp = simple_get(url, timeout=5)
    html = resp.text
    assert re.search('html', resp.headers['Content-Type'], re.IGNORECASE)
    soup = BeautifulSoup(html, 'html.parser')
    return soup


# Function to convert star rating string to a numeric scale
def convert_star_rating(star_rating):
    rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    return rating_map[star_rating]


print(
    "Category                 Title                                           Rating                                    Price")


# Function to scrape book information from a category page
def scrape_category_page(url):
    soup = make_soup(url)

    books = soup.find_all('h3')

    for book in books:
        title = book.a['title']
        rating_str = book.find_previous('p')['class'][1]
        rating = convert_star_rating(rating_str)

        # Check if the price element exists
        price_elem = book.find_next('p', class_='price_color')
        price = price_elem.text.strip() if price_elem else 'Price not available'

        # Output the book information
        print(f"{category_name}, {title}, {rating}, {price}")
    # print(soup)

    # Check if there is a next page and scrape it if so
    next_page = soup.find('li', class_='next')
    if next_page is not None:
        if url.endswith('index.html'):
            next_page_url = url[:-10] + next_page.a['href']

        else:
            next_page_url = url[:-11] + next_page.a['href']
        #print(next_page_url)
        try:

            scrape_category_page(next_page_url)
            soup = make_soup(next_page_url)
            next_page = soup.find('li', class_='next')
        except HTTPError as http_err:
            print(f'HTTP error occurred: {http_err}')
            #break  # Exit the loop if an HTTP error occurs (e.g., 404 Not Found)
        except Exception as err:
            print(f'Other error occurred: {err}')
            #break  # Exit the loop if any other error occurs


# Top-level URL
URL = 'http://books.toscrape.com/'

# Get the content of the top-level URL
soup = make_soup(URL)

# Find all book categories and their URLs
categories = soup.find('div', class_='side_categories').ul.find_all('a')

# Iterate through each category
#Note: With Book href link -- categories[:]
for category in categories:
    category_name = category.text.strip()
    category_url = URL + category['href']

    # Output the category information
    # print(f"\nCategory Name: {category_name}\n")

    # Scrape the first page of the category
    scrape_category_page(category_url)


Category                 Title                                           Rating                                    Price
Books, A Light in the Attic, 3, Â£51.77
Books, Tipping the Velvet, 1, Â£53.74
Books, Soumission, 1, Â£50.10
Books, Sharp Objects, 4, Â£47.82
Books, Sapiens: A Brief History of Humankind, 5, Â£54.23
Books, The Requiem Red, 1, Â£22.65
Books, The Dirty Little Secrets of Getting Your Dream Job, 4, Â£33.34
Books, The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, 3, Â£17.93
Books, The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, 4, Â£22.60
Books, The Black Maria, 1, Â£52.15
Books, Starving Hearts (Triangular Trade Trilogy, #1), 2, Â£13.99
Books, Shakespeare's Sonnets, 4, Â£20.66
Books, Set Me Free, 5, Â£17.46
Books, Scott Pilgrim's Precious Little Life (Scott Pilgrim #1), 5, Â£52.29
Books, Rip it Up and Start Again, 5, Â£35.02
Books, Our Band Could Be Your Life: Scenes from the American I