# Scraping Detail Pages

### Other Concepts Learnt (in addition to learning how to scrape detail pages)

1. How to pause program execution such that your script doesn't burden the server of the data source
2. Writing useful print statements, so we know the page and data item currently being scraped. Such messages are helpful if you want to see the status of your script during execution. In addition, when an error occurs, and the program stops, such messages show us the problematic page. 
3. How to handle errors when a piece of data is not available on a web page.
4. Using tuples for storing data.
5. Using sessions to efficiently manage multiple requests to the same server.

In [1]:
# Import Libraries
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import numpy as np

<Response [200]>
<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/osc

http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
Now scraping : http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
<Response [200]>
<class 'bs4.BeautifulSoup'>
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
£17.93
Three
"If you have a heart, if you have a soul, Karen Hicks' The Coming Woman will make you fall in love with Victoria Woodhull."-Kinky Friedman, author and Governor of the Heart of Texas "What kind of confidence would it take for a woman to buck the old boy's club of politics in 1872? More than 140 years pre-Hillary, there was Victoria Woodhull. This book takes you back with a "If you have a heart, if you have a soul, Karen Hicks' The Coming Woman will make you fall in love with Victoria Woodhull."-Kinky Friedman, author and Governor of the Heart of Texas "What ki

http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html
Now scraping : http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html
<Response [200]>
<class 'bs4.BeautifulSoup'>
Starving Hearts (Triangular Trade Trilogy, #1)
£13.99
Two
Since her assault, Miss Annette Chetwynd has been plagued by nightmares and worries about an arranged marriage. But she yearns to find her anonymous rescuer. Unfortunately, her health and intellect prevent it. Both repel suitors and cause Annette to doubt God's existence, at least until He answers her prayers in an unusual way ... Mr. Peter Adsley is joining the clergy, an Since her assault, Miss Annette Chetwynd has been plagued by nightmares and worries about an arranged marriage. But she yearns to find her anonymous rescuer. Unfortunately, her health and intellect prevent it. Both repel suitors and cause Annette to doubt God's existence, at least until He answers her prayers in an u

http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html
Now scraping : http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html
<Response [200]>
<class 'bs4.BeautifulSoup'>
Mesaerion: The Best Science Fiction Stories 1800-1849
£37.59
One
Andrew Barger, award-winning author and engineer, has extensively researched forgotten journals and magazines of the early 19th century to locate groundbreaking science fiction short stories in the English language. In doing so, he found what is possibly the first science fiction story by a female (and it is not from Mary Shelley). Andrew located the first steampunk short Andrew Barger, award-winning author and engineer, has extensively researched forgotten journals and magazines of the early 19th century to locate groundbreaking science fiction short stories in the English language. In doing so, he found what is possibly the first science fiction story by a 

In [2]:
# Create a Session object using the requests library in Python
'''
A Session object lets you persist certain parameters across multiple requests. For example, it keeps the connection open between requests 
to the same server, making it more efficient and faster because it doesn’t have to reestablish the TCP connection each time.
'''
session = requests.Session() # creates broswer open instead of opening bran-new broswer window for every click 

In [3]:
# Make a request to the book listing page
main_page = session.get("http://books.toscrape.com/") # go to website, download page HTML, and save as main_page

In [4]:
# Print the response code
print(main_page)

# response 200 means the webpage was downloaded correctly 
# 404 = page not found 
# 403 = forbidden 
# 500 = server error 

<Response [200]>


In [5]:
# Print the response content
print(main_page.content) # prints entire webpage HTML exactly as the broswer sees it 



In [8]:
# Create a BeautifulSoup object for the book listing page 
main_soup = BeautifulSoup(main_page.content,'html.parser') # converts it into beautiful soup 

In [9]:
# Check the type of "main_soup"
print(type(main_soup)) # type is bs4. beautifulsoup 

<class 'bs4.BeautifulSoup'>


In [10]:
# See the HTML code in a pretty way
print(main_soup.prettify()) # makes all the html organized speerating <tags> and strings 

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [11]:
# Extracting all the <h3> tags from the book listing page. 
# The <h3> tags contain the links to each book.
# The variable "h3tags", below, is a list of all <h3> tags
h3tags = main_soup.find_all("h3")
print(h3tags)

[<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>, <h3><a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>, <h3><a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a></h3>, <h3><a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>, <h3><a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>, <h3><a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>, <h3><a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>, <h3><a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html

In [12]:
# There should be 20 such <h3> tags because there are 20 books in the product listing page
print(len(h3tags))

20


In [13]:
# Create a list to store the rows of data
data = []

In [14]:
# this web scraping goes into each book link on the list page, opens the individuals page, and extracts title, price, rating, description

# Go to each <h3> tag (i.e, 20 of them), create a book page link, and extract data
for tag in h3tags: # htags = list of <h3> tags and each <h3> represents one book, so there are 20 books 
    
    # Extract the variable part of the link
    variable_part = tag.find("a").attrs['href'] # finds the first a tag with an attribute of h'ref', finds the value tied to href 
    # this gives the partial link
    
    # Create a string to store the full link of the book in the current iteration
    if 'catalogue' in variable_part: # this exists because there are some website links tht includes catalouge and others do not, builds true and COMPLETE URL
        product_link ='http://books.toscrape.com/' + variable_part 
    else:
        product_link ='http://books.toscrape.com/catalogue/' + variable_part
    print(product_link)
    
    # Print a message to display the link of the book in the current iteration
    print("Now scraping :", product_link) # builds and prints the full url for each book 
# ----------------------------------------------------------------------------------------------
    try: 
        # Make a request to the link of the book in the current iteration
        product_page = session.get(product_link)
        # get goes to books webpage and downloads the HTML
        
        # Print the response code
        print(product_page) 
        # <response [200]> means page loaded successfully 
# -----------------------------------------------------------------------------------------------
        # Creating a BeautifulSoup object for the link of the book in the current iteration
        product_soup = BeautifulSoup(product_page.content ,"html.parser") # .content , "html.parser" parses through the content of the webpage and converts to beautifulsoup
        print(type(product_soup)) # type is beautiful soup
# -----------------------------------------------------------------------------------------------
        # Extract cell values
        # Try and except is used for exception handling (in case the find method cannot find the data). 
        # When find method cannot find the data, .get_text() will throw an error
        try:
            heading = product_soup.find('h1').get_text().strip() 
            # finds h1 tag, gets all text, removes whitespace 
            #strip: removes whitespace 
        except:
            heading = "NA" # if there is no heading, this handles error and returns "NA"
        print(heading) # prints the text in heading 
# ------------------------------------------------------------------------------------------------
        try:
            saleprice = product_soup.find("p", class_="price_color").get_text().strip()
            # finds p tag where class= "price_color" - class = attribute and price_color = attribute value, extracts text content inside that tag, removes whitespace
            # Example: < p class = 'price_color'> 51.77 </p>
        except:
            saleprice = "NA" # handling error exception 
        print(saleprice) # prints the text attached to class 
# -------------------------------------------------------------------------------------------------
        try:
            rating = product_soup.find('p', class_="star-rating").attrs['class'][1]
            # finds the first p tag whos class is star_rating and has an an attribute of class 
            # what happens is that class attribute has multiple of values so beautiful soup stores as python list 
            # 1 show index for the list and which value to return so three returns 
            # whenever there is a class [index] next to it means that class was turned into a list and returns a specific value
        except:
            rating = "NA" # exception handling 
        print(rating)
# --------------------------------------------------------------------------------------------------
        try:
            description = product_soup.find("article", class_="product_page").find("p",recursive=False).get_text().strip()
            # finds article with class = product page, <article class='product_page'
            # find p tag that is a direct child of article, and after doing that get the text and strip of whitespaces and print
        except:
            description = "NA" # exception handling 
        print(description)

        # Collect each cell value in a tuple
        # Each tuple represents a row of data
        data.append((heading, saleprice, rating, description)) # creates each cell value in a table, tuple = row of data
# ---------------------------------------------------------------------------------------------------
    # handles exceptions that might occur when making a request using the requests library. 
    # -->This except block catches any exception that is a subclass of requests.exceptions.RequestException.
    # --> RequestException is the base exception class for all exceptions raised by the requests library, so this will catch a variety of 
    # issues, such as connection errors, timeouts, or HTTP errors. 
    #-->as e assigns the caught exception object to the variable e, allowing you to inspect it or use its message.
    except requests.exceptions.RequestException as e:
        print("Error fetching " + product_link + ": " + str(e))
# ----------------------------------------------------------------------------------------------------
    # Pause the program for 5 seconds before the next request (i.e., visiting the next book page)
    # This way we do not place undue burden on the website
    time.sleep(5) # .sleep pauses program for 5 seconds before executing again

http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Now scraping : http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
<Response [200]>
<class 'bs4.BeautifulSoup'>
A Light in the Attic
£51.77
Three
It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silver

In [17]:
# Print the list named "data"
print(data)

[('A Light in the Attic', '£51.77', 'Three', "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down he

In [20]:
# Create a tabular representation of the data (i.e., dataframe)
df = pd.DataFrame(np.array(data))

# Print the tabular representation - there should be no column names
print(df)

                                                    0       1      2  \
0                                A Light in the Attic  £51.77  Three   
1                                  Tipping the Velvet  £53.74    One   
2                                          Soumission  £50.10    One   
3                                       Sharp Objects  £47.82   Four   
4               Sapiens: A Brief History of Humankind  £54.23   Five   
5                                     The Requiem Red  £22.65    One   
6   The Dirty Little Secrets of Getting Your Dream...  £33.34   Four   
7   The Coming Woman: A Novel Based on the Life of...  £17.93  Three   
8   The Boys in the Boat: Nine Americans and Their...  £22.60   Four   
9                                     The Black Maria  £52.15    One   
10     Starving Hearts (Triangular Trade Trilogy, #1)  £13.99    Two   
11                              Shakespeare's Sonnets  £20.66   Four   
12                                        Set Me Free  £17.46   

In [21]:
# Add the column names
df.columns = ['heading','saleprice','rating','description']

# Print the tabular representation - you should see column names
print(df)

                                              heading saleprice rating  \
0                                A Light in the Attic    £51.77  Three   
1                                  Tipping the Velvet    £53.74    One   
2                                          Soumission    £50.10    One   
3                                       Sharp Objects    £47.82   Four   
4               Sapiens: A Brief History of Humankind    £54.23   Five   
5                                     The Requiem Red    £22.65    One   
6   The Dirty Little Secrets of Getting Your Dream...    £33.34   Four   
7   The Coming Woman: A Novel Based on the Life of...    £17.93  Three   
8   The Boys in the Boat: Nine Americans and Their...    £22.60   Four   
9                                     The Black Maria    £52.15    One   
10     Starving Hearts (Triangular Trade Trilogy, #1)    £13.99    Two   
11                              Shakespeare's Sonnets    £20.66   Four   
12                                    

In [22]:
# Save the data as a CSV file
df.to_csv("book_details.csv")