# Web Scraping
In this notebook, we investigate how we can scrap amazon.ca web-site for information about the top 100 book releases.

#### Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import urllib
from bs4 import BeautifulSoup
import datetime
import ast
import csv
import requests
from IPython.display import HTML

%matplotlib inline

## Amazon Books Web Scrapping

Here we will scrape the following website: https://www.amazon.ca/gp/new-releases/books
to find the top 100 books plus relevant information.

In [4]:
HTML('https://www.amazon.ca/gp/new-releases/books')

0,1,2,3,4,5,6,7,8
Amazon Music Stream millions of songs,,"Amazon Advertising Find, attract and engage customers",,Amazon Business Everything for your business,,Amazon Drive Cloud storage from Amazon,,Amazon Web Services Scalable Cloud Computing Services
,,,,,,,,
Goodreads Book reviews & recommendations,,"IMDb Movies, TV & Celebrities",,Amazon Photos Unlimited Photo Storage Free With Prime,,Shopbop Designer Fashion Brands,,Warehouse Deals Open-Box Discounts
,,,,,,,,
,,Whole Foods Market We Believe in Real Food,,Amazon Renewed Like-new products you can trust,,Blink Smart Security for Every Home,,


#### Open URL and read the contents

In [5]:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

r = requests.get('https://www.amazon.ca/gp/new-releases/books', headers=headers)
content = r.content
print (content[0:1000])

b'<!doctype html><html lang="en-ca" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n<!-- sp:feature:csm:head-open-part1 -->\n\n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<!-- sp:end-feature:csm:head-open-part1 -->\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n<!-- sp:feature:csm:head-open-part2 -->\n<script type=\'text/javascript\'>\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function()

#### Use beautifulsoup to parse the content

In [6]:
soup = BeautifulSoup(content, 'lxml')
print(soup)

<!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-ca"><!-- sp:feature:head-start -->
<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<!-- sp:end-feature:head-start -->
<!-- sp:feature:csm:head-open-part1 -->
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<!-- sp:end-feature:csm:head-open-part1 -->
<!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/>
<link href="https://images-na.ssl-images-amazon.com" rel="dns-prefetch"/>
<link href="https://m.media-amazon.com" rel="dns-prefetch"/>
<link href="https://completion.amazon.com" rel="dns-prefetch"/>
<!-- sp:end-feature:cs-optimization -->
<!-- sp:feature:csm:head-open-part2 -->
<script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=func

### Find the top 100 books
Let's parse this data to find the find the top 100 books and information about these books, such as author, release date, price, format, etc.

#### Get the hyperlinks of the child webpages
The top 100 books is divided across 2 web pages. We need to get the URL of those pages

The following is an HTML Structure for the hyperlinks of pages:

    <html>
        .....
        .....
        <ul id = 'a-pagination' >
            .....
            <!-- books URL are in this section-->
           <a href="https://www.amazon.ca/Sherlock-Holmes-Companion-Daniel-Smith/dp/0785827846">Sherlock-Holmes book</a>
            .....
        </div>
        .....
    </html>

In [7]:
links = []

for divTag in soup.find_all('ul',{'class': 'a-pagination'}):
    for aTag in divTag.find_all('a'):
        links.append(aTag.get('href'))

In [8]:
links = links[:-1]
if links[0].startswith("/"):
    links = ["https://www.amazon.ca"+s for s in links]
links

['https://www.amazon.ca/gp/new-releases/books/ref=zg_bsnr_pg_1_books?ie=UTF8&pg=1',
 'https://www.amazon.ca/gp/new-releases/books/ref=zg_bsnr_pg_2_books?ie=UTF8&pg=2']

#### Get product information from each hyperlink
We will now follow each of the above hyperlinks to get information about the top 100 books.

##### Investigate contents of one link

In [11]:
r = requests.get(links[0], headers=headers)
content = r.content
soup1 = BeautifulSoup(content, 'lxml')

In [12]:
books_data = soup1.findAll('div',{'class': 'a-column'})

books_data[0]

<div class="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc" id="gridItemRoot"><div class="a-cardui _cDEzb_grid-cell_1uMOS expandableGrid p13n-grid-content" data-a-card-type="basic" id="p13n-asin-index-0"><div class="a-section zg-bdg-ctr"><div class="a-section zg-bdg-body zg-bdg-clr-body aok-float-left"><span class="zg-bdg-text">#1</span></div><div class="a-section zg-bdg-tri zg-bdg-clr-tri aok-float-left"></div></div><div class="zg-grid-general-faceout"><div class="p13n-sc-uncoverable-faceout" id="B0CH2CZ1R4"><a class="a-link-normal" href="/Livy-Method-Fall-Posts-Guidelines/dp/B0CH2CZ1R4/ref=zg_bsnr_g_books_sccl_1/141-8723933-3586406?psc=1" role="link" tabindex="-1"><div class="a-section a-spacing-mini _cDEzb_noop_3Xbw5"><img alt="The Livy Method - Fall 2023: Posts and Guidelines" class="a-dynamic-image p13n-sc-dynamic-image p13n-product-image" data-a-dynamic-image='{"https://images-na.ssl-images-amazon.com/images/I/61CctzKXbXL._AC_UL300_SR300,200_.jpg":[300,200],"https://ima

##### Structure
The following is an HTML structure for each item, to help us find under which HTML tags the product information are in the links.

    <html>
        .....
        <div class = 'a-column' >
                .....
               <div class="zg-bdg-text"># </span>
               <div class="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y">Book title goes here </div>
               .....
               <span class="_cDEzb_p13n-sc-price_3mJ9Z">$ Price </span>
                .....
               <div class="a-link-child">Author Name </div>
                .....
               <div class="a-icon-alt">Rating </div>
                .....
               <div class="zg-release-date">Release Date </div>
                .....
               <div class="a-color-secondary">Format </div>
        </div>
        .....
    </html>

##### Parse pages

In [14]:
topBookResults = [] # Initialize a results list to store results

# We will append tuples containing the rank, price, author,
# release date, and book format. Then we will create a
# dataframe using this list of tuples.

# Loop through links
for link in links:
    # Open link, read and parse content
    #url = urllib.request.urlopen(link)
    #content = url.read()
    r = requests.get(link, headers=headers)
    content = r.content
    soup2 = BeautifulSoup(content, 'lxml')

    # Get the book data of each webpage by finding all elements
    # with 'div' tags
    books_data = soup2.findAll('div',{'class': 'a-column'})

    # Loop through each tagged 'a-column' item to extract
    # the rank, price, author, release date, and book format

    # We will store this in a tuple and append it to
    for item in books_data:
        # Get rank
        rank = item.find(class_='zg-bdg-text').get_text()
        rank = rank.strip(' \t\n\r').lstrip('#') # strip unnecessary carriage handles and spaces

        # Get name
        try:
            name = item.find(class_='_cDEzb_p13n-sc-css-line-clamp-2_EWgCb').get_text()
        except AttributeError:
            name = item.find(class_='_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y').get_text()
        name = name.strip(' \t\n\r') #strip carriage handles

        # Get price
        price = item.find(class_="_cDEzb_p13n-sc-price_3mJ9Z")
        if price is not None:
            price = float(price.get_text()[1:])
        else:
            price = float('nan')

        # Get author (if it exists)
        author = ''
        try:
            author = item.find(class_="a-link-child").get_text()
            author = author.rstrip('\n')
        except AttributeError:
            try:
                author = item.find(class_="a-color-base").get_text()
                author = author.rstrip('\n')
            except AttributeError:
                author = 'Unknown'

        # Get book rating (if it exists)
        rating = ''
        try:
            rating = item.find(class_="a-icon-alt").get_text()
            # Clean rating
            rating = float(rating[:3])
        except AttributeError:
            rating = float('nan')

        # Get release date (if it exists)
        release_date = ''
        try:
            release_date = item.find(class_="zg-release-date").get_text()
            # Clean date
            release_date = release_date[14:].replace('te:','')
        except AttributeError:
            release_date = 'N/A'

        # Get book format
        formt = item.find(class_='a-color-secondary').get_text()

        topBookResults.append((
            rank, name, author, price, rating, release_date, formt
            ))

print (topBookResults)

[('1', 'The Livy Method - Fall 2023: Posts and Guidelines', 'Gina Livy', 24.99, 5.0, 'N/A', 'Paperback'), ('2', 'Holly', 'Stephen King', 29.99, 5.0, 'N/A', 'Hardcover'), ('3', 'Iron Flame', 'Rebecca Yarros', 23.99, nan, 'N/A', 'Hardcover'), ('4', 'Murder in the Family: A Novel', 'Cara Hunter', 18.15, 4.0, 'N/A', 'Paperback'), ('5', 'Things We Left Behind', 'Lucy Score', 24.39, 4.5, 'N/A', 'Paperback'), ('6', '$100M Leads: How to Get Strangers To Want To Buy Your Stuff', 'Alex Hormozi', 34.95, 5.0, 'N/A', 'Paperback'), ('7', 'Matt Sprouts and the Curse of the Ten Broken Toes (Volume 1)', 'Matthew Eicheldinger', 14.36, 4.7, 'N/A', 'Paperback'), ('8', 'On This Bright Day: A Year of Reflections for Lasting Food Freedom', 'Susan Peirce Thompson Ph.D.', 25.99, nan, 'N/A', 'Hardcover'), ('9', "The Old Farmer's Almanac 2024 Canadian Edition", "Old Farmer's Almanac", 9.99, nan, 'N/A', 'Mass Market Paperback'), ('10', 'None of This Is True: A Novel', 'Lisa Jewell', 20.0, 4.4, 'N/A', 'Paperback')

#### Convert list of tuples to dataframe

In [15]:
topBookDF = pd.DataFrame(topBookResults,
             columns=['rank','name','author','price','rating','release_date','format'])
topBookDF.head()

Unnamed: 0,rank,name,author,price,rating,release_date,format
0,1,The Livy Method - Fall 2023: Posts and Guidelines,Gina Livy,24.99,5.0,,Paperback
1,2,Holly,Stephen King,29.99,5.0,,Hardcover
2,3,Iron Flame,Rebecca Yarros,23.99,,,Hardcover
3,4,Murder in the Family: A Novel,Cara Hunter,18.15,4.0,,Paperback
4,5,Things We Left Behind,Lucy Score,24.39,4.5,,Paperback
