### A company that sells books online finds that it has low sales in the “Travel” and “Non-Fiction” categories. For this reason, the company needs to obtain information about books in the "Travel" and "Nonfiction" categories from the competitor's website https://books.toscrape.com/, which is allowed to be scraped, and conduct competitor and price analysis. The company expects you to receive the information on the detail page of each book in these categories.

import libraries

In [36]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

import pandas as pd
import time
import re

### Task 1: Configuring and Launching the Scanner

In [2]:
#1. Define a ChromeOptions named options using Selenium's Webdriver class.
options= webdriver.ChromeOptions()

# 2. Add full screen feature to the options you defined.
options.add_argument("start-maximized")

In [3]:
#3. Create a Chrome browser named driver using the options you prepared in the previous steps.
driver = webdriver.Chrome(options=options)
sleep_time= 2

### Task 2: Examining and Scraping the Home Page

In [4]:
# 1. Open the Home Page with the driver and examine it.
driver.get("https://books.toscrape.com/")
time.sleep(sleep_time)

In [5]:
# 2. Write an XPath query that finds the elements holding the links of the "Travel" and "Nonfiction" category pages all at once.
category_element_xpath = "//a[contains(text(),'Travel') or contains(text(),'Nonfiction')]"

In [6]:
# 3. Find the elements you captured with the XPath query using the driver and scrape the category detail links.
category_elements= driver.find_elements(By.XPATH,category_element_xpath)

In [7]:
category_urls = [element.get_attribute("href") for element in category_elements]
category_urls

['https://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html']

### TASK 3: Inspecting and Scraping the Category Page

In [8]:
# 1. Go to any detail page and write the XPath query that captures the elements that hold the detail links of the books on that page.
driver.get(category_urls[0])
time.sleep(sleep_time)

In [9]:
book_elements_xpath= "//div[@class='image_container']//a"

In [10]:
# Capture the elements using the XPath query with the 2nd driver and extract their detail links.
book_elements = driver.find_elements(By.XPATH, book_elements_xpath)
book_urls = [element.get_attribute("href") for element in book_elements]
print(book_urls)
print(len(book_urls))

['https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html', 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html', 'https://books.toscrape.com/catalogue/see-america-a-celebration-of-our-national-parks-treasured-sites_732/index.html', 'https://books.toscrape.com/catalogue/vagabonding-an-uncommon-guide-to-the-art-of-long-term-world-travel_552/index.html', 'https://books.toscrape.com/catalogue/under-the-tuscan-sun_504/index.html', 'https://books.toscrape.com/catalogue/a-summer-in-europe_458/index.html', 'https://books.toscrape.com/catalogue/the-great-railway-bazaar_446/index.html', 'https://books.toscrape.com/catalogue/a-year-in-provence-provence-1_421/index.html', 'https://books.toscrape.com/catalogue/the-road-to-little-dribbling-adventures-of-an-american-in-britain-notes-from-a-small-island-2_277/index.html', 'https://books.toscrape.com/catalogue/neither-here-nor-there-travels-in-europe_198/index.html', 'h

In [12]:
# 3. For pagination, manipulate the page link instead of clicking on the buttons.

max_pagination = 3
url = category_urls[1]
book_urls= []
for i in range(1,max_pagination):
    update_url = url if i==1 else url.replace("index", f"page-{i}")
    driver.get(update_url)
    book_elements = driver.find_elements(By.XPATH, book_elements_xpath)

    temp_urls = [element.get_attribute("href") for element in book_elements]
    book_urls.extend(temp_urls)

print(book_urls)
print(len(book_urls))

['https://books.toscrape.com/catalogue/worlds-elsewhere-journeys-around-shakespeares-globe_972/index.html', 'https://books.toscrape.com/catalogue/the-five-love-languages-how-to-express-heartfelt-commitment-to-your-mate_969/index.html', 'https://books.toscrape.com/catalogue/reasons-to-stay-alive_959/index.html', 'https://books.toscrape.com/catalogue/higherselfie-wake-up-your-life-free-your-soul-find-your-tribe_957/index.html', 'https://books.toscrape.com/catalogue/unseen-city-the-majesty-of-pigeons-the-discreet-charm-of-snails-other-wonders-of-the-urban-wilderness_952/index.html', 'https://books.toscrape.com/catalogue/throwing-rocks-at-the-google-bus-how-growth-became-the-enemy-of-prosperity_948/index.html', 'https://books.toscrape.com/catalogue/the-life-changing-magic-of-tidying-up-the-japanese-art-of-decluttering-and-organizing_936/index.html', 'https://books.toscrape.com/catalogue/the-gutsy-girl-escapades-for-your-life-of-epic-adventure_934/index.html', 'https://books.toscrape.com/ca

In [13]:
# 4. Use control to avoid entering an infinite loop in pagination as a result of the analysis in the previous step.
max_pagination = 10
url = category_urls[1]
book_urls= []
for i in range(1,max_pagination):
    update_url = url if i==1 else url.replace("index", f"page-{i}")
    driver.get(update_url)
    book_elements = driver.find_elements(By.XPATH, book_elements_xpath)
    if not book_elements:
        break

    temp_urls = [element.get_attribute("href") for element in book_elements]
    book_urls.extend(temp_urls)

print(book_urls)
print(len(book_urls))

['https://books.toscrape.com/catalogue/worlds-elsewhere-journeys-around-shakespeares-globe_972/index.html', 'https://books.toscrape.com/catalogue/the-five-love-languages-how-to-express-heartfelt-commitment-to-your-mate_969/index.html', 'https://books.toscrape.com/catalogue/reasons-to-stay-alive_959/index.html', 'https://books.toscrape.com/catalogue/higherselfie-wake-up-your-life-free-your-soul-find-your-tribe_957/index.html', 'https://books.toscrape.com/catalogue/unseen-city-the-majesty-of-pigeons-the-discreet-charm-of-snails-other-wonders-of-the-urban-wilderness_952/index.html', 'https://books.toscrape.com/catalogue/throwing-rocks-at-the-google-bus-how-growth-became-the-enemy-of-prosperity_948/index.html', 'https://books.toscrape.com/catalogue/the-life-changing-magic-of-tidying-up-the-japanese-art-of-decluttering-and-organizing_936/index.html', 'https://books.toscrape.com/catalogue/the-gutsy-girl-escapades-for-your-life-of-epic-adventure_934/index.html', 'https://books.toscrape.com/ca

### TASK 4: Scraping the Product Detail Page

In [14]:
# 1. Go to the detail page of any product and capture the div element whose class attribute is content.
driver.get(book_urls[0])
time.sleep(sleep_time)
content_div = driver.find_elements(By.XPATH,"//div[@class='content']")

In [15]:
# 2. Take the HTML of the div you captured and assign it to the variable named inner_html.
inner_html = content_div[0].get_attribute("innerHTML")

In [16]:
# 3. Create the soup object with inner_html.
soup = BeautifulSoup(inner_html,"html.parser")

In [26]:
#4. Engrave the following information with the soup object you created:
# - Book Name
name_elem = soup.find("h1")
book_name = name_elem.text
book_name

'Worlds Elsewhere: Journeys Around Shakespeare’s Globe'

In [27]:
# - Book Price
price_elem = soup.find("p",attrs={"class":"price_color"})
book_price =price_elem.text
book_price

'£40.30'

In [30]:
# - Number of Book Stars,
# Hint: (regex = re.compile('^star-rating '))
regex = re.compile('^star-rating ')
star_elem = soup.find("p",attrs={"class":regex})
print(star_elem)
book_star_count= star_elem["class"][-1]
book_star_count

<p class="star-rating Five">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<!-- <small><a href="/catalogue/worlds-elsewhere-journeys-around-shakespeares-globe_972/reviews/">
        
                
                    0 customer reviews
                
        </a></small>
         --> 


<!-- 
    <a id="write_review" href="/catalogue/worlds-elsewhere-journeys-around-shakespeares-globe_972/reviews/add/#addreview" class="btn btn-success btn-sm">
        Write a review
    </a>

 --></p>


'Five'

In [31]:
# - Book Description,
desc_elem = soup.find("div", attrs={"id":"product_description"}).find_next_sibling()
book_desc = desc_elem.text
book_desc



'Anti-apartheid activist, Bollywood screenwriter, Nazi pin-up, hero of the Wild West: this is Shakespeare as you have never seen him before.From the sixteenth-century Baltic to the American Revolution, from colonial India to the skyscrapers of modern-day Shanghai, Shakespeare’s plays appear at the most fascinating of times and in the most unexpected of places. No other writ Anti-apartheid activist, Bollywood screenwriter, Nazi pin-up, hero of the Wild West: this is Shakespeare as you have never seen him before.From the sixteenth-century Baltic to the American Revolution, from colonial India to the skyscrapers of modern-day Shanghai, Shakespeare’s plays appear at the most fascinating of times and in the most unexpected of places. No other writer’s work has been performed, translated, adapted and altered in such a remarkable variety of cultures and languages. But what is it about William Shakespeare – a man from Warwickshire who never once set foot outside England – that has made him at 

In [32]:
# - Information in the table under the Product Information Heading.
product_information = {}
table_rows = soup.find("table").find_all("tr")
for row in table_rows:
    key = row.find("th").text
    value = row.find("td").text
    product_information[key] = value

product_information

{'UPC': '4c28def39d850cdf',
 'Product Type': 'Books',
 'Price (excl. tax)': '£40.30',
 'Price (incl. tax)': '£40.30',
 'Tax': '£0.00',
 'Availability': 'In stock (18 available)',
 'Number of reviews': '0'}

### TASK 5: Functionalization and Automating the Entire Process

In [33]:
# 1. Functionalize processes. For example: def get_product_detail(driver): | def get_category_detail_urls(driver)


def get_book_detail(driver,url):
    """gets book data from given book detail page url"""
    driver.get(url)
    time.sleep(sleep_time)
    content_div = driver.find_elements(By.XPATH,"//div[@class='content']")

    inner_html = content_div[0].get_attribute("innerHTML")

    soup = BeautifulSoup(inner_html,"html.parser")

    name_elem = soup.find("h1")
    book_name = name_elem.text

    price_elem = soup.find("p",attrs={"class":"price_color"})
    book_price =price_elem.text

    regex = re.compile('^star-rating ')
    star_elem = soup.find("p",attrs={"class":regex})
    book_star_count= star_elem["class"][-1]

    desc_elem = soup.find("div", attrs={"id":"product_description"}).find_next_sibling()
    book_desc = desc_elem.text

    product_information = {}
    table_rows = soup.find("table").find_all("tr")
    for row in table_rows:
        key = row.find("th").text
        value = row.find("td").text
        product_information[key] = value

    return{'book_name':book_name, 'book_price':book_price,'book_star_count':book_star_count,'book_desc':book_desc,**product_information}



def get_book_urls(driver,url):
    """"gets book urls from given page url"""

    max_pagination = 10

    book_urls= []
    book_elements_xpath = "//div[@class='image_container']//a"
    for i in range(1,max_pagination):
        update_url = url if i==1 else url.replace("index", f"page-{i}")
        driver.get(update_url)
        book_elements = driver.find_elements(By.XPATH, book_elements_xpath)

        #controller of pagination
        if not book_elements:
            break

        temp_urls = [element.get_attribute("href") for element in book_elements]
        book_urls.extend(temp_urls)

    return book_urls
    

def get_category_detail_urls(driver,url):
    """"Gets category urls from given homepage url"""
    driver.get(url)
    time.sleep(sleep_time)

    category_element_xpath = "//a[contains(text(),'Travel') or contains(text(),'Nonfiction')]"

    category_elements= driver.find_elements(By.XPATH,category_element_xpath)

    category_urls = [element.get_attribute("href") for element in category_elements]

    return category_urls

def initialize_driver():
    """"initializes driver with maximized options"""
    options= webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    driver = webdriver.Chrome(options)
    return driver

In [37]:
# 2. Automate the process and edit the code to get the details of all books belonging to the Travel and Nonfiction categories.
def main():
    BASE_URL = "https://books.toscrape.com/"
    driver = initialize_driver()
    category_urls = get_category_detail_urls(driver,BASE_URL)
    data= []
    for cat_url in category_urls:
        book_urls = get_book_urls(driver,cat_url)
        for book_url in book_urls:
            book_data = get_book_detail(driver,book_url)
            book_data["cat_url"] = cat_url
            data.append(book_data)

    len(data)


    pd.set_option("display.max_columns",None)
    pd.set_option("display.max_colwidth",40)
    pd.set_option("display.width",2000)
    df = pd.DataFrame(data)

    return df

df = main()


In [38]:
df

Unnamed: 0,book_name,book_price,book_star_count,book_desc,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,cat_url
0,It's Only the Himalayas,£45.17,Two,"“Wherever you go, whatever you do, j...",a22124811bfa8350,Books,£45.17,£45.17,£0.00,In stock (19 available),0,https://books.toscrape.com/catalogue...
1,Full Moon over Noah’s Ark: An Odysse...,£49.43,Four,Acclaimed travel writer Rick Antonso...,ce60436f52c5ee68,Books,£49.43,£49.43,£0.00,In stock (15 available),0,https://books.toscrape.com/catalogue...
2,See America: A Celebration of Our Na...,£48.87,Three,To coincide with the 2016 centennial...,f9705c362f070608,Books,£48.87,£48.87,£0.00,In stock (14 available),0,https://books.toscrape.com/catalogue...
3,Vagabonding: An Uncommon Guide to th...,£36.94,Two,With a new foreword by Tim Ferriss •...,1809259a5a5f1d8d,Books,£36.94,£36.94,£0.00,In stock (8 available),0,https://books.toscrape.com/catalogue...
4,Under the Tuscan Sun,£37.33,Three,A CLASSIC FROM THE BESTSELLING AUTHO...,a94350ee74deaa07,Books,£37.33,£37.33,£0.00,In stock (7 available),0,https://books.toscrape.com/catalogue...
...,...,...,...,...,...,...,...,...,...,...,...,...
116,H is for Hawk,£57.42,Five,When Helen Macdonald's father died s...,b6d3f4f4ee1f6069,Books,£57.42,£57.42,£0.00,In stock (2 available),0,https://books.toscrape.com/catalogue...
117,Travels with Charley: In Search of A...,£57.82,Five,An intimate journey across and in se...,0268f149d014b389,Books,£57.82,£57.82,£0.00,In stock (1 available),0,https://books.toscrape.com/catalogue...
118,The Tumor,£41.56,Five,John Grisham says THE TUMOR is the m...,6514add13c82b115,Books,£41.56,£41.56,£0.00,In stock (1 available),0,https://books.toscrape.com/catalogue...
119,The End of the Jesus Era (An Investi...,£14.40,One,God exists!? This may seem unprovabl...,09659d1639a1f978,Books,£14.40,£14.40,£0.00,In stock (1 available),0,https://books.toscrape.com/catalogue...


In [40]:
df.shape

(121, 12)

In [42]:
#save data
df.to_csv(r"books.csv",index=False)