# **Web Scraping**

Link of website for scraping <br>
🔗 http://books.toscrape.com/




In [3]:
# Importing requests and beautifulsoup libraries

from bs4 import BeautifulSoup as bs
import requests

In [4]:
# To scrape website, define the url

url = 'http://books.toscrape.com/'
response = requests.get(url)
response

# This code issues an HTTP GET request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

<Response [200]>

* <Response [200]> means the request succeeded.

If you print the .text attribute of page, then you'll notice that it looks just like the HTML that you inspected earlier with your browser's developer tools. You successfully fetched the static site content from the Internet! You now have access to the site's HTML from within your Python script.

In [5]:
soup = bs(response.text,'html')
print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

* **soup = bs(...):** This part creates a new BeautifulSoup object, which is a
powerful tool for parsing and navigating HTML content.

* **response.text:** This likely refers to a string containing HTML content that you've obtained from a website. It's probably the text content of a response object retrieved using a library like requests.

* **'html':** This specifies that you're parsing HTML code. BeautifulSoup can handle other markup languages as well, but in this case, it's configured for HTML.

# Extract details of the selected book

---



In [6]:
# .find_all() on a Beautiful Soup object returns an iterable containing all the HTML code displayed on that page

book_tag = soup.find_all('article',class_='product_pod')
book_tag

[<article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="th

In [7]:
# To get all info of the 10th book

book = book_tag[10]
book

<article class="product_pod">
<div class="image_container">
<a href="catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html"><img alt="Starving Hearts (Triangular Trade Trilogy, #1)" class="thumbnail" src="media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg"/></a>
</div>
<p class="star-rating Two">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html" title="Starving Hearts (Triangular Trade Trilogy, #1)">Starving Hearts (Triangular Trade ...</a></h3>
<div class="product_price">
<p class="price_color">Â£13.99</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [8]:
# Retrieve the book title from the HTML code, focusing on the selected book(10th book)

title_tag = book.find('a', title = True)['title']
title_tag

'Starving Hearts (Triangular Trade Trilogy, #1)'

In [9]:
# Retrieve the book price from the HTML code, focusing on the selected book(10th book)

price_tag = book.find('p', class_ = 'price_color').text[0:]
price_tag

'Â£13.99'

In [10]:
price_tag = book.find('p', class_ = 'price_color').text[1:]
price_tag

'£13.99'

In [11]:
# Retrieve the book rating from the HTML code, focusing on the selected book(10th book)

rating_tag = book.find('p')['class']
rating_tag

['star-rating', 'Two']

In [12]:
rating_tag = book.find('p')['class'][1]
rating_tag

'Two'

In [13]:
# Retrieve the book link from the HTML code, focusing on the selected book(10th book)

link_tag = f"{ 'https://books.toscrape.com/'}" + book.find('a')['href']
link_tag

'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html'

In [14]:
# create empty list

blist = []

In [15]:
# iterate through the books(till 10th book), extract all information and append it in the list

for books in range(0,11):
  book = book_tag[books]
  title_tag = book.find('a', title = True)['title']
  price_tag = book.find('p', class_ = 'price_color').text[1:]
  rating_tag = book.find('p')['class'][1]
  link_tag = f"{ 'https://books.toscrape.com/'}" + book.find('a')['href']

  blist.append([title_tag, price_tag, rating_tag, link_tag])

In [16]:
print(blist)

[['A Light in the Attic', '£51.77', 'Three', 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'], ['Tipping the Velvet', '£53.74', 'One', 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'], ['Soumission', '£50.10', 'One', 'https://books.toscrape.com/catalogue/soumission_998/index.html'], ['Sharp Objects', '£47.82', 'Four', 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html'], ['Sapiens: A Brief History of Humankind', '£54.23', 'Five', 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'], ['The Requiem Red', '£22.65', 'One', 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html'], ['The Dirty Little Secrets of Getting Your Dream Job', '£33.34', 'Four', 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html'], ['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', '£17.93', 'Three', 'htt

In [17]:
# import python library

import pandas as pd

In [18]:
# create header of each columns

columns = ['Title', 'Price', 'Rating', 'Link']
df = pd.DataFrame(blist, columns=columns)
df

Unnamed: 0,Title,Price,Rating,Link
0,A Light in the Attic,£51.77,Three,https://books.toscrape.com/catalogue/a-light-i...
1,Tipping the Velvet,£53.74,One,https://books.toscrape.com/catalogue/tipping-t...
2,Soumission,£50.10,One,https://books.toscrape.com/catalogue/soumissio...
3,Sharp Objects,£47.82,Four,https://books.toscrape.com/catalogue/sharp-obj...
4,Sapiens: A Brief History of Humankind,£54.23,Five,https://books.toscrape.com/catalogue/sapiens-a...
5,The Requiem Red,£22.65,One,https://books.toscrape.com/catalogue/the-requi...
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,Four,https://books.toscrape.com/catalogue/the-dirty...
7,The Coming Woman: A Novel Based on the Life of...,£17.93,Three,https://books.toscrape.com/catalogue/the-comin...
8,The Boys in the Boat: Nine Americans and Their...,£22.60,Four,https://books.toscrape.com/catalogue/the-boys-...
9,The Black Maria,£52.15,One,https://books.toscrape.com/catalogue/the-black...


# Extract details from all 50 pages

---



In [19]:
# create empty list
blist = []

# Loop through all the 50 pages
for page_number in range(1,51):
  url = f'http://books.toscrape.com/catalogue/page-{page_number}.html'
  response = requests.get(url)
  soup = bs(response.text,'html')
  book_tag = soup.find_all('article',class_='product_pod')


# append extracted information to the list
  for book in book_tag:
    title_tag = book.find('a', title = True)['title']
    price_tag = book.find('p', class_ = 'price_color').text[1:]
    rating_tag = book.find('p')['class'][1]
    link_tag = f"{ 'https://books.toscrape.com/'}" + book.find('a')['href']

    blist.append([title_tag, price_tag, rating_tag, link_tag])

In [21]:
# create header of each columns and convert list to dataframe

df_final = pd.DataFrame(blist, columns= ['Title', 'Price', 'Rating', 'Link'])
df_final

Unnamed: 0,Title,Price,Rating,Link
0,A Light in the Attic,£51.77,Three,https://books.toscrape.com/a-light-in-the-atti...
1,Tipping the Velvet,£53.74,One,https://books.toscrape.com/tipping-the-velvet_...
2,Soumission,£50.10,One,https://books.toscrape.com/soumission_998/inde...
3,Sharp Objects,£47.82,Four,https://books.toscrape.com/sharp-objects_997/i...
4,Sapiens: A Brief History of Humankind,£54.23,Five,https://books.toscrape.com/sapiens-a-brief-his...
...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,£55.53,One,https://books.toscrape.com/alice-in-wonderland...
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",£57.06,Four,https://books.toscrape.com/ajin-demi-human-vol...
997,A Spy's Devotion (The Regency Spies of London #1),£16.97,Five,https://books.toscrape.com/a-spys-devotion-the...
998,1st to Die (Women's Murder Club #1),£53.98,One,https://books.toscrape.com/1st-to-die-womens-m...


### Final dataset after scraping website

In [22]:
# save in csv format

df_final.to_csv('scraped.csv', index = False)
print("Data saved")

Data saved
