# Introduction
Web scraping is the process of automatically extracting data from websites. It involves using software (often called "scrapers" or "bots") to sift through websites and gather specific information.

### How does it work?
- Target identification: The websites and the specific data that is needed is selected (e.g., product prices, contact information, news articles).
- Data extraction: The scraper fetches the website's HTML code and uses techniques like,
    - Parsing: Analyzing the HTML structure to locate the data.
    - Selectors: Using tools like CSS selectors or XPath to pinpoint the exact elements containing the desired information.
- Data cleaning and transformation: The extracted data is often cleaned, formatted and transformed into a structured format (like CSV, JSON or a database) for further analysis.

### Common uses of web scraping
- Price comparison: Tracking prices from competitors.
- Market research: Gathering data on consumer trends, competitor analysis and market intelligence.
- Lead generation: Extracting contact information for potential customers.
- Data enrichment: Enhancing existing datasets with additional information from the web.
- Academic research: Collecting data for research projects.
- News monitoring: Tracking news articles and social media mentions related to specific topics.

### Tools and libraries
- Programming languages: Python (with packages like Beautiful Soup, Scrapy, Selenium), JavaScript and Node.js are popular choices.
- Web scraping APIs: Services like Scrapy Cloud, Apify and ParseHub offer managed scraping infrastructure.

### Ethical considerations
- Website terms of services: Always respect the website's terms of service and robots.txt file.
- Data privacy: Be mindful of data privacy regulations (like GDPR) and avoid scraping personal information without consent.
- Rate limiting: Avoid overwhelming the target website with excessive requests.
- Website changes: Websites can change their structure, rendering the scraper useless. Be prepared to maintain and update the scripts.

Web scraping can be complex and the legality and ethics vary depending on the specific use cases and the target website. It is crucial to research and understand the legal and ethical implication before engaging in web scraping activities.

# Install Necessary Packages

```
pip install requests
pip install beautifulsoup4
```

# `requests` Package
The `requests` package in Python is a powerful tool for making HTTP requests. It simplifies the process of sending and reveiving data over the web, making it easier to interact with APIs, download files and perform other web-related tasks.

### Key features
- Easy to use: Provides a user-friendly interface with methods like `get()`, `post()`, `put()`, `delete()`, etc, for different HTTP request types.
- Handles HTTP: Automatically handles common HTTP tasks like setting headers, encoding data and handling cookies.
- Supports various request types: Supports a wide range of HTTP methods, including `GET`, `POST`, `PUT`, `DELETE`, `HEAD`, `OPTIONS` and more.
- Handles responses: Provides easy access to response data, including status codes, headers and content of the response.
- Extensible: Can be easily extended with custom adapters and hooks for advanced use cases.

# `beautifulsoup4` Package
Beautiful Soup is a Python Package that makes it easy to parse HTML and XML documents. It creates a parse tree for the document, allowing the user to navigate, search and modify the data within.

### Key features
- Handles malformed markup: It can gracefully handle HTML that is not perfectly structured (often referred to as "tag soup"), which is common on the web.
- Easy to use: Provides a simple and an intuitive API for navigating and searching the parsed document.
- Flexible: Supports various parsers (like Python's built-in parser, lxml and html5lib) for different needs.
- Navigatable tree structure: Represents the HTML/ XML document as a tree of interconnected nodes, making it easy to traverse and extract specific elements.

### Common use cases
- Web scraping: Extracting data from websites (e.g., product prices, news articles, contact information).
- Data cleaning: Cleaning up messy HTML before further processing.
- HTML/ XML manipulation: Modifying or transforming HTML/ XML documents.

# Code
The objective is to capture the URL of all the categories on the website. Once the URLs have been captured, the objective then is to capture the data of each book under each category.

### Extracting URLs of all the categories

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
baseurl = "http://books.toscrape.com/"
# posting a GET request to the baseurl
r = requests.get(baseurl)
# view the content
r.content



In [3]:
# reformating the data into workable format
# "soup" is a nomenclature given to the content
# it is only a practice and not a compulsion
soup = BeautifulSoup(r.content)
soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [4]:
# specify the tag and class name
# right-click on the web page and select inspect to find the tag name
# the class name can be found in the same way
ul_list = soup.find("ul", class_ = "nav-list")
ul_list

<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">
                            
                                Books
                            
                        </a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">
                            
                                Travel
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/mystery_3/index.html">
                            
                                Mystery
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/historical-fiction_4/index.html">
                            
                                Historical Fiction
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/sequential-art_5/index.html">
                            
                                Sequential Art
                            
           

In [5]:
# capture only <li> present in the <ul>
ul_li_list = ul_list.ul.find_all("li")
ul_li_list

[<li>
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>
 </li>,
 <li>
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>
 </li>,
 <li>
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>
 </li>,
 <li>
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
                             
                         </a>
 </li>,
 <li>
 <a href="catalogue/category/books/classics_6/index.html">
                             
                                 Classics
                 

In [6]:
# inside each <li>, only the href inside the <a> tag is needed
baseurl + ul_li_list[0].a["href"]

'http://books.toscrape.com/catalogue/category/books/travel_2/index.html'

In [7]:
# method containing all the above
def extract_categories(baseurl):
    categories = {}
    r = requests.get(baseurl)
    soup = BeautifulSoup(r.content)
    categories_list = soup.find("ul", class_ = "nav-list").ul.find_all("li")
    for li in categories_list:
        categories.update({li.text.strip(): baseurl + li.a["href"]})
    return categories

In [8]:
extract_categories(baseurl)

{'Travel': 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'Mystery': 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'Historical Fiction': 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'Sequential Art': 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'Classics': 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'Philosophy': 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'Romance': 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'Womens Fiction': 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'Fiction': 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'Childrens': 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'Religion': 'http://books.toscrape.com/catalogue/category/books/rel

### Extracting the information of each book under each category

In [9]:
# consider "Mystery"
url = "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
response = requests.get(url)
response.status_code

200

In [10]:
soup = BeautifulSoup(response.content)
soup


<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    Mystery | 
     Books to Scrape - Sandbox

</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="
    
" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../../../../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../../../../static/oscar/css/styles.css" rel="stylesheet

In [11]:
# define a dictionary containing the names of the elements for which the data needs to be captured
data = {
    "product_page_url": [],
    "title": [],
    "price_including_tax": [],
    "number_available": []
}

In [12]:
book_page_list = soup.find('ol', class_="row").find_all("li")
book_page_list

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="../../../sharp-objects_997/index.html"><img alt="Sharp Objects" class="thumbnail" src="../../../../media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/></a>
 </div>
 <p class="star-rating Four">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="../../../sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>
 <div class="product_price">
 <p class="price_color">£47.82</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="../.

In [13]:
for li in book_page_list:
    href_split = li.a["href"].split("/")[3: ]
    book_page_link = "http://books.toscrape.com/catalogue/" + "/".join(href_split)
    print(book_page_link)
    req = requests.get(book_page_link)
    soup = BeautifulSoup(req.content, features = "html.parser")
    data["product_page_url"].append(book_page_link)
    rows = soup.find("table").find_all("tr")
    title = soup.find("title").text.strip()
    data["title"].append(title)
    data["price_including_tax"].append(rows[3].find("td").text.strip())
    data["number_available"].append(rows[5].find("td").text.strip())

http://books.toscrape.com/catalogue/sharp-objects_997/index.html
http://books.toscrape.com/catalogue/in-a-dark-dark-wood_963/index.html
http://books.toscrape.com/catalogue/the-past-never-ends_942/index.html
http://books.toscrape.com/catalogue/a-murder-in-time_877/index.html
http://books.toscrape.com/catalogue/the-murder-of-roger-ackroyd-hercule-poirot-4_852/index.html
http://books.toscrape.com/catalogue/the-last-mile-amos-decker-2_754/index.html
http://books.toscrape.com/catalogue/that-darkness-gardiner-and-renner-1_743/index.html
http://books.toscrape.com/catalogue/tastes-like-fear-di-marnie-rome-3_742/index.html
http://books.toscrape.com/catalogue/a-time-of-torment-charlie-parker-14_657/index.html
http://books.toscrape.com/catalogue/a-study-in-scarlet-sherlock-holmes-1_656/index.html
http://books.toscrape.com/catalogue/poisonous-max-revere-novels-3_627/index.html
http://books.toscrape.com/catalogue/murder-at-the-42nd-street-library-raymond-ambler-1_624/index.html
http://books.toscrap

In [14]:
data

{'product_page_url': ['http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'http://books.toscrape.com/catalogue/in-a-dark-dark-wood_963/index.html',
  'http://books.toscrape.com/catalogue/the-past-never-ends_942/index.html',
  'http://books.toscrape.com/catalogue/a-murder-in-time_877/index.html',
  'http://books.toscrape.com/catalogue/the-murder-of-roger-ackroyd-hercule-poirot-4_852/index.html',
  'http://books.toscrape.com/catalogue/the-last-mile-amos-decker-2_754/index.html',
  'http://books.toscrape.com/catalogue/that-darkness-gardiner-and-renner-1_743/index.html',
  'http://books.toscrape.com/catalogue/tastes-like-fear-di-marnie-rome-3_742/index.html',
  'http://books.toscrape.com/catalogue/a-time-of-torment-charlie-parker-14_657/index.html',
  'http://books.toscrape.com/catalogue/a-study-in-scarlet-sherlock-holmes-1_656/index.html',
  'http://books.toscrape.com/catalogue/poisonous-max-revere-novels-3_627/index.html',
  'http://books.toscrape.com/catalogue/murder-a

In [15]:
# storing all the scraped information into a DataFrame
import pandas as pd

data = pd.DataFrame(data)
data.sample(10)

Unnamed: 0,product_page_url,title,price_including_tax,number_available
12,http://books.toscrape.com/catalogue/most-wante...,Most Wanted | Books to Scrape - Sandbox,£35.28,In stock (12 available)
0,http://books.toscrape.com/catalogue/sharp-obje...,Sharp Objects | Books to Scrape - Sandbox,£47.82,In stock (20 available)
6,http://books.toscrape.com/catalogue/that-darkn...,That Darkness (Gardiner and Renner #1) | Books...,£13.92,In stock (14 available)
1,http://books.toscrape.com/catalogue/in-a-dark-...,"In a Dark, Dark Wood | Books to Scrape - Sandbox",£19.63,In stock (18 available)
3,http://books.toscrape.com/catalogue/a-murder-i...,A Murder in Time | Books to Scrape - Sandbox,£16.64,In stock (16 available)
16,http://books.toscrape.com/catalogue/playing-wi...,Playing with Fire | Books to Scrape - Sandbox,£13.71,In stock (11 available)
9,http://books.toscrape.com/catalogue/a-study-in...,A Study in Scarlet (Sherlock Holmes #1) | Book...,£16.73,In stock (14 available)
10,http://books.toscrape.com/catalogue/poisonous-...,Poisonous (Max Revere Novels #3) | Books to Sc...,£26.80,In stock (12 available)
7,http://books.toscrape.com/catalogue/tastes-lik...,Tastes Like Fear (DI Marnie Rome #3) | Books t...,£10.69,In stock (14 available)
19,http://books.toscrape.com/catalogue/delivering...,Delivering the Truth (Quaker Midwife Mystery #...,£20.89,In stock (7 available)
