### Webscraping

##### What is Beautiful Soup?
Simple but powerful or simply powerful, Beautiful Soup is a Python parsing library that can get data from HTML, XML, and other markup languages. If you're just starting out and want to learn how to use Beautiful Soup, you'll find it to be a very beginner-friendly option for parsing web content. It uses tags, text content, and attributes as search criteria which makes navigating and searching the HTML tree much easier. Put simply, it’s a tool that helps you pull structured data from web pages. If you're new to using Beautiful Soup or want a refresher on how it works in practice, check out our BeautifulSoup Tutorial – How to Parse Web Data With Python.

##### Main features
- Dealing with poorly formatted HTML

    In most situations, Beautiful Soup will help you parse data even from the most ill-formatted HTMLs. Of course, for the most extreme cases you might need to play around with Beautiful Soup’s parameters.

- Encoding conversion

    Beautiful Soup has the capability of automatically detecting the document encoding method and converting it to a suitable format. In case it doesn’t, you can still specify it and get the job done.

- Integration with parsing libraries

    Sitting on top of such parsing libraries as lxml and html5lib, Beautiful Soup can give your parsing approaches much more flexibility.

- Excellent error handling

    Beautiful Soup handles parsing mistakes by giving you thorough error messages and facilitating easier parsing error recovery. As a result, the parsing process becomes much more manageable.

##### Advantages of using Beautiful Soup
- Beginner friendly
- Open-source and free
- Simple to implement
- Flexible parsing options

##### Disadvantages of using Beautiful Soup
- Many dependencies
- Not very scalable
- Minimal proxy support

#### What is Scrapy?
Scrapy is an open-source application framework that has traditionally been used to crawl and extract data. It’s a stand-alone tool, which means that you can take it as it is and put it to work. However, Scrapy web scraping is not the only approach to take as this tool can also be used for data mining and automated testing.

##### Main features
- Asynchronous request handling 

Scrapy is able to handle and prioritize multiple requests, making large-scale scraping operations easier, faster, and more efficient.

- Middlewares and extensions

Being a framework dedicated to web scraping, Scrapy offers a number of middleware and extensions to support various web scraping processes. As such, it skillfully handles such things as cookies, redirects, forms, and pagination.

- Spider framework

There are many ways to scrape a website and that’s why Scrapy allows users to specify their preferred approach. By using Scrapy’s spider framework, users can define the exact way that they want a website (or a batch of them) to be crawled, scraped, and parsed.

- AutoThrottling

You can configure Scrapy so it doesn’t exhaust the target server's resources. The AutoThrottle extension evaluates the load on the Scrapy server as well as the target website server and adjusts the crawling speed.

##### Advantages of using Scrapy
- Easy-to-follow documentation
- Doesn’t require other dependencies (unless working with JavaScript)
- Can be used for large-scale scraping
- Memory-efficient structure (check out this Scrapy and AWS Lambda tutorial for a serverless Scrapy solution)

##### Disadvantages of using Scrapy
- Cannot handle JavaScript
- Steep learning curve

Scrapy is a Python web scraping framework built around Twisted, an asynchronous networking engine. This means that it doesn't use the standard asynchronous Python approach. Instead, it uses an event-driven networking infrastructure, allowing for more efficiency and scalability.

That being said, we don't have to interact with the underlying architecture. Scrapy abstracts it away with its own interface. From the development perspective, we'll mostly deal with the equivalent logic in callbacks and generators

![scrapy](scrapy-spider.svg)

The above illustration explains the Scrapy architecture in simple terms. Scrapy comes with an engine called Crawler (light blue). It handles the low-level logic, such as the HTTP connection, scheduling and the entire execution flow.

On the other hand, the high-level logic (dark blue) is missing. It's called Spider, which handles the scraping logic and how to perform it. In simple terms, we must provide the Crawler with a Spider object to generate the requests, parse and retrieve the data to store

Now before we create our first Spider to web scrape with Scrapy. Let's start off by defining common Scrapy terms:

- Callback

Scrapy is an asyncronous framework. Therefore, most of the actions are executed in the background, which allows for highly concurrent and efficient logic. In this context, a callback is a function that's attached to a background task, which called upon the successful finish of this task.

- Errorback

A similar function to the callback, but it is triggered when a task fails instead of when it succeeds.

- Generator

In Python, generators are functions that return results one at a time instead of all at once like a list.

- Settings

Scrapy's central configuration object, called settings and it's located in the settings.py file of the project.

|Criteria|	Scrapy|	Beautiful Soup|
|--------|-----------|-------|
|Purpose|	Web scraping and crawling|	Parsing|
|Speed	|Fast	|Average|
|Scraping projects	|Small to large scale	|Small to medium scale|
|Scalability	Highly scalable and can handle large-scale projects	|Not as suitable for large-scale projects|
|Asynchronous	|Yes	|No|
|Crawling	|Designed for web scraping and crawling	|Focused on parsing and manipulating HTML|
|Extensions	|High	|Limited|
|Browser support	|No	|Chrome, Edge, Firefox, and Safari|
|Headless execution|	No	|Yes|
|Browser interaction|	No|	Yes|

In [None]:
!pip install scrapy bs4

In [1]:
import scrapy

### Step 1: Create a Scrapy Project

In [2]:
!scrapy startproject bookscraper
!cd bookscraper

Error: scrapy.cfg already exists in C:\Users\sadiq\OneDrive\Desktop\GUVI\bookscraper


This creates a structured project with folders for spiders, pipelines, settings, etc.

### Step 2: Step 2: Create a Spider

Navigate to bookscraper/spiders/ and create a file called book_spider.py

In [3]:
import os
os.getcwd()

'C:\\Users\\sadiq\\OneDrive\\Desktop\\GUVI'

In [4]:
%%writefile "./bookscraper/bookscraper/spiders/book_spider.py"

import scrapy

class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get()
            }

        # Follow pagination links
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Overwriting ./bookscraper/bookscraper/spiders/book_spider.py


### Step 3: Run the Spider

In [6]:
import os
from scrapy.cmdline import execute

os.chdir('./bookscraper')  # Must contain scrapy.cfg
execute(['scrapy', 'crawl', 'books', '-o', 'books.json'])


2025-09-17 13:30:40 [scrapy.utils.log] INFO: Scrapy 2.13.3 started (bot: bookscraper)
2025-09-17 13:30:40 [scrapy.utils.log] INFO: Versions:
{'lxml': '6.0.1',
 'libxml2': '2.11.9',
 'cssselect': '1.3.0',
 'parsel': '1.10.0',
 'w3lib': '2.3.1',
 'Twisted': '25.5.0',
 'Python': '3.11.13 | packaged by Anaconda, Inc. | (main, Jun  5 2025, '
           '13:03:15) [MSC v.1929 64 bit (AMD64)]',
 'pyOpenSSL': '25.1.0 (OpenSSL 3.5.2 5 Aug 2025)',
 'cryptography': '45.0.7',
 'Platform': 'Windows-10-10.0.26100-SP0'}
2025-09-17 13:30:40 [scrapy.addons] INFO: Enabled addons:
[]
2025-09-17 13:30:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-09-17 13:30:40 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-09-17 13:30:40 [scrapy.extensions.telnet] INFO: Telnet Password: 88864597e51032cb
2025-09-17 13:30:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStat

RuntimeError: This event loop is already running

2025-09-17 13:30:41 [scrapy.core.engine] INFO: Spider opened
2025-09-17 13:30:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-09-17 13:30:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-09-17 13:30:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2025-09-17 13:30:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/> (referer: None)
2025-09-17 13:30:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/>
{'title': 'A Light in the Attic', 'price': '£51.77'}
2025-09-17 13:30:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/>
{'title': 'Tipping the Velvet', 'price': '£53.74'}
2025-09-17 13:30:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/>
{'title': 'Soumission', 'price': '£50.10'}
2025-09-17 13:30:43 [scrapy.core.scraper] DEBUG: Scraped 

This crawls all pages and saves book titles and prices to books.json.


### How to write crawling logic

##### 1. Start Points (start_urls or start_requests)

- Define where your crawl begins.
- Use static URLs or dynamically generate them from a database, sitemap, or API.


In [None]:
class MySpider(scrapy.Spider):
    name = "example"
    start_urls = ['https://example.com']

    def parse(self, response):
        ...

For large-scale or multi-domain crawls, use start_requests() to yield requests dynamically.

##### 2.  Link Following Strategy
- Decide whether to follow links recursively.
- Use response.follow() or response.css('a::attr(href)') to extract and follow links.


In [None]:
def parse(self, response):
    for href in response.css('a::attr(href)').getall():
        yield response.follow(href, self.parse)

You can limit depth using DEPTH_LIMIT in settings.py.

##### 3. Domain Scope (allowed_domains)
- Restrict crawling to specific domains to avoid going off-target.
allowed_domains = ['example.com']

##### 4. Data Extraction Logic
- Use CSS or XPath selectors to extract structured data.
- Keep selectors clean and test them in scrapy shell.
    

In [None]:
yield {
    'title': response.css('h1::text').get(),
    'price': response.css('.price::text').get()
}

##### 5. Pagination Handling
- Detect and follow “next page” links.

In [None]:
next_page = response.css('li.next a::attr(href)').get()
if next_page:
    yield response.follow(next_page, self.parse)

##### 6. Broad vs Focused Crawl
- Focused Crawl: Deep crawl of a single domain.
- Broad Crawl: Shallow crawl across many domains, often time-limited.
- 
For broad crawls:
- Use Breadth-First Order to reduce memory usage.
- Tune concurrency (CONCURRENT_REQUESTS) and DNS resolution settings.

##### Debugging and Refining Logic
- Use scrapy shell to test selectors and response structure.
- Use scrapy check to validate spider syntax.
- Log URLs and items to verify crawl paths.


In [1]:
import scrapy
from urllib.parse import urlparse

class DualDomainSpider(scrapy.Spider):
    name = "dual"
    allowed_domains = ["quotes.toscrape.com", "books.toscrape.com"]
    start_urls = [
        "http://quotes.toscrape.com/page/1/",
        "http://books.toscrape.com/catalogue/page-1.html"
    ]

    custom_settings = {
        'FEEDS': {
            'output.json': {
                'format': 'json',
                'encoding': 'utf8',
                'indent': 4,
            }
        },
        'DOWNLOAD_DELAY': 1,
        'DEPTH_LIMIT': 2,
        'AUTOTHROTTLE_ENABLED': True,
        'AUTOTHROTTLE_START_DELAY': 1,
        'AUTOTHROTTLE_MAX_DELAY': 3,
    }

    def parse(self, response):
        domain = urlparse(response.url).netloc

        if "quotes.toscrape.com" in domain:
            for quote in response.css("div.quote"):
                yield {
                    "domain": domain,
                    "type": "quote",
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall()
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page:
                yield response.follow(next_page, self.parse)

        elif "books.toscrape.com" in domain:
            for book in response.css("article.product_pod"):
                yield {
                    "domain": domain,
                    "type": "book",
                    "title": book.css("h3 a::attr(title)").get(),
                    "price": book.css(".price_color::text").get()
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page:
                yield response.follow(next_page, self.parse)

In [2]:
LOG_LEVEL = 'INFO'

In [7]:
import os
from scrapy.cmdline import execute
os.chdir('./bookscraper')
execute(['scrapy', 'crawl', 'books', '-o', 'books.json'])

2025-09-17 14:10:15 [scrapy.utils.log] INFO: Scrapy 2.13.3 started (bot: bookscraper)
2025-09-17 14:10:16 [scrapy.utils.log] INFO: Versions:
{'lxml': '6.0.1',
 'libxml2': '2.11.9',
 'cssselect': '1.3.0',
 'parsel': '1.10.0',
 'w3lib': '2.3.1',
 'Twisted': '25.5.0',
 'Python': '3.11.13 | packaged by Anaconda, Inc. | (main, Jun  5 2025, '
           '13:03:15) [MSC v.1929 64 bit (AMD64)]',
 'pyOpenSSL': '25.1.0 (OpenSSL 3.5.2 5 Aug 2025)',
 'cryptography': '45.0.7',
 'Platform': 'Windows-10-10.0.26100-SP0'}
2025-09-17 14:10:16 [scrapy.addons] INFO: Enabled addons:
[]
2025-09-17 14:10:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-09-17 14:10:16 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-09-17 14:10:16 [scrapy.extensions.telnet] INFO: Telnet Password: be0dcf71dc82c87e
2025-09-17 14:10:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStat

RuntimeError: This event loop is already running

2025-09-17 14:10:16 [scrapy.core.engine] INFO: Spider opened
2025-09-17 14:10:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-09-17 14:10:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-09-17 14:10:17 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2025-09-17 14:10:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/> (referer: None)
2025-09-17 14:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/>
{'title': 'A Light in the Attic', 'price': '£51.77'}
2025-09-17 14:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/>
{'title': 'Tipping the Velvet', 'price': '£53.74'}
2025-09-17 14:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/>
{'title': 'Soumission', 'price': '£50.10'}
2025-09-17 14:10:18 [scrapy.core.scraper] DEBUG: Scraped 