# Building Your Own Dataset: Web Scraping with Scrapy in Jupyter Notebooks

Training machine learning models require accurate datasets from the right sources. One way to procure, process and validate the data needed for a dataset is to do it yourself through web scraping.

Following this guide, we will be building a Scrapy spider to scrape data from a website, learn fundamentals of scrapy, and load the scraped JSON data into a Pandas dataframe all inside a Jupyter notebook.

## Getting Started

### Tools used

Ensure that you have the following tools installed in your system:

1. [Python 3](https://realpython.com/installing-python/)
3. [Jupyter Notebook](https://jupyter.org/install)

Once complete, open the Jupyter notebook, and validate if your installation was successful by running the following commands:



In [80]:
# Show Python version
import platform
import jupyter_core

platform.python_version()
jupyter_core.__version__

# Handle Deprecation Message
!REQUEST_FINGERPRINTER_IMPLEMENTATION="2.7"

'3.10.12'

'5.7.2'

### Installing dependencies

Install the following Python packages using `pip`:

> Recommended: Install the dependencies to [virtual environment](https://docs.python.org/3/library/venv.html) to not interfere with system level packages.

1. [Scrapy](https://docs.scrapy.org/en/latest): Web crawling framework used for web scraping.
2. [Crochet](https://pypi.org/project/crochet/): Helper package to unblock Twisted in Jupyter Notebook environments

In [81]:
# Install scrapy, crochet
!pip3 install scrapy
!pip3 install crochet

# import Scrapy
import scrapy



To unblock Twisted code running in Jupyter notebook environment, we use Crochet to setup our project and not face the `ReactorNotRestartable` error.

In [82]:
# Reactor restart
from crochet import setup, wait_for
setup()

## Creating our Spider

In this guide, we are using the `ValidMindSpider` to extract blog post metadata and content from [https://validmind.com/blog/](https://validmind.com/blog/). We will be going step-by-step gathering blog post links, extracting content using [XPath selectors](https://docs.scrapy.org/en/2.11/topics/selectors.html), and paginate recursively through pages to repeat the same process.

### Finding all blog post links on the current page

Scrapy accepts `start_urls` list as input to start scraping content from a website URL. The `parse()` method is the default callback that runs for each URL in the list. In the method, we define HTML elements to be picked up. In this case our XPath selector, `//div[contains(@class, "x-row-inner")][1]/a/@href` picks up all links from the `div` container having `x-row-inner` class. Learn more about [XPath Selectors](https://bugbug.io/blog/testing-frameworks/the-ultimate-xpath-cheat-sheet/).

Run the following code to extract the links


In [83]:
from scrapy.spiders import Spider
from scrapy.crawler import CrawlerRunner

class ValidMindSpider(Spider):
    name = 'valid'
    start_urls = ['https://validmind.com/blog/']  # FIRST LEVEL

    # 1. FOLLOWING LINKS
    def parse(self, response):
        # Extract all the blog post links from the current page
        links = response.xpath('//div[contains(@class, "x-row-inner")][1]/a/@href').getall()

        # Extract data from each blog post on the blog page
        for link in links:
            print(link)

@wait_for(10)
def run_spider():
    crawler = CrawlerRunner()
    d = crawler.crawl(ValidMindSpider)
    return d

run_spider()

https://validmind.com/validmind-launches-validmind-advantage-program-to-bring-trust-and-transparency-to-third-party-ai-model-vendors/
https://validmind.com/upcoming-webinar-driving-compliant-ai-innovation-in-financial-services/
https://validmind.com/unlock-the-power-of-genai-in-financial-services-join-us-for-an-exclusive-london-event/
https://validmind.com/ai-regulation-a-competitive-advantage-not-a-barrier/
https://validmind.com/complying-with-occ-2011-12-in-model-risk-management/
https://validmind.com/validmind-wins-risk-technology-award-named-model-validation-service-of-the-year/
https://validmind.com/mrm-teams-compliance-with-basel-regulations-2/
https://validmind.com/eu-ai-act-key-insights-and-compliance-for-mrm-teams/
https://validmind.com/5-essential-steps-for-banks-to-implement-llms-safely/
https://validmind.com/validmind-recognized-as-industry-leader-in-chartis-risktech-ai-50/
https://validmind.com/nist-model-risk-management-strategic-compliance/
https://validmind.com/how-bank

### Extracting content from blogs

After extracting blog links from the page, we need to dive deeper and extract the content of each blog post.

To do that, we create the `extract_blog` method. For each post, we are looking to extract the heading, date, author and content. Each of these elements on the HTML response can be picked up using XPath selectors.

In [84]:
from scrapy.spiders import Spider

class ValidMindSpider(Spider):
    name = 'valid'
    start_urls = ['https://validmind.com/blog/']  # FIRST LEVEL

    # 1. FOLLOWING
    def parse(self, response):
        # Extract all the blog post links from the current page
        links = response.xpath('//div[contains(@class, "x-row-inner")][1]/a/@href').getall()

        # Extract data from each blog post on the blog page
        for link in links:
            yield response.follow(link, self.extract_blog)

    # 2. DATA EXTRACTION
    def extract_blog(self, response):
        # Extract blog post data from the individual blog post
        data = {
            "heading": response.xpath('//h1/text()').get(),
            "author": response.xpath('//div[contains(@class, "pp-author-boxes-name")]/a/@title').get(),
            "date": response.xpath('//div[@class="x-text-content-text"]//span[@class="x-text-content-text-subheadline"]/text()').get(),
            # "content": response.xpath('//div[@class="x-text x-content e129323-e11 m2rsb-y m2rsb-z"]/p/text()').getall()
        }
        print(data)

from scrapy.crawler import CrawlerRunner

@wait_for(10)
def run_spider():
    crawler = CrawlerRunner()
    d = crawler.crawl(ValidMindSpider)
    return d

run_spider()

{'heading': 'ValidMind Launches ValidMind Advantage Program to Bring Trust and Transparency to Third-Party AI Model Vendors', 'author': 'Kevin Allen', 'date': 'September 26, 2024'}
{'heading': 'ValidMind Recognized as Industry Leader in Chartis RiskTech AI 50', 'author': 'Kevin Allen', 'date': 'August 5, 2024'}
{'heading': 'How Banks Can Innovate within the EU AI Act', 'author': 'Kevin Allen', 'date': 'July 23, 2024'}
{'heading': 'Webinar Replay: Driving Compliant AI Innovation in Financial Services', 'author': 'Kevin Allen', 'date': 'September 25, 2024'}
{'heading': 'Unlock the Power of GenAI in Financial Services: Join Us for an Exclusive London Event', 'author': 'Kevin Allen', 'date': 'September 25, 2024'}
{'heading': 'AI Regulation: A Competitive Advantage, Not a Barrier', 'author': 'Jonas Jacobi', 'date': 'September 24, 2024'}
{'heading': 'Ensuring Stability: A Comprehensive Guide to Complying with OCC 2011-12 in Model Risk Management', 'author': 'Emma Jacobi', 'date': 'September 

### Paginate through pages: Get all the data

[Pagination in Scrapy](https://scrapeops.io/python-scrapy-playbook/scrapy-pagination-guide/) is a critical skill. It allows scrapers to traverse websites like a user would and access data needed to be scraped. We opted for static URLs to paginate through in this website to keep it straightforward.

> Ideally, pagination in websites can be upto hundreds or thousands of pages. Hence, making sure you [dynamically select pagination URLs](https://docs.scrapy.org/en/2.11/topics/dynamic-content.html) is imperative for the reliablity of your spider.

The `pagination_links` list URLs are pages to paginate through. Once all links on the current page are finished extracting, we pick the next page from the list in `next_page_url` which is then used to recursively scrape all the blogs on the next page.

The `custom_settings` allow specific control over scrapy data output. We used `FEEDS` to create a file to store JSON data.

In [88]:
from scrapy.spiders import Spider

class ValidMindSpider(Spider):
    name = 'valid'
    start_urls = ['https://validmind.com/blog/']  # FIRST LEVEL
    pagination_links = ['https://validmind.com/blog/?_paged=2', 'https://validmind.com/blog/?_paged=3', 'https://validmind.com/blog/?_paged=4', 'https://validmind.com/blog/?_paged=5']

    custom_settings = {
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
        'FEEDS': {
            'validmind-blogs.json': {
                'format': 'json',
                'overwrite': True
            }
        }
    }

    # 1. FOLLOWING
    def parse(self, response):
        # Extract all the blog post links from the current page
        links = response.xpath('//div[contains(@class, "x-row-inner")][1]/a/@href').getall()

        # Extract data from each blog post on the blog page
        for link in links:
            yield response.follow(link, self.extract_blog)

        # 3. PAGINATION
        # Paginate through the next page if available
        next_page_url = self.pagination_links.pop(0) if self.pagination_links else None

        # Recursively follow the next page if available
        if next_page_url:
            yield response.follow(next_page_url, self.parse)

    # 2. DATA EXTRACTION
    def extract_blog(self, response):
        # Extract blog post data from the individual blog post
        data = {
            "heading": response.xpath('//h1/text()').get(),
            "author": response.xpath('//div[contains(@class, "pp-author-boxes-name")]/a/@title').get(),
            "date": response.xpath('//div[@class="x-text-content-text"]//span[@class="x-text-content-text-subheadline"]/text()').get(),
            # "content": response.xpath('//div[@class="x-text x-content e129323-e11 m2rsb-y m2rsb-z"]/p/text()').getall()
        }
        return data

from scrapy.crawler import CrawlerRunner

@wait_for(10)
def run_spider():
    crawler = CrawlerRunner()
    d = crawler.crawl(ValidMindSpider)
    return d

run_spider()

# Results

Once the `ValidMindSpider` finishes its run, a JSON file will be created. Load the JSON data using pandas to further analyze, modify and process the data you have scraped to start training your model on the dataset or get valuable insights from it.


In [89]:
import pandas as pd
validjson = pd.read_json('validmind-blogs.json')
validjson

Unnamed: 0,heading,author,date
0,ValidMind Launches ValidMind Advantage Program...,Kevin Allen,2024-09-26
1,How Banks Can Innovate within the EU AI Act,Kevin Allen,2024-07-23
2,6 Essential Strategies MRM Teams Can Use to Ma...,Emma Jacobi,2024-09-05
3,The EU AI Act: Understanding Model Risk Manage...,Emma Jacobi,2024-08-20
4,5 Essential Steps for Banks to Implement LLMs ...,Kristof Horompoly,2024-08-15
5,Understanding NIST: What All Model Risk Manage...,Emma Jacobi,2024-08-05
6,ValidMind Wins Risk Technology Award: Model Va...,Kevin Allen,2024-09-05
7,ValidMind Recognized as Industry Leader in Cha...,Kevin Allen,2024-08-05
8,Ensuring Stability: A Comprehensive Guide to C...,Emma Jacobi,2024-09-07
9,Webinar Replay: Driving Compliant AI Innovatio...,Kevin Allen,2024-09-25


# Conclusion

In this guide, we went over the basics of building your own datasets using Scrapy and Jupyter Notebooks. We scraped through a website, using XPath selectors and pagination. We stored the scraped data in JSON, and presented the data in a Jupyter Notebook.