# Building Your Own Dataset: Web Scraping with Scrapy in Jupyter Notebooks

Training machine learning models require accurate datasets from the right sources. One way to procure, process and validate the data needed for a dataset is to do it yourself through web scraping.

Following this guide, we will be building a Scrapy spider to scrape data from a website, learn fundamentals of scrapy, and load the scraped JSON data into a Pandas dataframe all in a Jupyter notebook to start creating our own dataset.


## Getting Started
### Tools used

Ensure that you have the following tools installed in your system:

1. [Python 3](https://realpython.com/installing-python/)
3. [Jupyter Notebook](https://jupyter.org/install)

Once complete, open the Jupyter notebook, and validate if your installation was successful by running the following commands:



In [None]:
# Show Python version
import platform
import jupyter_core

print(f"Python Version: {platform.python_version()}")
print(f"Jupyter Version: {jupyter_core.__version__}")


### Installing dependencies

Install the following Python packages using `pip3`, the Python package installer for Python

> Install the dependencies to a [virtual environment](https://docs.python.org/3/library/venv.html) is highly recommended to not interfere with system level packages and cause issues. [Guide to create virtual environments](https://realpython.com/python-virtual-environments-a-primer/).

1. [Scrapy](https://docs.scrapy.org/en/latest): Web crawling framework used for web scraping.
2. [Crochet](https://pypi.org/project/crochet/): Helper package to unblock Twisted in Jupyter Notebook environments

In [None]:
# Install scrapy, crochet
!pip3 install scrapy==2.11.2
!pip3 install crochet==2.1.1

# import Scrapy
import scrapy

To unblock Twisted code running in Jupyter notebook environment, we use Crochet to setup our project and not face the `ReactorNotRestartable` error.

In [None]:
# Reactor restart
from crochet import setup, wait_for
setup()

## Creating our Spider

In this guide, we are using the `ValidMindSpider` to extract blog post metadata and content from [https://validmind.com/blog/](https://validmind.com/blog/). We will be going step-by-step gathering blog post links, extracting content using [XPath selectors](https://docs.scrapy.org/en/2.11/topics/selectors.html), and paginate recursively through pages to repeat the same process.

### Finding all blog post links on the current page

Scrapy accepts `start_urls` list as input to start scraping content from a website URL. The `parse()` method is the default callback that runs for each URL in the list. In the method, we define HTML elements to be picked up. In this case our XPath selector, `//div[contains(@class, "x-row-inner")][1]/a/@href` picks up all links from the `div` container having `x-row-inner` class. Learn more about [XPath Selectors](https://bugbug.io/blog/testing-frameworks/the-ultimate-xpath-cheat-sheet/).

Run the following code to extract the links


In [None]:
from scrapy.spiders import Spider
from scrapy.crawler import CrawlerRunner

class ValidMindSpider(Spider):
    name = 'valid'
    start_urls = ['https://validmind.com/blog/']  # FIRST LEVEL

    custom_settings = {
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
    }

    # 1. FOLLOWING LINKS
    def parse(self, response):
        # Extract all the blog post links from the current page
        links = response.xpath('//div[contains(@class, "x-row-inner")][1]/a/@href').getall()

        # Extract data from each blog post on the blog page
        for link in links:
            print(link)

To run the spider, we create a new instance of the CrawlerRunner. CrawlerRunner is a thin wrapper that encapsulates helpers to run multiple crawlers without interferring with existing reactors in any way. We will be using the `run_spider()` function to run our `ValidMindSpider`.

In [None]:
from scrapy.crawler import CrawlerRunner

@wait_for(10)
def run_spider(spider):
    crawler = CrawlerRunner()
    return crawler.crawl(spider)

run_spider(ValidMindSpider)

### Extracting content from blogs

After extracting blog links from the page, we need to dive deeper and extract the content of each blog post.

To do that, we create the `extract_blog` method. For each post, we are looking to extract the heading, date, author and content. Each of these elements on the HTML response can be picked up using XPath selectors.

In [None]:
from scrapy.spiders import Spider

class ValidMindSpider(Spider):
    name = 'valid'
    start_urls = ['https://validmind.com/blog/']  # FIRST LEVEL

    custom_settings = {
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
    }

    # 1. FOLLOWING
    def parse(self, response):
        # Extract all the blog post links from the current page
        links = response.xpath('//div[contains(@class, "x-row-inner")][1]/a/@href').getall()

        # Extract data from each blog post on the blog page
        for link in links:
            yield response.follow(link, self.extract_blog)

    # 2. DATA EXTRACTION
    def extract_blog(self, response):
        # Extract blog post data from the individual blog post
        data = {
            "heading": response.xpath('//h1/text()').get(),
            "author": response.xpath('//div[contains(@class, "pp-author-boxes-name")]/a/@title').get(),
            "date": response.xpath('//div[@class="x-text-content-text"]//span[@class="x-text-content-text-subheadline"]/text()').get(),
            # "content": response.xpath('//div[@class="x-text x-content e129323-e11 m2rsb-y m2rsb-z"]/p/text()').getall()
        }
        print(data)

run_spider(ValidMindSpider)

### Paginate through pages: Get all the data

[Pagination in Scrapy](https://scrapeops.io/python-scrapy-playbook/scrapy-pagination-guide/) is a critical skill. It allows scrapers to traverse websites like a user would and access data needed to be scraped. We opted for static URLs to paginate through in this website to keep it straightforward.

> Ideally, pagination in websites can be upto hundreds or thousands of pages. Hence, making sure you [dynamically select pagination URLs](https://docs.scrapy.org/en/2.11/topics/dynamic-content.html) is imperative for the reliablity of your spider.

The `pagination_links` list URLs are pages to paginate through. Once all links on the current page are finished extracting, we pick the next page from the list in `next_page_url` which is then used to recursively scrape all the blogs on the next page.

The `custom_settings` allow specific control over scrapy data output. We used `FEEDS` to create a file to store JSON data.

In [None]:
from scrapy.spiders import Spider

class ValidMindSpider(Spider):
    name = 'valid'
    start_urls = ['https://validmind.com/blog/']  # FIRST LEVEL
    pagination_links = ['https://validmind.com/blog/?_paged=2', 'https://validmind.com/blog/?_paged=3', 'https://validmind.com/blog/?_paged=4', 'https://validmind.com/blog/?_paged=5']

    custom_settings = {
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
        'FEEDS': {
            'validmind-blogs.json': {
                'format': 'json',
                'overwrite': True
            }
        }
    }

    # 1. FOLLOWING
    def parse(self, response):
        # Extract all the blog post links from the current page
        links = response.xpath('//div[contains(@class, "x-row-inner")][1]/a/@href').getall()

        # Extract data from each blog post on the blog page
        for link in links:
            yield response.follow(link, self.extract_blog)

        # 3. PAGINATION
        # Paginate through the next page if available
        next_page_url = self.pagination_links.pop(0) if self.pagination_links else None

        # Recursively follow the next page if available
        if next_page_url:
            yield response.follow(next_page_url, self.parse)

    # 2. DATA EXTRACTION
    def extract_blog(self, response):
        # Extract blog post data from the individual blog post
        data = {
            "heading": response.xpath('//h1/text()').get(),
            "author": response.xpath('//div[contains(@class, "pp-author-boxes-name")]/a/@title').get(),
            "date": response.xpath('//div[@class="x-text-content-text"]//span[@class="x-text-content-text-subheadline"]/text()').get(),
            # "content": response.xpath('//div[@class="x-text x-content e129323-e11 m2rsb-y m2rsb-z"]/p/text()').getall()
        }
        return data


run_spider(ValidMindSpider)

## Results

Once the `ValidMindSpider` finishes its run, a JSON file will be created. Load the JSON data using pandas to further analyze, modify and process the data you have scraped to start training your model on the dataset or get valuable insights from it.


In [None]:
import pandas as pd
validjson = pd.read_json('validmind-blogs.json')
validjson

## Conclusion

In this guide, we went over the basics of building your own datasets using Scrapy and Jupyter Notebooks. We scraped through a website, using XPath selectors and pagination. We stored the scraped data in JSON, and presented the data in a Jupyter Notebook.