# Using Ray for Web Scraping

In this example we will show you how to use Ray for scraping information from the web. There are sophisticated Python libraries to achieve this task (like [https://scrapy.org/](https://scrapy.org/)). In this example we will keep it very simple and adapt existing code from [https://www.scrapingbee.com/blog/crawling-python/](https://www.scrapingbee.com/blog/crawling-python/) and show how simple it is to parallelize the code with Ray.

First install the required dependencies with

```
pip install requests bs4
```

We can then already run the example from [https://www.scrapingbee.com/blog/crawling-python/](https://www.scrapingbee.com/blog/crawling-python/) out of the box like this:

In [1]:
import logging
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

logging.basicConfig(
    format='%(asctime)s %(levelname)s:%(message)s',
    level=logging.INFO)

In [2]:
class Crawler:

    def __init__(self, urls=[]):
        self.visited_urls = []
        self.urls_to_visit = urls

    def download_url(self, url):
        text = requests.get(url).text
        return text

    def get_linked_urls(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a'):
            path = link.get('href')
            if path and path.startswith('/'):
                path = urljoin(url, path)
            yield path

    def add_url_to_visit(self, url):
        if url not in self.visited_urls and url not in self.urls_to_visit:
            self.urls_to_visit.append(url)

    def crawl(self, url):
        html = self.download_url(url)
        for url in self.get_linked_urls(url, html):
            self.add_url_to_visit(url)

    def run(self):
        while self.urls_to_visit:
            url = self.urls_to_visit.pop(0)
            logging.info(f'URLs: {len(self.visited_urls) + len(self.urls_to_visit)}')
            try:
                self.crawl(url)
            except Exception:
                pass
                # logging.exception(f'Failed to crawl: {url}')
            finally:
                self.visited_urls.append(url)

if __name__ == '__main__':
    Crawler(urls=['https://en.wikipedia.org/']).run()

2022-06-24 01:22:09,896 INFO:URLs: 0
2022-06-24 01:22:10,168 INFO:URLs: 264
2022-06-24 01:22:10,169 INFO:URLs: 264
2022-06-24 01:22:10,170 INFO:URLs: 264
2022-06-24 01:22:10,171 INFO:URLs: 264
2022-06-24 01:22:11,137 INFO:URLs: 3310
2022-06-24 01:22:11,535 INFO:URLs: 3899
2022-06-24 01:22:12,667 INFO:URLs: 4520
2022-06-24 01:22:12,940 INFO:URLs: 4579
2022-06-24 01:22:13,424 INFO:URLs: 4651
2022-06-24 01:22:14,379 INFO:URLs: 7433
2022-06-24 01:22:14,571 INFO:URLs: 7868
2022-06-24 01:22:14,727 INFO:URLs: 7986
2022-06-24 01:22:14,820 INFO:URLs: 8045
2022-06-24 01:22:15,710 INFO:URLs: 10009
2022-06-24 01:22:15,809 INFO:URLs: 10089
2022-06-24 01:22:16,260 INFO:URLs: 11088
2022-06-24 01:22:16,413 INFO:URLs: 11147
2022-06-24 01:22:17,254 INFO:URLs: 13067
2022-06-24 01:22:18,097 INFO:URLs: 14548
2022-06-24 01:22:18,512 INFO:URLs: 15108
2022-06-24 01:22:18,921 INFO:URLs: 15661
2022-06-24 01:22:19,220 INFO:URLs: 15978
2022-06-24 01:22:19,464 INFO:URLs: 16383
2022-06-24 01:22:19,677 INFO:URLs: 16

KeyboardInterrupt: 

In order to parallelize the crawling, let us first initialize Ray.

In [3]:
import ray
ray.init()

2022-06-24 01:23:28,882	INFO services.py:1476 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8267[39m[22m


RayContext(dashboard_url='127.0.0.1:8267', python_version='3.7.4', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-06-24_01-23-26_513063_20104/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-06-24_01-23-26_513063_20104/sockets/raylet', 'webui_url': '127.0.0.1:8267', 'session_dir': '/tmp/ray/session_2022-06-24_01-23-26_513063_20104', 'metrics_export_port': 51445, 'gcs_address': '127.0.0.1:52552', 'address': '127.0.0.1:52552', 'node_id': '2c1ab381c3fe5c41a8e4b50b8862e097696fd38410bd0b26a1902e8f'})

We need to keep track of which URLs we already crawled to avoid double visiting them and we also need to keep track of which URLs still need to be visited. We do this by centralize this data in an actor `CrawlQueue`.

In [4]:
import ray
import asyncio
import collections

@ray.remote
class CrawlQueue:
    # Initialize the crawl queue with a set of seed urls to be crawled.
    async def __init__(self, seed_urls):
        logging.basicConfig(
            format='%(asctime)s %(levelname)s:%(message)s',
            level=logging.INFO)
        # A queue of pending crawl requests
        self.pending_crawl_requests = asyncio.Queue()
        # All crawl requests - pending, in progress, and completed
        self.all_crawl_requests = set()
        for url in seed_urls:
            await self.add_crawl_request(url)

    # Add additional urls to be crawled - this is called each time a crawler
    # encounters a URL in the document it is processing.
    async def add_crawl_request(self, url):
        if url not in self.all_crawl_requests:
            await self.pending_crawl_requests.put(url)
            self.all_crawl_requests.add(url)

    # Get an url to crawl - this is called from an idle crawler.
    # It returns the url to be crawled.
    async def get_crawl_request(self):
        logging.info(f'URLs: {len(self.all_crawl_requests)}')
        return await self.pending_crawl_requests.get()

<!-- #raw -->
```{eval-rst}
.. code-block:: python
    :emphasize-lines: 21, 30, 39, 40, 42, 45, 46

    class RayCrawler:

        def __init__(self, crawl_queue):
            self.crawl_queue = crawl_queue
            self.num_processed_bytes = 0

        def download_url(self, url):
            text = requests.get(url).text
            self.num_processed_bytes += len(text)
            return text

        def get_linked_urls(self, url, html):
            soup = BeautifulSoup(html, 'html.parser')
            for link in soup.find_all('a'):
                path = link.get('href')
                if path and path.startswith('/'):
                    path = urljoin(url, path)
                yield path

        def add_url_to_visit(self, url):
            self.crawl_queue.add_crawl_request.remote(url)

        def crawl(self, url):
            html = self.download_url(url)
            for url in self.get_linked_urls(url, html):
                self.add_url_to_visit(url)

        def run(self):
            while True:
                url = ray.get(self.crawl_queue.get_crawl_request.remote())
                logging.info(f'Crawling: {url}')
                logging.info(f'Bytes: {self.num_processed_bytes}')
                try:
                    self.crawl(url)
                except Exception:
                    # logging.exception(f'Failed to crawl: {url}')
                    pass
                
    @ray.remote
    def worker(crawl_queue):
        logging.basicConfig(level=logging.INFO)
        RayCrawler(crawl_queue).run()

    if __name__ == '__main__':
        crawl_queue = CrawlQueue.remote(['https://en.wikipedia.org/'])
        ray.get([worker.remote(crawl_queue) for i in range(5)])
```
<!-- #endraw -->

In [5]:
class RayCrawler:

    def __init__(self, crawl_queue):
        self.crawl_queue = crawl_queue

    def download_url(self, url):
        text = requests.get(url).text
        return text

    def get_linked_urls(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a'):
            path = link.get('href')
            if path and path.startswith('/'):
                path = urljoin(url, path)
            yield path

    def add_url_to_visit(self, url):
        self.crawl_queue.add_crawl_request.remote(url)

    def crawl(self, url):
        html = self.download_url(url)
        for url in self.get_linked_urls(url, html):
            self.add_url_to_visit(url)

    def run(self):
        while True:
            url = ray.get(self.crawl_queue.get_crawl_request.remote())
            try:
                self.crawl(url)
            except Exception:
                # logging.exception(f'Failed to crawl: {url}')
                pass
                
@ray.remote
def worker(crawl_queue):
    logging.basicConfig(level=logging.INFO)
    RayCrawler(crawl_queue).run()

if __name__ == '__main__':
    crawl_queue = CrawlQueue.remote(['https://en.wikipedia.org/'])
    ray.get([worker.remote(crawl_queue) for i in range(4)])

[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:42,862 INFO:URLs: 1
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:42,863 INFO:URLs: 1
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:42,863 INFO:URLs: 1
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:42,863 INFO:URLs: 1
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:43,065 INFO:URLs: 40
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:43,065 INFO:URLs: 40
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:43,065 INFO:URLs: 40
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:43,161 INFO:URLs: 262
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:43,264 INFO:URLs: 327
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:43,637 INFO:URLs: 997
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:43,904 INFO:URLs: 1664
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:44,828 INFO:URLs: 3537
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:23:44,933 INFO:URLs: 3740
[2m[36m(CrawlQueue pid=20236)[

[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:36,195 INFO:URLs: 84420
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:36,733 INFO:URLs: 85094
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:36,935 INFO:URLs: 85221
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:37,995 INFO:URLs: 86190
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:38,816 INFO:URLs: 86995
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:39,070 INFO:URLs: 87348
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:39,215 INFO:URLs: 87517
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:39,383 INFO:URLs: 87707
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:40,026 INFO:URLs: 88526
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:40,654 INFO:URLs: 89269
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:40,856 INFO:URLs: 89389
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:41,212 INFO:URLs: 89914
[2m[36m(CrawlQueue pid=20236)[0m 2022-06-24 01:24:41,452 INFO:URLs: 90209

KeyboardInterrupt: 

## Profiling