# Using Ray for Web Scraping

In this example we will show you how to use Ray for scraping information from the web. There are sophisticated Python libraries to achieve this task (like [https://scrapy.org/](https://scrapy.org/)). In this example we will keep it very simple and adapt existing code from [https://www.scrapingbee.com/blog/crawling-python/](https://www.scrapingbee.com/blog/crawling-python/) and show how simple it is to parallelize the code with Ray.

First install the required dependencies with

```
pip install requests bs4
```

We can then already run the example from [https://www.scrapingbee.com/blog/crawling-python/](https://www.scrapingbee.com/blog/crawling-python/) out of the box like this:

In [2]:
import logging
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO)

In [3]:
class Crawler:

    def __init__(self, urls=[]):
        self.visited_urls = []
        self.urls_to_visit = urls
        self.num_processed_bytes = 0

    def download_url(self, url):
        text = requests.get(url).text
        self.num_processed_bytes += len(text)
        return text

    def get_linked_urls(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a'):
            path = link.get('href')
            if path and path.startswith('/'):
                path = urljoin(url, path)
            yield path

    def add_url_to_visit(self, url):
        if url not in self.visited_urls and url not in self.urls_to_visit:
            self.urls_to_visit.append(url)

    def crawl(self, url):
        html = self.download_url(url)
        for url in self.get_linked_urls(url, html):
            self.add_url_to_visit(url)

    def run(self):
        while self.urls_to_visit:
            url = self.urls_to_visit.pop(0)
            logging.info(f'Crawling: {url}')
            logging.info(f'Bytes: {self.num_processed_bytes}')
            try:
                self.crawl(url)
            except Exception:
                pass
                # logging.exception(f'Failed to crawl: {url}')
            finally:
                self.visited_urls.append(url)

if __name__ == '__main__':
    Crawler(urls=['https://en.wikipedia.org/']).run()

2022-06-23 23:29:05,540 INFO:Crawling: https://en.wikipedia.org/
2022-06-23 23:29:05,541 INFO:Bytes: 0
2022-06-23 23:29:05,655 INFO:Crawling: None
2022-06-23 23:29:05,655 INFO:Bytes: 86097
2022-06-23 23:29:05,657 INFO:Crawling: #mw-head
2022-06-23 23:29:05,657 INFO:Bytes: 86097
2022-06-23 23:29:05,658 INFO:Crawling: #searchInput
2022-06-23 23:29:05,659 INFO:Bytes: 86097
2022-06-23 23:29:05,660 INFO:Crawling: https://en.wikipedia.org/wiki/Wikipedia
2022-06-23 23:29:05,661 INFO:Bytes: 86097
2022-06-23 23:29:06,282 INFO:Crawling: https://en.wikipedia.org/wiki/Free_content
2022-06-23 23:29:06,283 INFO:Bytes: 1109745
2022-06-23 23:29:06,482 INFO:Crawling: https://en.wikipedia.org/wiki/Encyclopedia
2022-06-23 23:29:06,483 INFO:Bytes: 1326037
2022-06-23 23:29:06,679 INFO:Crawling: https://en.wikipedia.org/wiki/Help:Introduction_to_Wikipedia
2022-06-23 23:29:06,680 INFO:Bytes: 1564243
2022-06-23 23:29:06,749 INFO:Crawling: https://en.wikipedia.org/wiki/Special:Statistics
2022-06-23 23:29:06,75

2022-06-23 23:29:37,975 INFO:Bytes: 19446192
2022-06-23 23:29:39,837 INFO:Crawling: https://en.wikipedia.org/wiki/Boston_Celtics
2022-06-23 23:29:39,838 INFO:Bytes: 20022751
2022-06-23 23:29:41,949 INFO:Crawling: https://en.wikipedia.org/wiki/2022_NBA_Finals
2022-06-23 23:29:41,949 INFO:Bytes: 20674094
2022-06-23 23:29:42,995 INFO:Crawling: https://en.wikipedia.org/wiki/Portal:Current_events
2022-06-23 23:29:42,996 INFO:Bytes: 21045865
2022-06-23 23:29:44,403 INFO:Crawling: https://en.wikipedia.org/wiki/COVID-19_pandemic
2022-06-23 23:29:44,404 INFO:Bytes: 21321700
2022-06-23 23:29:48,737 INFO:Crawling: https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine
2022-06-23 23:29:48,738 INFO:Bytes: 22699515
2022-06-23 23:29:52,846 INFO:Crawling: https://en.wikipedia.org/wiki/Deaths_in_2022
2022-06-23 23:29:52,846 INFO:Bytes: 24254101
2022-06-23 23:29:55,254 INFO:Crawling: https://en.wikipedia.org/wiki/Hugh_McElhenny
2022-06-23 23:29:55,255 INFO:Bytes: 24742599
2022-06-23 23:29:56,335

2022-06-23 23:30:56,035 INFO:Bytes: 38033212
2022-06-23 23:30:56,544 INFO:Crawling: https://en.wikipedia.org/wiki/List_of_Yotsuba%26!_chapters
2022-06-23 23:30:56,545 INFO:Bytes: 38109465
2022-06-23 23:30:57,432 INFO:Crawling: https://en.wikipedia.org/wiki/Municipalities_of_Durango
2022-06-23 23:30:57,433 INFO:Bytes: 38296888
2022-06-23 23:30:57,842 INFO:Crawling: https://en.wikipedia.org/wiki/Grammy_Award_for_Best_Solo_Rock_Vocal_Performance
2022-06-23 23:30:57,843 INFO:Bytes: 38405051
2022-06-23 23:30:58,691 INFO:Crawling: https://en.wikipedia.org/wiki/Wikipedia:Today%27s_featured_list/June_2022
2022-06-23 23:30:58,692 INFO:Bytes: 38538513
2022-06-23 23:30:59,159 INFO:Crawling: https://en.wikipedia.org/wiki/Wikipedia:Featured_lists
2022-06-23 23:30:59,160 INFO:Bytes: 38614477
2022-06-23 23:31:06,095 INFO:Crawling: https://en.wikipedia.org/wiki/File:Vincent_van_Gogh_-_Wheatfield_with_a_reaper_-_Google_Art_Project.jpg
2022-06-23 23:31:06,096 INFO:Bytes: 39406126
2022-06-23 23:31:06,571

2022-06-23 23:32:22,856 INFO:Bytes: 48898702
2022-06-23 23:32:23,665 INFO:Crawling: https://no.wikipedia.org/wiki/
2022-06-23 23:32:23,665 INFO:Bytes: 49002409
2022-06-23 23:32:24,214 INFO:Crawling: https://ro.wikipedia.org/wiki/
2022-06-23 23:32:24,215 INFO:Bytes: 49065124
2022-06-23 23:32:25,281 INFO:Crawling: https://sr.wikipedia.org/wiki/
2022-06-23 23:32:25,282 INFO:Bytes: 49199667
2022-06-23 23:32:26,212 INFO:Crawling: https://sh.wikipedia.org/wiki/
2022-06-23 23:32:26,213 INFO:Bytes: 49331599
2022-06-23 23:32:27,223 INFO:Crawling: https://fi.wikipedia.org/wiki/
2022-06-23 23:32:27,224 INFO:Bytes: 49418670
2022-06-23 23:32:28,097 INFO:Crawling: https://tr.wikipedia.org/wiki/
2022-06-23 23:32:28,097 INFO:Bytes: 49493625
2022-06-23 23:32:29,708 INFO:Crawling: https://ast.wikipedia.org/wiki/
2022-06-23 23:32:29,709 INFO:Bytes: 49650445
2022-06-23 23:32:30,705 INFO:Crawling: https://bn.wikipedia.org/wiki/
2022-06-23 23:32:30,706 INFO:Bytes: 49741919
2022-06-23 23:32:32,478 INFO:Crawl

2022-06-23 23:33:40,252 INFO:Crawling: https://en.wiktionary.org/wiki/Wiktionary:Main_Page
2022-06-23 23:33:40,253 INFO:Bytes: 56046797
2022-06-23 23:33:41,302 INFO:Crawling: https://ka.wikipedia.org/wiki/
2022-06-23 23:33:41,303 INFO:Bytes: 56159104
2022-06-23 23:33:42,614 INFO:Crawling: https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
2022-06-23 23:33:42,614 INFO:Bytes: 56328151
2022-06-23 23:33:42,953 INFO:Crawling: https://creativecommons.org/licenses/by-sa/3.0/
2022-06-23 23:33:42,954 INFO:Bytes: 56411287
2022-06-23 23:33:43,289 INFO:Crawling: https://foundation.wikimedia.org/wiki/Terms_of_Use
2022-06-23 23:33:43,290 INFO:Bytes: 56450592
2022-06-23 23:33:43,687 INFO:Crawling: https://foundation.wikimedia.org/wiki/Privacy_policy
2022-06-23 23:33:43,688 INFO:Bytes: 56535866
2022-06-23 23:33:44,775 INFO:Crawling: https://www.wikimediafoundation.org/
2022-06-23 23:33:44,775 INFO:Bytes: 56691846
2022-06-23 23:33:45,104 INFO:Cr

2022-06-23 23:35:02,493 INFO:Bytes: 64991633
2022-06-23 23:35:03,522 INFO:Crawling: #cite_note-MiliardWho-10
2022-06-23 23:35:03,523 INFO:Bytes: 65079309
2022-06-23 23:35:03,524 INFO:Crawling: #cite_note-J_Sidener-11
2022-06-23 23:35:03,525 INFO:Bytes: 65079309
2022-06-23 23:35:03,526 INFO:Crawling: https://en.wikipedia.org/wiki/Spontaneous_order
2022-06-23 23:35:03,527 INFO:Bytes: 65079309
2022-06-23 23:35:04,418 INFO:Crawling: https://en.wikipedia.org/wiki/Friedrich_Hayek
2022-06-23 23:35:04,419 INFO:Bytes: 65174316
2022-06-23 23:35:17,411 INFO:Crawling: https://en.wikipedia.org/wiki/Austrian_School
2022-06-23 23:35:17,412 INFO:Bytes: 65922378
2022-06-23 23:35:23,682 INFO:Crawling: https://en.wikipedia.org/wiki/Mises_Institute
2022-06-23 23:35:23,682 INFO:Bytes: 66309767
2022-06-23 23:35:25,845 INFO:Crawling: https://en.wikipedia.org/wiki/Mark_Thornton
2022-06-23 23:35:25,846 INFO:Bytes: 66465171
2022-06-23 23:35:27,453 INFO:Crawling: #cite_note-12
2022-06-23 23:35:27,454 INFO:Bytes:

2022-06-23 23:36:42,194 INFO:Bytes: 73997357
2022-06-23 23:36:42,195 INFO:Crawling: #Internal_research_and_operational_development
2022-06-23 23:36:42,195 INFO:Bytes: 73997357
2022-06-23 23:36:42,196 INFO:Crawling: #Internal_news_publications
2022-06-23 23:36:42,197 INFO:Bytes: 73997357
2022-06-23 23:36:42,198 INFO:Crawling: #The_Wikipedia_Library
2022-06-23 23:36:42,198 INFO:Bytes: 73997357
2022-06-23 23:36:42,200 INFO:Crawling: #Access_to_content
2022-06-23 23:36:42,200 INFO:Bytes: 73997357
2022-06-23 23:36:42,201 INFO:Crawling: #Content_licensing
2022-06-23 23:36:42,202 INFO:Bytes: 73997357
2022-06-23 23:36:42,203 INFO:Crawling: #Methods_of_access
2022-06-23 23:36:42,203 INFO:Bytes: 73997357
2022-06-23 23:36:42,204 INFO:Crawling: #Mobile_access
2022-06-23 23:36:42,204 INFO:Bytes: 73997357
2022-06-23 23:36:42,205 INFO:Crawling: #Chinese_access
2022-06-23 23:36:42,206 INFO:Bytes: 73997357
2022-06-23 23:36:42,207 INFO:Crawling: #Cultural_impact
2022-06-23 23:36:42,207 INFO:Bytes: 73997

2022-06-23 23:37:37,100 INFO:Crawling: #cite_note-WP_growth_modelling_1-43
2022-06-23 23:37:37,100 INFO:Bytes: 78122415
2022-06-23 23:37:37,102 INFO:Crawling: https://en.wikipedia.org/wiki/Palo_Alto_Research_Center
2022-06-23 23:37:37,102 INFO:Bytes: 78122415
2022-06-23 23:37:38,665 INFO:Crawling: #cite_note-wikisym_slowing_growth_1-44
2022-06-23 23:37:38,666 INFO:Bytes: 78224453
2022-06-23 23:37:38,667 INFO:Crawling: https://en.wiktionary.org/wiki/low-hanging_fruit
2022-06-23 23:37:38,668 INFO:Bytes: 78224453
2022-06-23 23:37:39,080 INFO:Crawling: #cite_note-bostonreview_the_end_of_WP_1-45
2022-06-23 23:37:39,081 INFO:Bytes: 78258836
2022-06-23 23:37:39,083 INFO:Crawling: #cite_note-46
2022-06-23 23:37:39,083 INFO:Bytes: 78258836
2022-06-23 23:37:39,085 INFO:Crawling: #cite_note-stanford_WP_lack_of_future_growth_1-47
2022-06-23 23:37:39,086 INFO:Bytes: 78258836
2022-06-23 23:37:39,087 INFO:Crawling: https://en.wikipedia.org/wiki/Rey_Juan_Carlos_University
2022-06-23 23:37:39,088 INFO:

KeyboardInterrupt: 

In order to parallelize the crawling, let us first initialize Ray.

In [5]:
import ray
ray.init()

2022-06-23 23:39:29,307	INFO services.py:1476 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8267[39m[22m


RayContext(dashboard_url='127.0.0.1:8267', python_version='3.7.4', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-06-23_23-39-26_709042_14356/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-06-23_23-39-26_709042_14356/sockets/raylet', 'webui_url': '127.0.0.1:8267', 'session_dir': '/tmp/ray/session_2022-06-23_23-39-26_709042_14356', 'metrics_export_port': 61956, 'gcs_address': '127.0.0.1:59858', 'address': '127.0.0.1:59858', 'node_id': 'fddf9ebe7111f389b901634512c76bd58dd860ce122c2cc3de3f9da2'})

We need to keep track of which URLs we already crawled to avoid double visiting them and we also need to keep track of which URLs still need to be visited. We do this by centralize this data in an actor `CrawlQueue`.

In [6]:
import ray
import asyncio
import collections

@ray.remote
class CrawlQueue:
    # Initialize the crawl queue with a set of seed urls to be crawled.
    async def __init__(self, seed_urls):
        # A queue of pending crawl requests
        self.pending_crawl_requests = asyncio.Queue()
        # All crawl requests - pending, in progress, and completed
        self.all_crawl_requests = set()
        for url in seed_urls:
            await self.add_crawl_request(url)

    # Add additional urls to be crawled - this is called each time a crawler encounters a URL
    # in the document it is processing.
    async def add_crawl_request(self, url):
        if url not in self.all_crawl_requests:
            await self.pending_crawl_requests.put(url)
            self.all_crawl_requests.add(url)

    # Get an url to crawl - this is called from an idle crawler. actor_id is the id of this crawler.
    # It returns the url to be crawled or returns END_OF_CRAWLS to indicate that there is nothing
    # more to crawl. The crawler (actor) should terminate at this point.
    async def get_crawl_request(self):
        return await self.pending_crawl_requests.get()

In [None]:
# Let's start Ray
ray.init(address='auto')

<!-- #raw -->
```{eval-rst}
.. code-block:: python
    :emphasize-lines: 19, 20, 21

    class RayCrawler:

        def __init__(self, crawl_queue):
            self.crawl_queue = crawl_queue
            self.num_processed_bytes = 0

        def download_url(self, url):
            text = requests.get(url).text
            self.num_processed_bytes += len(text)
            return text

        def get_linked_urls(self, url, html):
            soup = BeautifulSoup(html, 'html.parser')
            for link in soup.find_all('a'):
                path = link.get('href')
                if path and path.startswith('/'):
                    path = urljoin(url, path)
                yield path

        def add_url_to_visit(self, url):
            self.crawl_queue.add_crawl_request.remote(url)

        def crawl(self, url):
            html = self.download_url(url)
            for url in self.get_linked_urls(url, html):
                self.add_url_to_visit(url)

        def run(self):
            while True:
                url = ray.get(self.crawl_queue.get_crawl_request.remote())
                logging.info(f'Crawling: {url}')
                logging.info(f'Bytes: {self.num_processed_bytes}')
                try:
                    self.crawl(url)
                except Exception:
                    # logging.exception(f'Failed to crawl: {url}')
                    pass
                
    @ray.remote
    def worker(crawl_queue):
        logging.basicConfig(level=logging.INFO)
        RayCrawler(crawl_queue).run()

    if __name__ == '__main__':
        crawl_queue = CrawlQueue.remote(['https://en.wikipedia.org/'])
        ray.get([worker.remote(crawl_queue) for i in range(5)])
```
<!-- #endraw -->

In [8]:
class RayCrawler:

    def __init__(self, crawl_queue):
        self.crawl_queue = crawl_queue
        self.num_processed_bytes = 0

    def download_url(self, url):
        text = requests.get(url).text
        self.num_processed_bytes += len(text)
        return text

    def get_linked_urls(self, url, html):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a'):
            path = link.get('href')
            if path and path.startswith('/'):
                path = urljoin(url, path)
            yield path

    def add_url_to_visit(self, url):
        self.crawl_queue.add_crawl_request.remote(url)

    def crawl(self, url):
        html = self.download_url(url)
        for url in self.get_linked_urls(url, html):
            self.add_url_to_visit(url)

    def run(self):
        while True:
            url = ray.get(self.crawl_queue.get_crawl_request.remote())
            logging.info(f'Crawling: {url}')
            logging.info(f'Bytes: {self.num_processed_bytes}')
            try:
                self.crawl(url)
            except Exception:
                # logging.exception(f'Failed to crawl: {url}')
                pass
                
@ray.remote
def worker(crawl_queue):
    logging.basicConfig(level=logging.INFO)
    RayCrawler(crawl_queue).run()

if __name__ == '__main__':
    crawl_queue = CrawlQueue.remote(['https://en.wikipedia.org/'])
    ray.get([worker.remote(crawl_queue) for i in range(5)])

[2m[36m(worker pid=15225)[0m ERROR:root:Crawling: https://en.wikipedia.org/wiki/Ensemble_Citoyens
[2m[36m(worker pid=15225)[0m ERROR:root:Bytes: 3966707
[2m[36m(worker pid=15222)[0m ERROR:root:Crawling: https://en.wikipedia.org/wiki/2022_French_legislative_election
[2m[36m(worker pid=15222)[0m ERROR:root:Bytes: 3206832
[2m[36m(worker pid=15226)[0m ERROR:root:Crawling: https://en.wikipedia.org/wiki/Emmanuel_Macron
[2m[36m(worker pid=15226)[0m ERROR:root:Bytes: 3788547
[2m[36m(worker pid=15221)[0m INFO:root:Crawling: https://en.wikipedia.org/
[2m[36m(worker pid=15221)[0m INFO:root:Bytes: 0
[2m[36m(worker pid=15224)[0m INFO:root:Crawling: #mw-head
[2m[36m(worker pid=15224)[0m INFO:root:Bytes: 0
[2m[36m(worker pid=15224)[0m INFO:root:Crawling: https://en.wikipedia.org/wiki/Special:Statistics
[2m[36m(worker pid=15224)[0m INFO:root:Bytes: 0
[2m[36m(worker pid=15218)[0m INFO:root:Crawling: #searchInput
[2m[36m(worker pid=15218)[0m INFO:root:Bytes: 0


[2m[36m(worker pid=15220)[0m ERROR:root:Crawling: https://en.wikipedia.org/wiki/Dom_Phillips
[2m[36m(worker pid=15220)[0m ERROR:root:Bytes: 5377286
[2m[36m(worker pid=15215)[0m INFO:root:Crawling: https://en.wikipedia.org/wiki/2014_Texas_Bowl
[2m[36m(worker pid=15215)[0m INFO:root:Bytes: 2236834
[2m[36m(worker pid=15227)[0m ERROR:root:Crawling: https://en.wikipedia.org/wiki/June_24
[2m[36m(worker pid=15227)[0m ERROR:root:Bytes: 4144921
[2m[36m(worker pid=15218)[0m INFO:root:Crawling: https://en.wikipedia.org/wiki/Arkansas_Razorbacks_football
[2m[36m(worker pid=15218)[0m INFO:root:Bytes: 2558195
[2m[36m(worker pid=15219)[0m INFO:root:Crawling: https://en.wikipedia.org/wiki/Texas_Longhorns_football
[2m[36m(worker pid=15219)[0m INFO:root:Bytes: 1760779
[2m[36m(worker pid=15225)[0m ERROR:root:Crawling: https://en.wikipedia.org/wiki/Caleb_Swanigan
[2m[36m(worker pid=15225)[0m ERROR:root:Bytes: 4922935
[2m[36m(worker pid=15222)[0m ERROR:root:Crawling: h

KeyboardInterrupt: 