So let's give scrapy a whirl, shall we ... 

https://docs.scrapy.org/en/latest/intro/tutorial.html

https://towardsdatascience.com/run-scrapy-code-from-jupyter-notebook-without-issues-69b7cb79530c

In [1]:
# scrape webpage
import scrapy
from scrapy.crawler import CrawlerProcess
# text cleaning
import re

In [2]:
class QuotesToCsv(scrapy.Spider):
    """scrape first line of  quotes from `wikiquote` by 
    Maynard James Keenan and save to json file"""
    name = "MJKQuotesToCsv"

    start_urls = [
        'https://en.wikiquote.org/wiki/Maynard_James_Keenan',
    ]

    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1
        },
        'FEEDS': {
            'wikiQuotes.csv': {
                'format': 'csv',
                'overwrite': True
            }
        }
    }
    

    def parse(self, response):
        """parse data from urls"""
        for quote in response.css('div.mw-parser-output > ul > li'):
            yield {'quote': quote.extract()}

In [3]:
class ExtractFirstLine(object):
    def process_item(self, item, spider):
        """text processing"""
        lines = dict(item)["quote"].splitlines()
        first_line = self.__remove_html_tags__(lines[0])

        return {'quote': first_line}

    def __remove_html_tags__(self, text):
        """remove html tags from string"""
        html_tags = re.compile('<.*?>')
        return re.sub(html_tags, '', text)

In [4]:
process = CrawlerProcess()

2023-07-17 16:49:39 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: scrapybot)
2023-07-17 16:49:39 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (default, May 26 2023, 14:05:08) - [GCC 9.4.0], pyOpenSSL 23.2.0 (OpenSSL 3.1.1 30 May 2023), cryptography 41.0.2, Platform Linux-5.19.0-46-generic-x86_64-with-glibc2.29


In [5]:
process.crawl(QuotesToCsv)

2023-07-17 16:49:42 [scrapy.crawler] INFO: Overridden settings:
{}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-07-17 16:49:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2023-07-17 16:49:42 [scrapy.extensions.telnet] INFO: Telnet Password: 9d4af74edd1697e8
2023-07-17 16:49:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-07-17 16:49:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermidd

<Deferred at 0x7fce4c91d280>

In [6]:
process.start()

2023-07-17 16:49:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikiquote.org/wiki/Maynard_James_Keenan> (referer: None)
2023-07-17 16:49:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://en.wikiquote.org/wiki/Maynard_James_Keenan>
{'quote': "Tool is not Slayer. I went to art school. I spent three years in the military. There's more to me than throwing devil horns."}
2023-07-17 16:49:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://en.wikiquote.org/wiki/Maynard_James_Keenan>
{'quote': 'I think there’s a reason why wine figures into so many religions. There’s something transcendent about it. It’s sort of the way that music is more than the sum of its parts. You have all these elements that make up the terroir that wine can communicate.'}
2023-07-17 16:49:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://en.wikiquote.org/wiki/Maynard_James_Keenan>
{'quote': "You can grow grapes in almost any part of the world. You just have to develop your palate en