# Web Crawling Models

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

## Dealing with different website layouts

In [5]:
class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
    
    def print(self):
        print(f'TITLE: {self.title}')
        print(f'URL: {self.url}')
        print(f'BODY:\n {self.body}')

def scrapeCNN(url):
    bs = BeautifulSoup(urlopen(url))
    title = bs.find('h1').text
    body = bs.find('div', {'class': 'article__content'}).text
    print('body: ')
    print(body)
    return Content(url, title, body)

def scrapeBrookings(url):
    bs = BeautifulSoup(urlopen(url))
    title = bs.find('h1').text
    body = bs.find('div', {'class': 'byo-block -narrow wysiwyg-block wysiwyg'}).text
    return Content(url, title, body)

In [6]:
url = 'https://www.brookings.edu/research/robotic-rulemaking/'
content = scrapeBrookings(url)
content.print()

TITLE: 
            Robotic rulemaking
          
URL: https://www.brookings.edu/research/robotic-rulemaking/
BODY:
 
As it has rocketed to some 100 million active users in record time, ChatGPT is provoking conversations about the role of artificial intelligence (AI) in drafting written materials such as student exams, news articles, legal pleadings, poems, and more. The chatbot, developed by OpenAI, relies on a large language model (LLM) to respond to user-submitted requests, or “prompts” as they are known. It is an example of generative AI, a technology that upends our understanding of who creates written materials and how they do it, challenging what it means to create, analyze, and express ideas.



In [4]:
url = 'https://www.cnn.com/2023/04/03/investing/dogecoin-elon-musk-twitter/index.html'
content = scrapeCNN(url)
content.print()

body: 



New York
CNN
         — 
    


            Twitter’s traditional bird icon was booted and replaced with an image of a Shiba Inu, an apparent nod to dogecoin, the joke cryptocurrency that CEO Elon Musk is being sued over. 
    

            Musk addressed the change Monday afternoon, tweeting, “as promised” above an image of a year-old conversation in which another user suggested that Musk “just buy Twitter” and “change the bird logo to a doge.” 
    











CNN/Adobe Stock





Elon Musk's Twitter promised a purge of blue check marks. Instead he singled out one account




            The doge logo appeared on the site two days after Musk asked a judge to throw out a $258 billion racketeering lawsuit accusing him of running a pyramid scheme to support the dogecoin, according to Reuters.


            Lawyers for Musk and Tesla called the lawsuit by dogecoin investors a “fanciful work of fiction” over Musk’s “innocuous and often silly tweets.”
    

            It wasn’t 

Below we're also adding a Website class - for more flexibility for different webiste structures (i.e. with different urls, define where the title is stored and which tag identifies the body of the text we want)

In [8]:
class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        """
        Flexible printing function controls output
        """
        print(f'URL: {self.url}')
        print(f'TITLE: {self.title}')
        print(f'BODY:\n{self.body}')

class Website:
    """ 
    Contains information about website structure
    """
    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

class Crawler:
    def getPage(url):
        try:
            html = urlopen(url)
        except Exception:
            return None
        return BeautifulSoup(html, 'html.parser')

    def safeGet(bs, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = bs.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def getContent(website, path):
        """
        Extract content from a given page URL
        """
        url = website.url+path
        bs = Crawler.getPage(url)
        if bs is not None:
            title = Crawler.safeGet(bs, website.titleTag)
            body = Crawler.safeGet(bs, website.bodyTag)
            return Content(url, title, body)
        return Content(url, '', '')

In [9]:
siteData = [
    ['O\'Reilly Media', 'https://www.oreilly.com', 'h1', 'div.title-description'],
    ['Reuters', 'https://www.reuters.com', 'h1', 'div.ArticleBodyWrapper'],
    ['Brookings', 'https://www.brookings.edu', 'h1', 'div.post-body'],
    ['CNN', 'https://www.cnn.com', 'h1', 'div.article__content']
]
websites = []
for name, url, title, body in siteData:
    websites.append(Website(name, url, title, body))

Crawler.getContent(websites[0], '/library/view/web-scraping-with/9781491910283').print()
Crawler.getContent(
    websites[1], '/article/us-usa-epa-pruitt-idUSKBN19W2D0').print()
Crawler.getContent(
    websites[2],
    '/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/').print()
Crawler.getContent(
    websites[3], 
    '/2023/04/03/investing/dogecoin-elon-musk-twitter/index.html').print()

URL: https://www.oreilly.com/library/view/web-scraping-with/9781491910283
TITLE: Web Scraping with Python
BODY:


Book description
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice.
Show and hide more

Publisher resources
View/Submit Errata




URL: https://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0
TITLE: 
BODY:

URL: https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/
TITLE: Idea t

## Crawling through sites with search

In [10]:
class Content:
    """Common base class for all articles/pages"""

    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.title = title
        self.body = body
        self.url = url

    def print(self):
        """
        Flexible printing function controls output
        """
        print(f'New article found for topic: {self.topic}')
        print(f'URL: {self.url}')
        print(f'TITLE: {self.title}')
        print(f'BODY:\n{self.body}')

class Website:
    """Contains information about website structure"""

    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl
        self.resultListing = resultListing
        self.resultUrl = resultUrl
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag

class Crawler:
    def __init__(self, website):
        self.site = website
        self.found = {}

    def getPage(url):
        try:
            html = urlopen(url)
        except Exception as e:
            return None
        return BeautifulSoup(html, 'html.parser')

    def safeGet(bs, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = bs.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def getContent(self, topic, url):
        """
        Extract content from a given page URL
        """
        bs = Crawler.getPage(url)
        if bs is not None:
            title = Crawler.safeGet(bs, self.site.titleTag)
            body = Crawler.safeGet(bs, self.site.bodyTag)
            return Content(topic, url, title, body)
        return Content(topic, url, '', '')

    def search(self, topic):
        """
        Searches a given website for a given topic and records all pages found
        """
        bs = Crawler.getPage(self.site.searchUrl + topic)
        searchResults = bs.select(self.site.resultListing)
        for result in searchResults:
            url = result.select(self.site.resultUrl)[0].attrs['href']
            # Check to see whether it's a relative or an absolute URL
            url = url if self.site.absoluteUrl else self.site.url + url
            if url not in self.found:
                self.found[url] = self.getContent(topic, url)
            self.found[url].print()


In [11]:
siteData = [
    ['Reuters', 'http://reuters.com', 'https://www.reuters.com/search/news?blob=', 'div.search-result-indiv',
        'h3.search-result-title a', False, 'h1', 'div.ArticleBodyWrapper'],
    ['Brookings', 'http://www.brookings.edu', 'https://www.brookings.edu/search/?s=',
        'div.article-info', 'h4.title a', True, 'h1', 'div.core-block']
]
sites = []
for name, url, search, rListing, rUrl, absUrl, tt, bt in siteData:
    sites.append(Website(name, url, search, rListing, rUrl, absUrl, tt, bt))

crawlers = [Crawler(site) for site in sites]
topics = ['python', 'data%20science']

for topic in topics:
    for crawler in crawlers:
        crawler.search(topic)


AttributeError: 'NoneType' object has no attribute 'select'

## Crawling Sites through Links

In [70]:
class Website:

    def __init__(self, name, url, targetPattern, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.targetPattern = targetPattern
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag


class Content:

    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        print(f'URL: {self.url}')
        print(f'TITLE: {self.title}')
        print(f'BODY:\n{self.body}')

In [83]:
import re


class Crawler:
    def __init__(self, site):
        self.site = site
        self.visited = {}

    def getPage(url):
        try:
            html = urlopen(url)
        except Exception as e:
            print(e)
            return None
        return BeautifulSoup(html, 'html.parser')

    def safeGet(bs, selector):
        selectedElems = bs.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def getContent(self, url):
        """
        Extract content from a given page URL
        """
        bs = Crawler.getPage(url)
        if bs is not None:
            title = Crawler.safeGet(bs, self.site.titleTag)
            body = Crawler.safeGet(bs, self.site.bodyTag)
            return Content(url, title, body)
        return Content(url, '', '')

    def crawl(self):
        """
        Get pages from website home page
        """
        bs = Crawler.getPage(self.site.url)
        targetPages = bs.findAll('a', href=re.compile(self.site.targetPattern))
        for targetPage in targetPages:
            url = targetPage.attrs['href']
            url = url if self.site.absoluteUrl else f'{self.site.url}{targetPage}'
            if url not in self.visited:
                self.visited[url] = self.getContent(url)
                self.visited[url].print()


brookings = Website('Reuters', 'https://brookings.edu', '\/(research|blog)\/', True, 'h1', 'div.post-body')
crawler = Crawler(brookings)
crawler.crawl()

URL: https://www.brookings.edu/blog/fixgov/2023/04/05/what-we-learned-from-the-chicago-mayoral-results/
TITLE: What we learned from the Chicago mayoral results
What we learned from the Chicago mayoral results
BODY:

As Chicagoans went to the polls on Tuesday, early signs pointed to a narrow victory for Paul Vallas, the former head of the city’s public school system and noted educational reformer, over Brandon Johnson, a former social studies teacher turned organizer for the Chicago Teachers Union. Vallas led in the pre-election polls by an average of 3 percentage points, a margin that widened to 6 points when undecided voters were asked whether they leaned toward a candidate. A higher share of Vallas’s supporters said that they were certain to cast their ballots, and more of Johnson’s said that they might change their minds about their choice. Vallas enjoyed a strong lead among voters 60 and older, who are the most likely to vote of all age cohorts, while Johnson was doing best among t

URL: https://www.brookings.edu/research/addressing-the-looming-sovereign-debt-crisis-in-the-developing-world-it-is-time-to-consider-a-brady-plan/
TITLE: Addressing the looming sovereign debt crisis in the developing world: It is time to consider a ‘Brady’ plan
Addressing the looming sovereign debt crisis in the developing world: It is time to consider a ‘Brady’ plan
BODY:








Brahima Sangafowa Coulibaly

					Vice President and Director - Global Economy and Development 

					Senior Fellow - Global Economy and Development 

 Twitter
BSangafowaCoul






W



Wafa Abedin

					Research and Administrative Assistant to the Vice President and Director - Global Economy and Development 




Among the challenges facing developing countries, none is arguably more crucial than the significantly deteriorated fiscal situation that threatens to erase several years of progress on development agendas. According to some estimates, almost 60 percent of the poorest countries are either in or at hig

URL: https://www.brookings.edu/research/the-second-half-of-the-sustainable-development-goal-era-ideas-for-doing-things-differently/
TITLE: The ‘Second Half’ of the Sustainable Development Goal era: Ideas for doing things differently
The ‘Second Half’ of the Sustainable Development Goal era: Ideas for doing things differently
BODY:








John W. McArthur

					Director - Center for Sustainable Development 

					Senior Fellow - Global Economy and Development 

 Twitter
@mcarthur





This September, the U.N. will convene a midpoint summit on the Sustainable Development Goals, halfway between their 2015 launch and 2030 deadline. For many leaders gathering in the General Assembly, the mood might be somber. Stark global tensions alongside inadequate SDG progress make for a tough outlook. But a successful summit will need to focus on pragmatics more than sentiment: What has gone well, where could a burst of effort tackle gaps, and—perhaps most importantly—what needs to be done differently

URL: https://www.brookings.edu/research/caring-about-care-an-sdg-5-priority/
TITLE: Caring about Care: An SDG-5 priority
Caring about Care: An SDG-5 priority
BODY:








Caren Grown

					Senior Fellow - Global Economy and Development, Center for Sustainable Development 




Goal 5 is an ambitious and expansive approach to reducing gaps between males and females and enabling women and girls to live their lives to the fullest. It proposes a multidimensional definition of, and comprehensive set of indicators for, tracking gender equality and women’s empowerment, complemented with targets and indicators across other goals. While advances have been made toward many aspects of Goal 5, the U.N. estimates that at the current rate, it will take nearly 300 years to meet all targets.  A high priority for accelerating progress is Target 5.4, which seeks to equalize the time that women and men spend on unpaid care and domestic work, including care for children, the elderly, the sick, and those w

URL: https://www.brookings.edu/blog/brown-center-chalkboard/2023/04/05/state-of-the-states-gubernatorial-policy-priorities-in-2023/
TITLE: State of the States: Gubernatorial policy priorities in 2023
State of the States: Gubernatorial policy priorities in 2023
BODY:








Katharine Meyer

					Fellow - Governance Studies, Brown Center on Education Policy 

 Twitter
@katharinemeyer








Rachel M. Perera

					Fellow - Governance Studies, Brown Center on Education Policy - The Brookings Institution 

 Twitter
@rachelmarisa





The federal government plays a limited role in education policy—states and local governments are primarily responsible for educating our nation’s youth. The first federal laws about education governance weren’t introduced until 1965 with the Elementary and Secondary Education Act (ESEA) and Higher Education Act (HEA). And still, states are given broad latitude to determine how to best implement these federal laws in their states. Today, the federal government

URL: https://www.brookings.edu/blog/order-from-chaos/2023/04/05/when-might-us-political-support-be-unwelcome-in-taiwan/
TITLE: When might US political support be unwelcome in Taiwan?
When might US political support be unwelcome in Taiwan?
BODY:

For a time, it looked as though House Speaker Kevin McCarthy would make a high-profile visit to Taiwan this spring. There was some suggestion that this might lead Beijing to react even more coercively than it did after the previous speaker, Nancy Pelosi, visited in August 2022. Perhaps for that reason, McCarthy will now have a meeting with Taiwanese President Tsai Ing-wen when she transits through Los Angeles, California. Depending on how McCarthy frames his support for Tsai, however, the People’s Republic of China (PRC) might still escalate its military operations around Taiwan to signal its opposition to the alleged “hollowing out” of the U.S. “One China” policy. Depending on the scale of these actions, some Taiwanese voters might again concl

URL: https://www.brookings.edu/research/sdg-implementation-for-fragile-countries-needs-more-risk-taking/
TITLE: SDG implementation for fragile countries needs more risk-taking
SDG implementation for fragile countries needs more risk-taking
BODY:








Naheed Sarabi

					Visiting Fellow - Global Economy and Development, Center for Sustainable Development 

 Twitter
Sarabinaheed





In 2023, concurring economic, social, and environmental crises are disproportionately affecting fragile states, creating a grim outlook for achieving the SDGs by 2030. The Global Peace Index Report for 2022 indicates deteriorating global peacefulness since 2014, with a growing gap between the most peaceful and least peaceful countries. SDG progress has been either stagnating or declining in more than half of the fragile states. Poverty and insecurity are on the rise in conflict-affected and fragile countries, where 20 percent of the global share of those in extreme poverty live; this is expected to rise t

URL: https://www.brookings.edu/research/scaling-private-sector-engagement-in-the-sdgs/
TITLE: Scaling private sector engagement in the SDGs
Scaling private sector engagement in the SDGs
BODY:








Jane Nelson

					Nonresident Senior Fellow - Global Economy and Development, Center for Sustainable Development 







George Ingram

					Senior Fellow - Global Economy and Development, Center for Sustainable Development 

 Twitter
@GMIngramIV





Private sector investment and innovation are essential to achieving the Sustainable Development Goals (SDGs). A vanguard of companies is making public commitments and taking action. Yet, business engagement and impact are far from becoming mainstream. A concerted effort is required to scale the quantity, quality, and accountability of private sector activities that could have a measurable impact on supporting the SDGs.  
In the 12th U.N. Global Compact-Accenture CEO Study, released in 2023, 98 percent of more than 2,600 chief executives acros

URL: https://www.brookings.edu/research/a-purpose-driven-fund-to-end-extreme-poverty-by-2030/
TITLE: A purpose-driven fund to end extreme poverty by 2030
A purpose-driven fund to end extreme poverty by 2030
BODY:








Homi Kharas

					Senior Fellow - Global Economy and Development, Center for Sustainable Development 







John W. McArthur

					Director - Center for Sustainable Development 

					Senior Fellow - Global Economy and Development 

 Twitter
@mcarthur





Ending extreme poverty by 2030 is first among equals within the Sustainable Development Goals. When SDG target 1.1 was formally adopted in 2015, the number of extremely poor people was thought to be around 730 million globally and was falling by roughly 65 million a year. Continuing that trend would have cut poverty rates to zero by 2030. But progress has slowed instead. Recent projections suggest 570 million people might still be poor in 2030, far short of elimination. At the SDG midpoint, rebooting efforts to endi

## Crawling multiple page types

In [12]:
class Website:
    """Common base class for all articles/pages"""

    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag
        
class Product(Website):
    """Contains information for scraping a product page"""

    def __init__(self, name, url, titleTag, productNumberTag, priceTag):
        Website.__init__(self, name, url, titleTag)
        self.productNumberTag = productNumberTag
        self.priceTag = priceTag

class Article(Website):
    """Contains information for scraping an article page"""

    def __init__(self, name, url, titleTag, bodyTag, dateTag):
        Website.__init__(self, name, url, titleTag)
        self.bodyTag = bodyTag
        self.dateTag = dateTag