Os web crawlers (rastreadores web) recebem esse nome porque rastreiam (crawl) a web. Em seu núcleo, encontra-se um elemento de recursão. Eles devem obter o conteúdo da página de um URL, analisar essa página em busca de outro URL e obter essa página, *ad infinitum*.

In [1]:
# Percorrendo um único domínio

from urllib.request import urlopen
from bs4 import BeautifulSoup 

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#searchInput
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia,_Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(1990_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/X-Men:_First_Class
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/w

Se analisarmos os links que apontam para páginas de artigos veremos que eles t~em 3 características em comum:
    
* Estão na div com o **id** definido como bodyContent;
* Os URLs não contêm dois-pontos;
* Os URLs começam com /wiki/

Podemos usar essas regras para obter apenas os links desejados usando a expressão regular ^(/wiki/)((?!:).)*$")

In [2]:
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find('div', {'id':'bodyContent'}).find_all(
    'a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia,_Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(1990_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/Streaming_television
/wiki/I_Love_Dick_(TV_series)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
/wiki/The_Guardian
/wi

In [3]:
# este script precisa ser interrompido

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))

links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

/wiki/Losing_Chase
/wiki/Kyra_Sedgwick
/wiki/Sela_Ward
/wiki/Barbara_Stanwyck
/wiki/Judy_Davis
/wiki/Jacki_Weaver
/wiki/Last_Cab_to_Darwin_(film)
/wiki/Brendan_Cowell
/wiki/The_Wild_Duck
/wiki/The_Lady_from_the_Sea
/wiki/The_Wild_Duck
/wiki/When_We_Dead_Awaken
/wiki/St._John%27s_Eve_(play)
/wiki/Theseus
/wiki/Mickey_Rourke
/wiki/Robert_De_Niro
/wiki/Joy_Mangano
/wiki/The_Futon_Critic
/wiki/Nielsen_ratings
/wiki/CNBC
/wiki/List_of_United_States_over-the-air_television_networks
/wiki/Religious_broadcasting
/wiki/Digital_radio
/wiki/Digital_terrestrial_television
/wiki/TVR1
/wiki/Romanian_Revolution_of_1989
/wiki/Dictatorship
/wiki/Arturo_Araujo
/wiki/Diego_Vigil_Coca%C3%B1a
/wiki/Vicente_Tosta
/wiki/Francisco_Ferrera
/wiki/Vicente_Tosta
/wiki/Jos%C3%A9_Santiago_Bueso
/wiki/Policarpo_Bonilla
/wiki/Tegucigalpa


KeyboardInterrupt: 

# Rastreamento recursivo de um site inteiro

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

# Coletando dados em um site inteiro

In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id ='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! Continuing.')
    
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

Main Page
<p><b><a href="/wiki/August_8" title="August 8">August 8</a></b>:
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia
Wikipedia
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Protection_policy#semi
Wikipedia:Protection policy
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Requests_for_page_protection
Wikipedia:Requests for page protection
<p>This page is for requesting that a page, file or template be <b>protected</b>. Please read up on the <a href="/wiki/Wikipedia:Protection_policy" title="Wikipedia:Protection policy">protection policy</a>. Full protection is used to stop edit warring between multiple users or to prevent vandalism to <a href="/wiki/Wikipedia:High-risk_templates" title="Wikipedia:High-risk templates">high-risk templates</a>; semi-protection and pending changes are usually used only to prevent IP and 

KeyboardInterrupt: 

# Colete todos os links externos de um site

In [7]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks
            
#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" that do
    #not contain the current URL
    for link in bs.find_all('a', href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html, 'html.parser')
    externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        return getRandomExternalLink(internalLinks[random.randint(0,
                                    len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks)-1)]
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(externalLink))
    followExternalOnly(externalLink)
            
followExternalOnly('http://oreilly.com')

Random external link is: https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2


HTTPError: HTTP Error 503: Service Unavailable

In [11]:
# Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()


def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    domain = '{}://{}'.format(urlparse(siteUrl).scheme,
                              urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html, 'html.parser')
    internalLinks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)

    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.add(link)
            #getAllExternalLinks(link)


allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

https://www.oreilly.com
https://learning.oreilly.com/accounts/login-check/
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/business.html
https://www.oreilly.com/online-learning/government.html
https://www.oreilly.com/online-learning/academic.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/feature-certification.html
https://www.oreilly.com/online-learning/intro-interactive-learning.html
https://www.oreilly.com/online-learning/live-events.html
https://www.oreilly.com/online-learning/feature-answers.html
https://www.oreilly.com/radar/
https://www.oreilly.com/content-marketing-solutions.html
https://www.oreilly.com/online-learning/enterprise.html
https://learning.oreilly.com/p/register/
https://learning.oreilly.com/search/?query=author%3A%22Arianne%20Dee%22&extended_publisher_data=true&hig

# Adaptações sugeridas

In [None]:
# https://gist.github.com/AO8/f721b6736c8a4805e99e377e72d3edbf
# Adapted from example in Ch.3 of "Web Scraping With Python, Second Edition" by Ryan Mitchell

import re
import requests
from bs4 import BeautifulSoup

pages = set()

def get_links(page_url):
    global pages
    pattern = re.compile("^(/)")
    html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
    soup = BeautifulSoup(html, "html.parser")
    for link in soup.find_all("a", href=pattern):
        if "href" in link.attrs:
            if link.attrs["href"] not in pages:
                new_page = link.attrs["href"]
                print(new_page)
                pages.add(new_page)
                get_links(new_page)
        
get_links("")

In [18]:
# testando

import re
import requests
from bs4 import BeautifulSoup

pages = set()

def get_links(page_url):
    global pages
    pattern = re.compile("^(/)")
    html = requests.get(f"https://en.wikipedia.org/wiki/Kevin_Bacon{page_url}").text # fstrings require Python 3.6+
    soup = BeautifulSoup(html, "html.parser")
    for link in soup.find_all("a", href=pattern):
        if "href" in link.attrs:
            if link.attrs["href"] not in pages:
                new_page = link.attrs["href"]
                print(new_page)
                pages.add(new_page)
                get_links(new_page)
        
get_links("")

/wiki/Wikipedia:Protection_policy#semi
/wiki/Special:SiteMatrix
/wiki/File:Wiktionary-logo-v2.svg
/wiki/File:Wikibooks-logo.svg
/wiki/File:Wikiquote-logo.svg
/wiki/File:Wikisource-logo.svg
/wiki/File:Wikiversity_logo_2017.svg
/wiki/File:Commons-logo.svg
/wiki/File:Wikivoyage-Logo-v3-icon.svg
/wiki/File:Wikinews-logo.svg
/wiki/File:Wikidata-logo.svg
/wiki/File:Wikispecies-logo.svg
/wiki/Wikipedia:User_access_levels#Autoconfirmed_users
/wiki/Wikipedia:Article_wizard
/wiki/Wikipedia:Requested_articles
/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Wikipedia:Requested_articles
/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Wikipedia:Requested_articles
/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Wikipedia:Requested_articles
/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Special:WhatLinksHere/Kevin_Bacon/wiki/Special:WhatLinksHere/Kevin_Bacon/

KeyboardInterrupt: 

In [21]:
from bs4 import BeautifulSoup
import urllib2
import itertools
import random
import urlparse


class Crawler(object):
    
    """docstring for Crawler"""

    def __init__(self):
        self.soup = None                                        # Beautiful Soup object
        self.current_page   = "http://www.python.org/"          # Current page's address
        self.links          = set()                             # Queue with every links fetched
        self.visited_links  = set()

        self.counter = 0 # Simple counter for debug purpose

    def open(self):

        # Open url
        print self.counter , ":", self.current_page
        res = urllib2.urlopen(self.current_page)
        html_code = res.read()
        self.visited_links.add(self.current_page)

        # Fetch every links
        self.soup = BeautifulSoup(html_code)

        page_links = []
        try :
            for link in [h.get('href') for h in self.soup.find_all('a')]:
                print "Found link: '" + link + "'"
                if link.startswith('http'):
                    page_links.append(link)
                    print "Adding link" + link + "\n"
                elif link.startswith('/'):
                    parts = urlparse.urlparse(self.current_page)
                    page_links.append(parts.scheme + '://' + parts.netloc + link)
                    print "Adding link " + parts.scheme + '://' + parts.netloc + link + "\n"
                else:
                    page_links.append(self.current_page+link)
                    print "Adding link " + self.current_page+link + "\n"

        except Exception, ex: # Magnificent exception handling
            print ex

        # Update links 
        self.links = self.links.union( set(page_links) )

        # Choose a random url from non-visited set
        self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
        self.counter+=1

    def run(self):

        # Crawl 3 webpages (or stop if all url has been fetched)
        while len(self.visited_links) < 3 or (self.visited_links == self.links):
            self.open()

        for link in self.links:
            print link

if __name__ == '__main__':
    C = Crawler()
    C.run()

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(self.counter , ":", self.current_page)? (<ipython-input-21-3d207b5c7096>, line 23)

###

O programa pode ser dividido nas seguintes partes:

* Obtendo o código html da página com urllib (vale a pena aprender urllib, se você estiver trabalhando com BeautifulSoup!)
* Analisando a página com BS
* Encontrar um link que inclua a palavra 'próximo' (veja mais detalhes na documentação do BS )
* Fazer algo com a página se precisar (estou apenas imprimindo um nome do link)
* Fazendo todas as etapas anteriores, mas para a próxima página até que não haja mais páginas seguintes

In [25]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup as bs

base_url = 'https://docs.python.org/3/tutorial/'

def find_pages(url):
    """Loop over all pages in online Python tutorial."""
    # try open url
    try:
        page = urlopen(url).read()
    # quit if there's no Next link
    except HTTPError:
        print("The end!")
        return

    # parse the page
    soup = bs(page, 'html.parser')

    # find all occurences of the links, that contain text 'next' and have no attributes
    next_url = soup.findAll('a', text = "next", attrs = {'accesskey' : ''})[0].get('href')

    # do something meaningful with the scrapped page here
    print(next_url)

    # recur with the newly obtained next page's url
    find_pages(base_url + next_url)

find_pages(base_url)

appetite.html
interpreter.html
introduction.html
controlflow.html
datastructures.html
modules.html
inputoutput.html
errors.html
classes.html
stdlib.html
stdlib2.html
venv.html
whatnow.html
interactive.html
floatingpoint.html
appendix.html
../using/index.html
cmdline.html
The end!


A resposta de alecxe é boa e seria essencialmente a segunda metade desta resposta, mas duplica páginas. Por exemplo, os urls https://docs.python.org/3/tutorial/inputoutput.htmle https://docs.python.org/3/tutorial/inputoutput.html#old-string-formattingsão na verdade a mesma página, o segundo é apenas uma âncora na página.

Se você quiser fazer isso como declarou inicialmente - encontre o valor do href do link "próximo" e navegue até lá - você pode fazer algo assim:

Use regex para encontrar os divs com "próximo" neles e, em seguida, use seus pais para obter o href real. Use urljoin()para juntar o base_url e o href para obter o url absoluto da próxima página.

In [26]:
import re
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin


BASE_URL = "https://docs.python.org/3/tutorial/"

def get_next_url(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    selected = soup.select('div.related h3')
    nav = selected[-1] if selected else None# grab the last one with this css selector
    if nav:
        href = nav.parent.find('a', text=re.compile('next'))['href']
        new_url = urljoin(BASE_URL, href)
        return new_url
    else:
        return None

next = get_next_url(BASE_URL)
while next:
    old = next
    next = get_next_url(old)

ModuleNotFoundError: No module named 'urlparse'

Como os hrefvalores são todos relativos ao URL atual, você não pode simplesmente verificar se o hrefatributo começa com https://docs.python.org/3/tutorial/. Observe que esses links têm as classes referencee internal, vamos usar isso:

soup.find_all("a", class_=["reference", "internal"])
soup.select("a.reference.internal")  # CSS selector to check multiple classes

Aqui está um exemplo de código de trabalho que extrai os hrefvalores da página:

In [27]:
from urlparse import urljoin

import requests
from bs4 import BeautifulSoup


base_url = "https://docs.python.org/3/tutorial/"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "html.parser")

for link in soup.select("a.reference.internal"):
    url = link["href"]
    absolute_url = urljoin(base_url, url)

    print(url, absolute_url)

ModuleNotFoundError: No module named 'urlparse'

Observe que temos que usar .urljoin()para obter os URLs absolutos para que possamos segui-los.

https://stackoverflow.com/questions/34439418/python-beautifulsoup4-web-scraping-multiple-pages-on-one-web-site