# Introducción a Web Crawling

Ya vimos como obtener datos de una sola página web utilizando Web Scraping, ahora veamos cómo obtener datos de distintas páginas de un mismo sitio

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html,"lxml")
bsObj

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Kevin Bacon - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Kevin_Bacon","wgTitle":"Kevin Bacon","wgCurRevisionId":826285118,"wgRevisionId":826285118,"wgArticleId":16827,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia indefinitely semi-protected biographies of living people","Use mdy dates from October 2016","Articles with hCards","All articles with unsourced statements","Articles with unsourced statements from January 2016","Articles needing additional references from October 2017","All articles needing additional references","Articles 

Aquí hacemos un for para buscar todos los links en la página de Kevin Bacon

In [3]:
links = bsObj.findAll("a")
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected to promote compliance with the policy on biographies of living people"><img alt="Page semi-protected" data-file-height="128" data-file-width="128" height="20" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/30px-Padlock-silver.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/40px-Padlock-silver.svg.png 2x" width="20"/></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a class="image" href="/wiki/File:Kevin_Bacon_SDCC_2014.jpg"><img alt="Kevin Bacon SDCC 2014.jpg" data-file-height="2649" data-file-width="1907" height="208" src="//upload.wikimedia.org/wi

In [6]:
for link in links:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
http://baconbros.com/
#cite_note-1
#cite_note-actor-2
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
#cite_note-3
/wiki/Hollywood_Walk_of_Fame
#cite_note-4
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
#cite_note-walk-5
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Advertising_work
#Personal_life
#Six_Degrees_of_Kevin_Ba

Podemos ver que obtenemos todos los links de la página. Sin embargo, hay ciertos links que no son interesantes para nosotros como el aviso de privacidad y otros que aparecen en el footer.  

Revisando la estructura de la página podemos ver que los links que conducen a un artículo tienen tres cosas en común
<ol> 
<li> Se encuentran dentro del div con un id='bodyContent'</li>
<li> Las URLs no tienen ;</li> 
<li> Las URLs comienzan con /wiki/</li>
</ol>

Aplicando estas reglas podemos reescribir

In [7]:
import re

In [8]:
regexLinks = re.compile(r"^(/wiki/)((?!:).)*$")

In [9]:
divsBodyContent = bsObj.find("div",{"id":"bodyContent"})

In [10]:
links_relevantes = divsBodyContent.findAll("a",href=regexLinks)                                          
links_relevantes

[<a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a href="/wiki/San_Diego_Comic-Con" title="San Diego Comic-Con">San Diego Comic-Con</a>,
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>,
 <a href="/wiki/Pennsylvania" title="Pennsylvania">Pennsylvania</a>,
 <a href="/wiki/Kyra_Sedgwick" title="Kyra Sedgwick">Kyra Sedgwick</a>,
 <a href="/wiki/Sosie_Bacon" title="Sosie Bacon">Sosie Bacon</a>,
 <a href="/wiki/Edmund_Bacon_(architect)" title="Edmund Bacon (architect)">Edmund Bacon</a>,
 <a href="/wiki/Michael_Bacon_(musician)" title="Michael Bacon (musician)">Michael Bacon</a>,
 <a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a>,
 <a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a>,
 <a href="/wiki/A_Few_Good_Men" title="A Few Good Men">A Few Good Men</a>,
 <a href="/wiki/Apollo_13_(film)" title="Apollo 13 (film)">Apollo 13</a>,
 <a href="/wiki/Mysti

In [11]:
for link in links_relevantes:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
/wiki/Philadelphia
/wiki/Edmund_Bacon_(architect)
/wiki/Pennsylvania_Governor%27s_School_for_the_Arts
/wiki/Bucknell_University
/wiki/Glory_Van_Scott
/wiki/Circle_in_the_Square
/wiki/Nancy_Mills
/wiki/Cosmopolitan_(magazine)
/wiki/Fraternities_and_sororities
/wiki/Animal_House
/wiki/Search_for_Tomorrow
/wiki/Guiding_Light
/wiki/F

Ya obtuvimos la lista de todos los links relevantes. Ahora podemos escribir una función llamada 'getLinks' que toma una URL de un artículo "/wiki/<Nombre Artículo> y regresa una lista de todas las URLs relevantes de ese artículo

In [12]:
import datetime as dt
import random

In [13]:
random.seed(dt.datetime.now())

In [14]:
def getLinks(articleURL):
    html = urlopen("http://en.wikipedia.org"+articleURL)
    bsObj = BeautifulSoup(html,"lxml")
    links_relevantes = bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                                    href=re.compile("^(/wiki/)((?!:).)*$"))
    return links_relevantes

In [15]:
links = getLinks("/wiki/Kevin_Bacon")
links

[<a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a href="/wiki/San_Diego_Comic-Con" title="San Diego Comic-Con">San Diego Comic-Con</a>,
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>,
 <a href="/wiki/Pennsylvania" title="Pennsylvania">Pennsylvania</a>,
 <a href="/wiki/Kyra_Sedgwick" title="Kyra Sedgwick">Kyra Sedgwick</a>,
 <a href="/wiki/Sosie_Bacon" title="Sosie Bacon">Sosie Bacon</a>,
 <a href="/wiki/Edmund_Bacon_(architect)" title="Edmund Bacon (architect)">Edmund Bacon</a>,
 <a href="/wiki/Michael_Bacon_(musician)" title="Michael Bacon (musician)">Michael Bacon</a>,
 <a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a>,
 <a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a>,
 <a href="/wiki/A_Few_Good_Men" title="A Few Good Men">A Few Good Men</a>,
 <a href="/wiki/Apollo_13_(film)" title="Apollo 13 (film)">Apollo 13</a>,
 <a href="/wiki/Mysti

In [16]:
i = 0
while i<30:
    newArticle = links[random.randint(0, len(links) -1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)
    i+=1

/wiki/Screen_Actors_Guild_Awards
/wiki/Paul_Giamatti
/wiki/Atheism
/wiki/Miao_folk_religion
/wiki/Psychology_of_religion
/wiki/Gnosticism
/wiki/Meta-ethics
/wiki/Indicative
/wiki/Evidentiality
/wiki/Volition_(linguistics)
/wiki/Grammatical_aspect
/wiki/Mandarin_Chinese
/wiki/Yinchuan
/wiki/Zhenjiang
/wiki/Administrative_divisions_of_the_People%27s_Republic_of_China#Township_level
/wiki/Vice_President_of_the_People%27s_Republic_of_China
/wiki/Politics_of_Macau
/wiki/Foreign_relations_of_Macau
/wiki/Wayback_Machine
/wiki/Declaratory_judgment
/wiki/Legal_appeal
/wiki/Common_law
/wiki/Future_interests
/wiki/Property_rights_(economics)
/wiki/Restraint_on_alienation
/wiki/Tangible_property
/wiki/Real_property
/wiki/Lien
/wiki/Concurrent_estate
/wiki/Real_estate


Esto nos muestra un Path aleatorio de links en Wikipedia

Ahora veamos como obtener las URLs completas de un sitio para lo que modificaremos nuestra función 'getlinks'

In [17]:
pages = set()

In [30]:
def getLinks(articleURL):
        global pages
        html = urlopen("http://en.wikipedia.org"+articleURL)
        bsObj = BeautifulSoup(html,"lxml")
        links_relevantes = bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                                href=re.compile("^(/wiki/)((?!:).)*$"))

        for link in links_relevantes:
            if ('href' in link.attrs) & (link.attrs["href"] not in pages):
                #Encontramos una nueva página
                newPage = link.attrs["href"]
                print(newPage)
                pages.add(newPage)
                getLinks(newPage) #Esto se llama recursividad      

In [31]:
getLinks("/wiki/Kevin_Bacon")

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Kevin_Bacon
/wiki/San_Diego_Comic-Con
/wiki/Geographic_coordinate_system
/wiki/Coordinate_system
/wiki/Coordinate_(disambiguation)
/wiki/Coordinate_space
/wiki/Mathematics
/wiki/Mathematics_(disambiguation)
/wiki/Mathematics_(Cherry_Ghost_song)
/wiki/Single_(music)
/wiki/Music
/wiki/Music_(disambiguation)
/wiki/Musical_notation
/wiki/Musical_isomorphism
/wiki/Isomorphism
/wiki/Isomorphism_(disambiguation)
/wiki/Graph_isomorphism
/wiki/Graph_theory
/wiki/Graph_of_a_function
/wiki/Graph_(discrete_mathematics)
/wiki/Graph_(disambiguation)
/wiki/Graph_(topology)
/wiki/Topology
/wiki/Topography
/wiki/Typography
/wiki/Typology_(disambiguation)
/wiki/Typology_(anthropology)
/wiki/Primate
/wiki/Primate_(disambiguation)
/wiki/Primate_(bishop)
/wiki/Primas_(film)
/wiki/Laura_Bari
/wiki/Argentina
/wiki/Argentina_(disambiguation)
/wiki/Argentina,_Santiago_del_Estero
/wiki/Provinces_of_Argentina
/wiki/Federated_state
/wiki/Federation
/wiki/Federation_(disamb

KeyboardInterrupt: 

In [32]:
len(pages)

101

Ahora que pudimos accesar a todas las páginas hagamos algo mientras estamos dentro de ellas. Vamos a accesar el título, el primer párrafo de contenido y el link para editar la página (en caso de que esté disponible).

Revisando la página de Wikipedia vemos que:

<ol>
<li> Los títulos están siempre entre tags h1</li>
<li> El primer párrafo siempre está en el primer p dentro de div#mw-content-text </li>
<li> Vemos que los links para edición están dentro de a -> span -> li#ca-edit</li>
</ol>

Entonces podemos escribir 

In [17]:
pages = set()

In [35]:
def getLinks(articleURL):
        global pages
        html = urlopen("http://en.wikipedia.org"+articleURL)
        bsObj = BeautifulSoup(html,"lxml")
        links_relevantes = bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                                href=re.compile("^(/wiki/)((?!:).)*$"))
        try: 
            print(bsObj.h1.get_text())
            print(bsObj.find(id="mw-content-text").findAll("p")[0])
            print(bsObj.find(id="ca-edit").find("span").find("a").attrs["href"])
        except AttributeError:
            print("A la página le falta algo. No te preocupes!")

        for link in links_relevantes:
            if ('href' in link.attrs) & (link.attrs["href"] not in pages):
                #Encontramos una nueva página
                newPage = link.attrs["href"]
                print("-------------------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage) #Esto se llama recursividad 

In [36]:
getLinks("/wiki/Kevin_Bacon")

Kevin Bacon
<p><b>Kevin Norwood Bacon</b><sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> (born July 8, 1958)<sup class="reference" id="cite_ref-actor_2-0"><a href="#cite_note-actor-2">[2]</a></sup> is an American actor and musician. His films include musical-drama film <i><a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a></i> (1984), the controversial historical conspiracy legal thriller <i><a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a></i> (1991), the legal drama <i><a href="/wiki/A_Few_Good_Men" title="A Few Good Men">A Few Good Men</a></i> (1992), the historical docudrama <i><a href="/wiki/Apollo_13_(film)" title="Apollo 13 (film)">Apollo 13</a></i> (1995), and the mystery drama <i><a href="/wiki/Mystic_River_(film)" title="Mystic River (film)">Mystic River</a></i> (2003). Bacon is also known for taking on darker roles such as that of a sadistic guard in <i><a href="/wiki/Sleepers" title="Sleepers">Sleepers</a></i> (1

England (disambiguation)
<p><b><a href="/wiki/England" title="England">England</a></b> is a country that is part of the United Kingdom.</p>
/w/index.php?title=England_(disambiguation)&action=edit
-------------------------
/wiki/England,_Arkansas
England, Arkansas
<p><b>England</b> is a U.S. city in southwestern <a href="/wiki/Lonoke_County,_Arkansas" title="Lonoke County, Arkansas">Lonoke County, Arkansas</a>, and the county's fourth most populous city. The population was 2,825 at the <a class="mw-redirect" href="/wiki/United_States_Census_2010" title="United States Census 2010">2010 census</a>. It is part of the <a href="/wiki/Little_Rock,_Arkansas" title="Little Rock, Arkansas">Little Rock</a>–<a href="/wiki/North_Little_Rock,_Arkansas" title="North Little Rock, Arkansas">North Little Rock</a>–<a href="/wiki/Conway,_Arkansas" title="Conway, Arkansas">Conway</a> <a class="mw-redirect" href="/wiki/Little_Rock-North_Little_Rock-Conway_metropolitan_area" title="Little Rock-North Little R

Territory (disambiguation)
<p>A <b><a href="/wiki/Territory" title="Territory">territory</a></b> is a subdivision of a country having a legal status different from other regions of that country.</p>
/w/index.php?title=Territory_(disambiguation)&action=edit
-------------------------
/wiki/Box_office_territory
Box office territory
<p>A <b>box office territory</b>,<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[nb 1]</a></sup> in context of the <a href="/wiki/Film_industry" title="Film industry">film industry</a>, ranges from a single country to a grouping of countries for reporting <a href="/wiki/Box_office" title="Box office">box office</a> gross ticket sales.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[1]</a></sup> This is distinct from <a href="/wiki/Territory" title="Territory">dependent territories</a>, though such territories under a country's administrative control may confuse box office revenue and reporting due to data variously including or excl

Burbank
<p><b>Burbank</b> may refer to:</p>
/w/index.php?title=Burbank&action=edit
-------------------------
/wiki/Burbank_(surname)
Burbank (surname)
<p><b>Burbank</b> is a surname. Notable people with the surname include:</p>
/w/index.php?title=Burbank_(surname)&action=edit
-------------------------
/wiki/Albert_Burbank
Albert Burbank
<p><b>Albert Burbank</b> (March 25, 1902 – August 15, 1976) was an <a href="/wiki/United_States" title="United States">American</a> <a href="/wiki/Dixieland" title="Dixieland">dixieland</a> <a href="/wiki/Clarinet" title="Clarinet">clarinet</a> player.</p>
/w/index.php?title=Albert_Burbank&action=edit
-------------------------
/wiki/United_States
United States
<p><span style="font-size: small;"><span id="coordinates"><a href="/wiki/Geographic_coordinate_system" title="Geographic coordinate system">Coordinates</a>: <span class="plainlinks nourlexpansion"><a class="external text" href="//tools.wmflabs.org/geohack/geohack.php?pagename=United_States&amp;par

Bossa nova (disambiguation)
<p><b><a href="/wiki/Bossa_nova" title="Bossa nova">Bossa nova</a></b> is a style of music.</p>
/w/index.php?title=Bossa_nova_(disambiguation)&action=edit
-------------------------
/wiki/Bossa_Nova_(dance)
Bossa Nova (dance)
<p><b>Bossa nova</b> was a <a class="mw-redirect" href="/wiki/Fad_dance" title="Fad dance">fad dance</a> that corresponded to the <a href="/wiki/Bossa_nova" title="Bossa nova">bossa nova</a> music. It was introduced in 1960 and faded out in the mid-sixties.</p>
/w/index.php?title=Bossa_Nova_(dance)&action=edit
-------------------------
/wiki/Fad_dance
Novelty and fad dances
<p><b>Fad dances</b> are <a href="/wiki/Dance" title="Dance">dances</a> which are characterized by a short burst of popularity, while <b>novelty dances</b> typically have a longer-lasting popularity based on their being characteristically <a class="mw-redirect" href="/wiki/Humor" title="Humor">humorous</a> or humor-invoking, as well as the sense of uniqueness which th

Cinema of the United States
<p>The <b>cinema of the United States</b>, often <a href="/wiki/Metonymy" title="Metonymy">metonymously</a> referred to as <b>Hollywood</b>, has had a profound effect on the <a href="/wiki/Film_industry" title="Film industry">film industry</a> in general since the early 20th century. The dominant style of American cinema is <a href="/wiki/Classical_Hollywood_cinema" title="Classical Hollywood cinema">classical Hollywood cinema</a>, which developed from 1917 to 1960 and characterizes most films made there to this day. While Frenchmen <a href="/wiki/Auguste_and_Louis_Lumi%C3%A8re" title="Auguste and Louis Lumière">Auguste and Louis Lumière</a> are generally credited with the birth of modern cinema,<sup class="reference" id="cite_ref-lumiere_pioneers_7-0"><a href="#cite_note-lumiere_pioneers-7">[7]</a></sup> American cinema quickly came to be the most dominant force in the industry as it emerged. Since the 1920s, the film industry of the United States has had h

South Africa
<p style="font-size:11px;text-align:center;margin-top:0px;margin-bottom:0px;line-height:1.15em;">in the <a href="/wiki/African_Union" title="African Union">African Union</a><span style="font-size:8px;"><span class="nowrap">  </span></span>(light blue)</p>
A la página le falta algo. No te preocupes!
-------------------------
/wiki/Southern_Africa
Southern Africa
<p><b>Southern Africa</b> is the <a href="/wiki/South" title="South">southernmost</a> <a href="/wiki/Region" title="Region">region</a> of the <a href="/wiki/Africa" title="Africa">African</a> <a href="/wiki/Continent" title="Continent">continent</a>, variably defined by <a href="/wiki/Geography" title="Geography">geography</a> or <a href="/wiki/Geopolitics" title="Geopolitics">geopolitics</a>, and including several countries. The term <i>southern Africa</i> or <i>Southern Africa</i>, generally includes <a href="/wiki/Angola" title="Angola">Angola</a>, <a href="/wiki/Botswana" title="Botswana">Botswana</a>, <a href="

KeyboardInterrupt: 