# Introducción a Web Crawling

Ya vimos como obtener datos de una sola página web utilizando Web Scraping, ahora veamos cómo obtener datos de distintas páginas de un mismo sitio

In [12]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html,"lxml")
bsObj

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Kevin Bacon - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Kevin_Bacon","wgTitle":"Kevin Bacon","wgCurRevisionId":806255732,"wgRevisionId":806255732,"wgArticleId":16827,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia indefinitely semi-protected biographies of living people","Use mdy dates from October 2016","Articles with hCards","All articles with unsourced statements","Articles with unsourced statements from January 2016","Articles needing additional references from October 2017","All articles needing additional references","Articles 

Aquí hacemos un for para buscar todos los links en la página de Kevin Bacon

In [3]:
links = bsObj.findAll("a")
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected to promote compliance with the policy on biographies of living people"><img alt="Page semi-protected" data-file-height="128" data-file-width="128" height="20" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/30px-Padlock-silver.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/40px-Padlock-silver.svg.png 2x" width="20"/></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a class="image" href="/wiki/File:Kevin_Bacon_SDCC_2014.jpg"><img alt="Kevin Bacon SDCC 2014.jpg" data-file-height="2649" data-file-width="1907" height="306" src="//upload.wikimedia.org/wi

In [4]:
for link in links:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
http://baconbros.com/
#cite_note-1
#cite_note-actor-2
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
#cite_note-3
/wiki/Hollywood_Walk_of_Fame
#cite_note-4
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
#cite_note-walk-5
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Advertising_work
#Personal_life
#Six_Degrees_of_Kevin_Ba

Podemos ver que obtenemos todos los links de la página. Sin embargo, hay ciertos links que no son interesantes para nosotros como el aviso de privacidad y otros que aparecen en el footer.  

Revisando la estructura de la página podemos ver que los links que conducen a un artículo tienen tres cosas en común
<ol> 
<li> Se encuentran dentro del div con un id='bodyContent'</li>
<li> Las URLs no tienen ;</li> 
<li> Las URLs comienzan con /wiki/</li>
</ol>

Aplicando estas reglas podemos reescribir

In [13]:
import re

In [6]:
regexLinks = re.compile(r"^(/wiki/)((?!:).)*$")

In [7]:
divsBodyContent = bsObj.find("div",{"id":"bodyContent"})

In [8]:
links_relevantes = divsBodyContent.findAll("a",href=regexLinks)                                          
links_relevantes

[<a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a href="/wiki/San_Diego_Comic-Con" title="San Diego Comic-Con">San Diego Comic-Con</a>,
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>,
 <a href="/wiki/Pennsylvania" title="Pennsylvania">Pennsylvania</a>,
 <a href="/wiki/Kyra_Sedgwick" title="Kyra Sedgwick">Kyra Sedgwick</a>,
 <a href="/wiki/Sosie_Bacon" title="Sosie Bacon">Sosie Bacon</a>,
 <a href="/wiki/Edmund_Bacon_(architect)" title="Edmund Bacon (architect)">Edmund Bacon</a>,
 <a href="/wiki/Michael_Bacon_(musician)" title="Michael Bacon (musician)">Michael Bacon</a>,
 <a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a>,
 <a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a>,
 <a href="/wiki/A_Few_Good_Men" title="A Few Good Men">A Few Good Men</a>,
 <a href="/wiki/Apollo_13_(film)" title="Apollo 13 (film)">Apollo 13</a>,
 <a href="/wiki/Mysti

In [9]:
for link in links_relevantes:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
/wiki/Philadelphia
/wiki/Edmund_Bacon_(architect)
/wiki/Pennsylvania_Governor%27s_School_for_the_Arts
/wiki/Bucknell_University
/wiki/Glory_Van_Scott
/wiki/Kevin_Bacon_filmography
/wiki/Circle_in_the_Square
/wiki/Nancy_Mills
/wiki/Cosmopolitan_(magazine)
/wiki/Fraternities_and_sororities
/wiki/Animal_House
/wiki/Search_for_Tomorr

Ya obtuvimos la lista de todos los links relevantes. Ahora podemos escribir una función llamada 'getLinks' que toma una URL de un artículo "/wiki/<Nombre Artículo> y regresa una lista de todas las URLs relevantes de ese artículo

In [14]:
import datetime as dt
import random

In [11]:
random.seed(dt.datetime.now())

In [12]:
def getLinks(articleURL):
    html = urlopen("http://en.wikipedia.org"+articleURL)
    bsObj = BeautifulSoup(html,"lxml")
    links_relevantes = bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                                    href=re.compile("^(/wiki/)((?!:).)*$"))
    return links_relevantes

In [13]:
links = getLinks("/wiki/Kevin_Bacon")
links

[<a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a href="/wiki/San_Diego_Comic-Con" title="San Diego Comic-Con">San Diego Comic-Con</a>,
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>,
 <a href="/wiki/Pennsylvania" title="Pennsylvania">Pennsylvania</a>,
 <a href="/wiki/Kyra_Sedgwick" title="Kyra Sedgwick">Kyra Sedgwick</a>,
 <a href="/wiki/Sosie_Bacon" title="Sosie Bacon">Sosie Bacon</a>,
 <a href="/wiki/Edmund_Bacon_(architect)" title="Edmund Bacon (architect)">Edmund Bacon</a>,
 <a href="/wiki/Michael_Bacon_(musician)" title="Michael Bacon (musician)">Michael Bacon</a>,
 <a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a>,
 <a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a>,
 <a href="/wiki/A_Few_Good_Men" title="A Few Good Men">A Few Good Men</a>,
 <a href="/wiki/Apollo_13_(film)" title="Apollo 13 (film)">Apollo 13</a>,
 <a href="/wiki/Mysti

In [14]:
i = 0
while i<30:
    newArticle = links[random.randint(0, len(links) -1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)
    i+=1

/wiki/David_Koepp
/wiki/Jonathan_Nolan
/wiki/Memento_Mori_(short_story)
/wiki/Memento_(film)
/wiki/DVD_region_code#2
/wiki/Macau
/wiki/Artux
/wiki/Tarim_Basin
/wiki/Oasis
/wiki/Paludification
/wiki/Evapotranspiration
/wiki/Irrigation
/wiki/Center_pivot_irrigation
/wiki/Kansas
/wiki/Smith_County,_Kansas
/wiki/Johnson_County,_Kansas
/wiki/Population_density
/wiki/Human_population_planning
/wiki/Logan%27s_Run
/wiki/Eugenics
/wiki/Reginald_Ruggles_Gates
/wiki/International_Association_for_the_Advancement_of_Ethnology_and_Eugenics
/wiki/Birth_defect
/wiki/Microphthalmia
/wiki/Lacrimal_apparatus
/wiki/Zygomatic_bone
/wiki/Reptile
/wiki/Kuehneosauridae
/wiki/Lepidosauromorpha
/wiki/Devonian


Esto nos muestra un Path aleatorio de links en Wikipedia

Ahora veamos como obtener las URLs completas de un sitio para lo que modificaremos nuestra función 'getlinks'

In [15]:
pages = set()

In [16]:
def getLinks(articleURL):
        global pages
        html = urlopen("http://en.wikipedia.org"+articleURL)
        bsObj = BeautifulSoup(html,"lxml")
        links_relevantes = bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                                href=re.compile("^(/wiki/)((?!:).)*$"))

        for link in links_relevantes:
            if ('href' in link.attrs) & (link.attrs["href"] not in pages):
                #Encontramos una nueva página
                newPage = link.attrs["href"]
                print(newPage)
                pages.add(newPage)
                getLinks(newPage) #Esto se llama recursividad      

In [None]:
getLinks("/wiki/Kevin_Bacon")

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Kevin_Bacon
/wiki/San_Diego_Comic-Con
/wiki/Comic_Con_(disambiguation)
/wiki/San_Diego_Comic-Con_International
/wiki/Geographic_coordinate_system
/wiki/Coordinate_system
/wiki/Coordinate_(disambiguation)
/wiki/Coordinate_space
/wiki/Mathematics
/wiki/Mathematics_(disambiguation)
/wiki/Mathematics_(Cherry_Ghost_song)
/wiki/Single_(music)
/wiki/Music
/wiki/Music_(disambiguation)
/wiki/Musical_notation
/wiki/Musical_isomorphism
/wiki/Isomorphism
/wiki/Isomorphism_(disambiguation)
/wiki/Graph_isomorphism
/wiki/Graph_theory
/wiki/Graph_of_a_function
/wiki/Graph_(discrete_mathematics)
/wiki/Graph_(disambiguation)
/wiki/Graph_(topology)
/wiki/Topology
/wiki/Topography
/wiki/Typography
/wiki/Typology_(disambiguation)
/wiki/Typology_(anthropology)
/wiki/Primate
/wiki/Primate_(disambiguation)
/wiki/Primate_(bishop)
/wiki/Hierarchy_of_the_Catholic_Church
/wiki/Saint_Peter
/wiki/Saint_Peter_(disambiguation)
/wiki/List_of_saints_named_Peter
/wiki/St._Peter_(

Ahora que pudimos accesar a todas las páginas hagamos algo mientras estamos dentro de ellas. Vamos a accesar el título, el primer párrafo de contenido y el link para editar la página (en caso de que esté disponible).

Revisando la página de Wikipedia vemos que:

<ol>
<li> Los títulos están siempre entre tags h1</li>
<li> El primer párrafo siempre está en el primer p dentro de div#mw-content-text </li>
<li> Vemos que los links para edición están dentro de a -> span -> li#ca-edit</li>
</ol>

Entonces podemos escribir 

In [17]:
pages = set()

In [15]:
def getLinks(articleURL):
        global pages
        html = urlopen("http://en.wikipedia.org"+articleURL)
        bsObj = BeautifulSoup(html,"lxml")
        links_relevantes = bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                                href=re.compile("^(/wiki/)((?!:).)*$"))
        try: 
            print(bsObj.h1.get_text())
            print(bsObj.find(id="mw-content-text").findAll("p")[0])
            print(bsObj.find(id="ca-edit").find("span").find("a").attrs["href"])
        except AttributeError:
            print("A la página le falta algo. No te preocupes!")

        for link in links_relevantes:
            if ('href' in link.attrs) & (link.attrs["href"] not in pages):
                #Encontramos una nueva página
                newPage = link.attrs["href"]
                print("-------------------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage) #Esto se llama recursividad 

In [18]:
getLinks("/wiki/Kevin_Bacon")

Kevin Bacon
<p><b>Kevin Norwood Bacon</b><sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> (born July 8, 1958)<sup class="reference" id="cite_ref-actor_2-0"><a href="#cite_note-actor-2">[2]</a></sup> is an American actor and musician. His notable films include musical-drama film <i><a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a></i> (1984), the controversial historical conspiracy legal thriller <i><a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a></i> (1991), the legal drama <i><a href="/wiki/A_Few_Good_Men" title="A Few Good Men">A Few Good Men</a></i> (1992), the historical docudrama <i><a href="/wiki/Apollo_13_(film)" title="Apollo 13 (film)">Apollo 13</a></i> (1995), and the mystery drama <i><a href="/wiki/Mystic_River_(film)" title="Mystic River (film)">Mystic River</a></i> (2003). Bacon is also known for taking on darker roles such as that of a sadistic guard in <i><a href="/wiki/Sleepers" title="Sleepers">Sleepers</a

KeyboardInterrupt: 