# Web Scraping

In [10]:
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## Ejecutar BeautifulSoup

In [5]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs =BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


## find_all y get_tex()
**find_all(etiqueta, atributos)**: retorna todo los elementos que coincidan con los filtros.

**find()**: retorna solo el primer elementos que coincide con los filtros.

**get_text()**: retorna solo el texto, sin etiquetas

In [9]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://gamaenlinea.com/VIVERES/Aceites-y-aderezos/Mayonesas/MAYONESA-NATURAL-HEINZ-370-GR/p/10034430')
bs = BeautifulSoup(html.read(), 'html.parser')
namelist = bs.find_all('span', {'class':'nav-items-total'})
for name in namelist:
    print(name.get_text())


0 Artículos
0 Artículos


### Script para paginas que tienen proteccion contra el agente de usuario de python

Una posible solucion es cambiar el User-Agent para que python se parezca a un navegador como Mozilla.

In [12]:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://vallearriba.elplazas.com/huevos-en-estuche-de-12und.html'
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }

req = urllib.request.Request(url, headers=hdr)
response = urllib.request.urlopen(req)
bs = BeautifulSoup(response, 'html.parser')
print(bs.title)

<title>HUEVOS EN ESTUCHE DE 12 UNIDADES</title>


## Manejo de exepciones

In [3]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1x.html')
except HTTPError as e:
    print(e)



HTTP Error 404: Not Found


In [8]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://garmaenlinea.com/BEBIDAS/Cervezas/Nacionales/CERVEZA-POLAR-TIPO-PILSEN-BOTELLA-0%2C355-LT/p/40005236')
except HTTPError as e:
    print('A ocurrido un error', e)
except URLError as e:
    print('servidor no encontrado ', e)
else:
    print('Bien')

servidor no encontrado  <urlopen error [Errno 11001] getaddrinfo failed>


In [11]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked')


The server could not be found!


### Excepciones en la etiqueta

In [11]:
try:
    badcontent = bs.etiquetaNoExiste.Otraetiqueta
except AttributeError as e:
    print('Etiqueta no encontrada', e)
else:
    if badcontent == None:
        print('Etiqueta no encontrada')
    else:
        print(badcontent)

Etiqueta no encontrada 'NoneType' object has no attribute 'Otraetiqueta'


In [32]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title)

<h1>An Interesting Title</h1>


In [8]:
print(bs.find_all('div'))

[<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>]


## Advanced HTML Parsing

- *find_all(etiqueta, etiquetaAtributos)*: funcion que se utiliza para buscar a traves de un archivo y retornar los elementos que coincidan con sus filtros
- *find*: retorna solo el primer elemnto que conincide con los filtros
- *get_text()*: funcion que retorna solo el texto de un objeto beautifulSoup sin las etiquetas o tags

.find_all(['h1','h2','h3'])

In [None]:
# retornar una lista de varias etiquetas
#.find_all(['h1','h2','h3'])

# filtrando por etiqueta, y varios atributos
#.find_all('span', {'class': {'grenn', 'red'}})

# busqueda por texto
#.find_all(string='the text')

# filtrar por keywords
#.find_all(id='cosa', class_='algo') # class con _

In [38]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

# busqueda por etiqueta y atributo
nameList = bs.find_all('span', {'class':'green'})

for name in nameList:
    print(name.get_text())


Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


## Encontrar etiquetas basado en la ubicacion

## chidren() y descendants()
El metodo children devuelve una lista de todos los hijos directos de una etiqueta html, mientras que descendants() devuelve una lista con todos los decendiendes de la etiqueta.

In [18]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.pythonscraping.com/pages/page3.html')

bs = BeautifulSoup(html, 'html.parser')
for e, child in enumerate(bs.find('table', {'id':'giftList'}).children):
    print(e, child)

0 

1 <tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
2 

3 <tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
4 

5 <tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
6 

7 <tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gi

## next_sibling y next_siblings()
Permite obtener el siguiente hermano(s) de un elemento del arbol de un documento HTML o XML

In [24]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for n, sibling in enumerate(bs.find('table', {'id':'giftList'}).tr.next_sibling):
    print(n, sibling)

0 



## previous_siblings
Permite obtener los hermanos anteriores de un elemento del documento html oxml

In [37]:
# ejemplo
from bs4 import BeautifulSoup

html = """
<html>
  <head>
    <title>Prueba</title>
  </head>
  <body>
    <h1>Título</h1>
    <p id=0>Este es el parrafo cero</p>
    <p id=1>Este es el primer párrafo.</p>
    <p id=2>Este es el segundo párrafo.</p>
    <p id=3>Este es el tercer párrafo.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Obtenemos la primera etiqueta p
primer_p = soup.find('p',{'id':1})

# Recorremos todos los hermanos siguientes de la primera etiqueta p
for hermano in primer_p.next_siblings:
    print(hermano)

# elementos previos
print('---------------------------')
for hermano in primer_p.previous_siblings:
    print(hermano)




<p id="2">Este es el segundo párrafo.</p>


<p id="3">Este es el tercer párrafo.</p>


---------------------------


<p id="0">Este es el parrafo cero</p>


<h1>Título</h1>




In [19]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = open('index.html','r')
# html = urlopen('index.html')
bs = BeautifulSoup(html, 'html.parser')
sitiosTuristicos = bs.find('li')
# print(sitiosTuristicos)
print(list(sitiosTuristicos.next_siblings))

result =[]
for h in sitiosTuristicos.next_siblings:
    if h != '\n':
        result.append(h)

for n, hijo in enumerate(result):
    print(n, hijo)


['\n', <p>Ubicacion: Centro de San Juan</p>, '\n', <li>el castrero</li>, '\n', <p>Ubicacion: Sureste</p>, '\n', <li>la puerta del llano</li>, '\n', <p>Ubicacion: Norte</p>, '\n']
0 <p>Ubicacion: Centro de San Juan</p>
1 <li>el castrero</li>
2 <p>Ubicacion: Sureste</p>
3 <li>la puerta del llano</li>
4 <p>Ubicacion: Norte</p>


## parent and parents
Permiten obtener los elementos que estan por encima de otro elemnto del documento html o xml

In [10]:
html = '''
<div id="pages">
  <ul>
    <li class="active"><a href="example.com">Example</a></li>
    <li><a href="example.com">Example</a></li>
    <li><a href="example1.com">Example 1</a></li>
    <li><a href="example2.com">Example 2</a></li>
  </ul>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
page = soup.find('li', {'class', 'active'})
padres = page.parents
for p in padres:
    print(p.name)
# x = soup.find_all('li')
# print(x)

[<li class="active"><a href="example.com">Example</a></li>, <li><a href="example.com">Example</a></li>, <li><a href="example1.com">Example 1</a></li>, <li><a href="example2.com">Example 2</a></li>]


In [53]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('img', {'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())



$15.00



## Expresiones regulares y BeautifulSoup

In [21]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.read(), 'html.parser')
images = bs.find_all('img',{'src':re.compile(r'\.\.\/img\/gifts/img.*\.jpg')})

for image in images:
    print(image.attrs['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


## Lambda y Beautifulsoup
ejemplo de uso:

In [38]:
from urllib.request import  urlopen
from bs4 import  BeautifulSoup
html = urlopen('https://gamaenlinea.com/VIVERES/Panes/Salados/PAN-DE-SANDWICH-BLANCO-HOLSUM-420-GR/p/10034381')

bs = BeautifulSoup(html.read(), 'html.parser')

# encontrar todas las tag que contengan el atributo 'class'
# elementos_class = bs.find_all(lambda tag: tag.has_attr('class'))
# for e in elementos_class:
#     print(e.attrs['class'])

#  encontrar todas las tags que tengan 5 atrributos
print('***********************************************')
dos_attrs = bs.find_all(lambda tag: len(tag.attrs) == 5)
for da in dos_attrs:
    print(da)

***********************************************
<input class="form-control form-control" id="review.headline" name="headline" type="text" value=""/>
<input class="sr-only js-ratingSetInput form-control" id="review.rating" name="rating" type="text" value=""/>
<input class="form-control form-control" id="alias" name="alias" type="text" value=""/>
<div aria-live="polite" aria-relevant="text" class="skip" id="ariaStatusMsg" role="status"></div>


## Rastreador Web, Web Crawlers
En general los rasteadores se utilizan para:
- generar un mapa del sitio web
- recopilacion de datos
ejemplos:

In [12]:
# Ejemplo para obtener todos los links de un articulo de wikipedia

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html.read(), 'html.parser')
Alinks = bs.find_all('a')
for link in Alinks:
    if 'href' in link.attrs:
        print(link.attrs['href'])

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Other_ventures
#Six_Degrees_of_Kevin_Bacon
#Personal_life
#Accolades
#Awards_and_nominations
#Other_honors
#S

### Ejemplo para obtener solo los links que interesan del articulo de wikipedia
Se utiliza un filtro y regex. lo que intereza tiene estas caracteristicas:
- Esta dentro de un div con id bodyContent
- No contine simbolo :
- la url comienza con /wiki/

In [11]:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html.read(), 'html.parser')
Alinks = bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile(r'^(/wiki/)((?!:).)*$'))

for link in Alinks:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Leading_man
/wiki/Character_actor
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/National_Lampoon%27s_Animal_House
/wiki/Footloose_(1984_film)
/wiki/Diner_(1982_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Frost/Nixon_(film)
/wiki/Friday_the_13th_(1980_film)
/wiki/Tremors_(1990_film)
/wiki/The_River_Wild
/wiki/The_Woodsman_(2004_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Patriots_Day_(film)
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Michael_Strobl
/wiki/HBO
/wiki/Taking_Chance
/wiki/Fox_Broadcas

In [None]:

# script que tiene una funcion que obtiene todos los links de cualquier articulo de wikipedia
# que el usuario desee.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

# la libreria random se utiliza para obtener cada vez una lista diferentes de links
random.seed(datetime.datetime.now().strftime('%S.%f'))

def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html.read(), 'html.parser')
    return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile(r'^(/wiki/)((?!:).)*$'))

links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = random.choice(links).attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

### Rastreo envitando links duplicados

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen(f'http://en.wikipedia.org{pageUrl}')
    bs = BeautifulSoup(html.read(), 'html.parser')
    the_links = bs.find_all('a', href=re.compile(r'^(/wiki/)'))

    for link in the_links:
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                # nueva pagina encontrada
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')


In [None]:
# aca se recopilara data como el titulo de la pagina, el primer parrafo 
# y el enlace para editar la pagina (si esta disponible)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen(f'http://en.wikipedia.org{pageUrl}')
    bs = BeautifulSoup(html,'html.parser')

    try:
        print(bs.h1.get_text())
        print(bs.find(id='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('Esta pagina le falta algo. Continuando')
    
    theLinks = bs.find_all('a', href=re.compile('^(/wiki/)'))
    for link in theLinks:
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('/wiki/Lisa_Nowak')
    
        


## script de raspado para busqueda de links externos (de forma aleatoria)

In [None]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now().strftime('%S.%f'))

#  funcion que retorna una lista de todos los links internos
def getInternalLinks(bs:BeautifulSoup, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    #  encontra los links que comienzan con "/"
    theLinks = bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')'))
    for link in theLinks:
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if (link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl + link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

# retorna una lista de todos los enlaces externos encontrados en la paginna
def getExternalLinks(bs:BeautifulSoup, excludeUrl):
    externalLinks = []
    # encuentra todos los enlaces que comienzan con 'http'
    #  que no contienen la url actual
    theLinks_e = bs.find_all('a', href=re.compile('^(http|www)((?!'+excludeUrl+').)*$'))
    for link in theLinks_e:
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

def getRandomExternalLink(startingPage):
    html=urlopen(startingPage)
    bs = BeautifulSoup(html.read(), 'html.parser')
    externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        return getRandomExternalLink(random.choice(internalLinks))
    else:
        return random.choice(externalLinks)
    
    
def followExternalOnly(startingPage):
    externalLink = getRandomExternalLink(startingPage)
    print('Random external link is: {}'.format(externalLink))
    followExternalOnly(externalLink)
    
    
followExternalOnly('http://oreilly.com')



In [6]:
# recopila una lista de tosos los links externos
allExtLinks = set()
allIntLinks = set()

def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    domain = f'{urlparse(siteUrl).scheme}://{urlparse(siteUrl).netloc}'
    bs = BeautifulSoup(html.read(), 'html.parser')
    internalLinks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)

    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)

    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.add(link)
            getAllExternalLinks(link)

allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

https://www.oreilly.com
https://www.oreilly.com/member/login/
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/government.html
https://www.oreilly.com/online-learning/academic.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/courses.html
https://www.oreilly.com/online-learning/feature-certification.html
https://www.oreilly.com/online-learning/intro-interactive-learning.html
https://www.oreilly.com/online-learning/live-events.html
https://www.oreilly.com/online-learning/feature-answers.html
https://www.oreilly.com/online-learning/insights-dashboard.html
https://www.oreilly.com/radar/
https://www.oreilly.com/content-marketing-solutions.html
https://learning.oreilly.com/start-trial/
https://www.oreilly.com/online-learning/generative-ai.html
https://www.oreilly.com/online-learning/testimonia

KeyboardInterrupt: 

## Ejemplo, articulo de noticas 

In [2]:
import requests
from bs4 import BeautifulSoup

class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')

def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find("h2").text
    lines = bs.find_all("div", {"class": 'note-text'})
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)

def scrapeBrooking(url):
    bs = getPage(url)
    title = bs.find("h1").text
    body =  bs.find("div",{"class": "byo-block -narrow wysiwyg-block wysiwyg"})
    return Content(url, title, body)

url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'
content = scrapeBrooking(url)
print(f'Title: {content.title}')
print(f'URL: {content.url}\n')
print(content.body)

print('*'*10)

url = 'https://www.eluniversal.com/internacional/164553/petro-cuestiona-diferencia-entre-ucrania-y-palestina-y-pide-a-onu-dos-conferencias-de-paz'
content = scrapeNYTimes(url)
print(f'title: {content.title}')
print(f'URL: {content.url}')
print(content.body)

Title: Delivering inclusive urban access: 3 uncomfortable truths
URL: https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/

<div class="byo-block -narrow wysiwyg-block wysiwyg">
<p>The past few decades have been filled with a deep optimism about the role of cities and suburbs across the world. These engines of economic growth host a majority of world population, are major drivers of economic innovation, and have created pathways to opportunities for untold amounts of people.</p>
</div>
**********
title: 
                                        Petro cuestiona diferencia entre Ucrania y Palestina y pide a ONU dos conferencias de paz                                    
URL: https://www.eluniversal.com/internacional/164553/petro-cuestiona-diferencia-entre-ucrania-y-palestina-y-pide-a-onu-dos-conferencias-de-paz

Caracas.- El presidente colombiano, Gustavo Petro, cuestionó hoy ante la Asamblea General de la ONU "la diferenci

### Ejemplo

In [4]:
class Content:
    'clase comun para todos los articulos/paginas'

    def __init__(self,url, title, body):
        self.url = url
        self.title = title
        self.body  = body
    
    def print(self):
        'funcion para mostrar las salidas'

        print("url:{}".format(self.url))
        print("title: {}".format(self.title))
        print("body: \n{}".format(self.body))

class Website:
    'contiene informacion acerca de la estrutura de la pagina'

    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url  = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

In [7]:
import requests
from bs4 import BeautifulSoup

class Crawler:
    
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')
    
    def safeGet(self, pageObj:BeautifulSoup, selector):
        """
        Utility function used to get content string from a 
        Beautiful Soup object and selector. Return an empty
        string if no object is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join(elem.get_text() for elem in selectedElems)
        return ''
    
    def parse(self, site, url):
        bs = self.getPage(url)
        if bs is not None:
            print('entra 1')
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            print(title=='', body=='')
            print(title, body)
            if title != '' and body != '':
                print('entra 2')
                content = Content(url, title, body)
                content.print()

crawler = Crawler()
siteData = [
['OReilly Media', 'http://oreilly.com','h1', 'section#product-description'],
['Reuters', 'http://reuters.com', 'h1','div.StandardArticleBody_body_1gnLA'],
['Brookings', 'http://www.brookings.edu','h1', 'div.post-body'],
['New York Times', 'http://nytimes.com','h1', 'p.story-content']
]

websites = []

for n, row in enumerate(siteData):
    websites.append(Website(row[0], row[1], row[2], row[3]))
    print(websites[n].name, websites[n].url, websites[n].titleTag, websites[n].bodyTag )

crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
crawler.parse(websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')
     

            

OReilly Media http://oreilly.com h1 section#product-description
Reuters http://reuters.com h1 div.StandardArticleBody_body_1gnLA
Brookings http://www.brookings.edu h1 div.post-body
New York Times http://nytimes.com h1 p.story-content
entra 1
False True
Learning Python, 5th Edition 
entra 1
False True
EPA chief wants scientists to debate climate on TV 
entra 1
False True
Idea to Retire: Old methods of policy education 
entra 1
True True
 


## Solo Pruebas

investigar:
metodo name
metodo prettify()

In [None]:
from urllib.request import  urlopen
from bs4 import  BeautifulSoup
html = urlopen('https://gamaenlinea.com/VIVERES/Panes/Salados/PAN-DE-SANDWICH-BLANCO-HOLSUM-420-GR/p/10034381')

bs = BeautifulSoup(html.read(), 'html.parser')
elementos_class = bs.find_all(lambda tag: tag.has_attr('class'))

for e in elementos_class:
    print(e.attrs['class'])
# precio = bs.find('div', {'class':'from-price-value'})
# print(precio)

In [15]:
import requests
from bs4 import BeautifulSoup

req = requests.get('https://gamaenlinea.com/ALIMENTOS-FRESCOS/L%C3%A1cteos/Mantequilla-Margarin/MANTEQULLA-CON-SAL-LACTUARIO-MARACAY-100-GR/p/30001570')
bs = BeautifulSoup(req.text, 'html.parser')
result = bs.select("div.js-mobile-logo")
print(result)

[<div class="js-mobile-logo">
</div>]


- tomar nota de el metodo **has_attr()** de un obj beautifulsoup
- tomar nota de fitros:
### retornar una lista de varias etiquetas
#.find_all(['h1','h2','h3'])

### filtrando por etiqueta, y varios atributos
#.find_all('span', {'class': {'grenn', 'red'}})

### busqueda por texto
#.find_all(string='the text')

### filtrar por keywords
#.find_all(id='cosa', class_='algo') # class con _

- tomar nota de **urllib.parse** para analizar una url
