## Basic Web Scraping with Python BeautifulSoup

Import the external module BeautifulSoup4: `pip install BeautifulSoup4`

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup # make sure you already install the module

First of all, one rule to Python web scraping is make sure the you have the website owner **permission**, or use some website that is use for web scraping. You also can create your own website (contains simple html and javascript) on your local computer.

Second, inspect the website HTML on your browser then determine what information you would get from the website.

In the following code, I'll show you the basic web scraping with Python BeautifulSoup module

In [None]:
# For example, I use this website as a test case from Python Web Scraping book
# You can change and experiment to different website

url = 'http://www.pythonscraping.com/pages/page1.html'
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

In [None]:
url = 'https://pythonscrapingthisurldoesnotexist.com' 
try:
    html = urlopen(url)
except HTTPError as e:
    print('The server returned an HTTP error') # If the HTTP server is offline or error
except URLError as e:
    print('The server could not be found!') # If the website doesn's exist
else:
    print(html.read())

There are several HTML parser, BeautifulSoup support different HTML parser including:

- Python’s html.parser: include in defaults Python's library, you don't need to install separately
- lxml’s HTML parser  : third party module --> install separately (`pip install lxml`)
- lxml’s XML parser   : third party module --> suppose already installed with `lxml`
- html5lib parser     : third party module --> install separately (`pip install html5lib`)


![img_1](img/beautifulsoup.jpg)
Reference:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup

In [None]:
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        # Create BeautifulSoup object to read html page
        # Option: html.parser, lxml, lxml-xml, html5lib
        bsObj = BeautifulSoup(html.read(), 'html5lib') 
        
        # Only read the h1 tag
        title = bsObj.body.h1 
    except AttributeError as e:
        return None
    return title


url = 'http://www.pythonscraping.com/pages/page1.html'
title = getTitle(url)
if title == None:
    print('Title could not be found')
else:
    print(title)

In [None]:
url = 'http://www.pythonscraping.com/pages/warandpeace.html'
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')

In [None]:
# Inspect the URL <'http://www.pythonscraping.com/pages/warandpeace.html'> so you can understand the following code
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

In [None]:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])

In [None]:
allText = bs.find_all('span', {'class':{'green', 'red'}})

# Store the text in a list
text_list = [text for text in allText]
print(text_list) 

In [None]:
title = bs.find_all(id='title', class_='text')
title_text = [text for text in allText]
print(title_text)

#### Scraping Table Tag

In [None]:
url = 'http://www.pythonscraping.com/pages/page3.html'
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

In [None]:
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling) 

#### Scraping Image

In [None]:
print(bs.find('img',
              {'src':'../img/gifts/img1.jpg'})
      .parent.previous_sibling.get_text())

In [None]:
import re
images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images: 
    print(image['src'])

In [None]:
bs.find_all(lambda tag: len(tag.attrs) == 2)

#### Scraping Particular Text 

In [None]:
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

In [None]:
bs.find_all('', text='Or maybe he\'s only resting?')