# Chapter 01

In [56]:
# import required libraries
from urllib.request import urlopen

In [57]:
# get data from desired url
html = urlopen('http://pythonscraping.com/pages/page1.html')

In [58]:
# print html content page
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


In [59]:
# install required libraries (BeautifulSoup BS4)
!pip3 install beautifulsoup4



In [60]:
# import beautifulsoup4 library
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [61]:
# get html page
html_page = urlopen('http://pythonscraping.com/pages/page1.html')

In [62]:
# read the html page using beautfiul soup library
bs = BeautifulSoup(html_page.read(), 'html.parser')

In [63]:
# print the html page
bs

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [64]:
# print h1 tag from the html page
bs.h1

<h1>An Interesting Title</h1>

Note that the h1 tag that we want to extract from the page is nested 2 layers deep into our BeautifulSoup object structure (html -> body -> h1). However, when we actually fetch it from the object, we call the h1 tag directly.

In fact, any of the following function calls will produce the same output.

In [65]:
bs.html.body.h1

<h1>An Interesting Title</h1>

In [66]:
bs.body.h1

<h1>An Interesting Title</h1>

In [67]:
bs.html.h1

<h1>An Interesting Title</h1>

When we create a BeautifulSoup object, two arguments are passed in. The first one is the HTML object, and the second specifies the parser that we want our BeautifulSoup to use in order to create that object (in majority of cases, it makes no difference which parser we use).

Another popular parser is 'lxml'. This one can be used with BeautifulSoup by changing the parser string provided.

In [68]:
!pip3 install lxml --user



In [72]:
html_page = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html_page.read(), 'lxml')

In [73]:
bs

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

'lxml' has some advantages over 'html.parser' in that it is generally better at parsing messy or malformed HTML code. It fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. One of the disadvantages of 'lxml' is that it has to be install separately and depends on third-party C libraries.

Looking at the following code html_page = urlopen('http://www.pythonscraping.com/pages/page1.html') two main things can happen:
- the page is not found on the server (or there was an error in retrieving it)
- the server is not found

In the first case, an HTTP error will be returned. It can be a 404 'Page Not Found' or 500 'Internal Server Error'. In all of these cases, the urlopen function will throw the generic exception HTTPError.

In [80]:
# check if url page exist
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print(html.read())

The server could not be found!


In [83]:
# function to scrap title from html page
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

In [84]:
# get the title into a variable from url page
title = getTitle('http://pythonscraping.com/pages/page1.html')

In [85]:
if title == None:
    print('Title could not be found')
else:
    print(title)

<h1>An Interesting Title</h1>
