# Your First Web Scraper
## Get Request

In [1]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## BeautifulSoup

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1) # it's a convention to use only one h1 tag per page, but some pages have more than one, so it will only return the first one

<h1>An Interesting Title</h1>


We use BeautifulSoup to structure the HTML response into a tree structure. We can then use this tree structure to extract the data we want. This is the tree structure of the HTML response:

![tree structure](html_tree.png)

Looking at the code above, we can see that the object bs uses in its method a parser. We can use the three options:
- htlm.parser - the default option
- lxml - useful when dealing with broken HTML pages. To use it, we have to install its dependency with `pip install lxml`

### Collecting Reliably and Handling Exceptions
Scraping a website can cause a lot of errors. Here are some common ones:
- 404 Page Not Found: The page is not found in the server
- 500 Internal Server Error: the server is not found
In both cases, urlopen function will return the generic exception HTTPError.

In [1]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    print(html.read())

The server could not be found!


We can also have the error of not finding a tag (like bs.h1). It will raise an AttributeError. This function below ahndle all these errors:

In [2]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>
