# Writing My First Web Scraper

In [1]:
from urllib.request import urlopen

urllib is a standard Python library that already contains functions to request data across the internet, it will handle cookies and change metadata such as headers and your user agent. (https://docs.python.org/3/library/urllib.html)

urlopen is used to open a remote object across a network and read it.

Now we're going to read the values in the file page1.html

In [2]:
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## Beautiful Soup

Beautiful Soup tries to make sense of the nonsensical; it helps format and organize web files by fixing HTML and presenting python objects representing XML structures

In [3]:
from bs4 import BeautifulSoup # type: ignore

In [4]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


! By convention, there is only one h1 tag on a page, but we can't count on that, so .h1 will only bring the first one (important)

you can think that BS reads in this way as standard
```
html
├── head
│   └── title → <title>A Useful Page</title>
└── body
    ├── h1 → <h1>An Interesting Title</h1>
    └── div → <div>Lorem Ipsum dolor...</div>
```

So we could call it any of those ways and it would work:  
`bs.html.body.h1`  
`bs.body.h1`  
`bs.html.h1`

## Handling Exceptions

Knowing that a lot of unexpected events can happen during a data scraping we need to be prepared for data malformation, sites offline and etc...

In [5]:
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print("The Server returned an HTTP error!")
except URLError as e:
    print("The server could not be Found")
else:
    print(html.read())

The server could not be Found


In [6]:
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>
