## **PART 1**
#### Building Scrapers
#### **CHAPTER 1**
#### Your First Web Scraper


*   Basics of sending GET request to web server
*   Reading HTML output from the page
*   Data extraction to isolate content we are looking for



#### Connecting

In [5]:
# text-wrap for notebook output
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [6]:
from urllib.request import urlopen
html=urlopen('https://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


#### An Introduction to BeautifulSoup

*  from request module from urllib library import urlopen
*  Get HTML content by calling html.read()
*  Transform HTML content to BeautifulSoup object

In [9]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj=BeautifulSoup(html.read())
print(bsObj.h1)

# query below also creates the same output
'''
bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1
'''

<h1>An Interesting Title</h1>


In [10]:
print(bsObj)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



#### Connecting Reliably
*  prevent code execution from stopping unexpectedly
*  make sure page can be found
*  server can be found

In [5]:
# handle page not found, print out HTTP error
from urllib.request import urlopen
from urllib.error import HTTPError
try:
    html=urlopen('http://www.pythonscraping.com/pages/NO_SUCH_PAGE.html')
except HTTPError as e:
    print(e)

HTTP Error 404: Not Found


In [None]:
# handle site not found
if html is None:
    print('URL is not found')
else:
    # program continues

In [None]:
# handle non existing tag
'''
if AttributeError is returned means nonExistingTag doesn't have this tag OR 
nonExistingTag is a None object
''' 
try:
    badContent=bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent==None:
        print('Tag was not found')
    else:
        print(badContent)

In [17]:
# Comprehensive query which takes in Exceptions into account

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html=urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj=BeautifulSoup(html.read())
        title=bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title=getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)


<h1>An Interesting Title</h1>


#### **CHAPTER 2**
#### Advanced HTML Parsing

#### Things to avoid
*  What if targe content is buried 20 tags deep in HTML?  
*  Avoid writing very specific which might break with the slightest change to the website  
*  What are the options?  
    * Look for "print this page" link/ mobile version of the site
    * Look for hidden info in JavaScript file
    * info might be available in URL itself
    * look for alternate sources
