# ODI Queensland workshop - Web Scraping 

## QUT DMRC - 2015

### Extract road sign name from a single item on a single page

This notebook gets a page from the regulatory road sign page| on the Queensland Government Transport and Motoring website and then extracts one of the fields we are interested in from it.

This notebook is broken up into a series of cells. We can run each cell in turn by clicking in the cell and then pressing ```<shift><enter>``` or choosing "run" from the "Cell" menu at the top of the page. The cell runs and any outputs it creates are shown below it.

First we import the Python modules we are going to use to get the information from the website.

In [None]:
import bs4
import requests

The next steps build up the URL that has the information we want. The sections of the url that we will want to change to get more pages of information are kept seperate so we can change them more easily.

In [None]:
# this is the base_url
base_url = "http://www.qld.gov.au/transport/safety/signs/"

In [None]:
# select which page to scrape based on the type of road sign
sign_type = "regulatory"

In [None]:
# build the url
thepage = base_url + sign_type + '/'

Now lets check what the variable ```thepage``` is set to. You can show the value of any variable in a notebook by putting it in the last line of a notebook cell and running the cell. Jupyter will try to display it in a clear way, often clearer than the default 'print' layout. 

In [None]:
thepage

These steps get the page using [Requests](http://docs.python-requests.org/en/latest/) and then process it using [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/).

In [None]:
# call the url
stuff = requests.get(thepage)

For some websites it might be necessary to set the User-Agent header. For example we could pretends to be a standard Mozilla browser by using:

    hdrs = {"User-Agent": "Mozilla/5.0"}
    stuff = requests.get(thepage, headers=hdrs)
    
But for this website we don't need to do that

In [None]:
# check that the page url was valid
stuff.ok

In [None]:
# transform to soup using lxml parser
soup = bs4.BeautifulSoup(stuff.text, "lxml")

In [None]:
# find the table with the signs - it is the first table on the page
signs_table = soup.find('table')

In [None]:
# extract all the rows from the table
lotsofitems = signs_table.findAll('tr')

In [None]:
# show the table header
lotsofitems[0]

In [None]:
# show the first sign
lotsofitems[1]

In [None]:
# extract the image tag from the first item
sign_heading = lotsofitems[1].find("strong")

In [None]:
# check it out
sign_heading

In [None]:
# extract the name of the sign from the image tag
sign_name = sign_heading.get_text()

In [None]:
#check it out
sign_name

If we don't want to extract any other information from the image, we can combine the steps just to get the element we want.

In [None]:
# combine extraction into single step
sign_name = lotsofitems[1].find("strong").get_text()
sign_name

Now we are ready to move onto the second notebook - [Extract all sign names on a single page](step2.ipynb)