# ODI Queensland workshop - Web Scraping 

## QUT DMRC - 2015

###  Extract all road sign names on a single page

This notebook extends the previous step to get all of the sign names from a single page.

In [None]:
import bs4
import requests

In [None]:
# this is the base_url
base_url = "http://www.qld.gov.au/transport/safety/signs/"

In [None]:
# select which page to scrape based on the type of road sign
sign_type = "regulatory"

In [None]:
# build the url
thepage = base_url + sign_type + '/'

In [None]:
# call the url
stuff = requests.get(thepage)

In [None]:
# transform to soup using lxml parser
soup = bs4.BeautifulSoup(stuff.text, "lxml")

In [None]:
# find the table with the signs - it is the first table on the page
signs_table = soup.find('table')

# extract all the rows from the table
lotsofitems = signs_table.findAll('tr')

Now process ```lotsofitems``` in a new way to get all the items instead of just one.

In [None]:
# now let's do the same thing as in the previous step but for all items in the page

thelist = []

for an_item in lotsofitems: 
    
    if an_item.find('th'):        
        print('skipping the header row')
    else:
        # extract the alt attribute from the image tag
        theitem = an_item.find("strong").get_text()
    
        # add the item to a list
        thelist += [theitem]


In [None]:
thelist

Some of these have extra spaces in them - lets tidy that up.

In [None]:
thelist = []

for an_item in lotsofitems: 
    
    if an_item.find('th'):        
        print('skipping the header row')
    else:
        # extract and clean up the sign name
        temptemp = an_item.find("strong").get_text()
        temptemp = temptemp.split()
        theitem = " ".join(temptemp)

    # add the item to a list
    thelist += [theitem]

thelist

In [None]:
# alternative way of skipping the header row
thelist = []

for an_item in lotsofitems[1:]:    
    # extract and clean up the sign name
    temptemp = an_item.find("strong").get_text()
    temptemp = temptemp.split()
    theitem = " ".join(temptemp)
    # add the item to a list
    thelist += [theitem]

thelist

But if we check the page, the 'No pedestrians sign' is not the last on the page - there are actually 3 tables of signs on the page.

So we need to find and process the extra tables.

By finding all the tables on the page, we need to be more careful that we only process the ones that have signs in them - to do this we check that the table header is what we expect.

In [None]:
# find all the tables on the page
tables = soup.findAll('table')

# move outside the loop so we just make one list
thelist = []

for table in tables:
    # find all the table rows
    lotsofitems = table.findAll('tr')
    
    # check if the first row contains a 'th' elements (table header)
    if lotsofitems[0].find('th'): 

        # get all header elements
        temp = lotsofitems[0].findAll('th')
        
        # check that the table header has the text we expect for the table
        if temp[0].get_text() == 'Sign' and temp[1].get_text() == 'Meaning':
            
            print('Traffic sign table found')
            
            # process the table of traffic signs
            for an_item in lotsofitems[1:]:    
                # extract and clean up the sign name
                temptemp = an_item.find("strong").get_text()
                temptemp = temptemp.split()
                theitem = " ".join(temptemp)
                thelist += [theitem]            
        else:
            print('Different table - with header row:', temp)
    else:
        print('Different table - no header row:', lotsofitems[0])
    
thelist

Now we are ready to move onto the third notebook - [Extract all traffic sign data from a single page](step3.ipynb)