# ODI Queensland workshop - Web Scraping 

## QUT DMRC - 2015

### Structure the data extraction as a function

This notebook shows how to move the code for data extraction into a function. 

A function makes it easier to call a block of code with different data and to reuse it in other code. In this case we have chosen to have the function expect one of the traffic sign pages which has already been processed by BeautifulSoup as a parameter.

In [None]:
import bs4
import requests

### scrape the page

In [None]:
# this is the base_url
base_url = "http://www.qld.gov.au/transport/safety/signs/"

In [None]:
# select which page to scrape based on the type of road sign
sign_type = "regulatory"

In [None]:
# build the url
thepage = base_url + sign_type + '/'

In [None]:
# call the url
stuff = requests.get(thepage)

In [None]:
# transform to soup using lxml parser
soup = bs4.BeautifulSoup(stuff.text, "lxml")

### function definitions

The code in this cell is doing exactly the same thing as in the previous step but packaged into a function that call be called when necessary.

The function processes a beautiful_soup data structure and returns new signs as a list of lists.


In [None]:
def get_itemlist(thesoup):
    
    # find all the tables on the page
    tables = thesoup.findAll('table')
    thelist = []

    for table in tables:
        # find all the table rows
        lotsofitems = table.findAll('tr')

        # check if the first row contains a 'th' elements (table header)
        if lotsofitems[0].find('th'): 

            # get all header elements
            temp = lotsofitems[0].findAll('th')

            # check that the table header has the text we expect for the signs table
            if temp[0].get_text() == 'Sign' and temp[1].get_text() == 'Meaning':

                # print('Traffic sign table found')

                # process the table of traffic signs **** THIS IS THE UPDATED SECTION *****
                for an_item in lotsofitems[1:]: 
                    theitem = []

                    # sign description & title
                    sign_text = an_item.findAll("p")
                    description = ''
                    for para in sign_text:
                        if para.find("strong"):
                            # extract sign name (this assumes only the sign name is in bold)
                            temptemp = para.find("strong").get_text()
                            temptemp = temptemp.split()
                            sign_name = " ".join(temptemp)
                        else:
                            # extract sign description (may be multiple paragraphs)
                            temptemp = para.get_text()
                            temptemp = temptemp.split()
                            description += " ".join(temptemp) + '\n'

                    theitem += [sign_name]
                    theitem += [description]

                    # sign images (may be more than one image per sign name) - save image name & image url
                    images = []
                    for image in  an_item.findAll("img"):
                        # get the image name & image url
                        images += [[image.attrs['alt'], image.attrs['src']]]
                    theitem += [images]

                    thelist += [theitem]

#            else:
#                print('Different table - with header row:', temp)
#        else:
#            print('Different table - no header row:', lotsofitems[0])

                       
    return thelist

### call the function

In [None]:
signs = get_itemlist(soup)

In [None]:
signs

Now we are ready to move onto the fifth notebook - [Store the data in a dataframe and save to disk](step5.ipynb)