# ODI Queensland workshop - Web Scraping 

## QUT DMRC - 2015

###  Support for multiple pages

This notebook scrapes http://www.qld.gov.au/transport/safety/signs/ and saves the data in a dataframe.
The script iterates through the webpage structure (structured by the types of signs).

In [None]:
# initialise plotting in the notebook
%pylab inline

### Import packages

In [None]:
import bs4
import requests
import pandas as pd
from os.path import isfile

### Initialise global variables

In [None]:
# this is the base_url
base_url = "http://www.qld.gov.au/transport/safety/signs/"

In [None]:
# columns labels
colnames = ["sign_name", "sign_type", "description", "images"]

# not using these for now - but needed to clean up display of images column
images_cols = ["image_name", "url"]

In [None]:
# The pages are organised by type of sign.
sign_types = ["regulatory",
              "hazard",
              "warning",
              "route",
              "instruction",
              "tourist",
              "service",
              "roadworks"]

# if you want to limit the number of pages to scrape, you simply shorten this list - e.g.
sign_types = ["regulatory", "hazard",]

Instead of having a static list of ```sign_types``` we could scrape the list of pages from the menu on the side of the page - and if we did we could keep their long names as well as the url.


### Function definitions

```get_itemlist``` function is unchanged from the previous page

In [None]:
# processes a beautiful_soup data structure and returns new signs in a dataframe
def get_itemlist(thesoup):
    
    # find all the tables on the page
    tables = thesoup.findAll('table')
    thelist = []

    for table in tables:
        # find all the table rows
        lotsofitems = table.findAll('tr')

        # check if the first row contains a 'th' elements (table header)
        if lotsofitems[0].find('th'): 

            # get all header elements
            temp = lotsofitems[0].findAll('th')

            # check that the table header has the text we expect for the signs table
            if temp[0].get_text() == 'Sign' and temp[1].get_text() == 'Meaning':

                # print('Traffic sign table found')

                # process the table of traffic signs **** THIS IS THE UPDATED SECTION *****
                for an_item in lotsofitems[1:]: 
                    theitem = []

                    # sign description & title
                    sign_text = an_item.findAll("p")
                    description = ''
                    for para in sign_text:
                        if para.find("strong"):
                            # extract sign name (this assumes only the sign name is in bold)
                            temptemp = para.find("strong").get_text()
                            temptemp = temptemp.split()
                            sign_name = " ".join(temptemp)
                        else:
                            # extract sign description (may be multiple paragraphs)
                            temptemp = para.get_text()
                            temptemp = temptemp.split()
                            description += " ".join(temptemp) + '\n'

                    theitem += [sign_name]
                    theitem += [sign_type]
                    theitem += [description]

                    # sign images (may be more than one image per sign name) - save image name & image url
                    images = []
                    for image in  an_item.findAll("img"):
                        # get the image name & image url
                        images += [[image.attrs['alt'], image.attrs['src']]]
                    theitem += [images]

                    thelist += [theitem]

#            else:
#                print('Different table - with header row:', temp)
#        else:
#            print('Different table - no header row:', lotsofitems[0])

    return pd.DataFrame(thelist,columns=colnames)

### The script

In [None]:
# reset the dataframe

# if there already is a file...
if isfile("signs.pkl"):
    # ...load signs from that file
    signs = pd.read_pickle("signs.pkl")
else:
    # otherwise, set up an empty dataframe
    signs = pd.DataFrame(columns=colnames)

# show the number of signs in the dataframe
print(len(signs))

In [None]:
# iterate over the list of types
for sign_type in sign_types:
            
    # build the url
    thepage = base_url + sign_type + '/'        

    # call the url
    stuff = requests.get(thepage)

    # transform to soup using lxml parser
    soup = bs4.BeautifulSoup(stuff.text, "lxml")

    # extract the new signs from this page
    new_signs = get_itemlist(soup)

    # add the new signs to the dataframe
    signs = signs.append(new_signs)

    # print something to show how the process progresses
    print("URL:",thepage,flush=True)
        
                
    # *** Tidy up the data and save to disk after each letter has been scraped ***
        
    # remove duplicates in case the same page has been scraped more than once
    # signs = signs.drop_duplicates()
        
    # save the signs to a csv file after each page
    signs.to_csv("signs.csv")
        
    # save the signs to a pkl file after each page
    signs.to_pickle("signs.pkl")


### Check the result

In [None]:
# how many signs are there in the dataframe?
len(signs)

In [None]:
# have a look at the first five items
signs[:5]

### Data processing

In [None]:
signs["number_of_images"] = signs["images"].map(lambda x:len(x))

### Plot the data

In [None]:
# histograms
pp = signs.hist(figsize = (12,7))

In [None]:
# scatter diagram
pp = signs[0:15].plot(kind='barh', use_index="True")

# look up syntax to use the sign_name column as the y-tick labels on the graph 

### Statistical analysis

In [None]:
signs.describe()

That concludes our workshop today. 

Please come and talk to us at the conference if you have any questions, or contact us by email: Patrik Wikström (patrik.wikstrom@qut.edu.au) and Brenda Moon (brenda.moon@qut.edu.au).

If you want to extend this example, you can find more information about the tools we have used on their websites, which are listed at the bottom of the [index page](index.ipnb).