# AoIR IR16 pre-conference workshop - Web Scraping 

## QUT DMRC - 2015

### Structure the data extraction as a function

This notebook shows how to move the code for data extraction into a function. 

A function makes it easier to call a block of code with different data and to reuse it in other code. In this case we have chosen to have the function expect one of the review pages which has already been processed by BeautifulSoup as a parameter.

In [None]:
import bs4
import requests

### scrape the page

In [None]:
# this is the base_url
base_url = "http://www.metacritic.com/browse/albums/artist"

In [None]:
# select which page to scrape based on the first letter of the artist names
lett = "/a"

In [None]:
# build the url (only scrape the first page - page 0)
thepage = base_url+lett+"?page=0"

In [None]:
# the bot pretends to be a standard Mozilla browser
hdrs = {"User-Agent": "Mozilla/5.0"}

In [None]:
# call the url
stuff = requests.get(thepage, headers=hdrs)

In [None]:
# transform to soup using html.parser parser
soup = bs4.BeautifulSoup(stuff.text, "html.parser")

### function definitions

The code in this cell is doing exactly the same thing as in the previous step but packaged into a function that call be called when necessary.

The function processes a beautiful_soup data structure and returns new album_reviews as a list of lists.


In [None]:
def get_itemlist(thesoup):
    
    #try to find all div-tags of class "product_wrap"
    lotsofitems = thesoup.find_all("div",class_=["product_wrap"])
    
    thelist = []
    for an_item in lotsofitems: 
        theitem = []
        
        # artistname
        temptemp = an_item.find("li",class_="product_artist")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]

        thetitle = an_item.find("div",class_="product_title")

        # albumname
        temptemp = thetitle.get_text()
        temptemp = temptemp.split()
        theitem += [" ".join(temptemp)]
        
        # release_date
        temptemp = an_item.find("li",class_="release_date")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]
        
        # mc_score
        theitem += [an_item.find("div",class_="metascore_w").get_text()]

        # user_score
        temptemp = an_item.find("li",class_="product_avguserscore")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]
        
        # url
        theitem += ["http://www.metacritic.com"+thetitle.a.attrs["href"]]

        # not all albums have both expert reviews and user reviews. Those albums
        # that has data missing, use "tbd" instead. We only want to add items
        # that have both user_score and mc_score
        if not "tbd" in theitem:
            thelist = thelist + [theitem]
    return thelist

### call the function

In [None]:
reviews = get_itemlist(soup)

In [None]:
reviews

Now we are ready to move onto the fifth notebook - [Store the data in a dataframe and save to disk](metacritic-AOIR-step5.ipynb)