# AoIR IR16 pre-conference workshop - Web Scraping 

## QUT DMRC - 2015

### Store the data in a dataframe and save to disk

This notebook extends the ``get_itemlist`` function to save the data in a dataframe.

### Import packages

In [None]:
import bs4
import requests
import pandas as pd

### Scrape the page

In [None]:
# this is the base_url
base_url = "http://www.metacritic.com/browse/albums/artist"

In [None]:
# select which page to scrape based on the first letter of the artist names
lett = "/a"

In [None]:
# build the url (only scrape the first page - page 0)
thepage = base_url+lett+"?page=0"

In [None]:
# the bot pretends to be a standard Mozilla browser
hdrs = {"User-Agent": "Mozilla/5.0"}

In [None]:
# call the url
stuff = requests.get(thepage, headers=hdrs)

In [None]:
# transform to soup using html.parser parser
soup = bs4.BeautifulSoup(stuff.text, "html.parser")

### Define column labels for the dataframe

In [None]:
# columns labels
colnames = ["artistname", "albumname", "release_date", "mc_score", "user_score", "url"]

### Function definitions

In [None]:
# processes a beautiful_soup data structure and returns new album_reviews in a dataframe
def get_itemlist(thesoup):
    
    #try to find all div-tags of class "product_wrap"
    lotsofitems = thesoup.find_all("div",class_=["product_wrap"])
    
    thelist = []
    for an_item in lotsofitems: 
        theitem = []
        
        # artistname
        temptemp = an_item.find("li",class_="product_artist")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]

        thetitle = an_item.find("div",class_="product_title")

        # albumname
        temptemp = thetitle.get_text()
        temptemp = temptemp.split()
        theitem += [" ".join(temptemp)]
        
        # release_date
        temptemp = an_item.find("li",class_="release_date")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]
        
        # mc_score
        theitem += [an_item.find("div",class_="metascore_w").get_text()]

        # user_score
        temptemp = an_item.find("li",class_="product_avguserscore")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]
        
        # url
        theitem += ["http://www.metacritic.com"+thetitle.a.attrs["href"]]

        # not all albums have both expert reviews and user reviews. Those albums
        # that has data missing, use "tbd" instead. We only want to add items
        # that have both user_score and mc_score
        if not "tbd" in theitem:
            thelist += [theitem]

    return pd.DataFrame(thelist,columns=colnames)

### Call the function

Call the function passing it the ``soup`` variable and store the result in the variable ```album_reviews```

In [None]:
album_reviews = get_itemlist(soup)

In [None]:
# have a look
album_reviews

### Save to disk

In [None]:
# save the data as a csv file
album_reviews.to_csv("reviews.csv")

Now we are ready to move onto the sixth notebook - [Restructure the code for clarity](metacritic-AOIR-step6.ipynb)