# Introduction to Web Scraping using Python

## QUT DMRC - 2016

### Plotting, tiny stat analysis and improved I/O
 
This notebook scrapes http://www.metacritic.com/browse/albums/artist and saves the data in a dataframe.
The script iterates through the webpage structure (structured by the first letter of the artist's name) for a single letter.

In [None]:
# initialise plotting in the notebook
%pylab inline

### Import packages

In [None]:
import bs4
import requests
import pandas as pd
from os.path import isfile

### Initialise global variables

In [None]:
# this is the base_url
base_url = "http://www.metacritic.com/browse/albums/artist"

In [None]:
# the bot pretends to be a standard Mozilla browser
hdrs = {"User-Agent": "Mozilla/5.0"}

In [None]:
# columns labels
colnames = ["artistname", "albumname", "release_date", "mc_score", "user_score", "url"]

### Function definitions

In [None]:
# processes a beautiful_soup data structure and returns new album_reviews in a dataframe
def get_itemlist(thesoup):
    
    #try to find all div-tags of class "product_wrap"
    lotsofitems = thesoup.find_all("div",class_=["product_wrap"])
    
    thelist = []
    for an_item in lotsofitems: 
        theitem = []
        
        # artistname
        temptemp = an_item.find("li",class_="product_artist")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]

        thetitle = an_item.find("div",class_="product_title")

        # albumname
        temptemp = thetitle.get_text()
        temptemp = temptemp.split()
        theitem += [" ".join(temptemp)]
        
        # release_date
        temptemp = an_item.find("li",class_="release_date")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]
        
        # mc_score
        theitem += [an_item.find("div",class_="metascore_w").get_text()]

        # user_score
        temptemp = an_item.find("li",class_="product_avguserscore")
        theitem += [temptemp.find("span",class_=["data"]).get_text()]
        
        # url
        theitem += ["http://www.metacritic.com"+thetitle.a.attrs["href"]]

        # not all albums have both expert reviews and user reviews. Those albums
        # that has data missing, use "tbd" instead. We only want to add items
        # that have both user_score and mc_score
        if not "tbd" in theitem:
            thelist = thelist + [theitem]
    return pd.DataFrame(thelist,columns=colnames)

### The script

In [None]:
# reset the dataframe

# if there already is a file...
if isfile("reviews.pkl"):
    # ...load album_reviews from that file
    album_reviews = pd.read_pickle("reviews.pkl")
else:
    # otherwise, set up an empty dataframe
    album_reviews = pd.DataFrame(columns=colnames)

# show the number of reviews in the dataframe
print(len(album_reviews))

In [None]:
# select which page to scrape based on the first letter of the artist names
lett = "/a"

In [None]:
# 1.build the url
thepage = base_url+lett+"?page=0"

In [None]:
# 2.call the url
stuff = requests.get(thepage, headers=hdrs)

In [None]:
# 3.transform to soup using html.parser parser
soup = bs4.BeautifulSoup(stuff.text, "html.parser")        

In [None]:
# 4.extract the new album_reviews from this page
new_reviews = get_itemlist(soup)

In [None]:
# 5.add the new reviews to the dataframe
album_reviews = album_reviews.append(new_reviews)

In [None]:
# 6.print something to show how the process develops
print("URL:",thepage,flush=True)

### Tidy up the data and save to disk

In [None]:
# make sure the review scores are numerical (float) types
album_reviews["mc_score"] = album_reviews["mc_score"].map(float)
album_reviews["user_score"] = album_reviews["user_score"].map(float)

In [None]:
# remove duplicates in case the same page has been scraped more than once
album_reviews = album_reviews.drop_duplicates()

In [None]:
# save the reviews to a csv file
album_reviews.to_csv("reviews.csv")

In [None]:
# save the reviews to a pkl file
album_reviews.to_pickle("reviews.pkl")

### Check the result

In [None]:
# how many reviews are there in the dataframe?
len(album_reviews)

In [None]:
# have a look at the first five items
album_reviews

### Data processing
Create two new columns based on transformations of user score data

In [None]:
album_reviews["user_score_inv"] = album_reviews["user_score"].map(lambda x:1/x)

In [None]:
album_reviews["user_score_log"] = album_reviews["user_score"].map(log)

### Plot the data

In [None]:
# histograms
pp = album_reviews.hist(figsize = (12,7))

In [None]:
# scatter diagram
pp = album_reviews.plot(kind="scatter",x="user_score",y="mc_score")

### Statistical analysis

In [None]:
# simple correlations
album_reviews.corr()

Now we are ready to move onto the final notebook and add [Support for multiple pages](web-scraping-intro-final.ipynb)