![QUT DMRC 2016](https://www.dropbox.com/s/gpl9miikncu2235/QUTDMRC_logo_s1.png?raw=1 "QUT DMRC 2016" )

##  
# Introduction to web scraping using Python
### Extract text data from a web page

This notebook gets a page from a website, extracts data from that page and stores that data as a csv file. The target for this scraping exercise is a sub section of the [metacritic website](http://www.metacritic.com/browse/albums/artist). If you are using this notebook for other experiments we ask you to change the target to another website. It should be fairly straig-forward to adapt the notebook to fit your particular web scraping project.
##  



In [None]:
# Import the Python modules required to extract data from the website.
from bs4 import BeautifulSoup
from requests import get

# These functions are only required to be able to generate the URL to a random subpage of the metacritic site.
from random import choice
from string import ascii_lowercase

### Call the URL and transform the html code to "beautiful soup"

In [None]:
# this is the URL - change this URL to the website you would like to scrape.
# In this example we call a random sub page of the site to make sure we don't make too many
# calls to the same page during a training workshop
the_url = "http://www.metacritic.com/browse/albums/artist/" + choice(ascii_lowercase)

# This is an web page you also might try to scrape - just remove the comment and it will override the instruction above
# http://www.news.com.au/national

print(the_url)

In [None]:
# the bot pretends to be a standard Mozilla browser
hdrs = {"User-Agent": "Mozilla/5.0"}

In [None]:
# make the call to the URL
stuff = get(the_url, headers=hdrs)

In [None]:
# transform to beautiful soup using html.parser parser
soup = BeautifulSoup(stuff.text, "html.parser")

### Search for specific tags in the beautiful soup

In [None]:
# Search for p tags
lotsofitems = soup.find_all("p")

In [None]:
# How many items did you find?
len(lotsofitems)

In [None]:
# Have a look at the first one in the list (The index of the first item in the list is zero "0")
lotsofitems[0]


### Let's search for div tags

In [None]:
# Search for div tags
lotsofitems = soup.find_all("div")

In [None]:
# How many did we find?
len(lotsofitems)


### Search for div tags with specific attributes (e.g. "id")

In [None]:
# Find all div-tags with a certain id.
#
# Note that this is just an example, you need to look at the source code of the page
# you are scraping and change the id value to something that makes sense for your particular case.
lotsofitems = soup.find_all("div", id="side")

In [None]:
# How many did you find?
len(lotsofitems)


### Search for div tags with specific attributes (e.g. "class")

In [None]:
# Find all div-tags of a certain class.
# 
# 'class' is a reserved word in Python so you have to use 'class_' instead.
# 
# Note that this is just an example, you need to look at the source code of the page
# you are scraping and change the class value to something that makes sense for your particular case.
lotsofitems = soup.find_all("div",class_="product_wrap")

In [None]:
# How many did we find
len(lotsofitems)

In [None]:
# have a look at the first item in the list
# Try changing the index (between the []) to any number that is lower than the number of items in the list
lotsofitems[0]

In [None]:
# Try changing the index (between the []) to any number that is lower than the number of items in the list
temptext = lotsofitems[0].get_text()

##  
### Dig deeper into the html code structure
Depending on the structure of the web page you are scraping and the data you want to extract,
you might need to dig deeper into the structure.

In [None]:
# In this cell we extract the first div-tag from the first item found in the previous step.
#
# Try changing the index (between the []) to any number that is lower than the number of items in the list
thedata = lotsofitems[0].find("div",class_="product_title")

In [None]:
# Check out the contents of the tag
thedata

In [None]:
# Extract the text from this tag
temptext = thedata.get_text()

In [None]:
# Clean up the string
clean_text = temptext.strip()
print(clean_text)

##  
### Extract album, title and score for all albums on the page and saves the data to file

In [None]:
# open a file to save the data
f = open("mydata.csv","w")

In [None]:
# iterate across 'lotsofitems' and extract the text from all items in the list
f.write('"title","artist","score"\n')
for an_item in lotsofitems:

    thedata = an_item.find("div",class_="product_title")
    temptext = thedata.get_text()
    title = temptext.strip()

    thedata = an_item.find("div",class_="metascore_w")
    temptext = thedata.get_text()
    score = temptext.strip()
    
    thedata = an_item.find("li",class_="product_artist")
    thedata = thedata.find("span",class_="data")
    temptext = thedata.get_text()
    artist = temptext.strip()
    
    f.write('"'+artist+'","'+title+'","'+score+'"\n')

In [None]:
# close the file when we're done
f.close()