# Introduction to web scraping using Python
## QUT DMRC - 2016

### Extract text data from a web page

This notebook gets a page from a website and then extracts data from that page. Adapt the notebook to fit you particular web scraping project

This is the webpage we're using for this exercise: [metacritic website](http://www.metacritic.com/browse/albums/artist/a)

In [1]:
# Import the Python modules required to extract data from the website.
from bs4 import BeautifulSoup
from requests import get

The next steps build up the URL that has the information we want. The sections of the url that we will want to change to get more pages of information are kept seperate so we can change them more easily.

In [2]:
# this is the url
the_url = "http://www.metacritic.com/browse/albums/artist/a"
#the_url = "http://www.news.com.au/national"

These steps get the page using [Requests](http://docs.python-requests.org/en/latest/) and then process it using [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/).

In [3]:
# the bot pretends to be a standard Mozilla browser
hdrs = {"User-Agent": "Mozilla/5.0"}

In [4]:
# call the url
stuff = get(the_url, headers=hdrs)

In [5]:
# transform to soup using html.parser parser
soup = BeautifulSoup(stuff.text, "html.parser")

### Let's search for specific tags in the beautiful soup!

In [6]:
# Search for p tags
lotsofitems = soup.find_all("p")

In [7]:
# How many items did you find?
len(lotsofitems)

2

In [8]:
# have a look at the first one in the list (starts with "0")
lotsofitems[0]

<p id="footer_about_links">
<a class="first" href="http://www.cbsinteractive.com">About CBS Interactive</a>
                | <a href="http://www.cbsinteractive.com/careers/">Jobs</a>
                | <a href="http://www.cbsinteractive.com/advertise/">Advertise</a>
                | <a href="http://www.metacritic.com/faq">FAQ</a>
                | <a href="http://www.metacritic.com/about-metacritic">About Metacritic</a>
                | <a href="http://www.metacritic.com/contact-us">Contact</a>
</p>


### Let's search for div tags

In [9]:
# Search for div tags
lotsofitems = soup.find_all("div")

In [10]:
# How many did we find
len(lotsofitems)

594


### Search for div tags with specific attributes (e.g. "id")

In [11]:
# find all div-tags with a certain id
# note that this is just an example, you need to change the id value to something
# that makes sense for the web page you are scraping
lotsofitems = soup.find_all("div", id="side")

In [12]:
# How many did we find
len(lotsofitems)

1


### Search for div tags with specific attributes (e.g. "class")

In [13]:
# Find all div-tags of a certain class.
# class is a reserved word in Python so you have to write class_ instead
# note that this is just an example, you need to change the class value to something
# that makes sense for the web page you are scraping
lotsofitems = soup.find_all("div",class_="product_wrap")

In [14]:
# How many did we find
len(lotsofitems)

99

In [15]:
# have a look at the first item in the list
# Try changing the index (between the []) to any number that is lower than the number of items in the list
lotsofitems[0]

<div class="product_wrap">
<div class="basic_stat product_title">
<a href="/music/colonia/a-camp">
                            Colonia
                                                    </a>
</div>
<div class="basic_stat product_score brief_metascore">
<div class="metascore_w small release positive">64</div>
</div>
<div class="basic_stat condensed_stats">
<ul class="more_stats">
<li class="stat product_artist">
<span class="label">Artist:</span>
<span class="data">A Camp</span>
</li>
<li class="stat product_avguserscore">
<span class="label">User:</span>
<span class="data textscore textscore_outstanding">9.0</span>
</li>
<li class="stat release_date full_release_date">
<span class="label">Release Date:</span>
<span class="data">Apr 28, 2009</span>
</li>
</ul>
</div>
</div>

In [16]:
# Try changing the index (between the []) to any number that is lower than the number of items in the list
temptext = lotsofitems[0].get_text()

##  

In [17]:
# dig deeper into the structure and extract the first div-tag from the first item
# Try changing the index (between the []) to any number that is lower than the number of items in the list
thedata = lotsofitems[0].find("div",class_="product_title")

In [18]:
# check out the contents of the tag
thedata

<div class="basic_stat product_title">
<a href="/music/colonia/a-camp">
                            Colonia
                                                    </a>
</div>

In [19]:
# extract the text from this tag
temptext = thedata.get_text()

In [20]:
# Clean up the string
clean_text = temptext.strip()
print(clean_text)

Colonia


##  

In [21]:
# open a file to save the data
f = open("mydata.csv","w")

In [22]:
# iterate across 'lotsofitems' and extract the text from all items in the list
f.write('"title","artist","score"\n')
for an_item in lotsofitems:

    thedata = an_item.find("div",class_="product_title")
    temptext = thedata.get_text()
    title = temptext.strip()

    thedata = an_item.find("div",class_="metascore_w")
    temptext = thedata.get_text()
    score = temptext.strip()
    
    thedata = an_item.find("li",class_="product_artist")
    thedata = thedata.find("span",class_="data")
    temptext = thedata.get_text()
    artist = temptext.strip()
    
    f.write('"'+artist+'","'+title+'","'+score+'"\n')

In [23]:
# close the file when we're done
f.close()