![QUT DMRC 2016](https://www.dropbox.com/s/gpl9miikncu2235/QUTDMRC_logo_s1.png?raw=1 "QUT DMRC 2016" )

# Introduction to web scraping with Python

This notebook gets a page from a website, extracts data from that page and stores that data as a csv file. It should be fairly straight-forward to adapt the notebook to fit your particular web scraping project.
##  



In [1]:
# Import the Python modules required to extract data from the website.
from bs4 import BeautifulSoup
from requests import get

from datetime import datetime

### Step #1: Get the soup from the website

In [2]:
# this is the URL - change this URL to the website you would like to scrape.
the_url = "http://www.spiegel.de/international/"

In [4]:
# make the call to the URL
stuff = get(the_url)

In [5]:
# transform to beautiful soup using html.parser parser
soup = BeautifulSoup(stuff.text, "html.parser")

### Step #2: Find the tags in the beautiful soup


#### Example: Search for p tags

In [6]:
# Search for p tags
lotsofitems = soup.find_all("p")

In [7]:
# How many items did you find?
len(lotsofitems)

9

In [8]:
# Have a look at the first one in the list (The index of the first item in the list is zero "0")
lotsofitems[0]

<p class="article-intro clearfix">
			Three months ago, a trio of Islamist terrorists stormed the Bataclan theater in Paris and slaughtered 90 people. Those who survived are still struggling to come to terms with what happened that night. <span class="author">By Julia Amalia Heyer and Petra Truckendanner</span> <a class="more-link" href="/international/europe/paris-terror-attack-victims-struggling-to-come-to-terms-a-1077426.html" title="Paris Survivors: Healing the Scars of Bataclan">more...</a> <span class="spInteractionMarks video-forum-bracket">[ <a href="/international/europe/paris-terror-attack-victims-struggling-to-come-to-terms-a-1077426.html#spLeserKommentare">Comment</a> ]</span></p>


#### Example: Search for div tags

In [9]:
# Search for div tags
lotsofitems = soup.find_all("div")

In [10]:
# How many did we find?
len(lotsofitems)

149


### Step #3: Find tags with specific attributes


#### Find ```div``` tags with specific ```class``` values

In [11]:
# Find all div-tags of a certain class.
# 
# 'class' is a reserved word in Python so you have to use 'class_' instead.
# In some cases you need use the attribute 'id' instead of 'class' to filter out the data...
# ...you are looking for.
# 
# Note that this is just an example, you need to look at the source code of the page
# you are scraping and change the class value to something that makes sense for your particular case.
lotsofitems = soup.find_all("div",class_="teaser")

In [12]:
# How many did we find
len(lotsofitems)

9

In [13]:
# have a look at the first item in the list
# Try changing the index (between the []) to any number that is lower than the number of items in the list
lotsofitems[0]

<div class="teaser teaser-first">
<h2 class="article-title ">
<a href="/international/europe/paris-terror-attack-victims-struggling-to-come-to-terms-a-1077426.html" title="Paris Survivors: Healing the Scars of Bataclan"><span class="headline-intro">Paris Survivors:</span> <span class="headline">Healing the Scars of Bataclan</span></a></h2><div class="article-image-box box-position breitwandaufmacher asset-align-center">
<div class="image-buttons-panel clearfix">
<a href="/international/europe/paris-terror-attack-victims-struggling-to-come-to-terms-a-1077426.html" title="Paris Survivors: Healing the Scars of Bataclan"><img alt="Paris Survivors: Healing the Scars of Bataclan" height="320" src="http://cdn4.spiegel.de/images/image-922947-breitwandaufmacher-uhhd-922947.jpg" title="Paris Survivors: Healing the Scars of Bataclan" width="860"/></a><span class="image-buttons">
</span>
</div>
</div><p class="article-intro clearfix">
			Three months ago, a trio of Islamist terrorists stormed the 

In [14]:
# Check out the text found in the first data item in the list
# Try changing the index (between the []) to any number that is lower than the number of items in the list
lotsofitems[0].get_text()

'\n\nParis Survivors: Healing the Scars of Bataclan\n\n\n\n\n\r\n\t\t\tThree months ago, a trio of Islamist terrorists stormed the Bataclan theater in Paris and slaughtered 90 people. Those who survived are still struggling to come to terms with what happened that night. By Julia Amalia Heyer and Petra Truckendanner more... [\xa0Comment\xa0]\n\n\n\n\n\nShare your thoughts:\r\n\t\t\tBe the first to comment on this text!\n\n\n\n\nPhoto Gallery: Coming to Terms with Terror\n\n'

##  
### Step #4 - Extract data from the selected tags
Dig deeper into the html code structure<br>
Depending on the structure of the web page you are scraping and the data you want to extract,
you might need to dig deeper into the structure.

In [15]:
# In this cell we extract the first div-tag from the first item found in the previous step.
#
# Try changing the index (between the []) to any number that is lower than the number of items in the list
thedata = lotsofitems[0].find("a")

In [16]:
# Check out the contents of the tag
thedata['href']

'/international/europe/paris-terror-attack-victims-struggling-to-come-to-terms-a-1077426.html'

In [17]:
# Extract the text from this tag
temptext = thedata.get_text()

In [18]:
# Clean up the string
clean_text = temptext.strip()
print(clean_text)

Paris Survivors: Healing the Scars of Bataclan


##  
### Step #5: Extract data from all relevant data entities on the page. Save to disk.

In [19]:
# open a file to save the data
f = open("mydata.csv","a")

In [20]:
# iterate across 'lotsofitems' and extract three attributes from all relevant data entities on the page.
f.write('"timestamp","out1","out2","out3"\n')
for an_item in lotsofitems:

    thedata = an_item.find("span",class_="headline")
    temptext = thedata.get_text()
    out1 = temptext.strip()

    thedata = an_item.find("p",class_="article-intro")
    temptext = thedata.get_text()
    out2 = temptext.strip()

    thedata = an_item.find("a")
    temptext = thedata["href"]
    out3 = temptext.strip()
    
    f.write('"'+str(datetime.now())[:19]+'","'+out1+'","'+out2+'","'+out3+'"\n')

In [21]:
# close the file when we're done
f.close()

##  That's all!
### Now open the [dashboard](../tree) and locate the csv file that was saved in the cells above.
* Click on the "mydata.csv" file in the file list to open the file in a browser tab.
* Then chose "Download as..." in the "File" menu to download the csv file to your computer.
 
### You may also want to explore [eight notebooks](web-scraping-intro-toc.ipynb) that gradually builds a scraper with considerably more functionality than what has been introduced in this notebook.


##  
