# QUT DRMC Workshop

# Introduction to web scraping with Python

This notebook gets a page from a website, extracts data from that page and stores that data as a csv file. It should be fairly straight-forward to adapt the notebook to fit your particular web scraping project.
##  



In [None]:
# Import the Python modules required to extract data from the website.
from bs4 import BeautifulSoup
from requests import get

from datetime import datetime

### Step #1: Get the soup from the website

In [None]:
# this is the URL - change this URL to the website you would like to scrape.
the_url = "http://www.spiegel.de/international/"

In [None]:
# make the call to the URL
stuff = get(the_url)

In [None]:
# transform to beautiful soup using html.parser parser
soup = BeautifulSoup(stuff.text, "html.parser")

### Step #2: Find the tags in the beautiful soup


#### Example: Search for p tags

In [None]:
# Search for p tags
lotsofitems = soup.find_all("p")

In [None]:
# How many items did you find?
len(lotsofitems)

In [None]:
# Have a look at the first one in the list (The index of the first item in the list is zero "0")
lotsofitems[0]


#### Example: Search for div tags

In [None]:
# Search for div tags
lotsofitems = soup.find_all("div")

In [None]:
# How many did we find?
len(lotsofitems)


### Step #3: Find tags with specific attributes


#### Find ```div``` tags with specific ```class``` values

In [None]:
# Find all div-tags of a certain class.
# 
# 'class' is a reserved word in Python so you have to use 'class_' instead.
# In some cases you need use the attribute 'id' instead of 'class' to filter out the data...
# ...you are looking for.
# 
# Note that this is just an example, you need to look at the source code of the page
# you are scraping and change the class value to something that makes sense for your particular case.
lotsofitems = soup.find_all("div",class_="teaser")

In [None]:
# How many did we find
len(lotsofitems)

In [None]:
# have a look at the first item in the list
# Try changing the index (between the []) to any number that is lower than the number of items in the list
lotsofitems[0]

In [None]:
# Check out the text found in the first data item in the list
# Try changing the index (between the []) to any number that is lower than the number of items in the list
lotsofitems[0].get_text()

##  
### Step #4 - Extract data from the selected tags
Dig deeper into the html code structure<br>
Depending on the structure of the web page you are scraping and the data you want to extract,
you might need to dig deeper into the structure.

In [None]:
# In this cell we extract the first div-tag from the first item found in the previous step.
#
# Try changing the index (between the []) to any number that is lower than the number of items in the list
thedata = lotsofitems[0].find("span",class_="headline")

In [None]:
# Extract the text from this tag
temptext = thedata.get_text()

In [None]:
# Clean up the string
clean_text = temptext.strip()
print(clean_text)

##  
### Step #5: Extract data from all relevant data entities on the page. Save to disk.

In [None]:
# open a file to save the data
f = open("mydata.csv","a",encoding="utf8")

In [None]:
# if the file is empty, add a title row
if f.tell()==0:
    f.write('"timestamp","out1","out2","out3"\n')

In [None]:
# iterate across 'lotsofitems' and extract three attributes from all relevant data entities on the page.
for an_item in lotsofitems:

    thedata = an_item.find("span",class_="headline")
    temptext = thedata.get_text()
    out1 = temptext.strip()

    thedata = an_item.find("p",class_="article-intro")
    temptext = thedata.get_text()
    out2 = temptext.strip()

    thedata = an_item.find("a")
    temptext = thedata["href"]
    out3 = temptext.strip()
    
    f.write('"'+str(datetime.now())[:19]+'","'+out1+'","'+out2+'","'+out3+'"\n')

In [None]:
# close the file when we're done
f.close()

##  That's all!
### Now open the [dashboard](../tree) and locate the csv file that was saved in the cells above.
* Click on the "mydata.csv" file in the file list to open the file in a browser tab.
* Then chose "Download as..." in the "File" menu to download the csv file to your computer.
 
### You may also want to explore [eight notebooks](web-scraping-intro-toc.ipynb) that gradually builds a scraper with considerably more functionality than what has been introduced in this notebook.


##  
