# Data Acquisition Lab

This lab is divided into short sections, one for each section of theory.

## Accessing Unprotected Web pages

In [1]:
# import the Python requests library so that you can use it in your program
import requests

In [2]:
# Go to the Australian Bureau of Meteorology website and work out which page corresponds to the
# Sydney weather forecast. Store that in a variable here
url = 'http://www.bom.gov.au/nsw/forecasts/sydney.shtml'

In [3]:
# Use the requests.get() method to fetch that page
r = requests.get(url)

In [5]:
# Did that succeed? What was the .status_code?
r.status_code

200

In [6]:
# What was the .text or .content of that page? Save it in a variable, because we will be using it
# a little later
print r.text

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-AU" lang="en">
<!--Page: pgn_sydney_forecast-->
  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
    <meta name="language" content="english" />
    <title>Sydney Forecast</title>
    <link rel="Schema.AGLS" href="http://www.naa.gov.au/recordkeeping/gov_online/agls/1.2" />
    <meta name="DC.Publisher" content="corporateName=Bureau of Meteorology" />
    <meta name="DC.Format" content="text/html" />
    <meta name="DC.Coverage.jurisdiction" content="Commonwealth of Australia" />
    <meta name="DC.Rights" scheme="URI" content="http://www.bom.gov.au/other/copyright.shtml" />
    <meta name="DC.Language" scheme="RFC1766" content="en" />
    <meta name="AGLS.Audience" scheme="BoM_audience list" content="All" />
    <link rel="image_src" href="http://www.bom.gov.au/images/ui/b

## Accessing forms

The pandas library already has a module for getting information from the Yahoo Finance pages, 
so you are unlikely to use the following code in any normal environment. But it's an example of
a simple web API

In [9]:
# There is a stock price lookup form on https://au.finance.yahoo.com (it says Enter Symbol)
# Inspect that element, and identify:
# - The <INPUT> tag with the name "s"
# - The <INPUT> tag with the name "ql" (which has a type of "hidden")
# - The <FORM> tag surrounding them with the action of "/q" and the method of GET
#
# Create a dictionary with appropriate keys to provide values for the input tags.
# Create a variable with the full URL to submit to

url = 'https://au.finance.yahoo.com/q?s=ibm&ql=1'
params = {'name': 'ibm', 'ql': 1}
answer = requests.get('https://au.finance.yahoo.com', data=params)

In [10]:
# Use requests.get to retrieve that page
answer.text

u'<!DOCTYPE html>\n<html lang="en-AU" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">\n<head>\n<script>var t_headstart=new Date().getTime();\n\n</script>\n\n<link rel="stylesheet" type="text/css" href="https://s1.yimg.com/zz/combo?d/lib/yui/3.4.1/build/cssreset/cssreset-min.css&d/lib/yui/3.4.1/build/cssfonts/cssfonts-min.css&os/mit/media/p/presentation/grids/master-min-a143923.css&os/mit/media/p/presentation/grids/desktop-min-8723999.css&os/mit/media/p/presentation/base/master-min-8723999.css&os/mit/media/p/presentation/base/desktop-min-a143923.css" />\n<script type="text/javascript" src="https://s1.yimg.com/zz/combo?yui:3.9.1/build/yui/yui-min.js&os/mit/media/m/base/imageloader-min-649ba6f.js&os/mit/media/m/base/imageloader-bootstrap-min-649ba6f.js&os/mit/media/m/base/viewport-loader-min-649ba6f.js&os/mit/media/p/common/rmp-min-56d3a2e.js&ss/rapid-3.32.js&aj/lh-0.9.js"></script>\n<script>YMedia = YUI({combine: true, comboBase: "https://s1.yimg.com/zz/combo?"

In [11]:
'ibm' in answer.text

False

## Secured pages

The username for files under http://www.ifost.org.au/ga/protected is "ga" and the password is "s3cr3t"

In this section we will fetch a file from a website that requires authentication.

In [33]:
# What happens if you use the requests library to fetch http://www.ifost.org.au/ga/protected/data.json 
# without supplying a password? What is the .status_code?

url = 'http://kemek.ifost.org.au/ga/protected/data.json'
r = requests.get(url)

In [34]:
r.text

u'<!DOCTYPE html>\n<html>\n<head>\n<title>401 Unauthorized</title>\n<style type="text/css"><!--\nbody { background-color: white; color: black; font-family: \'Comic Sans MS\', \'Chalkboard SE\', \'Comic Neue\', sans-serif; }\nhr { border: 0; border-bottom: 1px dashed; }\n\n--></style>\n</head>\n<body>\n<h1>401 Unauthorized</h1>\n<hr>\n<address>OpenBSD httpd</address>\n</body>\n</html>\n'

In [37]:
# Try again, but this time supplying a username and password
params = {'username': 'ga', 'password': 's3cr3t'}
r = requests.get(url, auth=('ga','s3cr3t'))
r.status_code

200

In [38]:
r.text

u'{\n "result": "success",\n "message": "you have accessed data from a protected page"\n}\n'

In [21]:
'error' in r.text

True

## Parsing HTML

In this section we will find the prediction for tomorrow's weather.

In [49]:
# import BeautifulSoup library (version 4)
import bs4

In [50]:
# Create a variable called "soup" with the result of parsing the Bureau of Meteorology prediction for
# Sydney that you captured at the start of this notebook.
import lxml
url = 'http://www.bom.gov.au/vic/forecasts/melbourne.shtml?ref=hdr'
melbourne = requests.get(url)
print melbourne.text


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-AU" lang="en">
<!--Page: pgv_forecasts_melbourne-->
  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
    <meta name="language" content="english" />
    <title>Melbourne Forecast</title>
    <link rel="Schema.AGLS" href="http://www.naa.gov.au/recordkeeping/gov_online/agls/1.2" />
    <meta name="DC.Identifier" scheme="URI" content="http://www.bom.gov.au/weather" />
    <meta name="DC.Creator" content="corporateName=Bureau of Meteorology" />
    <meta name="DC.Publisher" content="corporateName=Bureau of Meteorology" />
    <meta name="DC.Date.created" scheme="ISO8601" content="2000" />
    <meta name="DC.Type.documentType" scheme="BoM_document list" content="Web page; map" />
    <meta name="DC.Format" content="text/html" />
    <meta name="DC.Coverage.jurisdiction" content

In [None]:
soup = BeautifulSoup(html, 'lxml')

Hyperlinks = soup.find_all('a')
Hyperlinks = soup('a')
Sections = soup('div', class_='mainbox')
Bold_Sydneys = soup.find('b', string=lambda x: 'Sydney' in x)

In [59]:
soup = bs4.BeautifulSoup(melbourne.text, 'lxml')

def has_the_word_tuesday(x):
    if x is None:
        return False
    if 'Tuesday' in x:
        return True
    return False

# Find the first element in "soup" which has the word Tuesday in it
# You might find the function "has_the_word_tuesday" helpful

In [60]:
tuesday = soup.find_all(string = has_the_word_tuesday)
tuesday

[u'Tuesday 28 June',
 u'The next routine forecast will be issued at 5:05 am EST Tuesday.']

In [64]:
# The weather prediction is obviously going to be in a <DIV> that includes it
# Display the parent of the element you found in the previous cell. You might
# find the .prettify() method makes it easier to display
weather = tuesday[0]
weather.parent.parent

<div class="day main">\n<h2>Tuesday 28 June</h2>\n<div class="forecast">\n<dl>\n<dt>Summary</dt>\n<dd class="image">\n<img alt="" height="42" src="/images/symbols/large/light-showers.png" width="45"/>\n</dd>\n<dd>Min <em class="min">7</em></dd>\n<dd>Max <em class="max">15</em></dd>\n<dd class="summary">Mostly dry.</dd>\n<dd class="rain">Possible rainfall: <em class="rain">0 to 0.4 mm</em></dd>\n<dd class="rain">Chance of any rain: <em class="pop">30%\n\t\t\t\t\t<img alt="" height="10" src="/images/ui/weather/rain_30.gif" width="69"/></em></dd>\n</dl>\n<h3>Melbourne area</h3>\n<p>Becoming cloudy. Slight (30%) chance of showers in the afternoon and evening. Winds north to northwesterly 15 to 25 km/h tending north to northwesterly 15 to 20 km/h in the early afternoon.</p>\n</div>\n<p class="alert">No UV Alert, UV Index predicted to reach 2 [Low]</p>\n</div>

In [69]:
print(soup.prettify())

AttributeError: 'ResultSet' object has no attribute 'prettify'

In [65]:
# Can you find a <DD> element with a CSS class "summary"? (Use the parameter class_ in BeautifulSoup)
[child for child in weather.parent.parent.children]

[u'\n',
 <h2>Tuesday 28 June</h2>,
 u'\n',
 <div class="forecast">\n<dl>\n<dt>Summary</dt>\n<dd class="image">\n<img alt="" height="42" src="/images/symbols/large/light-showers.png" width="45"/>\n</dd>\n<dd>Min <em class="min">7</em></dd>\n<dd>Max <em class="max">15</em></dd>\n<dd class="summary">Mostly dry.</dd>\n<dd class="rain">Possible rainfall: <em class="rain">0 to 0.4 mm</em></dd>\n<dd class="rain">Chance of any rain: <em class="pop">30%\n\t\t\t\t\t<img alt="" height="10" src="/images/ui/weather/rain_30.gif" width="69"/></em></dd>\n</dl>\n<h3>Melbourne area</h3>\n<p>Becoming cloudy. Slight (30%) chance of showers in the afternoon and evening. Winds north to northwesterly 15 to 25 km/h tending north to northwesterly 15 to 20 km/h in the early afternoon.</p>\n</div>,
 u'\n',
 <p class="alert">No UV Alert, UV Index predicted to reach 2 [Low]</p>,
 u'\n']

In [None]:
# Display the "string" attribute of this summary element. Do you need to bring an umbrella?


## JSON APIs

Many websites display their information in JSON format. In this section we will interact
with the Pokemon database http://pokeapi.co/

In [None]:
# Look up their documentation. What is the base URL for querying a Pokemon? What URL
# would you use to look up the Pokemon called "Groudon"? Store it in a variable
import requests
r = requests.get("http://company.com/data.json")
data = r.json()
print data.

In [None]:
# Use the requests library to fetch the Groudon data

In [None]:
# Check the status code to make sure that it worked

In [None]:
# Is the content of the response in JSON format? Use the requests library function
# to decode it from JSON format into a Python dictionary

In [None]:
# What are the keys of this python dictionary?

In [None]:
# Is "weight" listed there? If so, then the value in it should be a number
# If you play Pokemon, does this number look reasonable?