# Data Acquisition Lab

This lab is divided into short sections, one for each section of theory.

## Accessing Unprotected Web pages

In [21]:
# import the Python requests library so that you can use it in your program
import requests

In [22]:
# Go to the Australian Bureau of Meteorology website and work out which page corresponds to the
# Sydney weather forecast. Store that in a variable here
url = 'http://www.bom.gov.au/nsw/forecasts/sydney.shtml'

In [23]:
# Use the requests.get() method to fetch that page
r = requests.get(url)

In [24]:
# Did that succeed? What was the .status_code?
r.status_code

200

In [25]:
# What was the .text or .content of that page? Save it in a variable, because we will be using it a little later
# r.text or r.content
t = r.text
print t

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-AU" lang="en">
<!--Page: pgn_sydney_forecast-->
  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
    <meta name="language" content="english" />
    <title>Sydney Forecast</title>
    <link rel="Schema.AGLS" href="http://www.naa.gov.au/recordkeeping/gov_online/agls/1.2" />
    <meta name="DC.Publisher" content="corporateName=Bureau of Meteorology" />
    <meta name="DC.Format" content="text/html" />
    <meta name="DC.Coverage.jurisdiction" content="Commonwealth of Australia" />
    <meta name="DC.Rights" scheme="URI" content="http://www.bom.gov.au/other/copyright.shtml" />
    <meta name="DC.Language" scheme="RFC1766" content="en" />
    <meta name="AGLS.Audience" scheme="BoM_audience list" content="All" />
    <link rel="image_src" href="http://www.bom.gov.au/images/ui/b

## Accessing forms

The pandas library already has a module for getting information from the Yahoo Finance pages, 
so you are unlikely to use the following code in any normal environment. But it's an example of
a simple web API

In [None]:
# There is a stock price lookup form on https://au.finance.yahoo.com (it says Enter Symbol)
# Inspect that element, and identify:
# - The <INPUT> tag with the name "s"
# - The <INPUT> tag with the name "ql" (which has a type of "hidden")
# - The <FORM> tag surrounding them with the action of "/q" and the method of GET
#
# Create a dictionary with appropriate keys to provide values for the input tags.
# Create a variable with the full URL to submit to

In [None]:
# Use requests.get to retrieve that page

## Secured pages

The username for files under http://www.ifost.org.au/ga/protected is "ga" and the password is "s3cr3t"

In this section we will fetch a file from a website that requires authentication.

In [10]:
# What happens if you use the requests library to fetch http://www.ifost.org.au/ga/protected/data.json 
# without supplying a password? What is the .status_code?
r = requests.get('http://www.ifost.org.au/ga/protected/data.json')
r.status_code

401

In [15]:
# Try again, but this time supplying a username and password
# r = requests.get("http://www.ifost.org.au/ga/protected/data.json", data = {'user':'ga', 'passwd':'s3cr3t'})
r = requests.get("http://kemek.ifost.org.au/ga/protected/data.json", auth = ('ga','s3cr3t'))
r.status_code

200

## Parsing HTML

In this section we will find the prediction for tomorrow's weather.

In [26]:
# import BeautifulSoup library (version 4)
import bs4

In [27]:
# Create a variable called "soup" with the result of parsing the Bureau of Meteorology prediction for
# Sydney that you captured at the start of this notebook.
soup = bs4.BeautifulSoup(r.content, 'lxml')

In [44]:
def has_the_word_tuesday(x):
    return 'Tuesday' in x
tuesday_things = soup.find_all(string=has_the_word_tuesday)
# tuesday_things
first_thing = tuesday_things[0]
first_thing
# Find the first element in "soup" which has the word Tuesday in it
# You might find the function "has_the_word_tuesday" helpful

u'Tuesday 28 June'

In [48]:
# The weather prediction is obviously going to be in a <DIV> that includes it
# Display the parent of the element you found in the previous cell. You might
# find the .prettify() method makes it easier to display
beautiful = first_thing.parent.parent
beautiful
# beautiful.prettify()

<div class="day main">\n<h2>Tuesday 28 June</h2>\n<div class="forecast">\n<dl>\n<dt>Summary</dt>\n<dd class="image">\n<img alt="" height="42" src="/images/symbols/large/partly-cloudy.png" width="45"/>\n</dd>\n<dd>Min <em class="min">10</em></dd>\n<dd>Max <em class="max">18</em></dd>\n<dd class="summary">Cloud clearing.</dd>\n<dd class="rain">Possible rainfall: <em class="rain">0 mm</em></dd>\n<dd class="rain">Chance of any rain: <em class="pop">10%\n\t\t\t\t\t<img alt="" height="10" src="/images/ui/weather/rain_10.gif" width="69"/></em></dd>\n</dl>\n<h3>Sydney area</h3>\n<p>Sunny. Winds southwesterly 20 to 30 km/h becoming light in the late afternoon.</p>\n</div>\n<p class="alert">No UV Alert, UV Index predicted to reach 2 [Low]</p>\n</div>

In [53]:
# Can you find a <DD> element with a CSS class "summary"? (Use the parameter class_ in BeautifulSoup)
summary = beautiful.find('dd', class_ = 'summary')

In [54]:
# Display the "string" attribute of this summary element. Do you need to bring an umbrella?
summary.string

u'Cloud clearing.'

## JSON APIs

Many websites display their information in JSON format. In this section we will interact
with the Pokemon database http://pokeapi.co/

In [None]:
# Look up their documentation. What is the base URL for querying a Pokemon? What URL
# would you use to look up the Pokemon called "Groudon"? Store it in a variable

In [None]:
# Use the requests library to fetch the Groudon data

In [None]:
# Check the status code to make sure that it worked

In [None]:
# Is the content of the response in JSON format? Use the requests library function
# to decode it from JSON format into a Python dictionary

In [None]:
# What are the keys of this python dictionary?

In [None]:
# Is "weight" listed there? If so, then the value in it should be a number
# If you play Pokemon, does this number look reasonable?