# Data Acquisition Lab

This lab is divided into short sections, one for each section of theory.

## Accessing Unprotected Web pages

In [1]:
# import the Python requests library so that you can use it in your program
import requests

In [2]:
# Go to the Australian Bureau of Meteorology website and work out which page corresponds to the
# Sydney weather forecast. Store that in a variable here
sydney = 'http://www.bom.gov.au/nsw/forecasts/sydney.shtml'

In [3]:
# Use the requests.get() method to fetch that page
r = requests.get(sydney)

In [4]:
# Did that succeed? What was the .status_code?
r.status_code

200

In [7]:
# What was the .text or .content of that page? Save it in a variable, because we will be using it
# a little later
sydney_text = r.text

## Accessing forms

The pandas library already has a module for getting information from the Yahoo Finance pages, 
so you are unlikely to use the following code in any normal environment. But it's an example of
a simple web API

In [8]:
# There is a stock price lookup form on https://au.finance.yahoo.com (it says Enter Symbol)
# Inspect that element, and identify:
# - The <INPUT> tag with the name "s"
# - The <INPUT> tag with the name "ql" (which has a type of "hidden")
# - The <FORM> tag surrounding them with the action of "/q" and the method of GET
#
# Create a dictionary with appropriate keys to provide values for the input tags.
# Create a variable with the full URL to submit to

params = {'s': 'IBM',
         'ql': '1'}
yahoo_finance = 'https://au.finance.yahoo.com/q?'

In [9]:
# Use requests.get to retrieve that page
ibm_price = requests.get(yahoo_finance, data=params)
open('yahoofinance.html','wb').write(ibm_price.content)

## Secured pages

The username for files under http://www.ifost.org.au/ga/protected is "ga" and the password is "s3cr3t"

In this section we will fetch a file from a website that requires authentication.

In [10]:
# What happens if you use the requests library to fetch http://www.ifost.org.au/ga/protected/data.json 
# without supplying a password? What is the .status_code?

ifost_url = 'http://kemek.ifost.org.au/ga/protected/data.json'
ifost_page = requests.get(ifost_url)
ifost_page

<Response [401]>

In [11]:
# Try again, but this time supplying a username and password

from requests.auth import HTTPDigestAuth
ifost_content = requests.get(ifost_url, auth=('ga','s3cr3t')).content
ifost_content

'{\n "result": "success",\n "message": "you have accessed data from a protected page"\n}\n'

## Parsing HTML

In this section we will find the prediction for tomorrow's weather.

In [12]:
# import BeautifulSoup library (version 4)
import bs4

In [13]:
# Create a variable called "soup" with the result of parsing the Bureau of Meteorology prediction for
# Sydney that you captured at the start of this notebook.
sydney_content = requests.get(sydney)
soup = bs4.BeautifulSoup(sydney_content.content, 'lxml')

In [14]:
def has_the_word_tuesday(x):
    return 'Tuesday' in x

# Find the first element in "soup" which has the word Tuesday in it
# You might find the function "has_the_word_tuesday" helpful

first_tuesday = soup.find(string=has_the_word_tuesday)

element = first_tuesday.parent
element

<h2>Tuesday 5 July</h2>

In [15]:
# The weather prediction is obviously going to be in a <DIV> that includes it
# Display the parent of the element you found in the previous cell. You might
# find the .prettify() method makes it easier to display
element.parent.prettify

<bound method Tag.prettify of <div class="day">\n<h2>Tuesday 5 July</h2>\n<div class="forecast">\n<dl>\n<dt>Summary</dt>\n<dd class="image">\n<img alt="" height="42" src="/images/symbols/large/rain.png" width="45"/>\n</dd>\n<dd>Min <em class="min">10</em></dd>\n<dd>Max <em class="max">16</em></dd>\n<dd class="summary">Rain.</dd>\n<dd class="rain">Possible rainfall: <em class="rain">3 to 8 mm</em></dd>\n<dd class="rain">Chance of any rain: <em class="pop">80%\n\t\t\t\t\t<img alt="" height="10" src="/images/ui/weather/rain_80.gif" width="69"/></em></dd>\n</dl>\n<h3>Sydney area</h3>\n<p>Cloudy. High (80%) chance of rain, most likely in the morning and afternoon. Winds north to northwesterly and light increasing to 15 to 25 km/h in the middle of the day then turning westerly 25 to 35 km/h in the afternoon.</p>\n</div>\n</div>>

In [17]:
# Can you find a <DD> element with a CSS class "summary"? (Use the parameter class_ in BeautifulSoup)

sections = element.parent.find('dd', class_='summary')
sections

<dd class="summary">Rain.</dd>

In [18]:
# Display the "string" attribute of this summary element. Do you need to bring an umbrella?
sections.string

u'Rain.'

## JSON APIs

Many websites display their information in JSON format. In this section we will interact
with the Pokemon database http://pokeapi.co/

In [19]:
# Look up their documentation. What is the base URL for querying a Pokemon? What URL
# would you use to look up the Pokemon called "Groudon"? Store it in a variable
groudon_url = 'http://pokeapi.co/api/v2/pokemon/383'

In [20]:
# Use the requests library to fetch the Groudon data
groudon_data = requests.get(groudon_url)

In [21]:
# Check the status code to make sure that it worked
groudon_data

<Response [200]>

In [22]:
# Is the content of the response in JSON format? Use the requests library function
# to decode it from JSON format into a Python dictionary
groudon_dic = groudon_data.json()

In [23]:
# What are the keys of this python dictionary?
groudon_dic.keys()

[u'is_default',
 u'abilities',
 u'stats',
 u'name',
 u'weight',
 u'held_items',
 u'location_area_encounters',
 u'height',
 u'forms',
 u'base_experience',
 u'id',
 u'game_indices',
 u'species',
 u'moves',
 u'order',
 u'sprites',
 u'types']

In [24]:
# Is "weight" listed there? If so, then the value in it should be a number
# If you play Pokemon, does this number look reasonable?
print groudon_dic['weight']

9500
