# Web Scraping

## Objectives


1. Understand the process of getting data from the web.
2. Know the basics of HTML/CSS:
    * Know how to pull desired data from web pages.
3. Be able to use existing API's to get fetch pre-formatted data.

<div style="text-align: center"><h3>The Reality of Web Scraping</h3><img src="images/scraping_meme.png" style="width: 600px"></div>

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.

<div style="text-align: center"><h3>Web Data Pipeline</h3><img src="images/web_data_pipeline.png" style="width: 600px"></div>


# What are the keys to a good scraper?

Some things to keep in mind when building your scraper:

- Save all the data you collect
- Don't abuse try:excepts
- Don't get banned from the site you're interested in

### Internet vs. World Wide Web

* The internet is commonly referred to as a network of networks. It is the infrastructure that allows networks all around the world to connect with one another. There are many different protocols to transfer information within this larger, meta-network.
* The World Wide Web, or Web, provides one of the ways that data can be transferred over the internet. Uses a **U**niform **R**esource **L**ocator, URL, to specify the location, within the internet, of a document.

<div style="text-align: center"><h3>Anatomy of a URL</h3><img src="images/url.png" style="width: 600px"></div>

* Documents on the web are generally written in **H**yper**T**ext **M**arkup **L**anguage, HTML, which can be natively viewed by browsers, the tool that we use to browse the web.

### Communication on the Web

Information is transmitted around the web through a number of protocols. The main one that you will see is the **H**yper**T**ext **T**ransfer **P**rotocol, HTTP. These transfers, called **requests**, are initiated in a number of ways, but always begin with the client, read: you at your browser.

<div style="text-align: center"><h3>Requests in Action</h3><img src="images/requests.png" style="width: 600px"></div>

There are 4 main types of request that can be issued by your browser: get, post, put and delete. For web scraping purposes, you will almost always be using get requests. We will learn some more about the others in a couple of weeks during data products day.

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a request from Python and parsing through the HTML that is returned from each page. For each of these tasks we have a Python library, `requests` and `bs4`, respectively.

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

## HTML Concepts


**H**yper**T**ext **M**arkup **L**anguage

A *markup language* (think markdown) that forms the building blocks of all websites.  Controls what to say and where to say it, along with some semantic meaning (this is a section, this is a list, this part is emphasised).

Consists of tags enclosed in angle brackets (like `<html>`)

A minimal HTML document, unfortuantely, contains a lot of cruft.  Here's one I got from [https://www.sitepoint.com/a-minimal-html-document/](https://www.sitepoint.com/a-minimal-html-document/).


```html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
  <head>
  
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title>title</title>
    <link rel="stylesheet" type="text/css" href="style.css">
    <script type="text/javascript" src="script.js"></script>
  </head>
  <body>
		
  </body>
</html>
```



The `<link>` and `<script>` tags are not strictly necessary, but will appear in more or less every HTML document.

* The `<link>` tag points to a **stylesheet**, which controls who different parts of the docuemnt are rendered in the browser.  This makes things pretty.
* The `<script>` tag points to a **javascript** program.  This allows programmers to add *dynamic behaviour* to a html document.
* The `<body>` tag contains the guts of your document.

### Important Tags

```html
<div>Defines a division or section of the docuemnt.</div>
<a href="http://www.w3schools.com">A Gyperlink to W3Schools.com!</a>

<h1>This is a header!</h1>

<p>This is a paragraph!</p>

<h2>This is a Subheading!</h2>

<table>
  This is a table!
  <tr>
    <td>An entry in the first row.</td>
    
    <td>Another entry in the first row.</td>
  </tr>
  <tr>
    <td>An entry in the second row.</td>
    <td>Another entry in the second row.</td>
  </tr>
</table>

<ul>
  This is a list!
  <li>This is the first thing in the list!</li>
  <li>This is the second thing in the list!</li>
</ul>
```

# Scraping Craigslist

In [1]:
import requests
import re
from bs4 import BeautifulSoup

import json

import time

## 1) Requesting the first page

In [2]:
webpage = requests.get('https://austin.craigslist.org/search/hhh?')

In [3]:
type(webpage)

requests.models.Response

In [4]:
webpage.text

'\ufeff<!DOCTYPE html>\n<html class="no-js"><head>\n    <title>austin housing  - craigslist</title>\n\n    <meta name="description" content="austin housing  - craigslist">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>\n    <link rel="canonical" href="https://austin.craigslist.org/search/hhh">\n    <link rel="alternate" type="application/rss+xml" href="https://austin.craigslist.org/search/hhh?format=rss" title="RSS feed for craigslist | austin housing  - craigslist">\n        <link rel="next" href="https://austin.craigslist.org/search/hhh?s=120">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/cl.css?v=d99915fc6f3187577f30df4caeee65d6">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/search.css?v=84cf86bc094026e12fa066bbbab154ac">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles

In [5]:
soup = BeautifulSoup(webpage.text, 'html.parser')

In [6]:
soup

﻿<!DOCTYPE html>

<html class="no-js"><head>
<title>austin housing  - craigslist</title>
<meta content="austin housing  - craigslist" name="description"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible">
<link href="https://austin.craigslist.org/search/hhh" rel="canonical"/>
<link href="https://austin.craigslist.org/search/hhh?format=rss" rel="alternate" title="RSS feed for craigslist | austin housing  - craigslist" type="application/rss+xml"/>
<link href="https://austin.craigslist.org/search/hhh?s=120" rel="next"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="//www.craigslist.org/styles/cl.css?v=d99915fc6f3187577f30df4caeee65d6" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.craigslist.org/styles/search.css?v=84cf86bc094026e12fa066bbbab154ac" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.craigslist.org/styles/jquery-ui-clcustom.css?v=3b05ddffb7c7f5b62066deff2dda9339" media="all" rel="stylesheet" type="text/c

From our work with the inspect feature of our web browser we find we are looking for the `result-title hdrlnk` tag.  The below `find_all` command will return a list of all links that match our search.

In [7]:
soup.find_all('a',class_='result-title hdrlnk')

[<a class="result-title hdrlnk" data-id="6703886443" href="https://austin.craigslist.org/apa/d/500-off-hill-country-views-d/6703886443.html">$500 off Hill Country views w/d inc Citywide 512-835-RENT (835-7368)</a>,
 <a class="result-title hdrlnk" data-id="6690661515" href="https://austin.craigslist.org/apa/d/park-like-setting-close-to/6690661515.html">Park Like Setting ~ Close To Shopping, Restaurants, &amp; Entertainment!</a>,
 <a class="result-title hdrlnk" data-id="6703899489" href="https://austin.craigslist.org/apa/d/leaves-dropping-prices/6703899489.html">Leaves Dropping, &amp; Prices Dropping!</a>,
 <a class="result-title hdrlnk" data-id="6695707805" href="https://austin.craigslist.org/apa/d/need-house-or-apt-fast-credit/6695707805.html">Need a house or Apt fast credit fix pay us after</a>,
 <a class="result-title hdrlnk" data-id="6701847664" href="https://austin.craigslist.org/apa/d/pet-friendly-including-large/6701847664.html">Pet Friendly, Including Large Dogs - Pet Policy, Wi

We want to work on one tag at a time to get our code working then we will just put it in a loop to run on all tags.

In [8]:
tag= soup.find_all('a',class_='result-title hdrlnk')
tag

[<a class="result-title hdrlnk" data-id="6703886443" href="https://austin.craigslist.org/apa/d/500-off-hill-country-views-d/6703886443.html">$500 off Hill Country views w/d inc Citywide 512-835-RENT (835-7368)</a>,
 <a class="result-title hdrlnk" data-id="6690661515" href="https://austin.craigslist.org/apa/d/park-like-setting-close-to/6690661515.html">Park Like Setting ~ Close To Shopping, Restaurants, &amp; Entertainment!</a>,
 <a class="result-title hdrlnk" data-id="6703899489" href="https://austin.craigslist.org/apa/d/leaves-dropping-prices/6703899489.html">Leaves Dropping, &amp; Prices Dropping!</a>,
 <a class="result-title hdrlnk" data-id="6695707805" href="https://austin.craigslist.org/apa/d/need-house-or-apt-fast-credit/6695707805.html">Need a house or Apt fast credit fix pay us after</a>,
 <a class="result-title hdrlnk" data-id="6701847664" href="https://austin.craigslist.org/apa/d/pet-friendly-including-large/6701847664.html">Pet Friendly, Including Large Dogs - Pet Policy, Wi

In [9]:
# Just take the first tag
tag = tag[0]
tag

<a class="result-title hdrlnk" data-id="6703886443" href="https://austin.craigslist.org/apa/d/500-off-hill-country-views-d/6703886443.html">$500 off Hill Country views w/d inc Citywide 512-835-RENT (835-7368)</a>

Now lets look for the data we want.  

In [10]:
tag['href']

'https://austin.craigslist.org/apa/d/500-off-hill-country-views-d/6703886443.html'

In [11]:
tag.text

'$500 off Hill Country views w/d inc Citywide 512-835-RENT (835-7368)'

In [12]:
tag['data-id']

'6703886443'

## 2) Gather information from sub page 

Many times there are links to pages with more information we want.  So we find the URL (href) and then request the page using the request library.

In [13]:
link = tag['href']

In [14]:
sub_page = requests.get(link)

In [15]:
sub_soup = BeautifulSoup(sub_page.text, 'html.parser')
sub_soup

<!DOCTYPE html>

<html class="no-js">
<head>
<title>$500 off Hill Country views w/d inc Citywide 512-835-RENT (835-7368) - apts/housing for rent - apartment rent</title>
<link href="https://austin.craigslist.org/apa/d/500-off-hill-country-views-d/6703886443.html" rel="canonical"/>
<meta content="The property offers apartments in Austin, TX with stunning one, two and three bedroom designs. All of the open apartments feature spacious kitchens with large islands and breakfast bars. Each of the..." name="description"/>
<meta content="noarchive,nofollow,unavailable_after: 05-Nov-18 12:55:44 CST" name="robots"/>
<meta content="preview" name="twitter:card"/>
<meta content="The property offers apartments in Austin, TX with stunning one, two and three bedroom designs. All of the open apartments feature spacious kitchens with large islands and breakfast bars. Each of the..." property="og:description"/>
<meta content="https://images.craigslist.org/00p0p_fMncOLz5tYy_600x450.jpg" property="og:image

#### Again lets pull out the data we want

In [16]:
price = sub_soup.find_all('span',class_='price')
price

[<span class="price">$1026</span>]

#### We really only want the price which is the text portion of this text.  However find_all returns a list we need to specify the first price.  Also remember `find_all` will return a list even if there is only one item

In [17]:
price[0]

<span class="price">$1026</span>

In [18]:
price[0].text

'$1026'

#### We can do these steps in one line

In [19]:
price = sub_soup.find_all('span',class_='price')[0].text
price

'$1026'

#### Now lets gather other information

In [20]:
housing = sub_soup.find_all('span',class_='housing')[0].text
housing

'/ 1br - 673ft2 - '

In [21]:
description = sub_soup.find_all('section',id='postingbody')[0].text
description

"\n\nQR Code Link to This Post\n\n\nThe property offers apartments in Austin, TX with stunning one, two and three bedroom designs. All of the open apartments feature spacious kitchens with large islands and breakfast bars. Each of the stunning living rooms connects to a spacious patio or balcony where residents can relax after a stressful day with a cool beverage. Every home features a patio or balcony, so residents have beautiful views of the surrounding hill country setting. The homes are finished with vinyl wood floors or stained concrete for a modern feel. Every bedroom in the community features soft carpeting to keep you feeling comfortable and oversized closets to keep you organized. The master suites boast exclusive bathrooms with espresso or café cabinets, granite countertops and framed mirrors.\nPet resort area.\nGarages available.\nHouse sitting available.\nCovered parking\n\nunit # 80859\n\nWe do free apartment locating and home sales.\nVisit our website for a free online ap

#### The map tag data may be useful so I will save it now so I can use it latter if I end up needing it.  I do not bother cleaning it now and can do that later if I need. I can just save it in as is. 

In [22]:
map_data = sub_soup.find_all('div',class_='mapAndAttrs')
map_data

[<div class="mapAndAttrs">
 <div class="mapbox">
 <div class="viewposting" data-accuracy="10" data-latitude="30.417567" data-longitude="-97.699508" id="map"></div>
 <div class="mapaddress">2311 W Parmer Ln</div>
 <p class="mapaddress">
 <small>
         (<a href="https://maps.google.com/?q=loc%3A+%32%33%31%31+W+Parmer+Ln+Austin+TX+US" target="_blank">google map</a>)
         </small>
 </p>
 </div>
 <p class="attrgroup">
 <span class="shared-line-bubble"><b>1BR</b> / <b>1Ba</b></span>
 <span class="shared-line-bubble"><b>673</b>ft<sup>2</sup></span>
 <span class="housing_movein_now property_date shared-line-bubble" data-date="2018-11-09" data-today_msg="available now">available nov 9</span>
 </p>
 <p class="attrgroup">
 <span>cats are OK - purrr</span>
 <br/>
 <span>dogs are OK - wooof</span>
 <br/>
 <span>condo</span>
 <br/>
 <span>w/d in unit</span>
 <br/>
 </p>
 </div>]

In [23]:
map_data = sub_soup.find_all('div',class_='mapAndAttrs')[0].decode()
map_data

'<div class="mapAndAttrs">\n<div class="mapbox">\n<div class="viewposting" data-accuracy="10" data-latitude="30.417567" data-longitude="-97.699508" id="map"></div>\n<div class="mapaddress">2311 W Parmer Ln</div>\n<p class="mapaddress">\n<small>\n        (<a href="https://maps.google.com/?q=loc%3A+%32%33%31%31+W+Parmer+Ln+Austin+TX+US" target="_blank">google map</a>)\n        </small>\n</p>\n</div>\n<p class="attrgroup">\n<span class="shared-line-bubble"><b>1BR</b> / <b>1Ba</b></span>\n<span class="shared-line-bubble"><b>673</b>ft<sup>2</sup></span>\n<span class="housing_movein_now property_date shared-line-bubble" data-date="2018-11-09" data-today_msg="available now">available nov 9</span>\n</p>\n<p class="attrgroup">\n<span>cats are OK - purrr</span>\n<br/>\n<span>dogs are OK - wooof</span>\n<br/>\n<span>condo</span>\n<br/>\n<span>w/d in unit</span>\n<br/>\n</p>\n</div>'

# Original scraping example

In [24]:
# Read in file for class example. Only doing this because reading from a local file

path = 'class_example.html'
with open(path) as f:
    html_str = f.read()

Create a Beautiful Soup object with webpage data. You may need to use another parser (the second field) but the html or xml usually works

In [25]:
# Create a Beautiful Soup object with webpage data

soup = BeautifulSoup(html_str, 'html.parser')

In [26]:
soup

<!DOCTYPE HTML>

<html lang="en">
<head>
<script src="//assets.adobedtm.com/d2f967b83a0c92b19d9b572545fdbdc3d591f6f5/satelliteLib-389760a7bc4573d6b081d36f6782b59f3c8ffb54.js"></script>
<title>Graphics Cards and Video Cards - Newegg.com</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="always" name="referrer"/>
<meta content="Graphics Cards, Video Cards" name="keywords"/>
<meta content="Shop a wide selection of Video Graphics Cards from EVGA, Gigabyte, MSI &amp; more! Newegg offers the best prices, fast shipping and top-rated customer service!" name="description"/>
<meta content="https://images10.newegg.com/WebResource/Themes/2005/Nest/logo_424x210.png" property="og:image"/>
<meta content="Shop a wide selection of Video Graphics Cards from EVGA, Gigabyte, MSI &amp; more! Newegg offers the best prices, fast shipping and top-rated customer service!" property="og:description"/>
<link href="https://m.newegg.com/Store/Category?description=zBX0b0dGNZj

In [27]:
containers = soup.findAll('div',{'class':'item-container'})

In [28]:
type(containers)

bs4.element.ResultSet

In [29]:
container = containers[0]

In [30]:
print(container)

<div class="item-container ">
<!--product image-->
<a class="item-img" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814125871&amp;ignorebbr=1">
<img alt="GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070G1 GAMING-8GD R2 256-Bit GDDR5 PCI Express 3.0 x16 SLI Support ATX Video Card" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/ProductImageCompressAll300/14-125-871-S99.jpg" src="//images10.newegg.com/WebResource/Themes/2005/Nest/blank.gif" title="GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070G1 GAMING-8GD R2 256-Bit GDDR5 PCI Express 3.0 x16 SLI Support ATX Video Card">
</img></a>
<div class="item-info">
<!--brand info-->
<div class="item-branding">
<a class="item-brand" href="https://www.newegg.com/GIGABYTE/BrandStore/ID-1314">
<img alt="GIGABYTE" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/Brandimage_70x28//Brand1314.gif" src="//images10.newegg.com/WebResource/Themes/2005/Nest/blank.gif" title="GIGABYTE">
</img></a>
<!--rat

In [31]:
container.div

<div class="item-info">
<!--brand info-->
<div class="item-branding">
<a class="item-brand" href="https://www.newegg.com/GIGABYTE/BrandStore/ID-1314">
<img alt="GIGABYTE" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/Brandimage_70x28//Brand1314.gif" src="//images10.newegg.com/WebResource/Themes/2005/Nest/blank.gif" title="GIGABYTE">
</img></a>
<!--rating info-->
<a class="item-rating" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814125871&amp;SortField=0&amp;SummaryType=0&amp;PageSize=10&amp;SelectedRating=-1&amp;VideoOnlyMark=False&amp;ignorebbr=1&amp;IsFeedbackTab=true#scrollFullInfo" title="Rating + 4"><i class="rating rating-4"></i><span class="item-rating-num">(296)</span></a>
</div>
<!--description info-->
<a class="item-title" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814125871&amp;ignorebbr=1" title="View Details"><i class="icon-premier icon-premier-xsm"></i>GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070G1 GAMING-8GD R2

In [32]:
container.div.div

<div class="item-branding">
<a class="item-brand" href="https://www.newegg.com/GIGABYTE/BrandStore/ID-1314">
<img alt="GIGABYTE" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/Brandimage_70x28//Brand1314.gif" src="//images10.newegg.com/WebResource/Themes/2005/Nest/blank.gif" title="GIGABYTE">
</img></a>
<!--rating info-->
<a class="item-rating" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814125871&amp;SortField=0&amp;SummaryType=0&amp;PageSize=10&amp;SelectedRating=-1&amp;VideoOnlyMark=False&amp;ignorebbr=1&amp;IsFeedbackTab=true#scrollFullInfo" title="Rating + 4"><i class="rating rating-4"></i><span class="item-rating-num">(296)</span></a>
</div>

In [33]:
container.div.div.a

<a class="item-brand" href="https://www.newegg.com/GIGABYTE/BrandStore/ID-1314">
<img alt="GIGABYTE" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/Brandimage_70x28//Brand1314.gif" src="//images10.newegg.com/WebResource/Themes/2005/Nest/blank.gif" title="GIGABYTE">
</img></a>

In [34]:
container.div.div.a.img

<img alt="GIGABYTE" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/Brandimage_70x28//Brand1314.gif" src="//images10.newegg.com/WebResource/Themes/2005/Nest/blank.gif" title="GIGABYTE">
</img>

In [35]:
container.div.div.a.img['title']

'GIGABYTE'

In [36]:
container.findAll('a',{'class':'item-title'})

[<a class="item-title" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814125871&amp;ignorebbr=1" title="View Details"><i class="icon-premier icon-premier-xsm"></i>GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070G1 GAMING-8GD R2 Video Card</a>]

In [37]:
container.findAll('a',{'class':'item-title'})[0]

<a class="item-title" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814125871&amp;ignorebbr=1" title="View Details"><i class="icon-premier icon-premier-xsm"></i>GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070G1 GAMING-8GD R2 Video Card</a>

In [38]:
container.findAll('a',{'class':'item-title'})[0].text

'GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070G1 GAMING-8GD R2 Video Card'

In [39]:
for container in containers:
    brand = container.div.div.a.img['title']
    product = container.findAll('a',{'class':'item-title'})[0].text
    print('Brand: ' + brand)
    print('Product: ' + product)
    print()
    
    

Brand: GIGABYTE
Product: GIGABYTE GeForce GTX 1070 DirectX 12 GV-N1070G1 GAMING-8GD R2 Video Card

Brand: EVGA
Product: EVGA GeForce GTX 1060 6GB SSC GAMING ACX 3.0, 6GB GDDR5, LED, DX12 OSD Support (PXOC), 06G-P4-6267-KR

Brand: ASUS
Product: ASUS GeForce GTX 1080 TURBO-GTX1080-8G Video Card

Brand: ZOTAC
Product: ZOTAC GeForce GTX 1070 Mini, ZT-P10700G-10M, 8GB GDDR5

Brand: XFX
Product: XFX Radeon RX 580 DirectX 12 RX-580P427D6 GTS XXX Edition Video Card w/ Backplate

Brand: EVGA
Product: EVGA GeForce GTX 1060 3GB SSC GAMING ACX 3.0, 03G-P4-6167-KR, 3GB GDDR5, LED, DX12 OSD Support (PXOC)

Brand: ASUS
Product: ASUS ROG Strix Radeon RX 570 O4G Gaming OC Edition GDDR5 DP HDMI DVI VR Ready AMD Graphics Card (ROG-STRIX-RX570-O4G-GAMING)

Brand: GIGABYTE
Product: GIGABYTE GeForce GTX 1080 Ti Turbo 11GD, GV-N108TTURBO-11GD

Brand: ZOTAC
Product: ZOTAC GeForce GTX 1080 Ti AMP Extreme Core 11GB GDDR5X 352-bit Gaming Graphics Card VR Ready 16+2 Power Phase Freeze Fan Stop IceStorm Cooling Sp

# Scraping example 2

Scraping ESPN college football stats

In [40]:
# Get webpage html

webpage = requests.get('http://www.espn.com/college-football/statistics/player/_/stat/passing/sort/passingYards/year/2015/qualified/false/count/1')

In [41]:
bs_obj = BeautifulSoup(webpage.text, 'html.parser')

#### Because of how the rows of the table are setup we want to return where class type is ether a evenrow or oddrow class type

In [42]:
bs_obj.findAll('tr',{'class':'evenrow','class':'oddrow'})

[<tr align="right" class="oddrow player-23-504866"><td align="left">1</td><td align="left"><a href="http://www.espn.com/college-football/player/_/id/504866/brandon-doughty">Brandon Doughty</a>, QB</td><td align="left"><span title="Western Kentucky">WKU</span></td><td>388</td><td>540</td><td>71.9</td><td class="sortcell">5055</td><td>9.4</td><td>78</td><td>48</td><td>9</td><td>15</td><td>176.5</td></tr>,
 <tr align="right" class="oddrow player-23-547401"><td align="left">3</td><td align="left"><a href="http://www.espn.com/college-football/player/_/id/547401/jared-goff">Jared Goff</a>, QB</td><td align="left"><span title="California">CAL</span></td><td>341</td><td>529</td><td>64.5</td><td class="sortcell">4719</td><td>8.9</td><td>80</td><td>43</td><td>13</td><td>26</td><td>161.3</td></tr>,
 <tr align="right" class="oddrow player-23-550629"><td align="left">5</td><td align="left"><a href="http://www.espn.com/college-football/player/_/id/550629/luke-falk">Luke Falk</a>, QB</td><td align="l

#### Another way to return 2 classes at once using pattern matching

In [43]:
bs_obj.findAll('tr',{'class':re.compile('^(evenrow|oddrow)')})

[<tr align="right" class="oddrow player-23-504866"><td align="left">1</td><td align="left"><a href="http://www.espn.com/college-football/player/_/id/504866/brandon-doughty">Brandon Doughty</a>, QB</td><td align="left"><span title="Western Kentucky">WKU</span></td><td>388</td><td>540</td><td>71.9</td><td class="sortcell">5055</td><td>9.4</td><td>78</td><td>48</td><td>9</td><td>15</td><td>176.5</td></tr>,
 <tr align="right" class="evenrow player-23-513573"><td align="left">2</td><td align="left"><a href="http://www.espn.com/college-football/player/_/id/513573/matt-johnson">Matt Quinn Johnson</a>, QB</td><td align="left"><span title="Bowling Green">BGSU</span></td><td>383</td><td>569</td><td>67.3</td><td class="sortcell">4946</td><td>8.7</td><td>94</td><td>46</td><td>8</td><td>36</td><td>164.2</td></tr>,
 <tr align="right" class="oddrow player-23-547401"><td align="left">3</td><td align="left"><a href="http://www.espn.com/college-football/player/_/id/547401/jared-goff">Jared Goff</a>, QB<

#### Looking at the above tags we can see that if we want the school we want the info in `title` of the `span` tag

In [44]:
for obj in bs_obj.findAll('tr',{'class':re.compile('^(evenrow|oddrow)')}):
    print(obj.find('span')['title'])

Western Kentucky
Bowling Green
California
Texas Tech
Washington State
Southern Mississippi
Georgia State
Tulsa
Clemson
Ole Miss
Louisiana Tech
Middle Tennessee
Arizona State
Central Michigan
Mississippi State
Memphis
Oklahoma State
Oklahoma
UCLA
TCU
Indiana
USC
Western Michigan
Arkansas
BYU
Boise State
Miami
West Virginia
Michigan State
Alabama
North Carolina
Nebraska
Michigan
Idaho
Temple
Buffalo
Toledo
Washington
UMass
Notre Dame


#### Now looping through every tag we can pull out better data and the link to the player individual page and pull the extra data we want

In [45]:
for obj in bs_obj.findAll('tr',{'class':re.compile('^(evenrow|oddrow)')}):
    
    html = requests.get(obj.find('a')['href'])
    bs_obj2 = BeautifulSoup(html.text, 'html.parser')
    name = bs_obj2.h1.get_text()
    born = bs_obj2.find('ul',{'class':'player-metadata'}).findAll('li')[0].get_text()
    if born[0] == 'B':
        print('{} Born: {}'.format(name,born[4:]))

Brandon Doughty Born: Oct 6, 1991 in Davie, FL
Matt Johnson Born: Sep 9, 1992 in Harrisburg, PA
Jared Goff Born: Oct 14, 1994 in San Rafael, CA (Age: 23)
Patrick Mahomes Born: Sep 17, 1995 in Tyler, TX (Age: 23)
Luke Falk Born: Dec 28, 1994 in Logan, UT (Age: 23)
Nick Mullens Born: Mar 21, 1995 in Hoover, AL
Dane Evans Born: Nov 19, 1993
Deshaun Watson Born: Sep 14, 1995 in Gainesville, GA (Age: 23)
Chad Kelly Born: Mar 26, 1994 in Niagara Falls, NY (Age: 24)
Jeff Driskel Born: Apr 23, 1993 in Oviedo, FL (Age: 25)
Mike Bercovici Born: Feb 9, 1993 in Northridge, CA
Cooper Rush Born: Nov 21, 1993 in Schererville, IN (Age: 24)
Dak Prescott Born: Jul 29, 1993 in Sulphur, LA (Age: 25)
Paxton Lynch Born: Feb 12, 1994 in San Antonio, TX
Mason Rudolph Born: Jul 17, 1995 in Rock Hill, SC (Age: 23)
Baker Mayfield Born: Apr 14, 1995 in USA (Age: 23)
Josh Rosen Born: Feb 10, 1997 in Manhattan Beach, CA (Age: 21)
Trevone Boykin Born: Aug 22, 1993 in Mesquite, TX
Nate Sudfeld Born: Oct 7, 1993 in Sa

# Scraping from an Existing API

Let's take a look at the API for all the publically available policing data in the [UK](https://data.police.uk/docs/). After taking a look at the documentation for the interface, let's experiment with what we get when we issue a request to this API. The process looks remarkable similar to the one we went through for scraping a web page, except this time the response we're looking for is available via the `json()` method.

In [47]:
r = requests.get('https://data.police.uk/api/leicestershire/NC04/events')


In [48]:
r.json()[:2]

[{'contact_details': {},
  'description': None,
  'end_date': '2018-09-22T14:00:00',
  'title': 'Drop-in at the Buddhist Centre',
  'address': 'The Buddhist Centre, 17 Guildhall Lane',
  'type': 'meeting',
  'start_date': '2018-09-22T12:00:00'},
 {'contact_details': {},
  'description': None,
  'end_date': '2018-09-26T12:30:00',
  'title': 'Beat Surgery St Maragret&#39;s Church',
  'address': 'St Margaret&#39;s Church, St Margaret&#39;s Way',
  'type': 'meeting',
  'start_date': '2018-09-26T10:00:00'}]