<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px;">

# Web Scraping Primer

---


### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- Understand basic HTML concepts
- Worked with Beautiful Soup

## What is hardest to understand about scraping?
_ie: If you were asked to scrape craigslist property listings and put them in a DataFrame(), what would hold you up?_

## HTML Review

In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

## Elements
Elements begin and end with **open and close "tags"**, which are defined by namespaced, encapsulated strings. 

```html
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

_note: the tags **title, p, and strong** are represented here._

## Element Parent / Child Relationships

<img src="http://www.htmlgoodies.com/img/2007/06/flowChart2.gif" width="250">

**Elements begin and end in the same namespace like so:**  `<p></p>`

**Elements can have parents and children:**

```html
<body>
    <div>I am inside the parent element
        <div>I am inside a child element</div>
        <div>I am inside another child element</div>
        <div>I am inside yet another child element</div>
    </div>
</body>
```

## Element Attributes

Elements can also have attributes!  Attributes are defined inside **element tags** and can contain data that may be useful to scrape.

```html
<a href="http://lmgtfy.com/?q=html+element+attributes" title="A title" id="web-link" name="hal">A Simple Link</a>
```

The **element attributes** of this `<a>` tag element are:
- id
- href
- title
- name

This `<a>` tag example will render in your browser like this:
> <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">A Simple Link</a>


## Can you identify an attribute, an element, a text item, and a child element?

```HTML
<html>
   <title id="main-title">All this scraping is making me itch!</title>
   <body>
       <h1>Welcome to my Homepage</h1>
       <p id="welcome-paragraph" class="strong-paragraph">
           <span>Hello friends, let me tell you about this cool hair product..</span>
           <ul>
              <li>It's cool</li>
              <li>It's fresh</li>
              <li>It can tell the future</li>
              <li>Always be closing</li>
           </ul>
       </p>
   </body>
```


## Enter XPath

XPath uses path expressions to select nodes or node-sets in an HTML/XML document. These path expressions look very much like the expressions you see when you work with a traditional computer file system.

## XPath Features

XPath includes over 100 built-in functions to help us select and manipulate HTML (or XML) documents. XPath has functions for:

- string values
- numeric values
- date and time comparison
- sequence manipulation
- Boolean values
- and more!

## Basic XPath Expressions

XPath comes with a wide array of features but the basics of selecting data are the most common problems that XPath can help you solve.

The most common task you'll use **XPath** for is selecting data from HTML documents.  There are two ways you can **select elements** within HTML using **XPath**:

- Absolute reference
- Relative reference

# XPath:  Absolute References

_For our XPath demonstration, we will use Scrapy, which is using libxml under the hood.  Libxml provides the basic functionality for XPath expressions._

In [5]:
# pip install scrapy
# pip install --upgrade zope2
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

HTML = """
<html>
    <body>
        <span id="only-span">good</span>
    </body>
</html>
"""
# The same thing but "absolute" reference
Selector(text=HTML).xpath('/html/body/span/text()').extract().pop().encode('utf-8')


'good'

In [2]:
import selenium

import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

chromedriver = '/Users/smoot/Desktop/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get("http://www.http://www.cfbstats.com/2015/team/732/rushing/index.html.com/")

#titles = driver.find_elements(By.XPATH, "/html/body/center[1]/table/tbody")
# titles = driver.find_elements(By.XPATH, '/html/body/center[1]/table/tbody/tr[3]/td/table/tbody/tr/td[3]/a')

# for i in titles:
#     print i.text

## Relative Reference

Relative references in XPath match the "ends" of structures.  Since there is only a single "span" element, `//span/text()` matches **one element**.

In [6]:
Selector(text=HTML).xpath('//span/text()').extract()

[u'good']

## Selecting Attributes

Attributes **within a tag**, such as `id="only-span"` within our span attribute.  We can get the attribute by using `@` symbol **after** the **element reference**.


In [7]:
Selector(text=HTML).xpath('//span/@id').extract() # this @id is asking to get the value to this 'id' key.

[u'only-span']

## Where's Waldo - "XPath Edition"

In this example, we will find Waldo together.  Find Waldo as:

- Element
- Attribute
- Text element

In [8]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">Name:  Waldo</li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
        </ul>
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

In [17]:
#solution code goes here

/td[3]/a

## 1 vs N Selections

When selecting elements via relative reference, it's possible that you will select multiple items.  It's still possible to select single items, if you're specfic enough.

**Singular Reference**
- **Index** starts at **1**
- Selections by offset
- Selections by "first" or "last"
- Selections by **unique attribute value**


In [10]:
HTML = """
<html>
    <body>
    
        <!-- Search Results -->
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=751hUX_q0Do" title="Rappin with Gas">Rapping with gas</a>
           <span class="link-details">This is a great video about gas.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=97byWqi-zsI" title="Casio Rapmap">The Rapmaster</a>
           <span class="link-details">My first synth ever.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=TSwqnR327fk" title="Cinco Products">Cinco Midi Organizer</a>
           <span class="link-details">Midi files at the speed of light.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=8TCxE0bWQeQ" title="Baddest Gates">BBG Baddest Moments</a>
           <span class="link-details">It's tough to be a gangster.</span>
        </div>
        
        <!-- Page stats -->
        <div class="page-stats-container">
            <li class="item" id="pageviews">1,333,443</li>
            <li class="item" id="last-viewed">01-22-2016</li>
            <li class="item" id="views-per-hour">1,532</li>
            <li class="item" id="john-views-per-hour">5,233.42</li>
        </div>
        
    </body>
</html>
"""

#### Selecting the first element in a series of elements

In [11]:
Selector(text=HTML).xpath('//span').extract()

[u'<span class="link-details">This is a great video about gas.</span>',
 u'<span class="link-details">My first synth ever.</span>',
 u'<span class="link-details">Midi files at the speed of light.</span>',
 u'<span class="link-details">It\'s tough to be a gangster.</span>']

#### Selecting the last element in a series of elements

In [12]:
Selector(text=HTML).xpath('//span[last()]').extract()

[u'<span class="link-details">This is a great video about gas.</span>',
 u'<span class="link-details">My first synth ever.</span>',
 u'<span class="link-details">Midi files at the speed of light.</span>',
 u'<span class="link-details">It\'s tough to be a gangster.</span>']

#### Selecting all elements matching a selection

In [13]:
Selector(text=HTML).xpath('//span').extract()

[u'<span class="link-details">This is a great video about gas.</span>',
 u'<span class="link-details">My first synth ever.</span>',
 u'<span class="link-details">Midi files at the speed of light.</span>',
 u'<span class="link-details">It\'s tough to be a gangster.</span>']

#### Selecting elements matching an _attribute_

This will be one of the most common ways you will select items.  HTML DOM elements will be more differentiated based on their "class" and "id" variables.  Mainly, these types of attributes are used by web developers to refer to specfic elements or a broad set of elements to apply visual characteristics using CSS.

```HTML 
//element[@attribute="value"]
```

**Generally**

- "class" attributes within elements usually refer to multiple items
- "id" attributes are supposed to be unique, but not always

_CSS stands for cascading style sheets.  These are used to abstract the definition of visual elements on a micro and macro scale for the web.  They are also our best friend as data miners.  They give us strong hints and cues as to how a web document is structured._

In [14]:
Selector(text=HTML).xpath('//div[@class="page-stats-container"]/li[@id="pageviews"][@class = "item"]').extract()

[u'<li class="item" id="pageviews">1,333,443</li>']

## Let's Code:

 - How can we get a series of only text items for the page statistics section of our page?
 - We want to know only how many times Kiefer views my Youtube videos page per hour?

In [15]:
# Get all text elements for the page statistics section
Selector(text=HTML).xpath('//li[@id]').extract()

[u'<li class="item" id="pageviews">1,333,443</li>',
 u'<li class="item" id="last-viewed">01-22-2016</li>',
 u'<li class="item" id="views-per-hour">1,532</li>',
 u'<li class="item" id="john-views-per-hour">5,233.42</li>']

In [16]:
# Get only the text for "Kiefer's" number of views per hour
# Selector(text=HTML).xpath('//div[@class="page-stats-container"]/li[4]/text()').extract()

# Get only the text for "Kiefer's" number of views per hour
Selector(text=HTML).xpath('//li[@id="john-views-per-hour"]/text()').extract()

[u'5,233.42']

## A Quick Note:  Requests

The requests module is the gateway to interacting with the web using Python.  We can:

 - Fetch web documents as strings
 - Decode JSON
 - Basic data munging with Web Documents
 - Download static files that are not text
  - Images
  - Videos
  - Binary data


Take some time and read up on Requests:

http://docs.python-requests.org/en/master/user/quickstart/