<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practicing Web Scraping with XPath

_Authors: Dave Yerrington (SF)_

---

### Learning Objectives
*After this lesson, you will be able to:*
- Practice scraping basics
- Review HTML and XPath basics
- Practice scraping a website for various data and put this into a DataFrame

### Lesson Guide
- [Review of HTML and web scraping](#review1)
- [Review of XPath](#review2)
- [Basic XPath expressions](#basic-xpath)
    - [Absolute references](#absolute)
    - [Relative references](#relative-references)
    - [Selecting attributes](#attributes)
- [Guided practice: Where's Waldo - "XPath Edition"](#practice1)
- [1 vs. N selections](#1vsn)
    - [Selecting the first element in a series of elements](#first-elem)
    - [Selecting the last element in a series of elements](#last-elem)
    - [Selecting all elements matching a selection](#all-elem-match)
    - [Selecting elements matching an attribute](#elem-match-attr)
- [Guided practice: selecting elements](#practice2)
- [A quick note: the requests module](#requests)
- [Guided practice: scrape Data Tau headlines](#practice3)
- [Independent practice](#independent)

<a id='review1'></a>
## Review of HTML and web scraping

---

Web scraping is a technique of extracting information from websites. It is the download and transformation of unstructured data on the web into structured data that can be stored and analyzed.

There are a variety of ways to "scrape" what we want from the web:
- 3rd Party Services (import.io)
- Write our own Python apps that pull HTML documents and parse them
  - Mechanize
  - Scrapy
  - Requests
  - libxml / XPath
  - Regular expressions
  - BeautifulSoup

> **Check:** What do you perceive to be the hardest aspect of scraping?

_ie: If you were asked to scrape craigslist property listings and put them in a DataFrame(), what would hold you up?_

### Review: HTML

In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

### Review: elements
Elements begin and end with **open and close "tags"**, which are defined by namespaced, encapsulated strings. 

```html
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

_note: the tags **title, p,** and **strong** are represented below._

### Review: element parent / child relationships

<img src="http://www.htmlgoodies.com/img/2007/06/flowChart2.gif" width="250">

**Elements begin and end in the same namespace like so:**  `<p></p>`

**Elements can have parents and children:** It is important to remember that an element can be both a parent and a child and whether to refer to the element as a parent or a child depends on the specific element you are referencing relative to it.

_Your parents are **parents** to you but **children** of your grandparents.  Same logic applies with html elements._

```html
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent'
        <div id = 'child_2'>I am the child of 'child_1'
            <div id = 'child_3'>I am the child of 'child_2'
                <div id = 'child_4'>I am the child of 'child_4'</div>
            </div>
        </div>
    </div>
</body>
```

**or**

```html
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2'
        <div id = 'child_2'>I am the parent of 'child_3'
            <div id = 'child_3'> I am the parent of 'child_4'
                <div id = 'child_4'>I am not a parent </div>
            </div>
        </div>
    </div>
</body>
```

### Review: element attributes

Elements can also have attributes!  Attributes are defined inside **element tags** and can contain data that may be useful to scrape.

```html
<a href="http://lmgtfy.com/?q=html+element+attributes" title="A title" id="web-link" name="hal">A Simple Link</a>
```

The **element attributes** of this `<a>` tag element are:
- `id`
- `href`
- `title`
- `name`

This `<a>` tag example will render in your browser like this:
> <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">A Simple Link</a>


**Check:** Can you identify an attribute, an element, a text item, and a child element?

```HTML
<html>
   <title id="main-title">All this scraping is making me itch!</title>
   <body>
       <h1>Welcome to my Homepage</h1>
       <p id="welcome-paragraph" class="strong-paragraph">
           <span>Hello friends, let me tell you about this cool hair product..</span>
           <ul>
              <li>It's cool</li>
              <li>It's fresh</li>
              <li>It can tell the future</li>
              <li>Always be closing</li>
           </ul>
       </p>
   </body>
```

**Bonus: What's missing?** 

In [15]:
# </html>

<a id='review2'></a>
## Review of XPath

---

XPath uses path expressions to select nodes or node-sets in an HTML/XML document. These path expressions look very much like the expressions you see when you work with a traditional computer file system.

### XPath features

XPath includes over 100 built-in functions to help us select and manipulate HTML (or XML) documents. XPath has functions for:

- string values
- numeric values
- date and time comparison
- sequence manipulation
- Boolean values
- and more!

<a id='basic-xpath'></a>
## Basic XPath expressions

---

XPath comes with a wide array of features but the basics of selecting data are the most common problems that XPath can help you solve.

The most common task you'll use **XPath** for is selecting data from HTML documents.  There are two ways you can **select elements** within HTML using **XPath**:

- Absolute reference
- Relative reference

<a id='absolute'></a>
### Absolute references

> _For our XPath demonstration, we will use Scrapy, which is using [libxml](http://xmlsoft.org) under the hood.  Libxml provides the basic functionality for XPath expressions._

In [16]:
# pip install scrapy
# pip install --upgrade zope2
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

HTML = """
<html>
    <body>
        <span id="only-span">good</span>
    </body>
</html>
"""
# The same thing but "absolute" reference
Selector(text=HTML).xpath('/html/body/span/text()').extract()


[u'good']

<a id='relative-references'></a>
### Relative references

Relative references in XPath match the "ends" of structures.  Since there is only a single "span" element, `//span/text()` matches **one element**.

In [17]:
Selector(text=HTML).xpath('//span/text()').extract()

[u'good']

<a id='attributes'></a>
### Selecting attributes

Attributes **within a tag**, such as `id="only-span"` within our span attribute.  We can get the attribute by using `@` symbol **after** the **element reference**.


In [18]:
Selector(text=HTML).xpath('//span/@id').extract()

[u'only-span']

<a id='practice1'></a>
## Guided practice: Where's Waldo - "XPath Edition"

---

**In this example, we will find Waldo together.  Find Waldo as an:**

- Element
- Attribute
- Text element

The practice HTML string is provided below.

In [19]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo Im not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

In [20]:
Selector(text=HTML).xpath('//@class').extract()

[u'waldo',
 u'waldo',
 u'waldo',
 u'waldo',
 u'nerds',
 u'alpha',
 u'alpha',
 u'beta',
 u'animal',
 u'tdawg',
 u'dsi-rocks']

In [21]:
# Find absolute element
print Selector(text=HTML).xpath('/html/body/waldo/text()').extract()
print Selector(text=HTML).xpath('/html/body/ul/li/text()').extract()

# Find relative element
print Selector(text=HTML).xpath('//li').extract()

# Find element attribute
print Selector(text=HTML).xpath('////@class').extract()
print Selector(text=HTML).xpath('//ul/@id').extract()


[u'Waldo']
[u'\n                ', u'\n            ', u'Height:  ???', u'Weight:  ???', u'Last Location:  ???', u'\n                ', u'\n                ', u'\n                ', u'\n                ', u'\n            ', u'\n                ', u'\n            ']
[u'<li class="waldo">\n                <span> yo Im not here</span>\n            </li>', u'<li class="waldo">Height:  ???</li>', u'<li class="waldo">Weight:  ???</li>', u'<li class="waldo">Last Location:  ???</li>', u'<li class="nerds">\n                <div class="alpha">Bill gates</div>\n                <div class="alpha">Zuckerberg</div>\n                <div class="beta">Theil</div>\n                <div class="animal">parker</div>\n            </li>', u'<li class="tdawg">\n                <span>yo im here</span>\n            </li>', u'<li>stuff</li>', u'<li>stuff2</li>']
[u'waldo', u'waldo', u'waldo', u'waldo', u'nerds', u'alpha', u'alpha', u'beta', u'animal', u'tdawg', u'dsi-rocks']
[u'waldo', u'tim']


<a id='1vsn'></a>
## 1 vs N selections

---

When selecting elements via relative reference, it's possible that you will select multiple items.  It's still possible to select single items, if you're specfic enough.

**Singular Reference**
- **Index** starts at **1**
- Selections by offset
- Selections by "first" or "last"
- Selections by **unique attribute value**


In [22]:
HTML = """
<html>
    <body>
    
        <!-- Search Results -->
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=751hUX_q0Do" title="Rappin with Gas">Rapping with gas</a>
           <span class="link-details">This is a great video about gas.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=97byWqi-zsI" title="Casio Rapmap">The Rapmaster</a>
           <span class="link-details">My first synth ever.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=TSwqnR327fk" title="Cinco Products">Cinco Midi Organizer</a>
           <span class="link-details">Midi files at the speed of light.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=8TCxE0bWQeQ" title="Baddest Gates">BBG Baddest Moments</a>
           <span class="link-details">It's tough to be a gangster.</span>
        </div>
        
        <!-- Page stats -->
        <div class="page-stats-container">
            <li class="item" id="pageviews">1,333,443</li>
            <li class="item" id="somethingelse">bla</li>
            <li class="item" id="last-viewed">01-22-2016</li>
            <li class="item" id="views-per-hour">1,532</li>
            <li class="item" id="kiefer-views-per-hour">5,233.42</li>
        </div>
        
    </body>
</html>
"""

span = Selector(text=HTML).xpath('/html/body/div/li[@id="kiefer-views-per-hour"]/text()').extract()
span

[u'5,233.42']

<a id='first-elem'></a>
### Selecting the first element in a series of elements

In [23]:
spans = Selector(text=HTML).xpath('//span').extract()
spans[0]

u'<span class="link-details">This is a great video about gas.</span>'

<a id='last-elem'></a>
### Selecting the last element in a series of elements

In [24]:
spans = Selector(text=HTML).xpath('//span').extract()
spans[-1]

u'<span class="link-details">It\'s tough to be a gangster.</span>'

<a id='all-elem-match'></a>
### Selecting all elements matching a selection

In [25]:
Selector(text=HTML).xpath('//span').extract()

[u'<span class="link-details">This is a great video about gas.</span>',
 u'<span class="link-details">My first synth ever.</span>',
 u'<span class="link-details">Midi files at the speed of light.</span>',
 u'<span class="link-details">It\'s tough to be a gangster.</span>']

<a id='elem-match-attr'></a>
### Selecting elements matching an _attribute_

This will be one of the most common ways you will select items.  HTML DOM elements will be more differentiated based on their "class" and "id" variables.  Mainly, these types of attributes are used by web developers to refer to specfic elements or a broad set of elements to apply visual characteristics using CSS.

```HTML 
//element[@attribute="value"]
```

**Generally**

- "class" attributes within elements usually refer to multiple items.
- "id" attributes are supposed to be unique, but not always.

_CSS stands for cascading style sheets.  These are used to abstract the definition of visual elements on a micro and macro scale for the web.  They are also our best friend as data miners.  They give us strong hints and cues as to how a web document is structured._

<a id='practice2'></a>
## Guided practice: selecting elements

---

1. **How can we get a series of only text items for the page statistics section of our page?**
2. **We want to know only how many times Kiefer views my Youtube videos page per hour?**

In [26]:
# Get all text elements for the page statistics section
Selector(text=HTML).xpath('//li/text()')

[<Selector xpath='//li/text()' data=u'1,333,443'>,
 <Selector xpath='//li/text()' data=u'bla'>,
 <Selector xpath='//li/text()' data=u'01-22-2016'>,
 <Selector xpath='//li/text()' data=u'1,532'>,
 <Selector xpath='//li/text()' data=u'5,233.42'>]

In [27]:
# Get only the text for "Kiefer's" number of views per hour
# Selector(text=HTML).xpath('//div[@class="page-stats-container"]/li[4]/text()').extract()

# Get only the text for "Kiefer's" number of views per hour
Selector(text=HTML).xpath('//li[@id="kiefer-views-per-hour"]/text()').extract()

[u'5,233.42']

<a id='requests'></a>
## A quick note:  the `requests` module

---

The requests module is the gateway to interacting with the web using Python.  We can:

 - Fetch web documents as strings
 - Decode JSON
 - Basic data munging with Web Documents
 - Download static files that are not text
  - Images
  - Videos
  - Binary data


Take some time and read up on Requests:

http://docs.python-requests.org/en/master/user/quickstart/

<a id='practice3'></a>
## Guided practice: scrape Data Tau headlines

DataTau is a great site for data science news. Let's take their headlines using Python **`requests`**, and practice selecting various elements.

Using <a href="https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en">XPath helper Chrome plugin</a> _(cmd-shift-x)_ and the Chrome "inspect" feature, let's explore the structure of the page.

_Here's a <a href="https://www.youtube.com/watch?v=i2Li1vnv09U">concise video</a> that demonstrates the basic inspect feature within Chrome._

In [28]:
# Please only run this frame once to avoid hitting the site too hard all at once
import requests

response = requests.get("http://www.datatau.com")
HTML = response.text  
HTML[0:150]           # view the first 500 characters of the HTML index document for DataTau

u'<html><head><link rel="stylesheet" type="text/css" href="news.css">\n<link rel="shortcut icon" href="http://www.iconj.com/ico/d/x/dxo02ap56v.ico">\n<scr'

### Selecting Only The Headlines

We will use the XPath helper tool to inspect the markup that comprises the **title** to find any pattern.  Since there are more than one **titles**, we expect to find a series of elements representing the **title** data that we are interested in.

In this example, we are referencing the the **1st center**, **3rd table row (`tr[3]`)**, within the 2nd **td having a class of "title" (`td[@class="title"][2]`)**, and the anchor tag within a **(`a/text()`)**.


In [30]:
import pandas as pd

titles = Selector(text=HTML).xpath('//td[@class="title"]/a/text()').extract()
titles[0:10] # the first 5 titles

[u'New York City: Data Science\u2019s Best Bet for Growth and Opportunity',
 u"You can't come up with one woman?",
 u'Using NLP to find familiar meals at new restaurants',
 u'Interactive Visualizations In Jupyter Notebook',
 u'Bayesian learning for statistical classification to improve your model',
 u'Self Driving Cab Simulator',
 u'TED Talks Dataset',
 u'When Are Citi Bikes Faster Than Taxis in New York City?',
 u'End-to-end machine learning on the GPU with GOAI',
 u'The Ten Fallacies of Data Science Work']

### How can we get the urls from the titles?

In [20]:
urls = Selector(text=HTML).xpath('//td[@class="title"]/a/@href').extract()
urls[::-1]
#<a href="http://tech.marksblogg.com/faster-queries-google-cloud-dataproc.html">33x Faster Queries on Google Cloud's Dataproc using Facebook's Presto</a>
# titles[0:5] # the first 5 titles

[u'/x?fnid=de7nLfGLCd',
 u'https://techcrunch.com/2017/06/16/object-detection-api/',
 u'https://arxiv.org/abs/1705.10883',
 u'http://blog.ethanrosenthal.com/2017/06/20/matrix-factorization-in-pytorch/',
 u'https://elitedatascience.com/bias-variance-tradeoff',
 u'https://techcrunch.com/2017/06/14/element-ai-a-platform-for-companies-to-build-ai-solutions-raises-102m/',
 u'https://github.com/indrajithi/mgc-django',
 u'https://arogozhnikov.github.io/3d_nn/',
 u'https://devblogs.nvidia.com/parallelforall/goai-open-gpu-accelerated-data-analytics/',
 u'https://data36.com/optimize-facebook-campaigns/',
 u'http://willwolf.io/2017/06/15/random-effects-neural-networks/',
 u'http://willwolf.io/2017/06/19/neurally-embedded-emojis/',
 u'https://medium.com/xtrememl/why-how-to-use-windows-10-wsl-built-in-linux-for-machine-learning-6a225f4bbd3a',
 u'https://www.mapd.com/blog/2017/06/18/release-feature-focus-being-smart-not-dense-with-immerse-density-gradients/',
 u'https://www.dataandsons.com/',
 u'htt

### How can we get the site domain, after the title within the parentheses (ie: stitchfix.com)?

In [21]:
domains = Selector(text=HTML).xpath("//span[@class='comhead']/text()").extract()

In [22]:
domains[0:5]

[u' (insightdatascience.com) ',
 u' (elitedatascience.com) ',
 u' (insightdatascience.com) ',
 u' (statsbot.co) ',
 u' (plot.ly) ']

### How about the points?

In [23]:
points = Selector(text=HTML).xpath('//td[@class="subtext"]/span/text()').extract()
points[0:5]

[u'3 points', u'13 points', u'7 points', u'9 points', u'7 points']

### How about the "more Link?"

> *Hint:  You can use `element[text()='exact text']` to find text element matching specific text.*

In [24]:
next_link = Selector(text=HTML).xpath('//a[text()="More"]/@href').extract()
next_link

[u'/x?fnid=de7nLfGLCd']

<a id='independent'></a>
## Independent practice

---

**For the next 30 minutes try to grab the following from Data Tau:**

- Story titles
- Story URL (href)
- Domain
- Points

**Stretch goals:**
- Author
- Comment count

**Put your results into a DataFrame.**

- Do basic analysis of domains and point distributions

**BONUS:**

Automatically find the next "more link" and mine the next page(s) until none exist.  Logically, you can each page with this pseudo code:

1. Does the next link exist (a tag with text == "More")
- Fetch URL, prepended with domain (datatau.com/(extracted link here))
- Parse the page with `Selector(text=HTML).xpath('').extract()` to find the elements
- Add to dataframe

> _Note:  You might want to set a limit something like 2-3 total requests per attempt to avoid unecessary transfer._


In [25]:
import requests, numpy as np

def parse_url(url="http://www.datatau.com", data=False):
    
    response  =  requests.get(url)
    links     =  Selector(text=response.text).xpath("//td[@class='title']/a/@href").extract()
    titles    =  Selector(text=response.text).xpath("//td[@class='title']/a/text()").extract()
    points    =  Selector(text=response.text).xpath("//td[@class='subtext']/span/text()").extract()
    domains   =  Selector(text=response.text).xpath("//td[@class='title']/span/text()").extract()
    authors   =  Selector(text=response.text).xpath("//td[@class='subtext']/a[contains(@href, 'user')]/text()").extract()
    comments  =  Selector(text=response.text).xpath("//td[@class='subtext']/a[contains(@href, 'item')]/text()").extract()

    expected_length = 30
    
    # [np.nan]*(expected_length - len(points)) to the end of the lists, will fill in missing
    # values at the end that sometimes don't exist at the ends of the results
    scraped = dict(
        titles   =  titles[:30], 
        links    =  links[:30], # :30 because of that damn "more" link
        points   =  points + [np.nan]*(expected_length - len(points)),
        domains  =  domains + [np.nan]*(expected_length - len(domains)),
        authors  =  authors + [np.nan]*(expected_length - len(authors)),
        comments =  comments + [np.nan]*(expected_length - len(comments))
    )
    
    df = pd.DataFrame(scraped)
    
    if type(data) != bool:
        data = df.append(data)
    else:
        data = df
        
    # If there's data append it, if not, it's the first iteration, no need.
    # Find more link
    more_anchor  =  Selector(text=response.text).xpath("//a[text() = 'More']/@href").extract()
    
    if len(more_anchor) > 0:
        more_url  =  "http://www.datatau.com%s" % more_anchor[0]
        print "Fetching %s..." % more_url
        return parse_url(more_url, data=data)
    else:
        return data.reset_index()
       
        
df = parse_url("http://www.datatau.com")
df

Fetching http://www.datatau.com/x?fnid=ISDTQ2p1bz...
Fetching http://www.datatau.com/x?fnid=KrI45lZv0A...
Fetching http://www.datatau.com/x?fnid=YTa77kRVQb...
Fetching http://www.datatau.com/x?fnid=GGH6o5b58c...
Fetching http://www.datatau.com/x?fnid=up00dadwoR...
Fetching http://www.datatau.com/x?fnid=3WLrlmSA40...


Unnamed: 0,index,authors,comments,domains,links,points,titles
0,0,meghido,discuss,(roialty.com),http://roialty.com/why-are-clusters-important-...,2 points,Clustering Social Media Audience
1,1,kokorcsin,discuss,(data36.com),https://data36.com/sql-for-data-analysis-tutor...,2 points,SQL for Data Analysis – Tutorial for Beginners...
2,2,lmcinnes,discuss,(github.io),http://lmcinnes.github.io/subreddit_mapping/,12 points,Mapping and Analysing SubReddits Using Python
3,3,freebiesmall,discuss,(freebiesmall.com),https://www.freebiesmall.com/blog/best-helveti...,2 points,Top 5 Fonts That Can Take The Place of Helveti...
4,4,freebiesmall,1 comment,(freebiesmall.com),https://www.freebiesmall.com/blog/call-to-acti...,5 points,Call to Action Button Examples Every UI/UX Des...
5,5,axelr,discuss,(github.io),http://arogozhnikov.github.io/2017/04/20/machi...,14 points,Machine Learning in Science and Industry [slides]
6,6,deeplearningt,discuss,(deeplearningtrack.com),https://www.deeplearningtrack.com/single-post/...,20 points,Learn under the hood of Gradient Descent algor...
7,7,Alphax,discuss,(tjpalanca.com),http://www.tjpalanca.com/2017/03/facebook-news...,6 points,Using topic modeling to find trends in Faceboo...
8,8,marklit,discuss,(marksblogg.com),http://tech.marksblogg.com/billion-nyc-taxi-ri...,8 points,1.1 Billion Taxi Rides with MapD 3.0 & 2 GPU-P...
9,9,saloni_S,1 comment,(byteacademy.co),http://byteacademy.co/blog/beginner-deep-learn...,2 points,A Beginner's Guide to Deep Learning
