<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px; width: 55px">

# Introduction to Web Scraping and Spiders with `scrapy`

_Authors: Dave Yerrington (SF), Sam Stack(DC)_

---

### Learning Objectives
- Understand the structure and content of HTML
- Learn about elements, attributes, and element hierarchy in HTML
- Learn about XPath and using multiple and singular selections
- Practice using Scrapy to get data from craigslist
- Practice using Beautiful Soup to parse data from craigslist
- Walkthrough the construction of a spider built using scrapy


### Lesson Guide
- [Introduction](#introduction)
- [HTML](#html)
    - [Elements](#elements)
    - [Attributes](#attributes)
    - [Element hierarchy](#element-hierarchy)
    - [More resources on HTML structure](#html-resources)
- [What is XPath?](#xpath)
    - [Multiple selections](#multiple-selections)
    - [Singular selections](#singlular-selections)
- [A simple `scrapy` example](#scrapy)
- [A practical example with Requests and Beautiful Soup](#practical)
    - [Step 1: fetch the content by URL](#step1)
    - [Step 2: Parse HTML document with Beautiful Soup](#step2)
    - [Practice: can you select the price of our junker?](#practice)
- [Scrapy and spiders](#scrapy)
    - [Create a Scrapy project](#scrapy-project)
    - [Define an "item"](#define-item)
    - [A spider that crawls](#spider-crawl)
    - [XPath and parsing with our spider](#xpath-spider)
    - [Save and examine our scraped data](#save-examine)
- [Addendum: leveraging XPath to get more results](#addendum)
    - [Following links](#follow-links)

- [Introduction](#introduction)
- [HTML](#html)
    - [Elements](#elements)
    - [Attributes](#attributes)
- [What is XPath?](#xpath)
    - [Absolute References](#xpath_absolute)
    - [Relative References](#xpath_relative)
    - ["Wheres Waldo?" Exercise](#waldo_exercise)
- [1 vs N Selectors](#1_v_n)
- [Demo Code](#demo)
    - [Scrape Data Tau](#scrape_tau)
- [Independent Practice](#ind_practice)

<a id='introduction'></a>

![What is Html](http://designshack.designshack.netdna-cdn.com/wp-content/uploads/htmlbasics-0.jpg)

One of the largest sources of data in the world is all around us.  We consume the web in some form every day.  One of the most powerful python toolsets we will learn allows us to extract and normalize data from unstructured sources like webpages.  

**If you can see it, it can be scraped, mined, and put into a dataframe.**

Before we begin the actual process of webscraping with python, it is important to cover the basic constructs that describe HTML as unstructured data. 

Then we will cover a a powerful selection technique called XPath, and look at a basic workflow using a framework called [Scrapy](http://www.scrapy.org).

<a id='html'></a>

## Hypertext markup language (HTML)

---

In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

<a id='elements'></a>
### Elements
Elements begin and end with open and close "tags", which are defined by namespaced, encapsulated strings. These namespaces that begin and end the elements must be the same.

```
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

As you may have several different titles or paragraphs on a single page, you can assign ID values to namespace to make more unique reference points.  IDs are also very useful for labelling nested elements.
```
<title id ='title_1'>I am a the first title.</title>
<p id ='para_1'>I am the first paragraph.</p>
<title id ='title_2'>I am a the second title.</title>
<p id ='para_2'>I am the second paragraph.</p>
```


**Elements can have parents and children:**
It is important to remember that an element can be both a parent and a child and whether to refer to the element as a parent or a child depends on the specific element you are referencing.


```
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent'
        <div id = 'child_2'>I am the child of 'child_1'
            <div id = 'child_3'>I am the child of 'child_2'
                <div id = 'child_4'>I am the child of 'child_4'</div>
            </div>
        </div>
    </div>
</body>
```
**or**
```
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2'
        <div id = 'child_2'>I am the parent of 'child_3'
            <div id = 'child_3'> I am the parent of 'child_4'
                <div id = 'child_4'>I am not a parent </div>
            </div>
        </div>
    </div>
</body>
```

<a id='attributes'></a>
### Attributes

HTML elements can have attributes.  They describe properties, and characteristics of elements.  Some affect how the element behaves or looks in terms of the rendered output by the browser.

The most common element is an "anchor" element.  Anchor elements often have an "href" element, which tells the browser where to go after it is clicked.  Anchor elements are typically are formatted in bold, and sometimes are underlined as a visual cue to differentiate itself.

**Markup that describes nn element with attributes, litterally looks like this**

```
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

<a id='element-hierarchy'></a>
### Element hierarchy

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

**Literally Represented:**

```
<html>
    
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
    
</html>
```

<a id='html-resources'></a>
### You are now qualified HTML experts

![](assets/certified.jpg)

Your HTML learning can continue...

Read all about the different elements supported amongst modern browsers:
 * [HTML5 Cheatsheet](http://websitesetup.org/html5-cheat-sheet/)
 * [Mozilla HTML Element Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
 * [HTML5 Visual Cheatsheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf)
 
From here on out, we will talking mostly about how to select these different elments with XPath:

- Single vs multiple elements
- Elements, conditionally matching attributes
- Element attributes
- Element text

<a id='xpath'></a>

## What is XPath?

---

<img src="assets/obama_wiki.png" width="700">

Understanding how to identify elements and attributes within HTML documents gives us the capability to write simple expressions that create structured data.  We can think os XPath like a query language for querying HTML.

To make this process easier to deal with, we will be using XPath helper, which is a Chrome addon.  It's not necessary, but highly recommended to help build XPath expressions.

[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en)

XPath expressions can select elements, element attributes, and element text.  These selections can be either to a single item, or multiple items.  Generally, if you're not specific enough, you will end up selecting multiple elements.


<a id='multiple-selections'></a>
### Multiple selections

***Multiple selections*** are useful for capturing search results, or any repeating element.  For instance, the _titles_ of an apartment listing search results from Craigslist:


**URL**

[http://sfbay.craigslist.org/search/sfc/apa](http://sfbay.craigslist.org/search/sfc/apa)


**Example HTML Markup**
```
...
<span class="pl"> 
    <time datetime="2016-01-12 23:27" title="Tue 12 Jan 11:27:35 PM">Jan 12</time> 
    <a href="/sfc/apa/5400584579.html" data-id="5400584579" class="hdrlnk">Welcome home to a sweetly renovated four bedroom one and a half bath</a> 
</span>
...
```

**XPath - Multiple Titles** _copy this into the XPath Helper Query box_
```
//a[@class='result-title hdrlnk']
```

**Returns (Ad Titles)**
```
***New Remodeled two bedroom Apartment***
WONDERFUL ONE BR APARTMENT HOME
Beautiful 1bed/1bath Apartment in Russian Hill NO SECURITY DEPOSIT
Knockout SF View|Green Oasis|Private Driveway|Furnished
3BR/3BA Spacious, Beautiful SOMA Loft: 5 month lease
Nob Hill Large Studio - Light, Quiet, Lovely Building
etc...
```

<a id='singlular-selections'></a>

### Singular selections

***Singular selections*** are necessary when you want to grab specific, unique text within elements.  Here's an example of a details page on Craigslist:

> *Note: this example may be expired if you view it sometime after Jan 12th, 2016. Please replace this with a current craigslist listing!

**URL**

[https://sfbay.craigslist.org/sfc/apa/6161864063.html](https://sfbay.craigslist.org/sfc/apa/6161864063.html)

**HTML Markup**

```
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
```

**XPath - Single Item**

```
//p[@class='postinginfo'][2]/time
```
**Returns (Time of posting or age of Post)**
```
2016-01-12 11:23pm
```

<a id='scrapy'></a>

## A simple  example using `scrapy` and `XPath`.

---

Below is an example of how to get information out of some fake HTML using the XPath capabilities of the `scrapy` package. You will likely need to install the scrapy package using `conda install scrapy`.   
**Note:** `Conda install` will install the necessary dependent packages needed for Scrapy, `pip install` will **not**.

We will use the `Selector` class from the `Scrapy` library to help us construct our query.

`Selector` classes take the HTML target as an argument and can then utilize several flavors of query types to extract information.  In our situation we will specify `XPath` as our query will utilize XPath flavoured language. 

Just like with writing python scripts, there are several was you can access the exact same information in HTML.  Lets try a few out.

> <img src="https://secure.gravatar.com/avatar/26da7b36ff8bb5db4211400358dc7c4e.jpg?s=512&r=g&d=mm" style="float: left; width: 90px; padding: -15px 25px 25px 0">If you're on Docker with the `scipy-notebook` version of Jupyter, you will need to temporarily install the Scrapy library for this lesson today.  If not, no need to run the next cell.

In [3]:
!conda install scrapy --yes

Fetching package metadata ...........
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda:
#
scrapy                    1.4.0                    py36_0    conda-forge


In [2]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

# HTML structure string
HTML = """
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
"""

# Option 1: use the exact class name to get its associated text
best = Selector(text=HTML).xpath("//span[@class='bestof-text']/text()").extract()
best

['best of']

In [4]:
# Option 2: use the 'contains()' function extract any text that includes the text 'best of'
best = Selector(text=HTML).xpath("//span[contains(text(), 'best of')]/text()").extract()
best

['best of']

In [5]:
# Option 3: First grabs the entire html post where 'class='bestof-link'
best =  Selector(text=HTML).xpath("/html/body/div/p/a[@class='bestof-link']")
        # parse the first grabbed chunk for the the text of the specific element with class='bestof-text'
nested_best =  best.xpath("./span[@class='bestof-text']/text()").extract()
nested_best

['best of']

_Option 3 will probably be the most common for you because there is a good chance that you will want to grab information from several children elements that exist within one parent element._

## Where's Waldo - "XPath Edition"

In this example, we will find Waldo together.  Find Waldo as:

- Element
- Attribute
- Text element

In [5]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo Im not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

**Tip:** We can use the asterisk special character '*' as an place holder for 'all possible'.

```python
# all elements where class='alpha'
Selector(text=HTML).xpath('//*[@class="alpha"]').extract()

#returns

[u'<div class="alpha">Bill gates</div>',
 u'<div class="alpha">Zuckerberg</div>']
```


In [6]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

#### Find element 'waldo'

Notice this is an absolute reference like a file directory.  The "//" slash refers to "all elements that end with `[element name]`"

In [7]:
# text contents of the element waldo
Selector(text=HTML).xpath('/html/body/waldo/text()').extract()

['Waldo']

**Find attribute(s) 'waldo'**

In [8]:
# Contents of all elements with all attributes with value waldo
Selector(text=HTML).xpath('//*[@*="waldo"]').extract()

['<ul id="waldo">\n            <li class="waldo">\n                <span> yo Im not here</span>\n            </li>\n            <li class="waldo">Height:  ???</li>\n            <li class="waldo">Weight:  ???</li>\n            <li class="waldo">Last Location:  ???</li>\n            <li class="nerds">\n                <div class="alpha">Bill gates</div>\n                <div class="alpha">Zuckerberg</div>\n                <div class="beta">Theil</div>\n                <div class="animal">parker</div>\n            </li>\n        </ul>',
 '<li class="waldo">\n                <span> yo Im not here</span>\n            </li>',
 '<li class="waldo">Height:  ???</li>',
 '<li class="waldo">Weight:  ???</li>',
 '<li class="waldo">Last Location:  ???</li>']

In [9]:
# Contents of all elements (*) with class attributes set to waldo
Selector(text=HTML).xpath('//*[@class="waldo"]').extract()

['<li class="waldo">\n                <span> yo Im not here</span>\n            </li>',
 '<li class="waldo">Height:  ???</li>',
 '<li class="waldo">Weight:  ???</li>',
 '<li class="waldo">Last Location:  ???</li>']

**Find text element Waldo**

In [10]:
# gets everything around the text element waldo
Selector(text=HTML).xpath("//*[text()='Waldo']").extract()

['<waldo>Waldo</waldo>']

## Now let's practice with the class a little bit.

#### 1) Select only "beta" nerds, text element
There's a list item that has a class "nerds".  First try to select that list item, conditionally with the class "nerds".  Then, extend your query to also select the child elements with the "beta" class.

In [11]:
# A:

#### 2) Select only "alpha" nerds, text element
There's a list item that has a class "nerds".  First try to select that list item, conditionally with the class "nerds".  Then, extend your query to also select the child elements with the "alpha" class.

In [None]:
# A: 

#### 3) _Absolutely_ select only the span text when ANY element has the class "tdawg"
Remember, absolute references start from the beginning of your DOM reference.  What is the first element in your document?  This will be the first element that you select beginning with a single slash "/".

_Bonus points, which popular series (without googling) is the charcter "T-Dawg" from?_

In [None]:
# A:

#### 4) Select all the child elements of nerd, but then only their "class" attributes

Relative or absolute reference.  The XPath query should return: alpha, alpha, beta, animal.

In [None]:
# A:

#### 5) Any text element that _contains_ the text "yo"
We haven't talked about this much but you can also do substring matches using the `contains(source, match)` within `[]` brackets.  It can match elements, attributes, or text elements as the source.  The match paramter is a string.

_Here's an example that shows "li" tags that match the @class "waldo"._

In [16]:
Selector(text=HTML).xpath("//li[contains(@class, 'waldo')]").extract()

['<li class="waldo">\n                <span> yo Im not here</span>\n            </li>',
 '<li class="waldo">Height:  ???</li>',
 '<li class="waldo">Weight:  ???</li>',
 '<li class="waldo">Last Location:  ???</li>']

_Try this same query with all (*), instead of "li", and the selector for text elements instead of @class_. 

>This one conditional selection method is very flexible and you will find it handy in the future.

In [None]:
# A:

# The Nested Content Problem
(Double click this tab to see the source example if you're in Jupyter lab.)

Imagine if you will, you have a piece of content like this:

```html
<div id="item_12345">
    Here is a great new website I found called <strong>thistothat.com</strong> which has practically been around forever.  Back in <em>1997</em>, I told all my friends about this and because of this knowledge transfer, I became <strong>the most popular man in <em>Alaska</em></strong>
<div>
```

In [12]:
html = """
<div id="item_12345">
    Here is a great new website I found called <strong>thistothat.com</strong> which has practically been around forever.  Back in <em>1997</em>, I told all my friends about this and because of this knowledge transfer, I became <strong>the most popular man in <em>Alaska</em></strong>
<div>
"""

In [13]:
Selector(text=html).xpath("//div[@id='item_12345']/text()").extract()

['\n    Here is a great new website I found called ',
 ' which has practically been around forever.  Back in ',
 ', I told all my friends about this and because of this knowledge transfer, I became ',
 '\n']

## What's returned relative to what we expected?

Seems like we have a structure that kind of looks like this:

```
<div> 
    [1. a text element] 
    <strong> [2. a text element] </strong> 
    [3. another text element] 
    <em> [4. text element] </em>
    [5. another text element -- starts with "I told all my friends..."]
    <strong> 
        [6. text element - "the most popular man in"]
        <em> [7. text element - "Alaska"] </em>
    </strong>
</div>
```

If we ask **XPath** to give us the **text()** elements of within the _1st_ level of our **//div** object(s), it's going to do _exactly_ that.  The first level of text elements are items _1, 3, and 5_.  Now, this is where XPath can be a little bit obtuse so in these cases we might be able to use **BeautifulSoup** in order to collapse these nested elements to produce a text element that looks like what you see.  

**BeautifulSoup** has much more functionality for selecting HTML by CSS selector, generalized filtering methods, and even cleaning bad HTML to be formatted correctly.  We are simply going to use it to capture these nested text elements in order.

## First let's see how to get this nested collection of text elements with XPath


In [14]:

div_text        =  Selector(text=html).xpath("//div[@id='item_12345']/text()").extract()
em_text         =  Selector(text=html).xpath("//div[@id='item_12345']/em/text()").extract() 
strong_text     =  Selector(text=html).xpath("//div[@id='item_12345']/strong/text()").extract()
strong_em_text  =  Selector(text=html).xpath("//div[@id='item_12345']/strong/em/text()").extract()

extracted_text = [div_text[0], strong_text[0], div_text[1], em_text[0], div_text[2], strong_text[1], strong_em_text[0]]
extracted_text

['\n    Here is a great new website I found called ',
 'thistothat.com',
 ' which has practically been around forever.  Back in ',
 '1997',
 ', I told all my friends about this and because of this knowledge transfer, I became ',
 'the most popular man in ',
 'Alaska']

Now that we've captured the text in the right format in the right order, in a list, we can `join` all elements as a single text string.

In [15]:
" ".join(extracted_text)

'\n    Here is a great new website I found called  thistothat.com  which has practically been around forever.  Back in  1997 , I told all my friends about this and because of this knowledge transfer, I became  the most popular man in  Alaska'

## That's a lot of work so let's try this with BeautifulSoup

To use BeautifulSoup, we just import it, initialize a `soup` object with `BeautifulSoup` class, passing the HTML content as a string, to the first parameter.  The 2nd parameter is the parser you wish to use with BeautifulSoup.  Mainly the parser is an underlying library that scans the HTML document providing the low level hooks **BeautifulSoup** uses, enabling all of it's features to read the lower level document elements.  Some parsers perform differently and other provide more parsing features of HTML.  "lxml" is based on LibXML and works great for pasing HTML documents.

Here's how we collapse nested text elements in 1 line of BeautifulSoup.

In [16]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
soup.text

'\n    Here is a great new website I found called thistothat.com which has practically been around forever.  Back in 1997, I told all my friends about this and because of this knowledge transfer, I became the most popular man in Alaska\n\n'

### Here are a few handy operations with BeautifulSoup


#### Reference elements by class variable / object

In [17]:
soup.div

<div id="item_12345">
    Here is a great new website I found called <strong>thistothat.com</strong> which has practically been around forever.  Back in <em>1997</em>, I told all my friends about this and because of this knowledge transfer, I became <strong>the most popular man in <em>Alaska</em></strong>
<div>
</div></div>

#### Reference element attributes by key

In [18]:
soup.div['id']

'item_12345'

#### Reference nested elements

In [19]:
soup.div.strong

<strong>thistothat.com</strong>

**Reference items with powerful selectors**

Like **.next\_element** and **.previous\_element**

In [20]:
# This selects the 2nd strong element
soup.div.strong.next_element.next_element

' which has practically been around forever.  Back in '

#### Select relative parents (which element owns the own in question?)

In [22]:
for element in soup.div.strong.find_parent():
    print(element)


    Here is a great new website I found called 
<strong>thistothat.com</strong>
 which has practically been around forever.  Back in 
<em>1997</em>
, I told all my friends about this and because of this knowledge transfer, I became 
<strong>the most popular man in <em>Alaska</em></strong>


<div>
</div>


#### Search for elements by name

In [23]:
soup.findAll("strong")

[<strong>thistothat.com</strong>,
 <strong>the most popular man in <em>Alaska</em></strong>]

## Read more about [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) in the [docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
BeautifulSoup is a wonderful complement to XPath because it's nice to collapse text and even strip HTML out of chunks of text that you've selected with XPath, eliminating a lot of work involved with iterating slicing through nested elements.  It's still, in a lot of ways not as powerful as XPath when it comes to query flexibility and performance.

_XPath can sometimes be thought of a relic from the past but it's extremely well suited for any size job, providing enough basic features to be useful and learned within a day or two.  You can certainly learn all of it's features like date trasnformations and formatters, but it's not going to make you any less productive._
<br><br>

><img src="https://snag.gy/3FGbPo.jpg" width="450">
>[Python HTML Parser Performance](http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/)
>This is an old article but BeautfulSoup does tend to be a bit slower if you have a larger job, you might consider measuring the resource utilization of your scrape / crawl methods before you decide to go fully BeautifulSoup.

#### One last thought about alternative parsing methods
One up and coming library in python is [pyquery](https://pypi.python.org/pypi/pyquery).  If you're an old salty Javascript developer, you might feel at home being able to use JQuery style selectors.  This is wildly powerful as it suports pseudo CSS class selectors and other nicities not found in either BeautifulSoup or XPath.

## What to use and when
$$Friends = 2*\frac{XPath * BeautifulSoup}{XPath + BeautifulSoup}$$

Generally, learn XPath well because we recommend using it as your default go-to since it's widely used in Scrapy which will help you scrape multiple pages and large scale jobs.  XPath offers the most flexibility overall, and is reasonable enough to learn.  Use BeautifulSoup for nested elements, smaller projects, and generally when it's easier to select items with objects and elements you extract with XPath to pull out the little items burried within.  Also use BeautifulSoup when your HTML is malformed and nothing seems to work with it.  Optionally learn the CSS selector queries because they can all work with scrapy's XPath implementation.

> **Read more about CSS selectors**
> * [W3Schools CSS Selectors](https://www.w3schools.com/cssref/css_selectors.asp)
> * [Tuts Plus: 30 Selectors You Must Memorize](https://code.tutsplus.com/tutorials/the-30-css-selectors-you-must-memorize--net-16048)

In [24]:
## Xpath with chained CSS selector
Selector(text=html).xpath("//div").css("strong em::text").extract()

['Alaska']

## Parsing Repeating Elements

The most common pattern you will likely encounter in your web scraping adventures, will be search results.  Search result type pages are best scraped by selecting the highest level repeating element, then iterating through them and extracting their child elements.

Let's say we're talking about job postings on Indeed.com:

<img src="https://snag.gy/aOV15P.jpg" width="600">

Each item is a job posting.  Each job posting has a **Job Title** that links to a details page about the job, a **Company Name**, and a brief **Job Description**.  Sometimes, it doesn't have **reviews** but other times it does.

There are two obvious strategies, but really only one is going to work in this case.

1. **Selecting and extracting Title, Company, Ratings, and Job Description seperately**
1. **Selecting the container element that contains each job posting and extracting the Title / Company / Ratings / Description within iteration.**

### Look at the reviews for a minute.  Can you think of tradeoffs or problems with either approach?
Think about how we might turn this into a DataFrame for a moment, given either case 1 or 2.

In [25]:
html = """
<div class="row  result">
    
    <a class="jobtitle">Data Scientist II</a>
    <span class="company">Big Fish Games</span>
    <span class="ratings">13 reviews</span>
    <span class="location">Oakland, CA 94612</span>
    <span class="summary">The Data Scientists are responsible for applying a wide... </span>
    
<div>

<div class="row  result">
    
    <a class="jobtitle">Machine Learning (Comprehend) Engineer</a>
    <span class="company">ipvive</span>

    <span class="location">Berkeley, CA 94704</span>
    <span class="summary">The ei-OS Machine Learning (Comprehend) Engineer will lead </span>
    
<div>

<div class="row  result">

    <a class="jobtitle">Senior Data Scientist</a>
    <span class="company">Big Fish Games</span>
    <span class="ratings">13 reviews</span>
    <span class="location">Oakland, CA 94612</span>
    <span class="summary">The Data Scientists are responsible for applying a wide... </span>
    
<div>

"""

### Case 1: selecting each element individually

Can you think of some ways to make this a dataframe?  Any problems?  Go head an inspect these items.

In [26]:
job_titles =  Selector(text=html).xpath("//a[@class='jobtitle']").extract()
companies  =  Selector(text=html).xpath("//span[@class='company']").extract()
ratings    =  Selector(text=html).xpath("//span[@class='ratings']").extract()
locations  =  Selector(text=html).xpath("//span[@class='location']").extract()
summary    =  Selector(text=html).xpath("//span[@class='summary']").extract()

### Case 2: Iterate through parent elements, extracting titles / ratings / etc.
Check this code out and try to think through how and why it works.  Feel free to print out anything and think through how this is better or worse than the first case.  Is this this really better or more complicated?

In [27]:
postings = []

# the contains() XPath function will substring match any item you put in it
# we use this to find the class "row" within the class attribute that says "row result"
for selector in Selector(text=html).xpath("//div[contains(@class, 'row')]"):  # also notice we didn't extract()
    
    ## Each "row" ie: Job title, being extracted is referenced by selector
    postings.append({
        # the . in front of the XPath query makes the reference relative to the current row in iteration
        "title":    selector.xpath("./a[@class='jobtitle']/text()").extract_first(default="N/A"),
        "company":  selector.xpath("./span[@class='company']/text()").extract_first(default="N/A"),
        "rating":   selector.xpath("./span[@class='ratings']/text()").extract_first(default="N/A"),
        "location": selector.xpath("./span[@class='location']/text()").extract_first(default="N/A"),
        "summary":  selector.xpath("./span[@class='summary']/text()").extract_first(default="N/A")
    })
    
postings

[{'company': 'Big Fish Games',
  'location': 'Oakland, CA 94612',
  'rating': '13 reviews',
  'summary': 'The Data Scientists are responsible for applying a wide... ',
  'title': 'Data Scientist II'},
 {'company': 'ipvive',
  'location': 'Berkeley, CA 94704',
  'rating': 'N/A',
  'summary': 'The ei-OS Machine Learning (Comprehend) Engineer will lead ',
  'title': 'Machine Learning (Comprehend) Engineer'},
 {'company': 'Big Fish Games',
  'location': 'Oakland, CA 94612',
  'rating': '13 reviews',
  'summary': 'The Data Scientists are responsible for applying a wide... ',
  'title': 'Senior Data Scientist'}]

In [28]:
import pandas as pd
df = pd.DataFrame(postings)
df

Unnamed: 0,company,location,rating,summary,title
0,Big Fish Games,"Oakland, CA 94612",13 reviews,The Data Scientists are responsible for applyi...,Data Scientist II
1,ipvive,"Berkeley, CA 94704",,The ei-OS Machine Learning (Comprehend) Engine...,Machine Learning (Comprehend) Engineer
2,Big Fish Games,"Oakland, CA 94612",13 reviews,The Data Scientists are responsible for applyi...,Senior Data Scientist


In [29]:
# If need be, demonstrate parent / child elements of selectors here

## Using Requests - Practice
The requests library is very powerful.  While it's not practical for large scale sites, you can do a lot in a pinch.  Let's try to scrape a basic site.


In [30]:
import requests 
response = requests.get("http://econpy.pythonanywhere.com/ex/001.html")
html = response.text

Can you extract the buyer-info elements into a DataFrame?  Try it out.

In [31]:
# A:

Can you develop a function or class that takes a url, parses the page elements, finds the next page, then parses the next page?  

Try to finish this code.  You can do it!  The only thing left to do is to update the code staring on line 29 which if you followed the same pattern as we did in the 2nd case previously, you should be able to extract the name and the amount.

Parsing next page:  http://econpy.pythonanywhere.com/ex/002.html
Parsing next page:  http://econpy.pythonanywhere.com/ex/003.html
Parsing next page:  http://econpy.pythonanywhere.com/ex/004.html
Parsing next page:  http://econpy.pythonanywhere.com/ex/005.html


[<Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title='buyer-info']" data='<div title="buyer-info">\n  <div title="b'>,
 <Selector xpath="//div[@title