<a id='practical'></a>

## Using Requests + Beautiful Soup to extract information from a webpage.

---

Beautiful Soup is a python library useful for pulling data out of HTML and XML files.  It works with many parsers, such as XPath and can be executed in an IDE, so it can be much easier to work with when first extracting information from html.

Please make sure that the required packages are installed: 

```bash
# beautiful soup:
> conda install bs4 
> conda install lxml

# or if conda doesn't work
> pip install bs4
> pip install lxml
```

Lets find another posting for a sweet set of wheels on Craigslist (You will probably ave to update the URL to one that hasn't expired.):

![](assets/craigslist.jpg)

> *Note: you will need to update this to a current/working craigslist post.*

https://washingtondc.craigslist.org/doc/cto/6178337288.html

<a id='step1'></a>
### Step 1: fetch the content by URL



In [9]:
# you will need the requests library in order to fully utilize bs4
import requests
from bs4 import BeautifulSoup

# target web page
url = "https://washingtondc.craigslist.org/doc/cto/6178337288.html"
# establishing the connection to the webpage
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
#Ex. 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found
print 'Status Code: ',response.status_code

# Pull HTML string out of requests and convert to a python string
html = response.text

# The first 500 characters of the content
print "\nFirst part of HTML document fetched as string:\n"
print html[:500]

Status Code:  200

First part of HTML document fetched as string:

<!DOCTYPE html>
<html class="no-js">
<head>
<title>1983 Mercedes 380SL - cars &amp; trucks - by owner - vehicle automotive sale</title>
    	<link rel="canonical" href="http://washingtondc.craigslist.org/doc/cto/6178337288.html">
	<meta name="description" content="A true Mercedes Benz classic that will only go up in value... This beautiful Grey 1983 well cared for Mercedes has a 3.8L V8 engine, 4 speed Automatic Transmission, PS, PB. Replaced original grey...">
	<meta name="robots" content="noar


[More information on request status codes](http://www.restapitutorial.com/httpstatuscodes.html)

<a id='step2'></a>
### Step 2: Parse HTML document with Beautiful Soup

This step allows us to access the elements of the document by XPATH expressions.

In [4]:
soup = BeautifulSoup(html, 'lxml')

Soup queries are more like accessing information within a python object.  

> **Note:** There are many ways to get the elements in a "soup" object

Here are a few ways to select HMTL elements as "objects" within "soup" as a document.

In [5]:
# Singular element
soup.html.title

<title>1983 Mercedes 380SL - cars &amp; trucks - by owner - vehicle automotive sale</title>

In [6]:
# Just the text between elements
print soup.html.title.text

1983 Mercedes 380SL - cars & trucks - by owner - vehicle automotive sale


In [9]:
# find single or multiple elements
# First parameter
element = soup.findAll("a", {"class": "header-logo"})
element[0].text

u'CL'

In [10]:
price_search = soup.findAll('span', {"class": "price"})
price_search[0].text

u'$18500'

In [12]:
# switching back to SF apartment listings
response = requests.get("http://sfbay.craigslist.org/search/sfc/apa")

In [14]:
soup = BeautifulSoup(response.text, "lxml")
search_titles = soup.findAll("a", {"class": "hdrlnk"})

In [15]:
for link in search_titles[0:5]:
    print link.attrs

{'data-id': '6178462035', 'href': '/sfc/apa/6178462035.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6178462043', 'href': '/sfc/apa/6178462043.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6178461647', 'href': '/sfc/apa/6178461647.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6178461105', 'href': '/sfc/apa/6178461105.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6178460712', 'href': '/sfc/apa/6178460712.html', 'class': ['result-title', 'hdrlnk']}


##### > **Check:** How do we know which parameters `findAll()` takes?

<a id='practice'></a>

### Practice: can you select the price of our junker?  

 - Use XPath Helper to get an idea of where the element is within the HTML document.
 - Try to select using the soup.html.body.something.something method.
 - Try using findAll() to find a concise element.

<a id='scrapy'></a>
<a scrapy-spiders></a>
## What is [Scrapy](http://scrapy.org/)?

---

> *"Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them."*

Below we will walkthrough the creation of a **spider** using scrapy. Spiders are automated processes that will crawl through a webpage or webpages and collect information.

> **Note:** This code should be written in a script outside of jupyter notebook.

<a id='scrapy-project'></a>
### 1. Create a new Scrapy project

In your terminal. `cd` into a directory you want to create your Crawler's folder.  I recommend the desktop for ease of access to the files inside we will need to edit.
> `scrapy startproject craigslist`

**Should create output that looks like this:**
<blockquote>
```
2016-01-13 00:12:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2016-01-13 00:12:45 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-13 00:12:45 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'craigslist' created in:
    /Users/davidyerrington/virtualenvs/data/scraping/craigslist

You can start your first spider with:
    cd craigslist
    scrapy genspider example example.com
```
</blockquote>

**That command generates a set of project files:**
<blockquote>
```
craigslist/
    scrapy.cfg
    craigslist/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...
```
</blockquote>

Generally, these are our files.  We will go into more detail on these soon.

 * **`scrapy.cfg`:** the project configuration file
 * **`craigslist/`:** the project’s python module, you’ll later import your code from here.
 * **`craigslist/items.py`:** the project’s items file.
 * **`craigslist/pipelines.py`:** the project’s pipelines file.
 * **`craigslist/settings.py`:** the project’s settings file.
 * **`craigslist/spiders/`:** a directory where you’ll later put your spiders.
 
Long story, but please add this line to your craigslist/settings.py file before continuing:
 
 <blockquote>
 ```
 DOWNLOAD_HANDLERS = {'s3': None,}
 ```
 </blockquote>



--- 
<a id='define-item'></a>
### 2. Define an "item"

Basically, when we define an item, it's telling our new application what it will be collecting.  In essence, an "item", is an entity that has attributes (ie: "title", "description", "price", etc) that are descriptive and relate to elements on pages that we will be scraping.  

In more precise terms, this is a model (for those who are familliar with ORM or relational database terms).  Don't worry if this is a foreign concept.  The main idea to understand is that a model has attributes that closely resemble / relate to elements on our target web page(s).

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CraigslistItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    price = scrapy.Field()
```


---

<a id='spider-crawl'></a>
### 3. A spider that crawls

An item is a model that resembles data on a webpage.  A spider is something that crawls pages and uses our item model to to get and hold items for us.

**Scrapy spiders are python classes.  Let's write our first file, called `craigslist_spider.py` and put it in our `/spiders` directory:**

```python
import scrapy

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["craigslist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sfc/apa"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
```

**Next, let's dive in and crawl from our `/craigslist/craigslist` directory:**

```
> scrapy crawl craigslist
```

**What just happened?**
 * Our application requested the URLs from the `start_urls` class attribute.
 * Ran parse over the content containing the HTML markup, of each request URL.
 * What else?
 
```python
    with open(filename, 'wb') as f:
        f.write(response.body)
```

It saved a file in our base project directory.  It should be named based on the end of the URL.  In our case, it should create a file called "sfc".  This is taken directly from the Scrapy docs and it's only point is to illustrate the workflow so far.  It is kind of nice to have a reference to our HTML file though.  

There might be some errors listed when we crawl, but they are fine for now.

--- 
<a id='xpath-spider'></a>
### 4. XPath + parsing with our spider

So far, we've defined what fields we'll get, some urls to fetch, and saved some content to a file.  Let's actually do something interesting.

**We should let our spider know about the item model we made earlier.  In the head of the `craigslist/craigslist/spiders/craigslist_spider.py`, lets add a new import:**

```python
from craigslist.items import CraigslistItem
```

> **Check:** Why won't it work otherwise?

<br><br><br>
**Let's replace our parse method, to find some data from our Craigslist spider response, and map it to our item model, CraigslistItem:**


```python
def parse(self, response): # define parse function 
    items = [] # element for storing scraped info
	hxs = Selector(response) # selector is a function that allows us to grab html from the response(target website)
	for sel in hxs.xpath("//li[@class='result-row']/p"): # as we're using xpath languange we need to specify
                        # that the paragraphs we are trying to isolate are expressed via xpath
		item = CraigslistItem()
        item['title'] =  sel.xpath("a/text()").extract() #title text from the 'a' element 
		item['link']  =  sel.xpath("a/@href").extract() # href/url from the 'a' element 
		item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]
                # price from the result price class nested in a few span elements.
        items.append(item)
	return items # shows scraped information as terminal output

```



---

<a id='save-examine'></a>
### Save and examine our scraped data

By default, we can save our crawled data as csv.  To save our data, we just need to pass a few optional parameters to our crawl call:

<blockquote>
```
> scrapy crawl craigslist -o items.csv -t csv
```
</blockquote>

It's always good to iteratively check our data when developing a spider to make sure it's close to what we want. 

> *Pro tip:  The longer your iterations are between checks, the harder it's going to be to understand what's not working and fix bugs.*

You should now have a file called '`items.csv`' in the directory you ran the `scrapy crawl` command from.

<a id='addendum'></a>
## Addendum: leveraging XPath to get more results

---

Generally, a workflow that is useful in this context is to load the page in your Chrome browser, check out the page using the XPath Helper plugin, and from that derive your own XPath expressions based on the output.

`text()` selects only the text of a given element (between the tags), and `@attribute_name` is used to select attributes.

**Here are a few examples of `text()`**:
<blockquote>
```
<h1>Darwin - The Evolution Of An Exhibition</h1>
```
</blockquote>

The XPath selector for this:

<blockquote>
```
//h1/text()
```
</blockquote>

**Here are a few examples of attributes**:

And the description is contained inside a `<div>` tag with `id="description"`:
<blockquote>
```
<h2>Description:</h2>

<div id="description">
Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
</div>
...
```
</blockquote>

XPath
<blockquote>
```
//div[@id='description']
```
</blockquote>

---
<a id='follow-links'></a>
### Following links for more results

100 results is pretty cool but what if we want more?  We need to follow the "next" links, and find new pages to grab.  Using the **`parse()`** method of our spider class, we only need to return another type of object.

```python
def parse(self, response):  

    items = [] 
	hxs = Selector(response) 
    titles = hxs.xpath("//li[@class='result-row']/p")
    
	for sel in titles:                     
		item = CraigslistItem()
        item['title'] =  sel.xpath("a/text()").extract() 
		item['link']  =  sel.xpath("a/@href").extract() 
		item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]     
        items.append(item)
	return items 

    # Does the next page exist?  Let's get it!
    next_page   = response.xpath("(//a[@class='button next']/@href)[1]")

    if next_page:
        url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(url, self.parse)

```
