![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Web scraping in Python

## Scrapy

In this project, you will use Scrapy to scrape content from websites.  Because the scrapy engine is run from the command line, you will develop a spider class and save it to a file, then run the engine using that file.  

You can do this entirely within the notebook in the manner shown.  You are welcome to develop the scripts in your favorite editor instead (including Jupyter text editor), but then copy the content into the notebook after it is developed.

In [1]:
%%writefile test-spider.py
import scrapy
from bs4 import BeautifulSoup as BS

class PythonTitle(scrapy.Spider):
    name = 'Title of python.org'
    start_urls = ['https://www.python.org/']
    
    def parse(self, response):
        return {'title': BS(response.text).title.text}

Overwriting test-spider.py


In [2]:
%%bash
rm -f test.jl
scrapy runspider test-spider.py -o test.jl -L ERROR
cat test.jl

{"title": "Welcome to Python.org"}


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 1

**A fictional bookstore**

The URL http://books.toscrape.com/ contains a collection of pages that resemble an online bookstore.  Prices and ratings are randomly assigned by them.  The book titles and authors appear to be actual books, although I have not verified all of them.

For this task, we wish to crawl the entire site, and identify every UPC code of every book they stock, and the corresponding URL.  If you perform this scraping correctly, you will identify 1000 such distinct items.  Write your results as CSV.

In [3]:
%%writefile upc-codes.py
import scrapy
from bs4 import BeautifulSoup

class UPCSpider(scrapy.Spider):
    name = 'UPC codes of all books'
    start_urls = ['http://books.toscrape.com/']
    
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')

        # First yield the UPC if it is a book page
        for tr in soup.find_all('tr'):
            if tr.th.text == 'UPC':
                yield {'UPC': tr.td.text, 'URL': response.url}
                break
                
        # Look for links either way
        for a in soup.find_all('a'):
            href = a['href']
            if href.startswith('http') and 'toscrape.com' not in href:
                # Only follow domain links
                continue
            yield response.follow(href, self.parse)

Overwriting upc-codes.py


In [4]:
%%bash
rm -f upc-codes.csv
scrapy runspider upc-codes.py -o upc-codes.csv -L WARNING

In [5]:
import pandas as pd
pd.read_csv('upc-codes.csv')

Unnamed: 0,UPC,URL
0,a18a4f574854aced,http://books.toscrape.com/catalogue/libertaria...
1,feb7cc7701ecf901,http://books.toscrape.com/catalogue/olio_984/i...
2,e30f54cea9b38190,http://books.toscrape.com/catalogue/mesaerion-...
3,3b1c02bac2a429e6,http://books.toscrape.com/catalogue/scott-pilg...
4,ce6396b0f23f6ecc,http://books.toscrape.com/catalogue/set-me-fre...
...,...,...
995,5faefdc861684eb1,http://books.toscrape.com/catalogue/the-sandma...
996,3301af038a720587,http://books.toscrape.com/catalogue/the-comple...
997,3cdca3b4a93980f5,http://books.toscrape.com/catalogue/rat-queens...
998,bcbcbcf0f6ed196f,http://books.toscrape.com/catalogue/paper-girl...


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 2

For a number of years, I wrote programming articles that I republished at my website https://www.gnosis.cx/publish/.  Very often within these articles, I would include a witticism in the last section, titled "About the Author."  All of these articles are "Web 0.5" style; very simple like the early web, and without any `class`, `id` or other special attributes on tags.  Sometimes these blurbs are repeated between articles. You should only save one copy of each blurb in the output.

For example, one article contained:


> About The Author<br/>
> David Mertz programs generically and is dispatched multiply. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Check out David's book Text Processing in Python (http://gnosis.cx/TPiP/).

Discarding the portion of the blurb that starts at "David may be reached at ..." is probably a good idea.  A good solution will produce a file something like shown.  The order of lines might vary since pages are fetched concurrently.

```python
>>> for line in open('gnosis-blurbs.csv').readlines()[:7]:
...     print(line.strip())
```
```
David Mertz programs generically and is dispatched multiply.
David Mertz has a slow brain, and most of his programs still run slowly.
David Mertz is feeling a bit testy.
David Mertz thinks that artificial languages are perfectly natural, but natural languages seem a bit artificial.
David Mertz is blessed with the virtues of laziness, and impatience, and hubris.
David Mertz had no idea he was writing prose this whole time.
While David Mertz also likes laziness and impatience, this installment is about hubris.
```

Be careful to limit your scraping to HTML pages under the `/publish/` path.  Otherwise you might scrape a lot that you do not want, and the process may take a long time.

In [37]:
%%writefile gnosis-blurbs.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup

class GnosisSpider(scrapy.Spider):
    name = 'Mertz blurbs'
    start_urls = ['https://www.gnosis.cx/publish/']
    link_extractor = LinkExtractor()
    blurbs = set()
    
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')

        # Look for the potential "About"
        for h3 in soup.find_all('h3'):
            if h3.text.lower() == 'about the author':
                blurb = h3.find_next('p').text
                blurb = ' '.join(blurb.split())
                blurb = blurb.split("David may be reached at")[0]
                if blurb not in self.blurbs:
                    self.blurbs.add(blurb)
                    yield {'blurb': blurb}
                
        # Look for links either way
        for link in self.link_extractor.extract_links(response):
            if 'gnosis.cx/publish' not in link.url or '.htm' not in link.url:
                continue    # Only follow links under path
            yield response.follow(link, self.parse)

Overwriting gnosis-blurbs.py


In [38]:
%%bash
rm -f gnosis-blurbs.csv
scrapy runspider gnosis-blurbs.py -o gnosis-blurbs.csv -L WARNING

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)