![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Web scraping in Python

## Scrapy

In this project, you will use Scrapy to scrape content from websites.  Because the scrapy engine is run from the command line, you will develop a spider class and save it to a file, then run the engine using that file.  

You can do this entirely within the notebook in the manner shown.  You are welcome to develop the scripts in your favorite editor instead (including Jupyter text editor), but then copy the content into the notebook after it is developed.

In [None]:
%%writefile test-spider.py
import scrapy
from bs4 import BeautifulSoup as BS

class PythonTitle(scrapy.Spider):
    name = 'Title of python.org'
    start_urls = ['https://www.python.org/']
    
    def parse(self, response):
        return {'title': BS(response.text).title.text}

In [None]:
%%bash
rm -f test.jl
scrapy runspider test-spider.py -o test.jl -L ERROR
cat test.jl

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 1

**A fictional bookstore**

The URL http://books.toscrape.com/ contains a collection of pages that resemble an online bookstore.  Prices and ratings are randomly assigned by them.  The book titles and authors appear to be actual books, although I have not verified all of them.

For this task, we wish to crawl the entire site, and identify every UPC code of every book they stock, and the corresponding URL.  If you perform this scraping correctly, you will identify N such distinct codes.  Write your results as CSV.

In [None]:
# Develop your crawler as, e.g. `upc-code.py` ...

In [None]:
# !scrapy runspider upc-codes.py  ...

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 2

For a number of years, I wrote programming articles that I republished at my website https://www.gnosis.cx/publish/.  Very often within these articles, I would include a witticism in the last section, titled "About the Author."  All of these articles are "Web 0.5" style; very simple like the early web, and without any `class`, `id` or other special attributes on tags.  Sometimes these blurbs are repeated between articles. You should only save one copy of each blurb in the output.

For example, one article contained:


> About The Author<br/>
> David Mertz programs generically and is dispatched multiply. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Check out David's book Text Processing in Python (http://gnosis.cx/TPiP/).

Discarding the portion of the blurb that starts at "David may be reached at ..." is probably a good idea.  A good solution will produce a file something like shown.  The order of lines might vary since pages are fetched concurrently.

```python
>>> for line in open('gnosis-blurbs.csv').readlines()[:7]:
...     print(line.strip())
```
```
David Mertz programs generically and is dispatched multiply.
David Mertz has a slow brain, and most of his programs still run slowly.
David Mertz is feeling a bit testy.
David Mertz thinks that artificial languages are perfectly natural, but natural languages seem a bit artificial.
David Mertz is blessed with the virtues of laziness, and impatience, and hubris.
David Mertz had no idea he was writing prose this whole time.
While David Mertz also likes laziness and impatience, this installment is about hubris.
```

Be careful to limit your scraping to HTML pages under the `/publish/` path.  Otherwise you might scrape a lot that you do not want, and the process may take a long time.

In [None]:
# Your code here...

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)