# 介绍scrapy

#### 第一个简单的scrapy爬虫

In [1]:
import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com/questions?sort=votes']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract()[0],
            'votes': response.css('.question .vote-count-post::text').extract()[0],
            'body': response.css('.question .post-text').extract()[0],
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }


- 把上面这段代码放到一个文件，比如stackoverflow_spider.py  
然后使用`runspider`命令运行这个爬虫
```
scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
```
- 将会生成一个json文件，包含stackoverflow里投票最高的问题，  
内容如下：  

```
[{
    "body": "... LONG HTML HERE ...",
    "link": "http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array",
    "tags": ["java", "c++", "performance", "optimization"],
    "title": "Why is processing a sorted array faster than an unsorted array?",
    "votes": "9924"
},
{
    "body": "... LONG HTML HERE ...",
    "link": "http://stackoverflow.com/questions/1260748/how-do-i-remove-a-git-submodule",
    "tags": ["git", "git-submodules"],
    "title": "How do I remove a Git submodule?",
    "votes": "1764"
},
...]
```

####  >>>  那么刚才发生了什么事情？
- 爬虫的基本结构
- 异步处理方式

In [None]:
scrapy.Spider?

------
### 现在我们开始一个scrapy项目
#### 首先使用scrapy的命令行工具

$ scrapy
Scrapy 1.0.4 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

#### 创建项目
- 我们的演示项目：钻戒公司 (用来从互联网上萃取数据珍珠)
startproject diamond_bank

#### 管理项目
cd diamond_bank  

#### Global commands:
```
startproject    ==> 创建项目
settings        ==> 
runspider       ==> 不需要创建项目，直接运行一个无依赖的spider
shell           ==>
fetch           ==>
view            ==>
version         ==>
```

#### Project-only commands:
```
crawl       ==>     使用一个spider开始爬行
check       ==>     检查定义的合约
list        ==>     列出所有spider
edit        ==>     编辑一个spider
parse       ==>     
genspider   ==>
bench       ==>
```

### 使用scrapy shell
$ scrapy shell http://baidu.com

### 快速创建爬虫
$ scrapy genspider mydomain mydomain.com

In [None]:
$ scrapy crawl mydomain    (-o items.json)

In [None]:
#### OK，框架搭起来了，下一步开始扩充各个细节，进入项目模式

## 其实，质量最高的一手信息就是官方文档
http://doc.scrapy.org/en/latest/

In [None]:
import requests
from scrapy.http import TextResponse

r = requests.get('http://stackoverflow.com/')
response = TextResponse(r.url, body=r.text, encoding='utf-8')