### Section 96.1: Scraping using the Scrapy framework

In [None]:
import scrapy
class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow' # each spider has a unique name
    start_urls = ['http://stackoverflow.com/questions?sort=votes'] # the parsing starts from a specific set of urls
    def parse(self, response): # for each request this generator yields, its response is sent to parse_question
        for href in response.css('.question-summary h3 a::attr(href)'): # do some scraping stuff using css selectors to find question urls
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)
    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract_first(),
            'votes': response.css('.question .vote-count-post::text').extract_first(),
            'body': response.css('.question .post-text').extract_first(),
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

In [1]:
%%cmd
scrapy runspider practice/practice/spiders/stackoverflow_spider.py

Microsoft Windows [版本 10.0.16299.125]
(c) 2017 Microsoft Corporation。保留所有权利。

E:\MyFile\Jupyter\Python-Learn\Chapter 96 Web scraping with Python>scrapy runspider practice/practice/spiders/stackoverflow_spider.py

E:\MyFile\Jupyter\Python-Learn\Chapter 96 Web scraping with Python>

2018-03-07 20:14:17 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-07 20:14:17 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2n  7 Dec 2017), cryptography 2.0.3, Platform Windows-10-10.0.16299-SP0
2018-03-07 20:14:17 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-03-07 20:14:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-03-07 20:14:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddl

### Section 96.2: Scraping using Selenium WebDriver

In [5]:
from selenium import webdriver
browser = webdriver.Chrome() # launch firefox browser
browser.get('http://stackoverflow.com/questions?sort=votes') # load url
title = browser.find_element_by_css_selector('h1').text # page title (first h1 element)
questions = browser.find_elements_by_css_selector('.question-summary') # question list
for question in questions: # iterate over questions
    question_title = question.find_element_by_css_selector('.summary h3 a').text
    question_excerpt = question.find_element_by_css_selector('.summary .excerpt').text
    question_vote = question.find_element_by_css_selector('.stats .vote .votes .vote-count-post').text
    print ("%s\n%s\n%s votes\n-----------\n" % (question_title, question_excerpt, question_vote))

Why is it faster to process a sorted array than an unsorted array?
Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster. #include <algorithm> #include <ctime> #...
20710 votes
-----------

How to undo the most recent commits in Git
I accidentally committed wrong files to Git, but I haven't pushed the commit to the server yet. How can I undo those commits from the local repository?
16740 votes
-----------

How do I delete a Git branch both locally and remotely?
I want to delete a branch both locally and on my remote project fork on GitHub. Failed Attempts to Delete Remote Branch $ git branch -d remotes/origin/bugfix error: branch 'remotes/origin/bugfix' ...
12742 votes
-----------

What is the difference between 'git pull' and 'git fetch'?
What are the differences between git pull and git fetch?
9463 votes
-----------

What is the correct JSON content type?
I've been messing around with JSON f

### Section 96.3: Basic example of using requests and lxml toscrape some data

In [6]:
import lxml.html
import requests
def main():
    r = requests.get("https://httpbin.org")
    html_source = r.text
    root_element = lxml.html.fromstring(html_source)
    # Note root_element.xpath() gives a *list* of results.
    # XPath specifies a path to the element we want.
    page_title = root_element.xpath('/html/head/title/text()')[0]
    print(page_title)
if __name__ == '__main__':
    main()

httpbin(1): HTTP Client Testing Service


### Section 96.4: Maintaining web-scraping session with requests

It is a good idea to maintain a [web-scraping session](http://docs.python-requests.org/en/master/user/advanced/#session-objects) to persist the cookies and other parameters. Additionally, it can
result into a performance improvement because requests.Session reuses the underlying TCP connection to a host:

In [7]:
import requests
with requests.Session() as session:
    # all requests through session now have User-Agent header set
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
    # set cookies
    session.get('http://httpbin.org/cookies/set?key=value')
    # get cookies
    response = session.get('http://httpbin.org/cookies')
    print(response.text)

{
  "cookies": {
    "key": "value"
  }
}



### Section 96.5: Scraping using BeautifulSoup4

In [8]:
from bs4 import BeautifulSoup
import requests
# Use the requests module to obtain a page
res = requests.get('https://www.codechef.com/problems/easy')
# Create a BeautifulSoup object
page = BeautifulSoup(res.text, 'lxml') # the text field contains the source of the page
# Now use a CSS selector in order to get the table containing the list of problems
datatable_tags = page.select('table.dataTable') # The problems are in the <table> tag,
# with class "dataTable"
# We extract the first tag from the list, since that's what we desire
datatable = datatable_tags[0]
# Now since we want problem names, they are contained in <b> tags, which are
# directly nested under <a> tags
prob_tags = datatable.select('a > b')
prob_names = [tag.getText().strip() for tag in prob_tags]
print (prob_names)

['Taxi Making Sharp Turns', 'Nested Candy Boxes', 'Number Game', 'A Tale of Two Right Angled Triangles', 'Compression Algorithm', 'Find an element in hidden array', 'Generating A Permutation', 'Method Resolution Order', 'A Few Laughing Men', 'Chef and Triangles', 'SAD Queries', 'C - Club of Riders', 'Obtain Desired Standard Deviation', 'Forces in the crystal', 'Obtain Desired Expected Value', 'Quadratic Functions', 'Chef and Average on a Tree', 'A - Appearance Count', 'A Tale of Three Squares', 'Year 3017', 'Just a simple sum', 'Weird Competition', 'Hasan and boring classes', 'Minimax', 'Hull Sum', 'Optimize The Slow Code', 'Animesh practices some programming contests', 'Black Nodes in Subgraphs', 'Make array great again', 'Mathison and the teleportation game', 'Long Homework', 'Malvika conducts her own ACM-ICPC contest series', 'Colorful Grids', 'Minimize the string', 'Sereja and Two Strings 2', 'Chef and Yoda', 'Sereja and Two Lines', 'Chef and Inflation', 'Chef Shifu and Celebration

### Section 96.6: Simple web content download with urllib.request

In [9]:
from urllib.request import urlopen
response = urlopen('http://stackoverflow.com/questions?sort=votes')
data = response.read()
# The received bytes should usually be decoded according the response's character set
encoding = response.info().get_content_charset()
html = data.decode(encoding)

In [10]:
html



### Section 96.7: Modify Scrapy user agent

In [None]:
#USER_AGENT = 'projectName (+http://www.yourdomain.com)'

In [None]:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/51.0.2704.103 Safari/537.36'

### Section 96.8: Scraping with curl

In [11]:
from subprocess import Popen, PIPE
from lxml import etree
from io import StringIO

In [12]:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/55.0.2883.95 Safari/537.36'
url = 'http://stackoverflow.com'
get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
result = get.stdout.read().decode('utf8')

In [13]:
tree = etree.parse(StringIO(result), etree.HTMLParser())
divs = tree.xpath('//div')

In [14]:
tree

<lxml.etree._ElementTree at 0x1ea82fa0548>

In [15]:
divs

[]

In [16]:
result

'<html><head><title>Object moved</title></head><body>\r\n<h2>Object moved to <a href="https://stackoverflow.com/">here</a>.</h2>\r\n</body></html>\r\n'