# Web scraping

Requires: urllib, requests, socket, urllib3, re, lxml, io, bs4, scrapy, sqlite3, pandas

## Intro: кодировки

In [1]:
with open('unicode_file.txt', 'w', encoding='utf-16le') as f:
    f.write('韩国烧酒')

In [2]:
with open('unicode_file.txt', 'r') as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

In [3]:
with open('unicode_file.txt', 'r', encoding='utf-16') as f:
    print(f.read())

UnicodeError: UTF-16 stream does not start with BOM

In [4]:
with open('unicode_file.txt', 'r', encoding='utf-16le') as f:
    print(f.read())

韩国烧酒


## Web scraping in a nutshell

* Получение html
* Парсинг html

## Получение html

* urllib / urllib2
* requests
* socket (low-level)



### urllib 

In [5]:
import urllib.request

In [6]:
response = urllib.request.urlopen('http://example.com/')
html = response.read()
print(html)

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\

In [7]:
html = html.decode('utf-8')
print(html)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This doma

In [8]:
with open('example.com.txt', 'w', encoding='utf-8') as f:
    f.write(html)

А что, кроме html?

In [9]:
print(dir(response))

['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclo

In [10]:
print(response.url)
print(response.msg)
print(response.code)

http://example.com/
OK
200


### requests
HTTP for Humans

In [1]:
import requests

In [None]:
requests.post('http://example.com')

In [2]:
response = requests.get('http://example.com')
response

<Response [200]>

In [8]:
print(dir(response))

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


In [9]:
print(response.url)
print(response.connection)
print(response.headers)
print(response.ok)
print(response.status_code)
print(response.encoding)
print(response.links)

http://example.com/
<requests.adapters.HTTPAdapter object at 0x107df90b8>
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html', 'Date': 'Thu, 26 Apr 2018 06:21:07 GMT', 'Etag': '"1541025663"', 'Expires': 'Thu, 03 May 2018 06:21:07 GMT', 'Last-Modified': 'Fri, 09 Aug 2013 23:54:35 GMT', 'Server': 'ECS (lga/1386)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '606'}
True
200
ISO-8859-1
{}


In [15]:
print(response.text)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This doma

### socket (low-level)

In [16]:
import socket

In [12]:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host = socket.gethostbyname('www.example.com')
port = 80

sock.connect((host,port))
sock.sendall(b'GET / HTTP/1.1\r\nHost: www.example.com\r\n\r\n')

val = sock.recv(4096)
print(val.decode('utf-8'))

NameError: name 'socket' is not defined

In [18]:
# Split off the HTTP headers
val = val.split(b'\r\n\r\n',1)[1]
print(val.decode('utf-8'))

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This doma

## HTTP Errors

In [19]:
import urllib3
dir(urllib3.exceptions)

['BodyNotHttplibCompatible',
 'ClosedPoolError',
 'ConnectTimeoutError',
 'ConnectionError',
 'DecodeError',
 'EmptyPoolError',
 'HTTPError',
 'HeaderParsingError',
 'HostChangedError',
 'IncompleteRead',
 'InvalidHeader',
 'LocationParseError',
 'LocationValueError',
 'MaxRetryError',
 'NewConnectionError',
 'PoolError',
 'ProtocolError',
 'ProxyError',
 'ProxySchemeUnknown',
 'ReadTimeoutError',
 'RequestError',
 'ResponseError',
 'ResponseNotChunked',
 'SSLError',
 'TimeoutError',
 'TimeoutStateError',
 'UnrewindableBodyError',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'absolute_import',
 'httplib_IncompleteRead']

## Парсинг html

* re
* lxml
* BeautifulSoup

In [15]:
html=response.text

In [16]:
print(html)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This doma

### re

In [17]:
import re

In [18]:
h1 = re.findall(r'<h1>[\w ]+</h1>', html)
print(h1)
h1 = re.findall(r'<h1>([\w ]+)</h1>', html)
print(h1)

['<h1>Example Domain</h1>']
['Example Domain']


In [21]:
paragraphs = re.findall(r'<p>(.*)</p>', html)
paragraphs

['<a href="http://www.iana.org/domains/example">More information...</a>']

Там же 2 параграфа, почему не нашёлся второй?


In [22]:
paragraphs = re.findall(r'<p>([\w\W]*)</p>', html) # как-то покрасивее надо
paragraphs

['This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>\n    <p><a href="http://www.iana.org/domains/example">More information...</a>']

Опять плохо..

In [23]:
paragraphs = re.findall(r'<p>.*?</p>', html, re.DOTALL) # non-greedy matching
paragraphs

['<p>This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>',
 '<p><a href="http://www.iana.org/domains/example">More information...</a></p>']

Жесть, да?

### lxml

In [24]:
from lxml import etree
from io import StringIO, BytesIO

ModuleNotFoundError: No module named 'lxml'

In [27]:
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
tree

<lxml.etree._ElementTree at 0x7f24e6941fc8>

In [28]:
print(dir(tree))

['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_setroot', 'docinfo', 'find', 'findall', 'findtext', 'getelementpath', 'getiterator', 'getpath', 'getroot', 'iter', 'iterfind', 'parse', 'parser', 'relaxng', 'write', 'write_c14n', 'xinclude', 'xmlschema', 'xpath', 'xslt']


In [29]:
print(tree.getroot())
print(etree.tostring(tree.getroot(), pretty_print=True, method="html"))

<Element html at 0x7f24e66ac048>
b'<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8">\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    </style>    \n</head>\n\n<bod

In [30]:
paragraphs = tree.xpath('//p')
for p in paragraphs:
    print(p.text)  

This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.
None


In [31]:
for p in paragraphs:
    print(etree.tostring(p, pretty_print=True, method='html'))  

b'<p>This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>\n    \n'
b'<p><a href="http://www.iana.org/domains/example">More information...</a></p>\n\n'


In [32]:
hrefs = tree.xpath('//a')
for href in hrefs:
    print(href.text)  
    print(href.attrib)  

More information...
{'href': 'http://www.iana.org/domains/example'}


In [33]:
specific_hrefs = tree.xpath('//a[@href="http://www.non-existing-domain.org/"]')
specific_hrefs

[]

### BeautifulSoup

In [25]:
print(html)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This doma

In [30]:
html = requests.get('https://www.google.com/search?source=hp&ei=OXPhWviWDKqJmwWThrmgBA&q=%D0%BF%D0%BE%D0%B3%D0%BE%D0%B4%D0%B0+%D0%B2+%D0%BC%D0%BE%D1%81%D0%BA%D0%B2%D0%B5&oq=%D0%BF%D0%BE%D0%B3%D0%BE%D0%B4%D0%B0+%D0%B2+%D0%BC%D0%BE%D1%81%D0%BA%D0%B2%D0%B5&gs_l=psy-ab.3..0i203k1l10.1990.5682.0.5795.19.14.1.0.0.0.550.1687.0j1j3j1j0j1.6.0....0...1c.1.64.psy-ab..13.6.1167.0..0j35i39k1.0.hbVCA6n3Um0').text

In [27]:
# try:
from bs4 import BeautifulSoup
# except:
#     !pip install bs4 

In [32]:
soup = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))


In [36]:
html

'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="ru"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><link href="/images/branding/product/ico/googleg_lodp.ico" rel="shortcut icon"><title>&#1087;&#1086;&#1075;&#1086;&#1076;&#1072; &#1074; &#1084;&#1086;&#1089;&#1082;&#1074;&#1077; - &#1055;&#1086;&#1080;&#1089;&#1082; &#1074; Google</title><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important} </style><style>.star{float:

In [37]:
soup.find_all(attrs={'id': 'gbar'})

[<div id="gbar"><nobr><b class="gb1">Поиск</b> <a class="gb1" href="https://www.google.ru/search?hl=ru&amp;tbm=isch&amp;source=og&amp;tab=wi">Картинки</a> <a class="gb1" href="https://maps.google.ru/maps?hl=ru&amp;tab=wl">Карты</a> <a class="gb1" href="https://play.google.com/?hl=ru&amp;tab=w8">Play</a> <a class="gb1" href="https://www.youtube.com/results?gl=RU&amp;tab=w1">YouTube</a> <a class="gb1" href="https://news.google.ru/nwshp?hl=ru&amp;tab=wn">Новости</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Почта</a> <a class="gb1" href="https://drive.google.com/?tab=wo">Диск</a> <a class="gb1" href="https://www.google.ru/intl/ru/options/" style="text-decoration:none"><u>Ещё</u> »</a></nobr></div>]

In [26]:
paragraphs = soup.find_all('p', )
paragraphs

NameError: name 'soup' is not defined

In [37]:
hrefs = soup.find_all('a')
hrefs

[<a href="http://www.iana.org/domains/example">More information...</a>]

In [38]:
hrefs = soup.find_all('a', href='http://www.iana.org/domains/example')
hrefs

[<a href="http://www.iana.org/domains/example">More information...</a>]

In [39]:
hrefs = soup.find_all('a', href='http://www.pther-website.org/domains/example')
hrefs

[]

In [41]:
from selenium import webdriver


In [42]:
dr = webdriver.Chrome()

In [49]:
dr.get('https://ya.ru')

In [51]:
el = dr.find_element_by_id('text')

In [52]:
el.send_keys('привет')

In [54]:
el.send_keys('\n')

In [44]:
!pwd

/Users/ryadom/seminars_spring2018/seminars/11


In [55]:
dr.save_screenshot('/Users/ryadom/seminars_spring2018/seminars/11/sc.png')

True

#### Ещё про кодировки: UnicodeDammit

Особенно он полезен в сочетании с библиотекой типа chardet

In [40]:
from bs4 import UnicodeDammit

Вот совсем маргинальный текст, со смешанными кодировками.


In [41]:
snowmen = (u'\N{SNOWMAN}' * 3)
print(snowmen)
quote = (u'\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}')
print(quote)
doc = snowmen.encode('utf8') + quote.encode('windows_1252')

☃☃☃
“I like snowmen!”


In [42]:
print(doc)
# ☃☃☃�I like snowmen!�

print(doc.decode('windows-1252'))
# â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”

print(doc.decode('utf8'))
# ☃☃☃�I like snowmen!�

b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\x93I like snowmen!\x94'
â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 9: invalid start byte

Считаем его через UnicodeDammit:

In [43]:
new_doc = UnicodeDammit.detwingle(doc)
print(new_doc)
print(new_doc.decode('utf8'))
# ☃☃☃“I like snowmen!”

b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\xe2\x80\x9cI like snowmen!\xe2\x80\x9d'
☃☃☃“I like snowmen!”


## Performance

In [44]:
%timeit re.findall(r'<p>.*?</p>', html, re.DOTALL)
%timeit tree.xpath('//p')
%timeit soup.find_all('p')

5.07 µs ± 60.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.8 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.9 µs ± 812 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Мораль: regex -- это быстро.

## На сладкое: scrapy

А заодно попрактикуемся в ipython magic!

Будем парсить сайт с архивом курсов криптовалют: https://coinmarketcap.com/.

Хочется открыть [historical snapshots](https://coinmarketcap.com/historical/) и для каждой недели выгрузить полную табличку.


In [45]:
import scrapy


In [46]:
!rm -rf coinmarketcap

In [47]:
!scrapy startproject coinmarketcap

New Scrapy project 'coinmarketcap', using template directory '/media/storage/workspace/gsoc/scrapy/scrapy/templates/project', created in:
    /media/storage/workspace/students/python/scraping/coinmarketcap

You can start your first spider with:
    cd coinmarketcap
    scrapy genspider example example.com


Что там внутри?

In [48]:
!cd coinmarketcap; ls -R

.:
coinmarketcap  scrapy.cfg

./coinmarketcap:
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  spiders

./coinmarketcap/spiders:
__init__.py


Специальная функция, чтобы из ноутбука формировать код в проекте scrapy:

In [49]:
def dump_to(path):
    with open(path, 'w') as f:
        f.write(_i)

### Item: cперва определим, что хотим собирать

In [50]:
# -*- coding:utf8 -*-

import scrapy


class CurrencyItem(scrapy.Item):
    date = scrapy.Field()
    name = scrapy.Field()
    symbol = scrapy.Field()
    market_cap = scrapy.Field()
    price = scrapy.Field()

In [51]:
dump_to('./coinmarketcap/coinmarketcap/items.py')

### Spider: тот, кто собирает Item'ы

In [52]:
# -*- coding:utf8 -*-

from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.loader import ItemLoader
from scrapy.selector import HtmlXPathSelector, Selector
from coinmarketcap.items import CurrencyItem

class CurrencyLoader(ItemLoader):
    pass

class WeeklySpider(CrawlSpider):
    name = 'weekly'
    allowed_domains = ['coinmarketcap.com']
    start_urls = ['https://coinmarketcap.com/historical/']
    only_2018_april_regex = '/201804[0-9]{2}' # full history parsing takes ~4 hrs 

    rules = (
        Rule(LinkExtractor(allow=(only_2018_april_regex, )), callback='parse_weekly_report', follow=False),
    )

    def parse_weekly_report(self, response):
        hxs = HtmlXPathSelector(response)
        items_html = hxs.xpath('//table[@id="currencies-all"]/tbody/tr')

        items = []

        for item_html in items_html:
            item = CurrencyItem()

            item['date'] = response.request.url.split('/')[-2]
            item['name'] = item_html.xpath('.//a[@class="currency-name-container"]/text()').extract()[0]
            item['symbol'] = item_html.xpath('.//td[contains(@class, "col-symbol")]/text()').extract()[0]
            item['market_cap'] = item_html.xpath('.//td[contains(@class, "market-cap")]/text()').extract()[0].strip()
            item['price'] = item_html.xpath('.//a[contains(@class, "price")]/text()').extract()[0]

            items.append(item)

        return items

ModuleNotFoundError: No module named 'coinmarketcap.items'

In [53]:
dump_to('./coinmarketcap/coinmarketcap/spiders/weekly.py')

### Pipeline: например, экспорт в БД

Не бойтесь, про это будет отдельный семинар!

In [54]:
# -*- coding: utf-8 -*-

import os, logging
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine, Table, Column, Integer, String, Date, MetaData, ForeignKey
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.pool import NullPool
from scrapy.exceptions import DropItem
from scrapy import signals
from coinmarketcap.items import CurrencyItem
import pandas as pd

logger = logging.getLogger(__name__)

DeclarativeBase = declarative_base()

class Currency(DeclarativeBase):
    __tablename__ = 'currency'
    __table_args__ = {'sqlite_autoincrement': True}

    id = Column('id', Integer, primary_key=True)
    date = Column('date', Date)
    name = Column('name', String)
    symbol = Column('symbol', String)
    market_cap = Column('market_cap', String)
    price = Column('price', String)

    def __init__(self, item):
        self.date = pd.to_datetime(item['date'], format='%Y%m%d')
        self.name = item['name']
        self.symbol = item['symbol']
        self.market_cap = item['market_cap']
        self.price = item['price']

    def __repr__(self):
        return "<Currency({0}, {1}, {2})>".format(self.id, self.symbol, self.market_cap)


class SqlitePipeline(object):
    def __init__(self, settings):
        self.database = settings.get('DATABASE')
        self.sessions = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls(crawler.settings)
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def create_engine(self):
        engine = create_engine(URL(**self.database), poolclass=NullPool)
        return engine

    def create_tables(self, engine):
        DeclarativeBase.metadata.create_all(engine, checkfirst=True)

    def create_session(self, engine):
        session = sessionmaker(bind=engine)()
        return session

    def spider_opened(self, spider):
        engine = self.create_engine()
        self.create_tables(engine)
        session = self.create_session(engine)
        self.sessions[spider] = session

    def spider_closed(self, spider):
        session = self.sessions.pop(spider)
        session.close()

    def process_item(self, item, spider):
        session = self.sessions[spider]
        currency = Currency(item)
        link_exists = session.query(Currency).filter_by(symbol=item['symbol'], date=item['date']).first() is not None

        if link_exists:
            logger.info('Item {} is in db'.format(currency))
            return item

        try:
            session.add(currency)
            session.commit()
            logger.info('Item {} stored in db'.format(currency))
        except:
            logger.info('Failed to add {} to db'.format(currency))
            session.rollback()
            raise

        return item

ModuleNotFoundError: No module named 'coinmarketcap.items'

In [55]:
dump_to('./coinmarketcap/coinmarketcap/pipelines.py')

### Settings: общие настройки scrapy

In [56]:
# -*- coding: utf-8 -*-

# Scrapy settings for coinmarketcap project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'coinmarketcap'

SPIDER_MODULES = ['coinmarketcap.spiders']
NEWSPIDER_MODULE = 'coinmarketcap.spiders'

DATABASE = {
    'drivername': 'sqlite',
    # 'host': 'localhost',
    # 'port': '5432',
    # 'username': 'YOUR_USERNAME',
    # 'password': 'YOUR_PASSWORD',
    'database': 'weekly.sqlite'
}

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = '%s' % (BOT_NAME)

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'coinmarketcap.middlewares.CoinmarketcapSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'coinmarketcap.middlewares.CoinmarketcapDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'coinmarketcap.pipelines.SqlitePipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

LOG_FILE = 'crawling.log'

In [57]:
dump_to('./coinmarketcap/coinmarketcap/settings.py')

Посмотрим на структуру scrapy проекта ещё раз

In [58]:
!cd coinmarketcap; ls -R

.:
coinmarketcap  scrapy.cfg

./coinmarketcap:
__init__.py  items.py  middlewares.py  pipelines.py  settings.py  spiders

./coinmarketcap/spiders:
__init__.py  weekly.py


### Всё! 

Запустим паука! 

Обходим только апрельские данные, сейчас это [1548](https://coinmarketcap.com/historical/20180401/) + [1531](https://coinmarketcap.com/historical/20180408/) + [1538](https://coinmarketcap.com/historical/20180415/) + [1554](https://coinmarketcap.com/historical/20180422/) = 6171 записей.

ETA 7-8 минут с хорошим интернетом.

In [59]:
%%timeit -n 1 -r 1
!cd coinmarketcap ; scrapy crawl weekly

8min 30s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Посмотрим, что получилось:

In [60]:
import sqlite3
import pandas as pd

connection = sqlite3.connect('./coinmarketcap/weekly.sqlite')

df = pd.read_sql_query("SELECT * FROM currency", connection)
print(df.shape)
df.head()

(6171, 6)


Unnamed: 0,id,date,name,symbol,market_cap,price
0,1,2018-04-01,Bitcoin,BTC,"$116,889,698,943",$6895.74
1,2,2018-04-01,Ethereum,ETH,"$38,416,967,029",$389.85
2,3,2018-04-01,Ripple,XRP,"$19,620,488,346",$0.501873
3,4,2018-04-01,Bitcoin Cash,BCH,"$11,522,745,022",$675.86
4,5,2018-04-01,Litecoin,LTC,"$6,440,675,643",$115.25


In [61]:
assert(df.shape[0] == 6171)