Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into py3_single_argume…
Browse files Browse the repository at this point in the history
…nt_processors
  • Loading branch information
elacuesta committed Sep 3, 2019
2 parents 4d23a75 + d4b8bf1 commit b3981d3
Show file tree
Hide file tree
Showing 50 changed files with 1,748 additions and 587 deletions.
41 changes: 41 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,41 @@
---
name: Bug report
about: Report a problem to help us improve
---

<!--
Thanks for taking an interest in Scrapy!
If you have a question that starts with "How to...", please see the Scrapy Community page: https://scrapy.org/community/.
The Github issue tracker's purpose is to deal with bug reports and feature requests for the project itself.
Keep in mind that by filing an issue, you are expected to comply with Scrapy's Code of Conduct, including treating everyone with respect: https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md
The following is a suggested template to structure your issue, you can find more guidelines at https://doc.scrapy.org/en/latest/contributing.html#reporting-bugs
-->

### Description

[Description of the issue]

### Steps to Reproduce

1. [First Step]
2. [Second Step]
3. [and so on...]

**Expected behavior:** [What you expect to happen]

**Actual behavior:** [What actually happens]

**Reproduces how often:** [What percentage of the time does it reproduce?]

### Versions

Please paste here the output of executing `scrapy version --verbose` in the command line.

### Additional context

Any additional information, configuration, data or output from commands that might be necessary to reproduce or understand the issue. Please try not to include screenshots of code or the command line, paste the contents as text instead. You can use [GitHub Flavored Markdown](https://help.github.com/en/articles/creating-and-highlighting-code-blocks) to make the text look better.
33 changes: 33 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,33 @@
---
name: Feature request
about: Suggest an idea for an enhancement or new feature
---

<!--
Thanks for taking an interest in Scrapy!
If you have a question that starts with "How to...", please see the Scrapy Community page: https://scrapy.org/community/.
The Github issue tracker's purpose is to deal with bug reports and feature requests for the project itself.
Keep in mind that by filing an issue, you are expected to comply with Scrapy's Code of Conduct, including treating everyone with respect: https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md
The following is a suggested template to structure your pull request, you can find more guidelines at https://doc.scrapy.org/en/latest/contributing.html#writing-patches and https://doc.scrapy.org/en/latest/contributing.html#submitting-patches
-->

## Summary

One paragraph explanation of the feature.

## Motivation

Why are we doing this? What use cases does it support? What is the expected outcome?

## Describe alternatives you've considered

A clear and concise description of the alternative solutions you've considered. Be sure to explain why Scrapy's existing customizability isn't suitable for this feature.

## Additional context

Any additional information about the feature request here.
43 changes: 23 additions & 20 deletions .travis.yml
@@ -1,31 +1,34 @@
language: python
dist: xenial
branches:
only:
- master
- /^\d\.\d+$/
- /^\d\.\d+\.\d+(rc\d+|\.dev\d+)?$/
matrix:
include:
- python: 2.7
env: TOXENV=py27
- python: 2.7
env: TOXENV=jessie
- python: 2.7
env: TOXENV=pypy
- python: 2.7
env: TOXENV=pypy3
- python: 3.4
env: TOXENV=py34
- python: 3.5
env: TOXENV=py35
- python: 3.6
env: TOXENV=py36
- python: 3.7
env: TOXENV=py37
dist: xenial
sudo: true
- python: 3.6
env: TOXENV=docs
- env: TOXENV=py27
python: 2.7
- env: TOXENV=py27-pinned
python: 2.7
- env: TOXENV=py27-extra-deps
python: 2.7
- env: TOXENV=pypy
python: 2.7
- env: TOXENV=pypy3
python: 3.5
- env: TOXENV=py35
python: 3.5
- env: TOXENV=py35-pinned
python: 3.5
- env: TOXENV=py36
python: 3.6
- env: TOXENV=py37
python: 3.7
- env: TOXENV=py37-extra-deps
python: 3.7
- env: TOXENV=docs
python: 3.6
install:
- |
if [ "$TOXENV" = "pypy" ]; then
Expand Down
2 changes: 1 addition & 1 deletion README.rst
Expand Up @@ -40,7 +40,7 @@ https://scrapy.org
Requirements
============

* Python 2.7 or Python 3.4+
* Python 2.7 or Python 3.5+
* Works on Linux, Windows, Mac OSX, BSD

Install
Expand Down
10 changes: 10 additions & 0 deletions docs/conf.py
Expand Up @@ -252,6 +252,16 @@

# Private exception used by the command-line interface implementation.
r'^scrapy\.exceptions\.UsageError',

# Methods of BaseItemExporter subclasses are only documented in
# BaseItemExporter.
r'^scrapy\.exporters\.(?!BaseItemExporter\b)\w*?\.',

# Extension behavior is only modified through settings. Methods of
# extension classes, as well as helper functions, are implementation
# details that are not documented.
r'^scrapy\.extensions\.[a-z]\w*?\.[A-Z]\w*?\.', # methods
r'^scrapy\.extensions\.[a-z]\w*?\.[a-z]', # helper functions
]


Expand Down
2 changes: 1 addition & 1 deletion docs/faq.rst
Expand Up @@ -69,7 +69,7 @@ Here's an example spider using BeautifulSoup API, with ``lxml`` as the HTML pars
What Python versions does Scrapy support?
-----------------------------------------

Scrapy is supported under Python 2.7 and Python 3.4+
Scrapy is supported under Python 2.7 and Python 3.5+
under CPython (default Python implementation) and PyPy (starting with PyPy 5.9).
Python 2.6 support was dropped starting at Scrapy 0.20.
Python 3 support was added in Scrapy 1.1.
Expand Down
2 changes: 1 addition & 1 deletion docs/intro/install.rst
Expand Up @@ -7,7 +7,7 @@ Installation guide
Installing Scrapy
=================

Scrapy runs on Python 2.7 and Python 3.4 or above
Scrapy runs on Python 2.7 and Python 3.5 or above
under CPython (default Python implementation) and PyPy (starting with PyPy 5.9).

If you're using `Anaconda`_ or `Miniconda`_, you can install the package from
Expand Down
17 changes: 11 additions & 6 deletions docs/news.rst
Expand Up @@ -6,6 +6,11 @@ Release notes
.. note:: Scrapy 1.x will be the last series supporting Python 2. Scrapy 2.0,
planned for Q4 2019 or Q1 2020, will support **Python 3 only**.

Scrapy 1.7.3 (2019-08-01)
-------------------------

Enforce lxml 4.3.5 or lower for Python 3.4 (:issue:`3912`, :issue:`3918`).

Scrapy 1.7.2 (2019-07-23)
-------------------------

Expand Down Expand Up @@ -75,8 +80,8 @@ New features
provides a cleaner way to pass keyword arguments to callback methods
(:issue:`1138`, :issue:`3563`)

* A new :class:`~scrapy.http.JSONRequest` class offers a more convenient way
to build JSON requests (:issue:`3504`, :issue:`3505`)
* A new :class:`JSONRequest <scrapy.http.JsonRequest>` class offers a more
convenient way to build JSON requests (:issue:`3504`, :issue:`3505`)

* A ``process_request`` callback passed to the :class:`~scrapy.spiders.Rule`
constructor now receives the :class:`~scrapy.http.Response` object that
Expand Down Expand Up @@ -1264,8 +1269,8 @@ This 1.1 release brings a lot of interesting features and bug fixes:
this behavior, update :setting:`ROBOTSTXT_OBEY` in ``settings.py`` file
after creating a new project.
- Exporters now work on unicode, instead of bytes by default (:issue:`1080`).
If you use ``PythonItemExporter``, you may want to update your code to
disable binary mode which is now deprecated.
If you use :class:`~scrapy.exporters.PythonItemExporter`, you may want to
update your code to disable binary mode which is now deprecated.
- Accept XML node names containing dots as valid (:issue:`1533`).
- When uploading files or images to S3 (with ``FilesPipeline`` or
``ImagesPipeline``), the default ACL policy is now "private" instead
Expand Down Expand Up @@ -1403,8 +1408,8 @@ Bugfixes
- Fixed bug on ``XMLItemExporter`` with non-string fields in
items (:issue:`1738`).
- Fixed startproject command in OS X (:issue:`1635`).
- Fixed PythonItemExporter and CSVExporter for non-string item
types (:issue:`1737`).
- Fixed :class:`~scrapy.exporters.PythonItemExporter` and CSVExporter for
non-string item types (:issue:`1737`).
- Various logging related fixes (:issue:`1294`, :issue:`1419`, :issue:`1263`,
:issue:`1624`, :issue:`1654`, :issue:`1722`, :issue:`1726` and :issue:`1303`).
- Fixed bug in ``utils.template.render_templatefile()`` (:issue:`1212`).
Expand Down
30 changes: 27 additions & 3 deletions docs/topics/developer-tools.rst
Expand Up @@ -252,17 +252,41 @@ If the handy ``has_next`` element is ``true`` (try loading
`quotes.toscrape.com/api/quotes?page=10`_ in your browser or a
page-number greater than 10), we increment the ``page`` attribute
and ``yield`` a new request, inserting the incremented page-number
into our ``url``.
into our ``url``.

You can see that with a few inspections in the `Network`-tool we
.. _requests-from-curl:

In more complex websites, it could be difficult to easily reproduce the
requests, as we could need to add ``headers`` or ``cookies`` to make it work.
In those cases you can export the requests in `cURL <https://curl.haxx.se/>`_
format, by right-clicking on each of them in the network tool and using the
:meth:`~scrapy.http.Request.from_curl()` method to generate an equivalent
request::

from scrapy import Request

request = Request.from_curl(
"curl 'http://quotes.toscrape.com/api/quotes?page=1' -H 'User-Agent: Mozil"
"la/5.0 (X11; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0' -H 'Acce"
"pt: */*' -H 'Accept-Language: ca,en-US;q=0.7,en;q=0.3' --compressed -H 'X"
"-Requested-With: XMLHttpRequest' -H 'Proxy-Authorization: Basic QFRLLTAzM"
"zEwZTAxLTk5MWUtNDFiNC1iZWRmLTJjNGI4M2ZiNDBmNDpAVEstMDMzMTBlMDEtOTkxZS00MW"
"I0LWJlZGYtMmM0YjgzZmI0MGY0' -H 'Connection: keep-alive' -H 'Referer: http"
"://quotes.toscrape.com/scroll' -H 'Cache-Control: max-age=0'")

Alternatively, if you want to know the arguments needed to recreate that
request you can use the :func:`scrapy.utils.curl.curl_to_request_kwargs`
function to get a dictionary with the equivalent arguments.

As you can see, with a few inspections in the `Network`-tool we
were able to easily replicate the dynamic requests of the scrolling
functionality of the page. Crawling dynamic pages can be quite
daunting and pages can be very complex, but it (mostly) boils down
to identifying the correct request and replicating it in your spider.

.. _Developer Tools: https://en.wikipedia.org/wiki/Web_development_tools
.. _quotes.toscrape.com: http://quotes.toscrape.com
.. _quotes.toscrape.com/scroll: quotes.toscrape.com/scroll/
.. _quotes.toscrape.com/scroll: http://quotes.toscrape.com/scroll
.. _quotes.toscrape.com/api/quotes?page=10: http://quotes.toscrape.com/api/quotes?page=10
.. _has-class-extension: https://parsel.readthedocs.io/en/latest/usage.html#other-xpath-extensions

0 comments on commit b3981d3

Please sign in to comment.