Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RobotsTxtMiddleware doesn't support wildcards in Disallow rules #754

Closed
mattfullerton opened this issue Jun 21, 2014 · 16 comments
Closed

RobotsTxtMiddleware doesn't support wildcards in Disallow rules #754

mattfullerton opened this issue Jun 21, 2014 · 16 comments

Comments

@mattfullerton
Copy link

This is because, as I understand, Python's robotparser module doesn't support them either. There is an alternative, drop-in module, Robotexclusionrulesparser:

http://nikitathespider.com/python/rerp/

This line would need to be changed:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/robotstxt.py#L7

@mattfullerton
Copy link
Author

It's not that simple :) Robotexclusionrulesparser doesn't work any better for me, but reppy does (https://pypi.python.org/pypi/reppy), and looks like its taking care of caching as well. I could submit a pull request with the alternative code or instead add a new middleware to contrib... what would be best?

@pablohoffman
Copy link
Member

I don't think there'd be room for two robots.txts built-in implementation on Scrapy, so I'd favor replacing the builtin one with one bases on reppy, if we agree it's better. Does reppy work well with an asynchronous API?

@mattfullerton
Copy link
Author

I can't give a good answer to that. Its a fairly straightforward module: https://github.com/seomoz/reppy/tree/master/reppy

And I am using it right now with the normal concurrent requests settings with good results.

@kmike
Copy link
Member

kmike commented Jul 7, 2014

I don't like some parts of reppy - for example, it has "install_requires: python-dateutil==1.5" in setup.py (which could downgrade user's python-dateutil), it doesn't have a CI setup, it doesn't provide an official list of Pythons supported. On the other hand, reppy code is OK: the parser is separated from downloader and from the cache, and there is a good testing suite.

@kmike
Copy link
Member

kmike commented Sep 23, 2014

reppy now has a CI setup, and it removed stupid python-dateutil==1.5 from install_requires. It provides other features that could be useful, see #892. I'm fine with depending on it.

@graingerkid
Copy link

Just wondering if there were any more developments on this?

@mattfullerton
Copy link
Author

I'll see if I can change the middleware and requirements accordingly

@mattfullerton
Copy link
Author

I want to test before I make a pull request. On both my local test and running my repo on Travis I get a connection error to http://site.local, which is used by the test:
https://github.com/scrapy/scrapy/blob/master/tests/test_downloadermiddleware_robotstxt.py

My guess is that the python robotparser doesn't throw an exception if the site doesn't exist but reppy does. Although that wouldn't explain how the test is able to pass. I see there is a test website in Scrapy but it doesn't include those paths. And I see nothing in any of the test config about starting a webserver and nothing in the docs about starting one.

Any ideas?

@kmike
Copy link
Member

kmike commented Nov 12, 2014

@mattfullerton if some parts of reppy do HTTP requests to check if a site is available we shouldn't use these parts. For Scrapy it is fine to use reppy.parser; reppy.cache is not OK.

The test doesn't make a real HTTP request, the response is mocked - see

crawler.engine.download.side_effect = return_response
. I agree that the code is not straightforward, so refactoring it to be easier to follow (if that's possible) is a welcome addition.

Making a PR with failing tests is fine if you want to discuss the changes; it won't be merged as-is, but it'd be easier for other people to check.

I don't know why tests are failing for you - are they failing with unmodified Scrapy or after some changes you've done? Travis tests pass for Scrapy master: https://travis-ci.org/scrapy/scrapy/builds/40556422

@mattfullerton
Copy link
Author

Thanks, now I understand. Indeed it was my code that was failing on Travis. Although there are a set of failures I get locally and on Travis involving the ProxyServer, but that has nothing to do with me.

I now made the changes allowing Scrapy to do the downloading and Reppy only to do the parsing and allowing/disallowing. I've made a pull request.

@redapple
Copy link
Contributor

redapple commented Jan 7, 2016

I believe this is the one: #949
@mattfullerton , are you still working on it?

@mattfullerton
Copy link
Author

No 😊

@Parth-Vader
Copy link
Member

Hi, I'm trying to make the PR for this issue.
I have made the required changes, and after running the local test tests/test_downloadermiddleware_robotstxt.py, one test is failed.

The error is :

twisted.internet.defer.FirstError: FirstError[#1, [Failure instance: Traceback: <class 'twisted.trial.unittest.FailTest'>: did not catch an error, instead got None
/home/parth/Codes/scrapy/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py:653:_runCallbacks
/home/parth/Codes/scrapy/scrapy/downloadermiddlewares/robotstxt.py:77:cb
/home/parth/Codes/scrapy/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py:459:callback
/home/parth/Codes/scrapy/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py:567:_startRunCallbacks
--- <exception caught here> ---
/home/parth/Codes/scrapy/.tox/py35/lib/python3.5/site-packages/twisted/internet/defer.py:653:_runCallbacks
/home/parth/Codes/scrapy/.tox/py35/lib/python3.5/site-packages/twisted/trial/_asynctest.py:67:_cb
]]

@redapple
Copy link
Contributor

@Parth-Vader , which tests fails?
One can usually see the failing test when running tox -- -v:

$ tox -- -v tests/test_downloadermiddleware_robotstxt.py
GLOB sdist-make: /home/paul/src/scrapy/setup.py
py27 inst-nodeps: /home/paul/src/scrapy/.tox/dist/Scrapy-1.3.2.zip
py27 installed: appdirs==1.4.3,asn1crypto==0.21.1,attrs==16.3.0,Automat==0.5.0,backports.shutil-get-terminal-size==1.0.0,blessings==1.6,botocore==1.5.24,bpython==0.16,brotlipy==0.6.0,cffi==1.9.1,click==6.7,constantly==15.1.0,coverage==4.3.4,cryptography==1.8.1,cssselect==1.0.1,curtsies==0.2.11,decorator==4.0.11,docutils==0.13.1,enum34==1.1.6,Flask==0.12,funcsigs==1.0.2,greenlet==0.4.12,idna==2.5,incremental==16.10.1,ipaddress==1.0.18,ipython==5.3.0,ipython-genutils==0.1.0,itsdangerous==0.24,Jinja2==2.9.5,jmespath==0.9.2,leveldb==0.194,lxml==3.7.3,MarkupSafe==1.0,mitmproxy==0.10.1,mock==2.0.0,netlib==0.10.1,olefile==0.44,packaging==16.8,parsel==1.1.0,pathlib2==2.2.1,pbr==2.0.0,pexpect==4.2.1,pickleshare==0.7.4,Pillow==4.0.0,pkg-resources==0.0.0,prompt-toolkit==1.0.13,ptyprocess==0.5.1,py==1.4.32,pyasn1==0.2.3,pyasn1-modules==0.0.8,pycparser==2.17,PyDispatcher==2.0.5,Pygments==2.2.0,pyOpenSSL==16.2.0,pyparsing==2.2.0,pytest==2.9.2,pytest-cov==2.2.1,pytest-twisted==1.5,python-dateutil==2.6.0,queuelib==1.4.2,requests==2.13.0,scandir==1.5,Scrapy==1.3.2,service-identity==16.0.0,simplegeneric==0.8.1,six==1.10.0,testfixtures==4.13.5,traitlets==4.3.2,Twisted==17.1.0,urwid==1.3.1,w3lib==1.17.0,wcwidth==0.1.7,Werkzeug==0.12,zope.interface==4.3.3
py27 runtests: PYTHONHASHSEED='1336577254'
py27 runtests: commands[0] | py.test --cov=scrapy --cov-report= -v tests/test_downloadermiddleware_robotstxt.py
============================================================================================================================= test session starts =============================================================================================================================
platform linux2 -- Python 2.7.12+, pytest-2.9.2, py-1.4.32, pluggy-0.3.1 -- /home/paul/src/scrapy/.tox/py27/bin/python2.7
cachedir: .cache
rootdir: /home/paul/src/scrapy, inifile: pytest.ini
plugins: twisted-1.5, cov-2.2.1
collected 9 items 

tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_ignore_robotstxt_request PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_empty_response PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_error PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_garbage PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_immediate_error PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_meta PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_ready_parser PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_settings PASSED

========================================================================================================================== 9 passed in 1.58 seconds ===========================================================================================================================
___________________________________________________________________________________________________________________________________ summary ___________________________________________________________________________________________________________________________________
  py27: commands succeeded
  congratulations :)

@Parth-Vader
Copy link
Member

@redapple
The second test is failing :

tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_ignore_robotstxt_request PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt FAILED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_empty_response PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_error PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_garbage PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_immediate_error PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_meta PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_ready_parser PASSED
tests/test_downloadermiddleware_robotstxt.py::RobotsTxtMiddlewareTest::test_robotstxt_settings PASSED

I will look into it.

@Gallaecio
Copy link
Member

Fixed by #3796. It is now possible to select a different robots.txt parser that does support wildcards in disallow rules, or even write your own parser or your own adapter for an existing parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants