Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose certificate for HTTPS responses #4054

Merged
merged 5 commits into from Feb 22, 2020

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Oct 1, 2019

Fixes #2726

The approach I took here is very similar to the one in #3940.

No tests at the moment (not sure how to test this without requesting external pages since the mockserver uses HTTP, any help is appreciated ❤️), posting a scrapy shell session instead:

$ scrapy shell https://example.org
2019-10-01 14:23:58 [scrapy.utils.log] INFO: Scrapy 1.7.0 started (bot: scrapybot)
(...)
Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fdfc53b0208>
[s]   item       {}
[s]   request    <GET https://example.org>
[s]   response   <200 https://example.org>
[s]   settings   <scrapy.settings.Settings object at 0x7fdfc1147898>
[s]   spider     <DefaultSpider 'default' at 0x7fdfc073af98>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response.certificate                                                                                                                                                                                                                  
Out[1]: <Certificate Subject=b'www.example.org' Issuer=b'DigiCert SHA2 Secure Server CA'>

In [2]: response.certificate.getIssuer()                                                                                                                                                                                                      
Out[2]: <DN 'commonName': b'DigiCert SHA2 Secure Server CA', 'organizationName': b'DigiCert Inc', 'countryName': b'US'>

In [3]: print(response.certificate.getIssuer().inspect())                                                                                                                                                                                     
               Common Name: DigiCert SHA2 Secure Server CA
              Country Name: US
         Organization Name: DigiCert Inc

@codecov
Copy link

@codecov codecov bot commented Oct 1, 2019

Codecov Report

Merging #4054 into master will decrease coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4054      +/-   ##
==========================================
- Coverage   83.92%   83.73%   -0.19%     
==========================================
  Files         166      165       -1     
  Lines        9877     9900      +23     
  Branches     1470     1470              
==========================================
+ Hits         8289     8290       +1     
- Misses       1334     1354      +20     
- Partials      254      256       +2
Impacted Files Coverage Δ
scrapy/utils/reactor.py 83.33% <100%> (+11.45%) ⬆️
scrapy/signals.py 100% <100%> (ø) ⬆️
scrapy/utils/python.py 80.1% <100%> (ø) ⬆️
scrapy/dupefilters.py 98.11% <100%> (ø) ⬆️
scrapy/crawler.py 89.26% <100%> (ø) ⬆️
scrapy/utils/log.py 88.76% <100%> (-0.13%) ⬇️
scrapy/core/spidermw.py 100% <100%> (ø) ⬆️
scrapy/utils/defer.py 95% <100%> (-2.5%) ⬇️
scrapy/settings/default_settings.py 98.71% <100%> (ø) ⬆️
scrapy/core/downloader/__init__.py 89.39% <100%> (+0.08%) ⬆️
... and 5 more

docs/topics/request-response.rst Outdated Show resolved Hide resolved
@kmike
Copy link
Member

@kmike kmike commented Oct 1, 2019

We can have https mockserver as well if we want, e.g. https://github.com/scrapinghub/splash/blob/master/splash/tests/mockserver.py has it (if I'm not mistaken, it was originally based on Scrapy's, and evolved since then).

Hmm, actually, https://github.com/scrapy/scrapy/blob/master/tests/mockserver.py also has https option.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Oct 2, 2019

@kmike I'm having trouble with the mockserver in HTTPS mode. Sadly, I don't have a lot of experience with self-signed certificates (I assume that is the case).

diff --git a/tests/test_crawl.py b/tests/test_crawl.py
index 3fc13eeb..94d2d8ec 100644
--- a/tests/test_crawl.py
+++ b/tests/test_crawl.py
@@ -3,6 +3,7 @@ import logging
 
 from testfixtures import LogCapture
 from twisted.internet import defer
+from twisted.internet.ssl import Certificate
 from twisted.trial.unittest import TestCase
 
 from scrapy.http import Request
@@ -277,3 +278,19 @@ with multiples lines
 
         self._assert_retried(log)
         self.assertIn("Got response 200", str(log))
+
+    @defer.inlineCallbacks
+    def test_response_ssl_certificate_none(self):
+        crawler = self.runner.create_crawler(SingleRequestSpider)
+        url = self.mockserver.url("/status?n=200", is_secure=False)
+        yield crawler.crawl(seed=url, mockserver=self.mockserver)
+        self.assertIsNone(crawler.spider.meta['responses'][0].certificate)
+
+    @defer.inlineCallbacks
+    def test_response_ssl_certificate(self):
+        crawler = self.runner.create_crawler(SingleRequestSpider)
+        url = self.mockserver.url("/status?n=200", is_secure=True)
+        yield crawler.crawl(seed=url, mockserver=self.mockserver)
+        cert = crawler.spider.meta['responses'][0].certificate
+        print('Response.certificate:', cert)
+        self.assertIsInstance(cert, Certificate)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.test_crawl.CrawlTestCase testMethod=test_response_ssl_certificate>, msg = "None is not an instance of <class 'twisted.internet._sslverify.Certificate'>"

    def fail(self, msg=None):
        """
        Absolutely fail the test.  Do not pass go, do not collect $200.
    
        @param msg: the message that will be displayed as the reason for the
        failure
        """
>       raise self.failureException(msg)
E       twisted.trial.unittest.FailTest: None is not an instance of <class 'twisted.internet._sslverify.Certificate'>

/.../scrapy/.tox/py35/lib/python3.5/site-packages/twisted/trial/_synctest.py:375: FailTest
------------------------------------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------------------------------------
Response.certificate: None
------------------------------------------------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------------------------------------------------
Coverage.py warning: --include is ignored because --source is set (include-ignored)
------------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------------
WARNING  scrapy.core.downloader.tls:tls.py:90 Ignoring error while verifying certificate from host "0.0.0.0" (exception: ValueError('Invalid DNS-ID.',))
============================================================================================================== warnings summary ==============================================================================================================
tests/test_crawl.py::CrawlTestCase::test_crawl_multiple
  /.../scrapy/scrapy/core/downloader/webclient.py:4: DeprecationWarning: twisted.web.client.HTTPClientFactory was deprecated in Twisted 16.7.0: please use https://pypi.org/project/treq/ or twisted.web.client.Agent instead
    from twisted.web.client import HTTPClientFactory

tests/test_crawl.py::CrawlTestCase::test_response_ssl_certificate
  /.../scrapy/scrapy/core/downloader/contextfactory.py:53: DeprecationWarning: Passing method to twisted.internet.ssl.CertificateOptions was deprecated in Twisted 17.1.0. Please use a combination of insecurelyLowerMinimumTo, raiseMinimumTo, and lowerMaximumSecurityTo instead, as Twisted will correctly configure the method.
    acceptableCiphers=self.tls_ciphers)

tests/test_crawl.py::CrawlTestCase::test_response_ssl_certificate
  /.../scrapy/.tox/py35/lib/python3.5/site-packages/service_identity/pyopenssl.py:49: SubjectAltNameWarning: Certificate with CN 'localhost' has no `subjectAltName`, falling back to check for a `commonName` for now.  This feature is being removed by major browsers and deprecated by RFC 2818.  service_identity will remove the support for it in mid-2018.
    cert_patterns=extract_ids(connection.get_peer_certificate()),

-- Docs: https://docs.pytest.org/en/latest/warnings.html

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Oct 14, 2019

@elacuesta Could you push your WIP for certificate tests?

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Oct 23, 2019

@elacuesta Could you push your WIP for certificate tests?

@Gallaecio Pushed 3f76c85, with the changes from #4054 (comment). Sorry I missed this comment.

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Oct 29, 2019

The issue is in the Scrapy code, not the mock server. When the response body is empty, the certificate is set to None in _cb_bodyready.

I tried reading it from txresponse._transport._producer.getPeerCertificate() at that point, but I get AttributeError("'NoneType' object has no attribute 'getPeerCertificate'",).

So:

  • Add copies of your tests that use the /echo endpoint instead of /status?n=200.
  • Fix the empty body issue if you know how or feel like it. Otherwise, I’m OK with marking that test as an expected failure, and after merging these changes creating a new ticket to eventually fix that scenario.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Oct 29, 2019

Nice, I updated the tests to use /echo.
Many thanks for the debugging 😊

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Oct 29, 2019

I think we should also keep tests with /status?n=200, and either mark the SSL one as an expected failure or fix the issue.

@elacuesta elacuesta force-pushed the expose-certificate-for-https-responses branch from 2b51c06 to 41f122c Compare Nov 27, 2019
scrapy/core/downloader/handlers/http11.py Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
@elacuesta elacuesta force-pushed the expose-certificate-for-https-responses branch from 41f122c to ce5b3ea Compare Dec 4, 2019
@elacuesta elacuesta force-pushed the expose-certificate-for-https-responses branch from ce5b3ea to 998cc3b Compare Dec 5, 2019
@elacuesta elacuesta force-pushed the expose-certificate-for-https-responses branch 2 times, most recently from 43972f2 to f5a01f6 Compare Dec 24, 2019
@elacuesta elacuesta force-pushed the expose-certificate-for-https-responses branch from f5a01f6 to ede242b Compare Jan 26, 2020
docs/conf.py Outdated Show resolved Hide resolved
@elacuesta elacuesta force-pushed the expose-certificate-for-https-responses branch from ede242b to 928eb54 Compare Jan 29, 2020
@elacuesta elacuesta force-pushed the expose-certificate-for-https-responses branch from 928eb54 to 556fa19 Compare Feb 3, 2020
wRAR
wRAR approved these changes Feb 22, 2020
@wRAR wRAR merged commit 9d983c1 into scrapy:master Feb 22, 2020
2 checks passed
@elacuesta elacuesta deleted the expose-certificate-for-https-responses branch Feb 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants