Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Response.urljoin() helper #1086

Merged
merged 2 commits into from Mar 27, 2015
Merged

Add Response.urljoin() helper #1086

merged 2 commits into from Mar 27, 2015

Conversation

@curita
Copy link
Member

@curita curita commented Mar 19, 2015

Discussed in #548.

def urljoin(self, uri):
"""Join this Response's url with a possible relative url to form an
absolute interpretation of the latter."""
return urlparse.urljoin(self.url, uri)

This comment has been minimized.

@dangra

dangra Mar 19, 2015
Member

what do you think about making it base url aware too?

This comment has been minimized.

@curita

curita Mar 19, 2015
Author Member

Don't think I understood this, you meant having the base url as a possible parameter?

This comment has been minimized.

@dangra

dangra Mar 19, 2015
Member

No, I mean extracting the base url from the response body using scrapy.utils.response.get_base_url() instead of joining directly with self.url.

This comment has been minimized.

@curita

curita Mar 19, 2015
Author Member

Ohh, the base url in the html. Not sure about it, it won't work if the response is not html right?

This comment has been minimized.

@curita

curita Mar 19, 2015
Author Member

Maybe we could add it just in HtmlResponse.

This comment has been minimized.

@dangra

dangra Mar 19, 2015
Member

Yes.

2015-03-19 19:57 GMT-03:00 Julia Medina notifications@github.com:

In scrapy/http/response/init.py
#1086 (comment):

@@ -75,3 +76,8 @@ def replace(self, _args, *_kwargs):
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.class)
return cls(_args, *_kwargs)
+

  • def urljoin(self, uri):
  •    """Join this Response's url with a possible relative url to form an
    
  •    absolute interpretation of the latter."""
    
  •    return urlparse.urljoin(self.url, uri)
    

Maybe we could add it just in HtmlResponse.


Reply to this email directly or view it on GitHub
https://github.com/scrapy/scrapy/pull/1086/files#r26806283.

@@ -6,6 +6,7 @@
"""

import copy
import urlparse

This comment has been minimized.

@dangra

dangra Mar 19, 2015
Member

from six.moves.urllib.parse import urljoin

.. method:: Request.urljoin(uri)

Contructs an absolute url by combining the Response's :attr:`url` with
a possible local url.

This comment has been minimized.

@kmike

kmike Mar 19, 2015
Member

i think "relative url" is better - "local url" usually means file:// URL

@@ -172,6 +172,18 @@ Request objects
is given in the ``meta`` argument). See also
:ref:`topics-request-response-ref-request-callback-arguments`.

.. method:: Request.urljoin(uri)

This comment has been minimized.

@kmike

kmike Mar 19, 2015
Member

it should be Response.urljoin

This comment has been minimized.

@kmike

kmike Mar 19, 2015
Member

Stdlib's urljoin calls argument "url", not "uri", and I think it is more correct.

@curita
Copy link
Member Author

@curita curita commented Mar 19, 2015

Ouch, I got really distracted with this PR, I'll fix those right away.

@curita curita force-pushed the curita:response-urljoin branch from fe30a62 to cda3922 Mar 19, 2015
@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Mar 20, 2015

I would like to see a simple test, just to verify it is there.

@nyov
Copy link
Contributor

@nyov nyov commented Mar 24, 2015

I've added a simple test, and an override to TextResponse for also checking on baseurl, as @dangra mentioned.

Here's the diff. wow, with merge button :)

I don't want to open another PR,
If this is okay, please pull it into the existing one here. And feel free to modify it, if I missed anything.

git remote add nyov https://github.com/nyov/scrapy
git remote update
git checkout response-urljoin
git merge --ff-only nyov/response-urljoin

@nyov
Copy link
Contributor

@nyov nyov commented Mar 24, 2015

Added a second test for the base url.

@kmike
Copy link
Member

@kmike kmike commented Mar 25, 2015

@nyov looks goo, thanks! A couple of questions:

  1. Why is this helper added to TextResponse and not to HtmlResponse?
  2. What do you think about caching get_base_url result?

@dangra
Copy link
Member

@dangra dangra commented Mar 25, 2015

What do you think about caching get_base_url result?

get_base_url() already caches the result in a weakkey dict using the response as key.

@nyov
Copy link
Contributor

@nyov nyov commented Mar 25, 2015

@kmike, I did this because it seemed to me that TextResponse is actually the class that handles html (because of possible MIME or content-type mis-detection IIRC), while HtmlResponse is just an empty subclass. (Also the selector property is added to the TextResponse class, not HtmlResponse).

edit: note: I had to change scrapy/utils/response.py to lazy-load responses in open_in_browser, because otherwise it seems there is a circular import loop with scrapy/http/response/text.py now and tests would fail.

@curita
Copy link
Member Author

@curita curita commented Mar 26, 2015

Thanks @nyov for the tests!
I agree that TextResponse seems the right place to add urljoin() using get_base_url(). Right now it's where all html specific functions are declared and get_base_url() works fine if response.body is not html (it fallbacks to response.url).

I'd change the name of test_urljoin in TextResponseTest to something else (test_urljoin_with_base_url?) instead of replacing the one in BaseResponseTest since both cases should work in TextResponse, but otherwise I think I can rebase those tests and the PR should be complete.

@nyov
Copy link
Contributor

@nyov nyov commented Mar 26, 2015

I didn't think it matters since TextResponse.urljoin() would always override Response.urljoin()?
But here you go (same place as before).

Side note: I noticed get_base_url() does not seem to handle relative links correctly.
According to MDN, relative URIs are allowed in <base href>.

@nyov
Copy link
Contributor

@nyov nyov commented Mar 27, 2015

Oh! I was wrong. get_base_url() does work on relative urls.
I tested two relative root-urls and assumed they would be concatenated, but that didn't happen.
Since my base-url didn't end in a slash, it got replaced. But thats correct according to the RFC.
Sorry.

and add evaluation of base-url for HtmlResponse.
@curita
Copy link
Member Author

@curita curita commented Mar 27, 2015

Just rebased @nyov tests.

dangra added a commit that referenced this pull request Mar 27, 2015
@dangra dangra merged commit 55a23d1 into scrapy:master Mar 27, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@curita curita deleted the curita:response-urljoin branch Apr 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants