Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Issue #2919: Fix FormRequest.formdata with GET method duplicates same key in query string #3579

Merged
merged 30 commits into from
Jun 15, 2021

Conversation

maramsumanth
Copy link
Contributor

@maramsumanth maramsumanth commented Jan 13, 2019

Solved #2919.

Combining querystr and formdata into dictionary, and thereby eliminating the duplicates. @kmike , @lopuhin , please review.
I have also corrected the code, by creating a list where we want to pass duplicate keys. as described in comment #3269 (comment)

I am currently preparing for GSoC 2019, please guide me if there are any mistakes/suggestions. I would be glad if you give me suggestions for getting selected in scrapy.
Thanks in advance :) 馃憤

@codecov
Copy link

codecov bot commented Jan 13, 2019

Codecov Report

Merging #3579 (66e2004) into master (9f81de2) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3579   +/-   ##
=======================================
  Coverage   88.18%   88.18%           
=======================================
  Files         162      162           
  Lines       10490    10490           
  Branches     1516     1516           
=======================================
  Hits         9251     9251           
  Misses        965      965           
  Partials      274      274           
Impacted Files Coverage 螖
scrapy/http/request/form.py 95.48% <100.00%> (酶)

@maramsumanth
Copy link
Contributor Author

@lopuhin , what can be the reason here for failure of codecov/patch ?

@elacuesta
Copy link
Member

elacuesta commented Jan 16, 2019

I couldn't take a deep look into the algorithm itself, these are just a few Python suggestions to improve readibility:

  1. There are some duplicated occurences of urlsplit(self.url), you could assign that to a variable instead

  2. It's better to iterate sequences directly instead of by index, i.e.

In [1]: keys = ['a', 'b', 'c']                                                                                                                                                                                                                                                  

In [2]: for k in keys:  # good 
   ...:     print(k)                                                                                                                                                                                                                                                            
a
b
c

In [3]: for i in range(len(keys)):  # not so good 
   ...:     print(keys[i])                                                                                                                                                                                                                                                      
a
b
c
  1. If you need to iterate at the same time over more than one iterator you could use zip
In [4]: values = [1, 2, 3]                                                                                                                                                                                                                                                      

In [5]: for k, v in zip(keys, values): 
   ...:     print('{}={}'.format(k, v))                                                                                                                                                                                                                                         
a=1
b=2
c=3

Finally, if I'm not mistaken the Codecov failure is because some of the new lines are not hit by any of the existing tests, that's why it says "will decrease coverage by 0.04%"

@maramsumanth
Copy link
Contributor Author

@elacuesta , thanks for the suggestion :) 馃憤. I have modified code to improve readability.

Copy link

@amarynets amarynets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://www.example.com/?a=0&a=2&b=1&c=3#fragment
data = {'a': 1}

Looks like it's wrong request. I tested it by Postman and got 400 status code
As far as I know all data from GET form should be sent by query string and duplicates of keys can exist because on the server it will be transformed to list. More detail info on SO
What do you think @kmike @lopuhin

scrapy/http/request/form.py Outdated Show resolved Hide resolved
scrapy/http/request/form.py Outdated Show resolved Hide resolved
scrapy/http/request/form.py Outdated Show resolved Hide resolved
@maramsumanth
Copy link
Contributor Author

Done @amarynets. Thanks for your suggestions! :) 馃憤

@maramsumanth
Copy link
Contributor Author

@lopuhin , can you please look into this?

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @maramsumanth and for adding tests. I think the same task could be accomplished by relying more on stdlib and w3lib functions for parsing and unparsing URLs though (unless they can't be applied here for some reason).

scrapy/http/request/form.py Outdated Show resolved Hide resolved
scrapy/http/request/form.py Outdated Show resolved Hide resolved
items = []
items += [(k, v) for k, v in parse_qsl(url_split.query) if k not in formdata_key_list]
query_str = _urlencode(items, self.encoding)
self._set_url(urljoin(self.url,'?'+ (query_str + '&' if query_str else '') + querystr + ('#'+ url_split.fragment if url_split.fragment else '')))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think it's better to use stdlib functions for URL construction from parts (see e.g. urlunsplit / urlunparse)

Copy link
Contributor Author

@maramsumanth maramsumanth Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not be able to do that, because I have both query_str and querystr to be appended. We can also have the case where query_str can be empty. So, I explicitly defined in the urljoin. Do you have any suggestion @lopuhin ?

Copy link
Member

@lopuhin lopuhin Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not be able to do that, because I have both query_str and querystr to be appended. We can also have the case where query_str can be empty.

@maramsumanth I see, in that case you could first concatenate query strings, and then pass them to urlunsplit. Also it already has the logic to handle empty parts:

In [1]: from urllib.parse import urlunsplit

In [2]: urlunsplit(('a', 'b', 'c', 'd', 'e'))
Out[2]: 'a://b/c?d#e'

In [3]: urlunsplit(('a', 'b', 'c', 'd', ''))
Out[3]: 'a://b/c?d'

In [4]: urlunsplit(('a', 'b', 'c', 'd', ''))
Out[4]: 'a://b/c?d'

In [5]: urlunsplit(('a', 'b', '', '', 'foo'))
Out[5]: 'a://b#foo'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of using urlunsplit, but both urljoin and urlunsplit yield the same thing.

@lopuhin
Copy link
Member

lopuhin commented Mar 4, 2019

http://www.example.com/?a=0&a=2&b=1&c=3#fragment
data = {'a': 1}

Looks like it's wrong request. I tested it by Postman and got 400 status code
As far as I know all data from GET form should be sent by query string and duplicates of keys can exist because on the server it will be transformed to list. More detail info on SO

@amarynets thanks for review! Could you please clarify what you mean here? I didn't quite understand what data = {'a': 1} means here. I agree that duplicate GET arguments are possible and we should preserve them, but not add new duplicates. Do you mean it's not handled correctly in this PR? If yes, could you please clarify which case is not handled correctly?

@maramsumanth I see that some duplicate cases are checked in tests, but could you please add an explicit test for the case then there is a duplicate GET argument and that it's preserved when we add another unlreated argument? (it seems this is not tested directly).

@maramsumanth
Copy link
Contributor Author

@lopuhin Thanks for your review. I have made the changes :) 馃憤

@maramsumanth
Copy link
Contributor Author

maramsumanth commented Mar 16, 2019

@whalebot-helmsman , can you please look into this so that I can work on it ASAP. 馃憤 :)

@whalebot-helmsman
Copy link
Contributor

I referenced HTML file as example. It has two forms. First one uses GET request, has a and b as keys and targets http://example.com/f?a=1&d=2 URL. After you click Submit button you get to http://example.com/f?a=3&b=2. Browser not only replace value for a key, but drop value for d key. Reformulating in terms of python structures we got

FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url == 'http://example.com/?id=8'

@maramsumanth
Copy link
Contributor Author

maramsumanth commented Mar 18, 2019

I have Ubuntu OS in which I am using chromium browser.
But for FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url
my output is 'http://example.com/?smth=a&id=8' which is != 'http://example.com/?id=8'
Above thing is handled in the last part of tests that I have written in this PR. Does chromium and chrome browsers behave differently?

@maramsumanth
Copy link
Contributor Author

@whalebot-helmsman , when I tested the HTML file you provided, I can see that in case of GET form request both Chrome and Firefox drop url_query_str from URL and add form_query_str. But running through the command line as described in the above comment, I cannot see any problem.

@whalebot-helmsman
Copy link
Contributor

I can see that in case of GET form request both Chrome and Firefox drop url_query_str from URL and add form_query_str.

Scrapy and related libraries should mirror browser behaviour.
At this moment in master and in your branch

>>> from scrapy.http.request.form import FormRequest;FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url == 'http://example.com/?id=8'
False

This is incorrect representation of browser behaviour. I understand, such behaviour was implemented before your PR. As we started to work on this part we should also fix this.

@maramsumanth
Copy link
Contributor Author

maramsumanth commented Mar 19, 2019

Now I understood the problem. Is there any python code to get the browser that the user is using @whalebot-helmsman ?

@whalebot-helmsman
Copy link
Contributor

@whalebot-helmsman , when I tested the HTML file you provided, I can see that in case of GET form request both Chrome and Firefox drop url_query_str from URL and add form_query_str.

It is a browser behaviour. We agreed on that.

Scrapy implementation is separate from installed browser. Scrapy behaviour doesn't depend on installed browser. In master branch we preserve both url_query_str and form_query_str, In branch for this PR we replace values for common keys in url_query_str by values from form_query_str.
Scrapy's behaviour in both branches doesn't mirror browser behaviour. It is a problem, we should fix it.

I propose to fix it in this PR.

Current behaviour

>>> from scrapy.http.request.form import FormRequest;FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url
'http://example.com/?smth=a&id=8'

Behaviour after fix

>>> from scrapy.http.request.form import FormRequest;FormRequest('http://example.com/?id=9&smth=a', method='GET', formdata={'id': '8'}).url
'http://example.com/?id=8'

@maramsumanth
Copy link
Contributor Author

maramsumanth commented Mar 20, 2019

@whalebot-helmsman Thanks for guiding me.
Till now I was thinking that, we have to somewhere account for the browser name and proceed accordingly. Now I have updated the PR and changed tests as you suggested :)

@maramsumanth
Copy link
Contributor Author

@whalebot-helmsman , should I change anything or is it safe to merge. 馃憤
Thanks :)

@whalebot-helmsman
Copy link
Contributor

It is safe to merge now. But merging takes more time and not required by GSoC rules. We can keep it as it is for now.

@Gallaecio Gallaecio changed the title Issue #2919: Fix FormRequest.formdata with GET method duplicates same key in query string [MRG+1] Issue #2919: Fix FormRequest.formdata with GET method duplicates same key in query string Mar 26, 2019
@maramsumanth
Copy link
Contributor Author

Now its fine :) 馃憤

@kmike kmike requested a review from wRAR March 26, 2020 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants