Lxml formrequest #111

Merged
merged 15 commits into from Apr 13, 2012

Projects

None yet

3 participants

@LucianU
Contributor
LucianU commented Apr 10, 2012

This branch contains the implementation of lxml-based FormRequest.

The changes consist of replacing the ClientForm-based FormRequest class with the lxml-based one, the addition of a test that makes sure that ambiguous clickdata is not accepted by FormRequest, and the removal of the ClientForm and BeautifulSoup libraries and ClientForm's patch and test.

I ran the tests and they all passed.

@pablohoffman

why not use scrapy.utils.python.unicode_to_str() ?

There is a difference between this version of unicode_to_str and the one in scrapy.utils.python.unicode_to_str. The difference is that this version converts an already-encoded string to 'utf-8'.

In the end, the question is do we expect only unicode or 'utf-8' or 'ascii' encoding strings or do we leave the current version of unicode_to_str that leaves the string with its existing encoding, no matter what that is?

@dangra dangra commented on an outdated diff Apr 11, 2012
scrapy/http/request/form.py
+ """
+ Returns all the inputs that will be sent with the request,
+ both those already present in the form and those given by
+ the user
+ """
+ clickables = []
+ inputs = []
+ for el in form.inputs:
+ name = el.name
+ if not name or name in formdata:
+ continue
+ tag = html._nons(el.tag)
+ if tag == 'textarea':
+ inputs.append((name, el.value))
+ elif tag == 'select':
+ if u' xmlns' in response.body_as_unicode()[:200]:
@dangra
dangra Apr 11, 2012 Member

Can we avoid checking for " xmlns " in every loop iteration? it is an invariant check. move it to the top of the function.

@dangra dangra commented on an outdated diff Apr 11, 2012
scrapy/http/request/form.py
+ if el.type in ('image', 'reset'):
+ continue
+ elif el.type == 'submit':
+ clickables.append(el)
+ else:
+ value = el.value
+ if value is not None:
+ inputs.append((name, el.value))
+
+ # If we are allowed to click on buttons and we have clickable
+ # elements, we move on to see if we have any clickdata
+ if not dont_click and clickables:
+ clickable = _get_clickable(clickdata, clickables, form)
+ inputs.append(clickable)
+
+ inputs.extend([(key, value) for key, value in formdata.iteritems()])
@dangra
dangra Apr 11, 2012 Member

alternative and less boilerplate: inputs.extend(formdata.iteritems())

@pablohoffman pablohoffman commented on an outdated diff Apr 11, 2012
scrapy/http/request/form.py
+ else:
+ return form
+
+def _get_inputs(form, formdata, dont_click, clickdata, response):
+ """
+ Returns all the inputs that will be sent with the request,
+ both those already present in the form and those given by
+ the user
+ """
+ clickables = []
+ inputs = []
+ for el in form.inputs:
+ name = el.name
+ if not name or name in formdata:
+ continue
+ tag = html._nons(el.tag)
@pablohoffman
pablohoffman Apr 11, 2012 Member

Is there a way to avoid calling a protected method of lxml.html?. As this may raise some compatibility issues on future/past lxml versions.

@dangra dangra and 1 other commented on an outdated diff Apr 12, 2012
scrapy/http/request/form.py
+
+def _get_inputs(form, formdata, dont_click, clickdata, response):
+ """
+ Returns all the inputs that will be sent with the request,
+ both those already present in the form and those given by
+ the user
+ """
+ clickables = []
+ inputs = []
+ xmlns_in_resp = u' xmlns' in response.body_as_unicode()[:200]
+
+ for el in form.inputs:
+ name = el.name
+ if not name or name in formdata:
+ continue
+ tag = _nons(el.tag)
@dangra
dangra Apr 12, 2012 Member

I tried removing the call to _nons() and still passed all tests.
What about removing it completely of figuring out a test case that justifies its use?

dangra added some commits Apr 13, 2012
@dangra dangra reuse form_values() method from lxml to avoid copying code. #111 e4d22cb
@dangra dangra lxml form request cleanup. #111
* remove unused _nons function copied from lxml.html
* compute clickables only if dont_click is False
* less _get_clickables function branch nesting
32b9f78
dangra added some commits Apr 13, 2012
@dangra dangra more lxml form fixes and test cases. #111
* Do not treat "coord" attribute specially, just pass "NN,NN" as clickdata value
* Raise explicit ValueError if not clickable is found
* Fix bug looking for clickeables trough xpath when there is more than one form
* Test from_response with multiple clickdata
ee1f784
@dangra dangra no need for MultipleElementsFound exception. #111 2904dc2
@dangra dangra Merge branch 'master' into lxml-formrequest 9e10abc
@dangra dangra reuse LxmlDocument in FormRequest. #111 a11ef7f
@dangra dangra simplify formdata type infering. #111 51e5aad
@dangra dangra do not add clickable if it is in formdata. #111 b88bdd0
@dangra dangra iteritems returns tuple elements duh!. #111 39395eb
@dangra dangra merged commit 150e573 into scrapy:master Apr 13, 2012
@dangra dangra added a commit to dangra/scrapy that referenced this pull request Apr 15, 2012
@dangra dangra SELECT matched as form inputs but hasnot type attribute. #111 5b0df46
@dangra dangra added a commit to dangra/scrapy that referenced this pull request Apr 19, 2012
@dangra dangra find form input elements using one xpath #111 #121 63e4355
@dangra dangra added a commit to dangra/scrapy that referenced this pull request Apr 19, 2012
@dangra dangra test and fix each form input type with border cases. #111 #121 84d5f5e
@dangra dangra added a commit to dangra/scrapy that referenced this pull request Apr 19, 2012
@dangra dangra More lxml FormRequest fixes. #111 #121
* test textarea elements
* handle odd cases for select elements like chrome and FF browsers does
* Remove test case already covered by per tag test cases
4340a13
@dangra dangra added a commit to dangra/scrapy that referenced this pull request Apr 19, 2012
@dangra dangra cleanup all FormRequest test cases. #111 #121 3b2458d
@dangra dangra added a commit to dangra/scrapy that referenced this pull request Apr 19, 2012
@dangra dangra Add a test case to cover input elements as not direct child of form e…
…lement. #111 #121
7248512
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment