Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Added JmesSelect #1016

Merged
merged 1 commit into from Mar 18, 2015
Merged

[MRG+1] Added JmesSelect #1016

merged 1 commit into from Mar 18, 2015

Conversation

@SudShekhar
Copy link
Contributor

@SudShekhar SudShekhar commented Jan 19, 2015

Created the first implementation of Json processor, any comments/edits are welcome.

I am currently using the jmespath (https://github.com/boto/jmespath) module to search for paths in the given list of values. I have added some sample test cases too.
The processor will return the list of values unchanged in case the jmespath module isn't installed.
I wasn't sure how to show a warning message in such a case and thus, have used the python print statement for now.

try:
import jmespath,ast,json
except:
print "JsonProcessor module requires the jmespath library to function. No processing will happen in its abscence"

This comment has been minimized.

@nramirezuy

nramirezuy Jan 19, 2015
Contributor

When I said required, I meant for the Scrapy library. So it should raise the exception on the import of this Processor.

except:
print "JsonProcessor module requires the jmespath library to function. No processing will happen in its abscence"
return values
compiledPath = jmespath.compile(self.json_path)

This comment has been minimized.

@nramirezuy

nramirezuy Jan 19, 2015
Contributor

This compilation should happen on __init__, it will improve performance.

@nramirezuy nramirezuy mentioned this pull request Jan 21, 2015
@SudShekhar
Copy link
Contributor Author

@SudShekhar SudShekhar commented Jan 22, 2015

Regarding the storage of ast module as self.astModule in the JsonProcessor class, I looked up the following document : https://wiki.python.org/moin/PythonSpeed/PerformanceTips (Import statement overhead topic). It seems that the current method should be acceptable. Kindly let me know if there are any other issues to be handled.

try:
import jmespath
import ast
self.astModule = ast

This comment has been minimized.

@nramirezuy

nramirezuy Jan 23, 2015
Contributor

Just save the function you are going to use. self.ast_literal_eval = ast.literal_eval.
Also fix naming, we use camel case just for classes.

raise ImportError("You need to have jmespath(https://github.com/boto/jmespath)"
+ " and AST modules before being able to use this processor")

def __call__(self, currentString):

This comment has been minimized.

@nramirezuy

nramirezuy Jan 23, 2015
Contributor

Rename currentString to value

+ " and AST modules before being able to use this processor")

def __call__(self, currentString):
if currentString != "": # Empty strings will return None

This comment has been minimized.

@nramirezuy

nramirezuy Jan 23, 2015
Contributor

Be prepared for None values. I think if value: is enough.

self.astModule.literal_eval(currentString)
)
return jsonValue
return None

This comment has been minimized.

@nramirezuy

nramirezuy Jan 23, 2015
Contributor

return None line can be removed

@@ -63,6 +63,31 @@ def __call__(self, values):
return values


class JsonProcessor(object):
"""

This comment has been minimized.

@nramirezuy

nramirezuy Jan 23, 2015
Contributor

Declare that it expect a string in __call__

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 23, 2015

We are also missing documentation.

self.assertEqual(test, expected,
msg='test "{}" got {} expected {}'.format(l, test, expected)
)
except: # in case jmespath isn't installed

This comment has been minimized.

@nramirezuy

nramirezuy Jan 30, 2015
Contributor

If jmespath isn't installed just let it fail.
You can add dependencies for tests: https://github.com/scrapy/scrapy/blob/master/tests/requirements.txt

except: # in case jmespath isn't installed
pass

def test_errors(self):

This comment has been minimized.

@nramirezuy

nramirezuy Jan 30, 2015
Contributor

Not sure what this test.


def test_equals(self):
try:
import jmespath

This comment has been minimized.

@nramirezuy

nramirezuy Jan 30, 2015
Contributor

Why are you importing jmespath ?

@SudShekhar SudShekhar force-pushed the SudShekhar:jsonProcessor branch 4 times, most recently from 2bdf817 to d096792 Jan 30, 2015
self.json_path = json_path
try:
import jmespath
import ast

This comment has been minimized.

@nramirezuy

nramirezuy Feb 2, 2015
Contributor

ast is from the standard library so we can import it at the top. So storing literal_eval in an attribute won't be needed.

@@ -675,3 +675,18 @@ Here is a list of all built-in processors:
constructor keyword arguments are used as default context values. See
:class:`Compose` processor for more info.

.. class:: JsonProcessor(json_path)

This comment has been minimized.

@nramirezuy

nramirezuy Feb 2, 2015
Contributor

Let's rename it to JmesProcessor.

@SudShekhar SudShekhar force-pushed the SudShekhar:jsonProcessor branch 2 times, most recently from d0a1a2e to 5c64b3a Feb 2, 2015
@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 2, 2015

@kmike Can you review this? I'm looking for py3 and documentation feedback.

@dangra
Copy link
Member

@dangra dangra commented Feb 2, 2015

+1 LGTM. It needs to be rebased into a single commit before merging.

Queries the value using the json path provided to the constructor and returns the output.
Requires jmespath (https://github.com/boto/jmespath) to run.

This processor takes only one string at a time and will return a python string/dictionary/None as answer

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

  1. Can it return a list? A number? A boolean?
  2. What is a "python string"? Does this processor return bytes or unicode?
.. class:: JmesProcessor(json_path)

Queries the value using the json path provided to the constructor and returns the output.
Requires jmespath (https://github.com/boto/jmespath) to run.

This comment has been minimized.

"""
if value: # Empty strings will return None
return_value = self.compiled_path.search(
literal_eval(value)

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

Why is literal_eval needed?

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

I mean, why not use json?

def __call__(self, value):
"""Query value for the jmespath query and return answer
Input : string
Output : string / dict / None

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

What about other Python types?
Could you please also use Sphinx-compatible docstrings, like

:param str value: a string with JSON data to extract from
Query the input string for the jmespath (given at instantiation), and return the answer
Requires : jmespath(https://github.com/boto/jmespath), ast
Note: JmesProcessor accepts only a single string at a time and returns string/dict/None based on the jmespath query.
"""

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

This comment has been minimized.

@SudShekhar

SudShekhar Feb 2, 2015
Author Contributor

Thanks for your feedback. I will update the things you have mentioned.

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

Thanks!
The Sphinx link I posted here is not relevant, but it is relevant for __call__ docstring.

self.compiled_path = jmespath.compile(self.json_path)
except:
raise ImportError("You need to have jmespath(https://github.com/boto/jmespath)"
+ " and AST modules before being able to use this processor")

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

This is bad for several reasons:

  1. If "import jmespath" fails, it hides the original exception which can provide more information about why isn't jmespath available. See e.g. #902.
  2. If jmespath.compile fails the code raises an obscure ImportError, hiding the original exception again.
  3. except: shouldn't be used; it catches e.g. KeyboardInterrupt and SystemExit exceptions.

I suggest to remove try-except entirely. It will fail with "ImportError" pointing to jmespath, this is self-explanatory.

@@ -3,7 +3,7 @@

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, Identity, TakeFirst, \
Compose, MapCompose
Compose, MapCompose,JmesProcessor

This comment has been minimized.

@kmike

kmike Feb 2, 2015
Member

a nitpick - space is missing after a comma

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 3, 2015

class JmesProcessor(object):
    def __init__(self, path, process=None):
        self.process = process or self._process

    def _process(self, data):
         if isinstance(data, basestring):
              return json.loads(data)
         return data

But I guess doing the json.loads outside is not that hard.

# Load JSON and process
Compose(json.loads, JmesProcessor('foo'))
# Process several JSON objects: '[{"foo":"bar"}, {"baz":"tar"}]'
Compose(json.loads, MapCompose(JmesProcessor('foo')))
# Just process
JmesProcessor('foo')
# Chained processors with load
Compose(json.loads, JmesProcessor('foo'), JmesProcessor('bar'))

I like more the second option.
@kmike @SudShekhar What do you think?

EDIT: We should also add examples to the doc.

@kmike
Copy link
Member

@kmike kmike commented Feb 3, 2015

I like not doing json.loads in the JmesProcessor, and the Compose / MapCompose examples are good.

@nramirezuy you should really create a few nice images/diagrams which show how Compose and MapCompose work :) What are the inputs, what are the outputs, how they can be used together. And maybe we should rename the processors to verbs - instead of JmesProcessor write SelectJmes - Compose(json.loads, SelectJmes('foo')) reads very well.

@kmike
Copy link
Member

@kmike kmike commented Feb 3, 2015

@nramirezuy a realted question: is it possible to use the processors without the item loaders?

get_foo = Compose(json.loads, SelectJmes('foo'))
# ...
value = get_foo(response.body_as_unicode())

is rather nice. If it works then I think it worths documenting. An maybe splitting this micro-framework from the item loaders, if it is not only about item loaders.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 4, 2015

Well every Processor works outside ItemLoaders; and that is completely valid.
I'm fine with the processor rename there are no processors named Processor 👅

@SudShekhar
Copy link
Contributor Author

@SudShekhar SudShekhar commented Feb 4, 2015

@nramirezuy : The second options looks much better and it gives users more control. So, should I remove the json.loads from inside the processor? Regarding the documentation, the examples you have given look pretty comprehensive to me 😄 .I will add them and some others to the list.
+1 to the renaming too.

EDIT: In the documentation, I didn't add the chaining example because I felt that it fit better in the Compose examples list (same can be said for the processor handling a list of json strings I guess). Do let me know your views on this.

@SudShekhar SudShekhar force-pushed the SudShekhar:jsonProcessor branch from bf02b31 to f975f27 Feb 4, 2015
>>> proc({'foo':{'bar':'baz'}})
{'bar': 'baz'}
>>> import json
>>> procSingleJsonStr = Compose(json.loads,SelectJmes("foo"))

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

We just use we just use CamelCase for classes; blame pep8 👅
procSingleJsonStr -> proc_single_json_str

Space after commas please 😄

>>> procSingleJsonStr = Compose(json.loads,SelectJmes("foo"))
>>> procSingleJsonStr('{"foo":"bar"}')
u'bar'
>>> procJsonList = Compose(json.loads, MapCompose(SelectJmes('foo')))

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

same here

:return: Element extracted according to jmespath query
"""
return_value = self.compiled_path.search(value)
return return_value

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

make it one line

@@ -579,5 +579,44 @@ def test_replace_css_re(self):
self.assertEqual(l.get_output_value('url'), [u'scrapy.org'])


class SelectJmesTestCase(unittest.TestCase):

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

Gotta test those python types. Json isn't relevant here.

'bar'
>>> proc({'foo':{'bar':'baz'}})
{'bar': 'baz'}
>>> import json

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

Make it 2 blocks, little title in between Working with Json.

'simple': ('foo.bar', '{"foo": {"bar": "baz"}}', "baz"),
'invalid': ('foo.bar.baz', '{"foo": {"bar": "baz"}}', None),
'top_level': ('foo', '{"foo": {"bar": "baz"}}', {"bar": "baz"}),
'double_vs_single_quoteString': ('foo.bar', '{"foo":{"bar":"baz"}}', "baz"),

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

That camel case 👅

msg='test "{}" got {} expected {}'.format(l, test, expected)
)

def test_dict(self):

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

Use the tests configuration above.

This comment has been minimized.

@SudShekhar

SudShekhar Feb 13, 2015
Author Contributor

Can you please clarify what you mean by this? Json's used to load the input. Do you just want me to directly use the dict/list/simple strings to check?

This comment has been minimized.

@nramirezuy

nramirezuy Feb 13, 2015
Contributor

Yes, use python types and forget about json, since we don't load json anymore inside the class. Also use the test_list_equals method, so it is easy to add more comparisons if needed.

@SudShekhar SudShekhar force-pushed the SudShekhar:jsonProcessor branch from f975f27 to d430dd7 Feb 13, 2015
@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 17, 2015

There is something with the markup but I don't know what it is.

/cc @kmike

@SudShekhar
Copy link
Contributor Author

@SudShekhar SudShekhar commented Feb 24, 2015

I created the documentation locally but was unable to figure out the error. Can you please point out the issue?

Thanks.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 24, 2015

I think you are missing :: in Working with Json: http://pasteboard.co/1mkMnIj3.png

Utilizes jmespath. Also, added tests and documentation for the same.
@SudShekhar SudShekhar force-pushed the SudShekhar:jsonProcessor branch from d430dd7 to 839ffba Feb 24, 2015
@SudShekhar
Copy link
Contributor Author

@SudShekhar SudShekhar commented Mar 8, 2015

Hi,
Is there anything else that needs to be changed in this commit? Thanks for all your feedback.
/cc @nramirezuy @kmike

@kmike
Copy link
Member

@kmike kmike commented Mar 13, 2015

I think this PR is fine, +1 to merge it.

@kmike kmike changed the title Added a json processor [MRG+1] Added a json processor Mar 13, 2015
@nramirezuy nramirezuy changed the title [MRG+1] Added a json processor [MRG+1] Added JmesSelect Mar 18, 2015
nramirezuy added a commit that referenced this pull request Mar 18, 2015
[MRG+1] Added JmesSelect
@nramirezuy nramirezuy merged commit ee82fe0 into scrapy:master Mar 18, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@SudShekhar
Copy link
Contributor Author

@SudShekhar SudShekhar commented Mar 18, 2015

Thanks 😄

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Mar 18, 2015

Sorry I couldn't chime in before merge but, if we're gonna add jmespath as requirement (to requirements.txt) we should import it at module-level, not in the class constructor, don't you think?

@kmike
Copy link
Member

@kmike kmike commented Mar 18, 2015

@pablohoffman why does it matter? jmespath is not added to install_requires, it is optional.

Imports are very fast after the first successful import (a lookup in a dict) - by moving import to module level we won't get any speed benefits, but the exception may become less clear is jmespath is absent.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Mar 20, 2015

@pablohoffman @kmike jmespath was added to tests/requirements.txt, not Scrapy ones.

@SudShekhar SudShekhar deleted the SudShekhar:jsonProcessor branch Mar 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants