Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_connector: Fetch all publications of one specific author #50

Closed
jnwang opened this issue Dec 21, 2019 · 10 comments · Fixed by sfu-db/APIConnectors#1
Closed

data_connector: Fetch all publications of one specific author #50

jnwang opened this issue Dec 21, 2019 · 10 comments · Fixed by sfu-db/APIConnectors#1
Assignees
Labels

Comments

@jnwang
Copy link

jnwang commented Dec 21, 2019

Suppose a user wants to fetch all publications of one specific author (e.g., Jian Pei). Dataprep.data_connector cannot meet her needs. For example, the first paper is not written by Jian Pei, but it was returned since the author list contains the keywords Jian and Pei.

Screen Shot 2019-12-21 at 1 53 52 PM

A user can get all publications of Jian Pei through this API: https://dblp.org/search/publ/api?q=author%3AJian_Pei%3A

Please consider to support this feature.

@pallavibharadwaj
Copy link
Contributor

Hi, I would like to work on this enhancement. I just had a couple of questions and it would be great if you could point me in the right direction.

'author' in the API call is part of the 'q' query string parameter. What exactly is the expected behavior? I believe that as an end-user, one would just want to specify dc.query("publication", author='Jian_Pei') instead of using the 'q' parameter and using 'author' as a nested parameter.

Any suggestions on this would be of help. Thank you!

@dovahcrow
Copy link
Member

dovahcrow commented Feb 24, 2020

Hi @pallavibharadwaj, we recently merged a PR (#70) called templated query provided the ground for a solution to this issue.

Basically you can help us add this to the dblp config and test if it works.

{
    "request": {
        "params: {
            "template": "author:{{first_name}}_{{last_name}}:",
            "required": false,
            "toKey": "q",
        }
    }
}

So that you can issue queries like dc.query("publication", first_name="Jian", last_name="Pei").

Note that the templated query PR is not very tested so there could be bugs somewhere and please don't hesitate to file an issue/PR to this repo!

@pallavibharadwaj
Copy link
Contributor

@dovahcrow - Sure, I'm taking up this issue and will try the format that you mentioned. I had another question. Could you please brief me or redirect me to instructions on how to make the build after DataConnectorConfig changes? I had to clone that separately into dataprep and I don't seem to find any instructions related to it.

Please assign the issue to me so that no one would spend time debugging it. Thank you!

@dovahcrow
Copy link
Member

@pallavibharadwaj You can create a Connector via Connector("./relative/path/to/dblp_config_folder") or Connector("/absolute/path/to/dblp_config_folder") to load the config locally.

@pallavibharadwaj
Copy link
Contributor

pallavibharadwaj commented Feb 26, 2020

@dovahcrow - I needed a small clarification on tokey and fromkey parameters.

  1. when both "tokey" and "template" are specified, is it always the case that the parameter dictionary would have it re-constructed to {ret['tokey']:ret['template']}? I don't see this implementation in the "types.py" file as part of the "support for template" PR. Is this a bug or an enhancement to be made as part of this PR?
  2. What is the expected behavior when there are multiple such "tokey" parameters and a different template is to be specified for each of the "tokey"?
  3. What is the role of "fromkey"?

Thank you!

@dovahcrow
Copy link
Member

@pallavibharadwaj

  1. Check the last else branch of the block starts from https://github.com/sfu-db/dataprep/blob/master/dataprep/data_connector/types.py#L122
        elif isinstance(def_, dict):
                template = def_.get("template")
                remove_if_empty = def_["removeIfEmpty"]
                to_key = def_.get("toKey") or to_key
                from_key = def_.get("fromKey") or from_key

                if template is None:
                    required = def_["required"]
                    value = params.get(from_key)
                    if value is None and required:
                        raise KeyError(from_key)
                else:
                    tmplt = jenv.from_string(template)
                    value = tmplt.render(**params)

If the template variable is defined (not None), it will always be used.

So, yes, it is always the case that the parameter dictionary would have it re-constructed to {ret['tokey']:ret['template']}.

  1. Check the code at https://github.com/sfu-db/dataprep/blob/master/dataprep/data_connector/types.py#L143
if to_key in ret:
        print(Warning(f"{key} conflicting with {to_key}"))

There will be a warning printed out if the toKey is duplicated. Currently, we don't have a hard constraint to forbid duplications. We might support that in the future for the robustness.

  1. fromKey is used as a rename of the parameter when the field is not a template. Check the code piece in 1. when the template is None. We actually get the value using fromKey from params, and put that under the toKey.

@pallavibharadwaj
Copy link
Contributor

@dovahcrow - Isn't the template field also a string here - "template": "author:{{first_name}}_{{last_name}}:"?
How is it going to enter the dict condition part? I see that it works perfectly fine if I add that implementation outside the elsif block that handles param values that are dictionary. Am I missing anything here?

@dovahcrow
Copy link
Member

@pallavibharadwaj Ah I see. There's a mistake in my comment.

The correct config for the request field in publication.json should be this:

{
    "request": {
        "url": "https://dblp.org/search/publ/api",
        "method": "GET",
        "params: {
+          "author": {
+              "template": "author:{{first_name}}_{{last_name}}:",
+              "required": false,
+              "toKey": "q",
+          },
-          "q": true,
+         "q": false,
            "h": false
        }
    }
}

@pallavibharadwaj
Copy link
Contributor

@dovahcrow - Okay, this looks good. However, params key expects only string values at the moment and does not support a dictionary. I see the following error originating from implicit_database.py where types of parameters are defined and json validator is invalidating this configuration:
jsonschema.exceptions.ValidationError: {'template': 'author:{{first_name}}_{{last_name}}:', 'required': False, 'toKey': 'q'} is not of type 'string'

I will create an issue for this and work on it first. Please let me know if that sounds okay or you have anything else in mind.

@dovahcrow
Copy link
Member

@pallavibharadwaj Somehow the JSON schema error is not very informative. Actually it's missing a field removeIfEmpty.

Have a try on this:

"author": {
                "template": "author:{{first_name}}_{{last_name}}:",
                "required": false,
                "removeIfEmpty": true,
                "toKey": "q"
            }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants