Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary Tags removed from dataset when worker extracts text #16

Closed
jbothma opened this issue Oct 30, 2018 · 8 comments
Closed

Vocabulary Tags removed from dataset when worker extracts text #16

jbothma opened this issue Oct 30, 2018 · 8 comments
Assignees

Comments

@jbothma
Copy link

jbothma commented Oct 30, 2018

I have a few custom ckan tag vocabularies for my datasets. It looks like when the worker extracts the text, the vocabulary tags are removed from the dataset.

I haven't looked into the worker code yet and I'm still on ckanext-extractor@v0.3.1

Basically the only thing I have using celery (yes, still on celery despite you upgrading it to work with redis on my request, sorry) is this.

When I create a dataset, I assign a couple of vocabulary tags to it.

When I add a PDF resource to it and programmatically request the package immediately afterwords, they're still set correctly.

A few seconds later they're not set any more.

If I stop the celery worker, the tags will stay in place until I start the worker again.

Any idea why this might be? I'll dive into the worker code ASAP but it's taken me a day or so to track this down to this plugin so it might not be tomorrow.

As always, I'm such a huge fan of this and appreciate it very much. Just posting here so long in case you know very quickly what it is. I'll update when I know more.

I think this has been hidden in the past because I used a script that would update (and fix) the package each time I add a resource, and I generally add XLS resources after adding PDF resources to the same datasets, and I have extractor configured to only extract PDF resources.

@torfsen torfsen self-assigned this Oct 31, 2018
@torfsen
Copy link
Contributor

torfsen commented Oct 31, 2018

Thanks for your report, @jbothma!

Your description does indeed suggest a connection to ckanext-extractor. However, I currently don't have an idea how ckanext-extractor could influence your tags: the metadata extracted by ckanext-extractor is stored in separate database tables and ckanext-extractor isn't supposed to modify the dataset/resource data itself.

Obviously that doesn't mean that ckanext-extractor isn't the problem, but simply that this needs further investigation 😉 I'll look into it, but am currently busy with other things. If you can spare some time to investigate on your own then that would be a big help.

@jbothma
Copy link
Author

jbothma commented Oct 31, 2018 via email

@jbothma
Copy link
Author

jbothma commented Nov 13, 2018

Looks like the tag vocabulary fields (financial_year and sphere) are still in the index document except for the validated_data_dict fields, suggesting it has something to do with the package data cached in solr

Someone's discussed disabling that for quicker iteration on their schema ckan/ckan#3226

Perhaps there's something wrong with my schema https://github.com/OpenUpSA/ckanext-satreasury/blob/master/ckanext/satreasury/plugin.py#L111

      {
        "data_dict":"{\"license_title\": \"License not specified\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [\"national\"], \"financial_year\": [\"2019-20\"], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"resources\": [{\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"Annexure_A_-_Individual_Investor.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T13:47:17.740234\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"url_type\": \"upload\", \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"size\": 52558}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T14:35:45.663218\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"url_type\": \"upload\", \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:11:28.867650\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:31:07.133643\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"size\": 241245}], \"title\": \"whatever\", \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
        "site_id":"default",
        "financial_year":["2019-20"],
        "id":"e50c37e5-cec5-40d2-b55b-e6bd512c8d71",
        "metadata_created":"2018-11-13T11:39:31.160Z",
        "capacity":"public",
        "metadata_modified":"2018-11-13T15:31:07.163Z",
        "res_format":["PDF",
          "PDF",
          "PDF",
          "PDF"],
        "state":"active",
"license_id":"notspecified",
        "res_url":["http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf",
          "http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf",
          "http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf",
          "http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf"],
        "entity_type":"package",
        "title":"whatever",
        "dataset_type":"dataset",
        "validated_data_dict":"{\"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [], \"financial_year\": [], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"title\": \"whatever\", \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"license_title\": \"License not specified\", \"resources\": [{\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T13:47:17.740234\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"size\": 52558, \"datastore_active\": false, \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"name\": \"Annexure_A_-_Individual_Investor.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T14:35:45.663218\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"size\": 241245, \"datastore_active\": false, \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:11:28.867650\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:31:07.133643\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}], \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
        "res_name":["Annexure_A_-_Individual_Investor.pdf",
          "vote-10-public-service-and-administration.pdf",
          "vote-10-public-service-and-administration.pdf",
          "vote-10-public-service-and-administration.pdf"],
        "name":"whatever",

@jbothma jbothma closed this as completed Nov 13, 2018
@jbothma
Copy link
Author

jbothma commented Nov 20, 2018

The pkg_dict ckanext-extractor's worker gets from package_show on https://github.com/stadt-karlsruhe/ckanext-extractor/blob/master/ckanext/extractor/tasks.py#L62 already has vocabulary tags converted.

So when ckanext-extractor's worker calls index_for('package').update_dict(pkg_dict) on https://github.com/stadt-karlsruhe/ckanext-extractor/blob/master/ckanext/extractor/tasks.py#L110 there aren't any ('tag', ..., ...) keys in the data argument to the converters.convert_from_tags callable https://github.com/ckan/ckan/blob/master/ckan/logic/converters.py#L93

Since converters.convert_from_tags overwrites data[key], the worker's index call ends up triggering a second convert on the tag vocabulary fields and setting them to empty lists.

I think the following are reasonable options, but I'm not sure what the best one is and would like input. I'll cross-post to the ckan-dev list:

  • the worker should be operating on a pre-converted pkg_dict so that converting it has the expected result
    • in this case, how? Is there a context flag to package_show that can give an unconverted dict?
  • index_for('package').update_dict(pkg_dict) should handle an already-converted dict safely
    • how? Its only optional argument is defer_commit
  • convertors should be idempotent, in which case this is a ckan bug
    • unlikely - it sounds weird and there isn't really enough metadata to support this safely, I don't think

@jbothma jbothma reopened this Nov 20, 2018
jbothma added a commit to vulekamali/ckanext-satreasury that referenced this issue Nov 20, 2018
@torfsen
Copy link
Contributor

torfsen commented Nov 20, 2018

Thanks for the detailed investigation, @jbothma!

Perhaps we can avoid this issue completely by using ckan.lib.search.rebuild instead of ckan.lib.search.index_for('package').update_dict. Could you please try the following:

In the file ckanext/extractor/tasks.py, replace the line index_for('package').update_dict(pkg_dict) with the following lines:

from ckan.lib import search
search.rebuild(package_id=res_dict['package_id'])

That would leave all the details of handling the package dict to CKAN core.

jbothma added a commit to jbothma/ckanext-extractor that referenced this issue Nov 20, 2018
This avoids conflict with tag vocabulary fields already converted
by package_show and then getting converted again by update_dict.

Fixes stadt-karlsruhe#16
@jbothma
Copy link
Author

jbothma commented Nov 20, 2018

That seems to work perfectly, thanks!

I've made a pull request.

@torfsen
Copy link
Contributor

torfsen commented Nov 21, 2018

The mentioned change has been committed in cbf1cae.

@jbothma: Do you need this backported to 3.1?

@jbothma
Copy link
Author

jbothma commented Nov 22, 2018

No need, thanks - I took the opportunity to upgrade (and drop celery) while debugging.

Thanks for fitting this into your schedule - much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants