New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocabulary Tags removed from dataset when worker extracts text #16
Comments
Thanks for your report, @jbothma! Your description does indeed suggest a connection to ckanext-extractor. However, I currently don't have an idea how ckanext-extractor could influence your tags: the metadata extracted by ckanext-extractor is stored in separate database tables and ckanext-extractor isn't supposed to modify the dataset/resource data itself. Obviously that doesn't mean that ckanext-extractor isn't the problem, but simply that this needs further investigation 😉 I'll look into it, but am currently busy with other things. If you can spare some time to investigate on your own then that would be a big help. |
Oh dear. Thanks for the response.
From memory perhaps, could it be that it causes the document to be
re-indexed in CKAN and that the tags are still set up correctly in the
database but that the API and UI is presenting based on the response from
solr?
I'll confirm that in the code but you might have a hunch.
JD
…On Wed, 31 Oct 2018 at 16:16, Florian Brucker ***@***.***> wrote:
Thanks for your report, @jbothma <https://github.com/jbothma>!
Your description does indeed suggest a connection to ckanext-extractor.
However, I currently don't have an idea how ckanext-extractor could
influence your tags: the metadata extracted by ckanext-extractor is stored
in separate database tables and ckanext-extractor isn't supposed to modify
the dataset/resource data itself.
Obviously that doesn't mean that ckanext-extractor isn't the problem, but
simply that this needs further investigation 😉 I'll look into it, but am
currently busy with other things. If you can spare some time to investigate
on your own then that would be a big help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAOZGRV3cD84QvVT5bJJrFZ_2mgQhloPks5uqbCmgaJpZM4YCrEd>
.
|
Looks like the tag vocabulary fields ( Someone's discussed disabling that for quicker iteration on their schema ckan/ckan#3226 Perhaps there's something wrong with my schema https://github.com/OpenUpSA/ckanext-satreasury/blob/master/ckanext/satreasury/plugin.py#L111 {
"data_dict":"{\"license_title\": \"License not specified\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [\"national\"], \"financial_year\": [\"2019-20\"], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"resources\": [{\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"Annexure_A_-_Individual_Investor.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T13:47:17.740234\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"url_type\": \"upload\", \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"size\": 52558}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T14:35:45.663218\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"url_type\": \"upload\", \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:11:28.867650\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:31:07.133643\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"size\": 241245}], \"title\": \"whatever\", \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
"site_id":"default",
"financial_year":["2019-20"],
"id":"e50c37e5-cec5-40d2-b55b-e6bd512c8d71",
"metadata_created":"2018-11-13T11:39:31.160Z",
"capacity":"public",
"metadata_modified":"2018-11-13T15:31:07.163Z",
"res_format":["PDF",
"PDF",
"PDF",
"PDF"],
"state":"active",
"license_id":"notspecified",
"res_url":["http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf",
"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf",
"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf",
"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf"],
"entity_type":"package",
"title":"whatever",
"dataset_type":"dataset",
"validated_data_dict":"{\"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [], \"financial_year\": [], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"title\": \"whatever\", \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"license_title\": \"License not specified\", \"resources\": [{\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T13:47:17.740234\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"size\": 52558, \"datastore_active\": false, \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"name\": \"Annexure_A_-_Individual_Investor.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T14:35:45.663218\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"size\": 241245, \"datastore_active\": false, \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:11:28.867650\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:31:07.133643\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}], \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
"res_name":["Annexure_A_-_Individual_Investor.pdf",
"vote-10-public-service-and-administration.pdf",
"vote-10-public-service-and-administration.pdf",
"vote-10-public-service-and-administration.pdf"],
"name":"whatever", |
The So when ckanext-extractor's worker calls Since I think the following are reasonable options, but I'm not sure what the best one is and would like input. I'll cross-post to the ckan-dev list:
|
Thanks for the detailed investigation, @jbothma! Perhaps we can avoid this issue completely by using In the file
That would leave all the details of handling the package dict to CKAN core. |
This avoids conflict with tag vocabulary fields already converted by package_show and then getting converted again by update_dict. Fixes stadt-karlsruhe#16
That seems to work perfectly, thanks! I've made a pull request. |
No need, thanks - I took the opportunity to upgrade (and drop celery) while debugging. Thanks for fitting this into your schedule - much appreciated. |
I have a few custom ckan tag vocabularies for my datasets. It looks like when the worker extracts the text, the vocabulary tags are removed from the dataset.
I haven't looked into the worker code yet and I'm still on ckanext-extractor@v0.3.1
Basically the only thing I have using celery (yes, still on celery despite you upgrading it to work with redis on my request, sorry) is this.
When I create a dataset, I assign a couple of vocabulary tags to it.
When I add a PDF resource to it and programmatically request the package immediately afterwords, they're still set correctly.
A few seconds later they're not set any more.
If I stop the celery worker, the tags will stay in place until I start the worker again.
Any idea why this might be? I'll dive into the worker code ASAP but it's taken me a day or so to track this down to this plugin so it might not be tomorrow.
As always, I'm such a huge fan of this and appreciate it very much. Just posting here so long in case you know very quickly what it is. I'll update when I know more.
I think this has been hidden in the past because I used a script that would update (and fix) the package each time I add a resource, and I generally add XLS resources after adding PDF resources to the same datasets, and I have extractor configured to only extract PDF resources.
The text was updated successfully, but these errors were encountered: