fix: refiine cli to resolve languages data fetch incosistency #330

DeleMike · 2024-10-12T00:49:44Z

Implemented checks for non-existent languages and data types in the total command.
Added informative error messages guiding users to update or set their language metadata.
Enhanced feedback for improved usability of the CLI.

Contributor checklist

This pull request is on a separate branch and not the main branch

Description

This PR enhances the Scribe-Data CLI by validating user input for languages and data types. If a non-existent language is specified, the user will receive a clear message informing them that the language does not exist, along with instructions on how to update their language_metadata.json file using:

scribe-data update --metadata

Alternatively, users can manually set the metadata with:

scribe-data set-metadata -lang [your_language] -qid [your_qid]

These enhancements aim to improve user experience and ensure accurate data retrieval in the Scribe-Data CLI.

Related issue

This issue is closely related to #295

fixes Refine CLI User Experience by Validating Input Languages and Data Types #328

- Implemented checks for non-existent languages and data types in the total command. - Added informative error messages guiding users to update or set their language metadata. - Enhanced feedback for improved usability of the CLI.

github-actions · 2024-10-12T00:50:11Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
- The contributor's name and icon in remote commits should be the same as what appears in the PR
- If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

catreedle · 2024-10-12T01:38:35Z

Hi @DeleMike,
Great job on this! It's impressive how you've come up with this solution.

Just a couple of things to consider:

We might need to revisit total.py later, as there could be some changes related to this PR.
Do we think it's safe to allow users to update the language_metadata? Specifically, does this handle cases where a user might try to update an existing language (e.g., English) with an incorrect QID?

Let me know your thoughts when you get a chance. :)

catreedle · 2024-10-12T02:34:53Z

I was browsing through the issues and found one that might be related to this. #293

DeleMike · 2024-10-12T13:25:28Z

Hi @DeleMike, Great job on this! It's impressive how you've come up with this solution.

Just a couple of things to consider:

We might need to revisit total.py later, as there could be some changes related to this PR.

Do we think it's safe to allow users to update the language_metadata? Specifically, does this handle cases where a user might try to update an existing language (e.g., English) with an incorrect QID?

Let me know your thoughts when you get a chance. :)

Yes, we might need to revisit it after it has been merged. Thanks @catreedle!

concerning the ability to allow users (the developer) to update the language metadata is a good option IMHO. In Flutter and other systems, there's always a need to sometimes update your dependencies or refetch them(flutter pub get) just to make sure they are cached properly/available for use. However, I agree with you when you say, what if the user updates, say, "English" with the wrong QID, how do we tackle that?

I will think of ways soon and drop them in this discussion.

DeleMike · 2024-10-12T13:33:50Z

I was browsing through the issues and found one that might be related to this. #293

hmm... did not see that.
It is closely related, right ? @andrewtavis. What can we do about this?

One thing, this issue originated when the Scribe-Data was returning some kind of default values for languages it could not find in the language_metadata.json file. Hence, I came up with this initial fix which led to the procedure of updating the metadata

 # when we want to run `scribe-data t -l Latin` and Latin doesn't exist in the metadata hence we cannot provide a QID for it
 language_qid = get_qid_by_input(language)
 data_type_qid = get_qid_by_input(data_type)
 
 if not language_qid:
        print(
            "The specified language does not exist. Please update your language_metadata.json file by using:\n"
            "`scribe-data update --metadata`\n"
            "Alternatively, you can manually set it with:\n"
            "`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.\n\n"
            "This will ensure that you can fetch the correct data."
        )
        return

We will basically inform the user that the specified language does not exist and they should update the metadata file. We also terminate the whole process early with the return keyword so that we exit from the procedure gracefully after telling the user what to do.

Output of running scribe-data t -l Latin if it does not exist:

The specified language does not exist. Please update your language_metadata.json file by using:
`scribe-data update --metadata`
Alternatively, you can manually set it with:
`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.

This will ensure that you can fetch the correct data.

andrewtavis · 2024-10-13T10:36:48Z

Ok @DeleMike, I've had a bit of time to look this over. Big thing is that this functionality is useful now, but shouldn't be in the future as we shouldn't be releasing Scribe-Data or merging in language querying files if the language isn't in the metadata file. @OmarAI2003 will be expanding the language metadata file in #293, so then this should be fine 🤔 I agree with @catreedle that we don't necessarily want the user to have control over these internal files (edit: I'm unsure on this now).

There is definitely some value here though 😊 Specifically maybe we can change this so that it does the following:

A new file /src/scribe_data/check_language_metadata.py that would include some of the functionality you have here to check the language_data_extraction directory to see if the languages in the directory are included in the metadata file
A GitHub workflow in .github/workflows that runs on PRs that would run check_language_metadata.py and throw an error if the language_data_extraction directory hasn't been updated

How does this sound?

Edit: there's even a more expansive suggestion for how to use this code here. Really interesting possibilities!

andrewtavis · 2024-10-13T10:38:54Z

src/scribe_data/cli/cli_utils.py


-with LANGUAGE_METADATA_FILE.open("r", encoding="utf-8") as file:
-    language_metadata = json.load(file)
+try:


These try-excepts are fine, but can we not set language_metadata = {"languages": []} at the end of them (and the same for data types). If these files are not being loaded in then we have serious problems. If there's not a test for it already - that both of them are read and are in a specific location - then we should write one!

CC @DeleMike and @catreedle for checking for this test :)

Given the reaction here, @catreedle, do you want to check the tests to see if we have some for those two files being accessible, and if not you can open a PR with tests for it? :)

Sure, allow me some time as I’m still getting familiar with testing. I'll check and get back to you later. :)

Sounds good!

From what I see, there aren’t any specific tests to verify whether both language_metadata.json and data_type_metadata.json are being read, but the tests depend on these files. Deleting one causes an error during test runs. I've written tests to check the accessibility of these files, but they won’t run if the files are missing. I’m unsure about adding these tests and could use some help with this. :)

I guess I could put it in a new test file for this if it's still needed?

Yes, I guess you could but wait for a final go-ahead :)

EDIT: We definitely need the test because before submitting this PR, I thought about it, "what if the files are not loaded ??".

We can have two minor tests just like you described, @catreedle :) Feel free to put them in an appropriate directory/file that's for the whole application. I'll move it if needed 😊

added a PR here. Looking forward to the review. :)

andrewtavis · 2024-10-13T10:49:42Z

src/scribe_data/cli/cli_utils.py

 }


+def get_available_languages() -> list[tuple[str, str]]:


Here we'd just return the languages for the check_language_metadata.py file :) There's a degree of hard coding here with the query_verbs.sparql that we should avoid.

Maybe there is some value with letting a Scribe-Data developer set the metadata file with a language and a QID of their choice though 🤔 They could also use the QID during testing, but then that has specific return functionality now that's getting all data types by default instead of what's in Scribe-Data already. This would allow us to do a simple command in the docs to update the metadata file, but we should make this clear that it's for development only. We don't want the end user to add a language-qid pair to the metadata as they wouldn't have the ability to write the queries (if say they install with PIP).

You are right. I thought about it only returning language names but I felt we could do better or create like a "standard". i.e. We will have to ensure that we have query_verbs.sparql exists for all directories. And I've checked, the French directory hasquery_verbs_1.sparql and query_verbs_2sparql, so I guess it won't be helpful.

andrewtavis · 2024-10-13T10:52:04Z

src/scribe_data/cli/cli_utils.py

+    return available_languages
+
+
+def extract_qid_from_sparql(file_path: Path) -> str:


extract_qid_from_sparql, check_and_update_languages and update_language_metadata wouldn't be needed as this would be covered by the check workflow on PRs that will assure that the metadata file is appropriate :) I'm still on the fence when it comes to set_metadata though 🤔

Yeah, I agree with why we should move them. Great point 😄

I thought that the set_metadata would be useful if, for some reason, we could not automatically update the metadata(that is, when it was still in use), then we can call scribe-data set-metadata -lang French -qid Q150 for example, and this updates our metadata file.

andrewtavis · 2024-10-13T10:52:49Z

src/scribe_data/cli/main.py

 )
 TOTAL_DESCRIPTION = "Check Wikidata for the total available data for the given languages and data types."
 CONVERT_DESCRIPTION = "Convert data returned by Scribe-Data to different file types."
+UPDATE_DESCRIPTION = "Update the metadata file with available languages and QIDs."


Main would need the update functionality removed as it'll be covered by the new workflow on PRs :)

andrewtavis · 2024-10-13T10:53:22Z

src/scribe_data/cli/total.py


+    if not language_qid:
+        print(
+            "The specified language does not exist. Please update your language_metadata.json file by using:\n"


We'll need to check these outputs once the updates are done :)

Yeah, sure! Thanks @andrewtavis !

andrewtavis · 2024-10-13T10:57:07Z

src/scribe_data/cli/cli_utils.py

+        print(f"Error updating language metadata: {e}")
+
+
+def set_metadata(language_name: str, qid: str):


If we do keep set_metadata, then we should include an iso argument :)

Also @DeleMike, it'd be great if for those functions that are kept that you check some of the other docstrings and use similar formatting. As the function docstrings are written now, they won't be parsed correctly for the readthedocs documentation :)

oh! I did not know about this. will update :)

andrewtavis · 2024-10-13T11:02:36Z

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

Is the language QID the one that it's supposed to be given the directory
Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

catreedle · 2024-10-13T11:59:21Z

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

Is the language QID the one that it's supposed to be given the directory

Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

This is very thoughtful. I'm willing to work/collaborate on it. 😊

DeleMike · 2024-10-13T16:23:50Z

Hey, @andrewtavis and @catreedle thanks folks for the review😄.
so helpful. I am going through everything as I type.

I just wanted to drop my initial thoughts. From all I have read, we can generate a set of new issues from the works in this PR. I will be dropping my comments!

andrewtavis · 2024-10-13T16:37:56Z

I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make 😊 Let me know if I should make the issues and tag you both!

DeleMike · 2024-10-13T16:44:09Z

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

Is the language QID the one that it's supposed to be given the directory

Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

Lovely! I would love to improve this code.

From the quote, I understand that:

We want to repurpose existing code, particularly from extract_qid_from_sparql, to check the accuracy of queries.
We want Language QID Check and Data Type Check in our workflows for validation.

DeleMike · 2024-10-13T16:44:59Z

I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make 😊 Let me know if I should make the issues and tag you both!

yes, please. we would love to work on the issues created! :)

andrewtavis · 2024-10-13T16:50:49Z

Ok, let me know how you want to proceed here, @DeleMike :) I'm happy to clean this PR up with the code we want for now, and then also make the issues at the same time using the current functions as snippets. Let me know if you'd like to do any changes before - so I'll wait for a confirmation here and then finalize this one 😊

DeleMike · 2024-10-13T16:59:11Z

Yeah, true!
For now, I'd rather you help us point in the direction we should go. That is, if there's any extra addition I'd make at this moment, it would all stem from your suggestions hence I would be happy if you could help adjust the PR now, and then make issues from it. The main reason is that we don't create a future complication. How about this?

or is there something I could quickly do so that it does not affect our codebase? the docstrings format issue?

andrewtavis · 2024-10-13T17:00:24Z

I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!

andrewtavis · 2024-10-13T17:00:44Z

I'll do a quick fix of this one soon then and then make issues for the workflows 🚀

DeleMike · 2024-10-13T17:05:32Z

I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!

oh okay! noted 😄

DeleMike · 2024-10-13T17:06:40Z

I'll do a quick fix of this one soon then and then make issues for the workflows 🚀

I'll be waiting for them :)

Thanks @andrewtavis for the support!

and thank you @catreedle. Let's get ready!! 💪😄

andrewtavis · 2024-10-13T17:32:21Z

#339 and #340 have been made and I pinged you each in them :) Let me know if there are any questions!

DeleMike · 2024-10-13T17:47:08Z

alright, checking...

andrewtavis

I decided against doing set metadata as well, @DeleMike, as I think we need to focus on making sure that the CLI options are for the end users and not for the development team. Thanks so much for these suggestions though! I think that the workflows are going to improve the project so much, and all of this was the catalyst 😊

DeleMike · 2024-10-13T19:01:19Z

I decided against doing set metadata as well, @DeleMike, as I think we need to focus on making sure that the CLI options are for the end users and not for the development team. Thanks so much for these suggestions though! I think that the workflows are going to improve the project so much, and all of this was the catalyst 😊

That's okay :-)
And you are welcome!

DeleMike mentioned this pull request Oct 12, 2024

Refine CLI User Experience by Validating Input Languages and Data Types #328

Closed

5 tasks

fix all test case issues

5940d2b

andrewtavis self-requested a review October 12, 2024 14:51

andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 12, 2024

Merge branch 'main' into fix/resolve-lang-fetch-inconsistency

9ab8035

andrewtavis reviewed Oct 13, 2024

View reviewed changes

This was referenced Oct 13, 2024

Add workflow to check queries #339

Closed

Add workflow to check project metadata #340

Closed

Revert changes in metadata updates in favor of check workflows

3846a20

andrewtavis approved these changes Oct 13, 2024

View reviewed changes

andrewtavis merged commit d43aeb8 into scribe-org:main Oct 13, 2024
3 checks passed

		return available_languages


		def extract_qid_from_sparql(file_path: Path) -> str:

		print(f"Error updating language metadata: {e}")


		def set_metadata(language_name: str, qid: str):

fix: refiine cli to resolve languages data fetch incosistency #330

fix: refiine cli to resolve languages data fetch incosistency #330

Uh oh!

Conversation

DeleMike commented Oct 12, 2024

Contributor checklist

Description

Related issue

Uh oh!

github-actions bot commented Oct 12, 2024 • edited by andrewtavis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thank you for the pull request!

Maintainer checklist

Uh oh!

catreedle commented Oct 12, 2024

Uh oh!

catreedle commented Oct 12, 2024

Uh oh!

DeleMike commented Oct 12, 2024

Uh oh!

DeleMike commented Oct 12, 2024

Uh oh!

andrewtavis commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DeleMike Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewtavis commented Oct 13, 2024

Uh oh!

catreedle commented Oct 13, 2024

Uh oh!

DeleMike commented Oct 13, 2024

Uh oh!

andrewtavis commented Oct 13, 2024

Uh oh!

DeleMike commented Oct 13, 2024

Uh oh!

DeleMike commented Oct 13, 2024

Uh oh!

andrewtavis commented Oct 13, 2024

Uh oh!

DeleMike commented Oct 13, 2024

github-actions bot commented Oct 12, 2024 •

edited by andrewtavis

Loading

andrewtavis commented Oct 13, 2024 •

edited

Loading

DeleMike Oct 13, 2024 •

edited

Loading