Skip to content

Conversation

@DeleMike
Copy link
Collaborator

  • Implemented checks for non-existent languages and data types in the total command.
  • Added informative error messages guiding users to update or set their language metadata.
  • Enhanced feedback for improved usability of the CLI.

Contributor checklist


Description

This PR enhances the Scribe-Data CLI by validating user input for languages and data types. If a non-existent language is specified, the user will receive a clear message informing them that the language does not exist, along with instructions on how to update their language_metadata.json file using:

scribe-data update --metadata

Alternatively, users can manually set the metadata with:

scribe-data set-metadata -lang [your_language] -qid [your_qid]

These enhancements aim to improve user experience and ensure accurate data retrieval in the Scribe-Data CLI.

Related issue

This issue is closely related to #295

- Implemented checks for non-existent languages and data types in the total command.
- Added informative error messages guiding users to update or set their language metadata.
- Enhanced feedback for improved usability of the CLI.
@github-actions
Copy link

github-actions bot commented Oct 12, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution

    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@catreedle
Copy link
Collaborator

Hi @DeleMike,
Great job on this! It's impressive how you've come up with this solution.

Just a couple of things to consider:

  • We might need to revisit total.py later, as there could be some changes related to this PR.
  • Do we think it's safe to allow users to update the language_metadata? Specifically, does this handle cases where a user might try to update an existing language (e.g., English) with an incorrect QID?

Let me know your thoughts when you get a chance. :)

@catreedle
Copy link
Collaborator

I was browsing through the issues and found one that might be related to this. #293

@DeleMike
Copy link
Collaborator Author

Hi @DeleMike, Great job on this! It's impressive how you've come up with this solution.

Just a couple of things to consider:

  • We might need to revisit total.py later, as there could be some changes related to this PR.
  • Do we think it's safe to allow users to update the language_metadata? Specifically, does this handle cases where a user might try to update an existing language (e.g., English) with an incorrect QID?

Let me know your thoughts when you get a chance. :)

Yes, we might need to revisit it after it has been merged. Thanks @catreedle!

concerning the ability to allow users (the developer) to update the language metadata is a good option IMHO. In Flutter and other systems, there's always a need to sometimes update your dependencies or refetch them(flutter pub get) just to make sure they are cached properly/available for use. However, I agree with you when you say, what if the user updates, say, "English" with the wrong QID, how do we tackle that?

I will think of ways soon and drop them in this discussion.

@DeleMike
Copy link
Collaborator Author

I was browsing through the issues and found one that might be related to this. #293

hmm... did not see that.
It is closely related, right ? @andrewtavis. What can we do about this?

One thing, this issue originated when the Scribe-Data was returning some kind of default values for languages it could not find in the language_metadata.json file. Hence, I came up with this initial fix which led to the procedure of updating the metadata

 # when we want to run `scribe-data t -l Latin` and Latin doesn't exist in the metadata hence we cannot provide a QID for it
 language_qid = get_qid_by_input(language)
 data_type_qid = get_qid_by_input(data_type)
 
 if not language_qid:
        print(
            "The specified language does not exist. Please update your language_metadata.json file by using:\n"
            "`scribe-data update --metadata`\n"
            "Alternatively, you can manually set it with:\n"
            "`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.\n\n"
            "This will ensure that you can fetch the correct data."
        )
        return

We will basically inform the user that the specified language does not exist and they should update the metadata file. We also terminate the whole process early with the return keyword so that we exit from the procedure gracefully after telling the user what to do.

Output of running scribe-data t -l Latin if it does not exist:

The specified language does not exist. Please update your language_metadata.json file by using:
`scribe-data update --metadata`
Alternatively, you can manually set it with:
`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.

This will ensure that you can fetch the correct data.

@andrewtavis andrewtavis self-requested a review October 12, 2024 14:51
@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 12, 2024
@andrewtavis
Copy link
Member

andrewtavis commented Oct 13, 2024

Ok @DeleMike, I've had a bit of time to look this over. Big thing is that this functionality is useful now, but shouldn't be in the future as we shouldn't be releasing Scribe-Data or merging in language querying files if the language isn't in the metadata file. @OmarAI2003 will be expanding the language metadata file in #293, so then this should be fine 🤔 I agree with @catreedle that we don't necessarily want the user to have control over these internal files (edit: I'm unsure on this now).

There is definitely some value here though 😊 Specifically maybe we can change this so that it does the following:

  • A new file /src/scribe_data/check_language_metadata.py that would include some of the functionality you have here to check the language_data_extraction directory to see if the languages in the directory are included in the metadata file
  • A GitHub workflow in .github/workflows that runs on PRs that would run check_language_metadata.py and throw an error if the language_data_extraction directory hasn't been updated

How does this sound?

Edit: there's even a more expansive suggestion for how to use this code here. Really interesting possibilities!


with LANGUAGE_METADATA_FILE.open("r", encoding="utf-8") as file:
language_metadata = json.load(file)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These try-excepts are fine, but can we not set language_metadata = {"languages": []} at the end of them (and the same for data types). If these files are not being loaded in then we have serious problems. If there's not a test for it already - that both of them are read and are in a specific location - then we should write one!

CC @DeleMike and @catreedle for checking for this test :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the reaction here, @catreedle, do you want to check the tests to see if we have some for those two files being accessible, and if not you can open a PR with tests for it? :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, allow me some time as I’m still getting familiar with testing. I'll check and get back to you later. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, there aren’t any specific tests to verify whether both language_metadata.json and data_type_metadata.json are being read, but the tests depend on these files. Deleting one causes an error during test runs. I've written tests to check the accessibility of these files, but they won’t run if the files are missing. I’m unsure about adding these tests and could use some help with this. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could put it in a new test file for this if it's still needed?

Copy link
Collaborator Author

@DeleMike DeleMike Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess you could but wait for a final go-ahead :)

EDIT: We definitely need the test because before submitting this PR, I thought about it, "what if the files are not loaded ??".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have two minor tests just like you described, @catreedle :) Feel free to put them in an appropriate directory/file that's for the whole application. I'll move it if needed 😊

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a PR here. Looking forward to the review. :)

}


def get_available_languages() -> list[tuple[str, str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we'd just return the languages for the check_language_metadata.py file :) There's a degree of hard coding here with the query_verbs.sparql that we should avoid.

Maybe there is some value with letting a Scribe-Data developer set the metadata file with a language and a QID of their choice though 🤔 They could also use the QID during testing, but then that has specific return functionality now that's getting all data types by default instead of what's in Scribe-Data already. This would allow us to do a simple command in the docs to update the metadata file, but we should make this clear that it's for development only. We don't want the end user to add a language-qid pair to the metadata as they wouldn't have the ability to write the queries (if say they install with PIP).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I thought about it only returning language names but I felt we could do better or create like a "standard". i.e. We will have to ensure that we have query_verbs.sparql exists for all directories. And I've checked, the French directory hasquery_verbs_1.sparql and query_verbs_2sparql, so I guess it won't be helpful.

return available_languages


def extract_qid_from_sparql(file_path: Path) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_qid_from_sparql, check_and_update_languages and update_language_metadata wouldn't be needed as this would be covered by the check workflow on PRs that will assure that the metadata file is appropriate :) I'm still on the fence when it comes to set_metadata though 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with why we should move them. Great point 😄

I thought that the set_metadata would be useful if, for some reason, we could not automatically update the metadata(that is, when it was still in use), then we can call scribe-data set-metadata -lang French -qid Q150 for example, and this updates our metadata file.

)
TOTAL_DESCRIPTION = "Check Wikidata for the total available data for the given languages and data types."
CONVERT_DESCRIPTION = "Convert data returned by Scribe-Data to different file types."
UPDATE_DESCRIPTION = "Update the metadata file with available languages and QIDs."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main would need the update functionality removed as it'll be covered by the new workflow on PRs :)


if not language_qid:
print(
"The specified language does not exist. Please update your language_metadata.json file by using:\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to check these outputs once the updates are done :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure! Thanks @andrewtavis !

print(f"Error updating language metadata: {e}")


def set_metadata(language_name: str, qid: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do keep set_metadata, then we should include an iso argument :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @DeleMike, it'd be great if for those functions that are kept that you check some of the other docstrings and use similar formatting. As the function docstrings are written now, they won't be parsed correctly for the readthedocs documentation :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! I did not know about this. will update :)

@andrewtavis
Copy link
Member

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

  • Is the language QID the one that it's supposed to be given the directory
  • Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

@catreedle
Copy link
Collaborator

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

  • Is the language QID the one that it's supposed to be given the directory
  • Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

This is very thoughtful. I'm willing to work/collaborate on it. 😊

@DeleMike
Copy link
Collaborator Author

Hey, @andrewtavis and @catreedle thanks folks for the review😄.
so helpful. I am going through everything as I type.

I just wanted to drop my initial thoughts. From all I have read, we can generate a set of new issues from the works in this PR. I will be dropping my comments!

@andrewtavis
Copy link
Member

I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make 😊 Let me know if I should make the issues and tag you both!

@DeleMike
Copy link
Collaborator Author

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

  • Is the language QID the one that it's supposed to be given the directory
  • Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

Lovely! I would love to improve this code.

From the quote, I understand that:

  1. We want to repurpose existing code, particularly from extract_qid_from_sparql, to check the accuracy of queries.
  2. We want Language QID Check and Data Type Check in our workflows for validation.

@DeleMike
Copy link
Collaborator Author

I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make 😊 Let me know if I should make the issues and tag you both!

yes, please. we would love to work on the issues created! :)

@andrewtavis
Copy link
Member

Ok, let me know how you want to proceed here, @DeleMike :) I'm happy to clean this PR up with the code we want for now, and then also make the issues at the same time using the current functions as snippets. Let me know if you'd like to do any changes before - so I'll wait for a confirmation here and then finalize this one 😊

@DeleMike
Copy link
Collaborator Author

Yeah, true!
For now, I'd rather you help us point in the direction we should go. That is, if there's any extra addition I'd make at this moment, it would all stem from your suggestions hence I would be happy if you could help adjust the PR now, and then make issues from it. The main reason is that we don't create a future complication. How about this?

or is there something I could quickly do so that it does not affect our codebase? the docstrings format issue?

@andrewtavis
Copy link
Member

I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!

@andrewtavis
Copy link
Member

I'll do a quick fix of this one soon then and then make issues for the workflows 🚀

@DeleMike
Copy link
Collaborator Author

I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!

oh okay! noted 😄

@DeleMike
Copy link
Collaborator Author

I'll do a quick fix of this one soon then and then make issues for the workflows 🚀

I'll be waiting for them :)

Thanks @andrewtavis for the support!

and thank you @catreedle. Let's get ready!! 💪😄

@andrewtavis
Copy link
Member

#339 and #340 have been made and I pinged you each in them :) Let me know if there are any questions!

@DeleMike
Copy link
Collaborator Author

alright, checking...

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against doing set metadata as well, @DeleMike, as I think we need to focus on making sure that the CLI options are for the end users and not for the development team. Thanks so much for these suggestions though! I think that the workflows are going to improve the project so much, and all of this was the catalyst 😊

@andrewtavis andrewtavis merged commit d43aeb8 into scribe-org:main Oct 13, 2024
3 checks passed
@DeleMike
Copy link
Collaborator Author

I decided against doing set metadata as well, @DeleMike, as I think we need to focus on making sure that the CLI options are for the end users and not for the development team. Thanks so much for these suggestions though! I think that the workflows are going to improve the project so much, and all of this was the catalyst 😊

That's okay :-)
And you are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hacktoberfest-accepted Accepted as a part of Hacktoberfest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refine CLI User Experience by Validating Input Languages and Data Types

3 participants