Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline autosuggest #194

Closed
2 tasks done
andrewtavis opened this issue Aug 14, 2022 · 35 comments
Closed
2 tasks done

Baseline autosuggest #194

andrewtavis opened this issue Aug 14, 2022 · 35 comments
Assignees
Labels
-next release- Included in the next release feature New feature or request help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Member

andrewtavis commented Aug 14, 2022

Terms

Description

In order to release the work that's been done in #188, Scribe also needs to add baseline autosuggest functionality. Reason being is that the user would expect to also see word suggestions even if they're not currently typing a word. This is the other part of the baseline version of #3, which would see this feature implemented using natural language processing models over the current words in the text proxy (will be another issue later).

The working idea for this is that we should find/generate a list of most used words in each Scribe keyboard language, and then do random selections from this list. In this way the architecture of the feature will be set for future iterations.

Contribution

I'll be working on this with those who are interested prior to the v1.5.0 release :)

@andrewtavis
Copy link
Member Author

andrewtavis commented Aug 18, 2022

[Edit]: I think the Wikipedia based implementation below might be better than this 🤔😊

@SaurabhJamadagni, for this issue, how does using wordfreq sound? It has a function top_n_list() that would give us the words we need, and we could maybe just subset so that we're selecting words that are three characters or longer? I was thinking we can take a look at the words over three characters given certain cutoffs for n, and from there we can decide how many words we want to include for this baseline (i.e. 30, 40, etc).

It has an MIT license, so we should mention it in the privacy policy and I'll maybe make a NOTICE.txt for us to add some things that we're using. Seems like the easiest thing to do, as I can just make a simple Python script and save it to Scribe-Data 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Aug 18, 2022

Looking at German autosuggest for the first time in a bit, some patterns are coming that we could also use to make this more performant, as I think that some of the system autosuggest isn't really NLP based at all it seems. The following two are ones I think we can implement, and maybe we can think of one or two others:

  • There are three words that are always presented for the first options when nothing is typed that we can copy
  • If the first word the user types is a pronoun, the three suggestions are normally the conjugations of have, be and can

We should definitely make sure to not over-engineer this with things that are just going to be removed with the NLP implementation. The focus should be on that happens when a user types something at the start of the proxy, not later on in situations where context would be used later.

@andrewtavis
Copy link
Member Author

andrewtavis commented Aug 18, 2022

Something to consider as well is that Wikipedia is licensed CC-BY-SA, meaning with attribution we can use their data (similar to Wikidata, which is CC0 where we don't have to attribute, but do to show respect). We could potentially do a more final version of this feature from the get go, in that I could download Wikipedia dumps in a similar manner that I did for wikirec, and from there the dumps could be used to generate the list of most common words, and a random selection of articles could also be used to generate the four words that most often follow them on Wikipedia (four so that we could do a random selection from them for some variance). Words that aren't some of the most common would receive random selections.

I feel like this would basically be equivalent to system autosuggest, as there really doesn't seem to be much variance. We could also use the frequency of words that follow to determine if there should be any variance at all in the selections, as if it's basically only three words that follow it, then we wouldn't need to include a fourth. This list given JSON sizes could easily be 500 words with 3-5 options each without being too large, with these 500 or so words basically accounting for 75% of the words that are used for a given language.

@andrewtavis
Copy link
Member Author

andrewtavis commented Aug 18, 2022

Putting this all in a Python context would also allow for the general data analytics/data science community to give some feedback on the process. I think that adding this kind of implementation to Scribe-Data could help grow the community 😊

@SaurabhJamadagni
Copy link
Collaborator

If the first word the user types is a pronoun, the three suggestions are normally the conjugations of have, be and can

First off, this was a really nice observation @andrewtavis. I never noticed it before.

To clarify, you are saying that using Wikipedia we can find the four most commonly used words following the word the user types. For example, If the user types "I love", using Wikipedia dumps we can come up with a list that hypothetically gives us ["cars", "movies", "food", "math"] assuming they are the most common words that follow the word 'love' in a random selection of articles. Is this a correct understanding?

I think the Wikipedia option seems better than just randomly displaying suggestions based on word frequency. But I am still not clear on what exactly will go on behind the scenes. Does the four word list have to be generated each time a user types a new word or do we just do a basic lookup in the already downloaded data?

@andrewtavis
Copy link
Member Author

andrewtavis commented Aug 20, 2022

I’m not sure if there are any other hard coded rules like the first words or words after pronouns to think about, but I was happy to realize them 😊 Let’s stick to these for now unless something jumps out.

You’re understanding is correct, @SaurabhJamadagni! So not what usually comes after “I love”, but instead deriving what usually comes after “love”. Maybe we can figure out phrases a bit later, but for the implementation now let’s focus on one word so that the iOS side will be similar to autocomplete in that we’ll just get the last word, not need to check the last two or more.

We’ll do a basic lookup of already downloaded data for this. I think that JSONs with 500 keys and the four most common words as values will be within the range of size we can add as of now. The same as it works for other data, the process will happen in Scribe-Data that will then update files in the various apps :)

@SaurabhJamadagni
Copy link
Collaborator

We’ll do a basic lookup of already downloaded data for this. I think that JSONs with 500 keys and the four most common words as values will be within the range of size we can add as of now. The same as it works for other data, the process will happen in Scribe-Data that will then update files in the various apps :)

That sounds like a nice plan @andrewtavis. What do we default to when the user types a word that isn't included in the 500 keys?

Also I was wondering if we could use the wordfreq library to amp up autocomplete? Instead of giving completions that are alphabetical, we could give the completions based on their frequency. And then when the user is done typing, the autosuggestions that come later would be from the Wikipedia list we create of common 500 words.

@andrewtavis
Copy link
Member Author

andrewtavis commented Aug 21, 2022

Your idea of basing the suggestions on frequency is a good one, @SaurabhJamadagni, but we’ll basically be doing that in the approach ourselves :) After this we’ll get the 500 or so words, and with the top 3-5 words we’ll also have frequencies that will determine how often the words are suggested (we could do this, if we wanted). We certainly do not need to do a random selection from the words that we find for suggestions :)

We can just do some suggestions from common words if the word isn’t in the 500 we get. This again will only be ~20 percent of the time, and maybe even less given that mobile keyboards are more for simpler sprach. We can increase the dictionary size within the language packs later potentially 😊

@SaurabhJamadagni
Copy link
Collaborator

but we’ll basically be doing that in the apron h ourselves

I think there's a typo there. I didn't understand what you meant 😅

We can just do some suggestions from common words if the word isn’t in the 509 we get. This again will only be ~20 percent of the time, and maybe even less given that mobile keyboards are more for simpler sprach. We can increase the dictionary size within the language packs later potentially 😊

Alright, that sounds like a good idea! Let me know how we can get started with the implementation. Excited to dig into something new haha. I'll go through the Scribe-Data documentation in the meanwhile. 😊

@andrewtavis
Copy link
Member Author

I think there's a typo there. I didn't understand what you meant 😅

My guess is approach, reading it again 😅 Sorry, have been typing on my phone :)

Alright, that sounds like a good idea! Let me know how we can get started with the implementation. Excited to dig into something new haha. I'll go through the Scribe-Data documentation in the meanwhile. 😊

I already made an issue for the data process with some tasks: Scribe-Data #15 :) Check that out a bit, and we can discuss more who does what later! Looking forward as well 😊😊

@andrewtavis
Copy link
Member Author

Once this issue is finished, the keys of the dictionary should also be included into the autocomplete options :)

@andrewtavis andrewtavis removed the good first issue Good for newcomers label Aug 29, 2022
@andrewtavis
Copy link
Member Author

Note that as autosuggest will be functioning at the same time as annotations, it would be best if the first best option is shifted to the second position and the second to the third in case of an annotation :)

@andrewtavis
Copy link
Member Author

Another observation, for numbers, the autosuggestions should be the words for "is", "or" and "minimum" in the given language :)

@andrewtavis
Copy link
Member Author

I'll do the small fixes I mentioned in the PR now, @SaurabhJamadagni :)

A small aside though: this page here shows something that I was expecting sooner rather than later, and it makes me happy 😊 We don't have too many contributors, but you're commits now give you the credit that your code contributions deserve 🏆 Thanks for all the work you've been putting in!

@andrewtavis
Copy link
Member Author

andrewtavis commented Sep 12, 2022

Just sent along some minor fixes that I didn't think needed review in 3ffbdce, @SaurabhJamadagni 😊

Let me know how you're feeling right now as far as the issues that are being worked on. Everything that's left is coding that you till now you haven't worked on like .xib files in #197 and Wikidata related issues in Scribe-Data. Happy to guide you if you have interest in any of it, but also totally fine if you'd like to take a break for a moment 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Sep 24, 2022

Note that autosuggest as of now is deleting part of the previous word as if it's autocomplete, so we need to make sure that this isn't triggered :)

Edit: I fixed it just now 🙌

@andrewtavis
Copy link
Member Author

@SaurabhJamadagni, this issue isn't closed yet, but the data is in there via 1a7ee7d 🚀 We'd talked about you switching over the autosuggest source to these files, which you'd be welcome to do if you'd like to 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Oct 4, 2022

Almost done with v2.0.0!! Still lots of work, but we're very very close 😊

@SaurabhJamadagni
Copy link
Collaborator

We'd talked about you switching over the autosuggest source to these files, which you'd be welcome to do if you'd like to 😊

Yeah @andrewtavis. I think I'll do this issue first then. The annotation formatting issue is getting a bit confusing when it comes to moving the function from the ViewController to Annotate.swift.

@andrewtavis
Copy link
Member Author

That one is definitely a bit more confusing, @SaurabhJamadagni :) I'll answer your question there later in the day! And looking forward to the work here! 🚀

@andrewtavis
Copy link
Member Author

@SaurabhJamadagni, what fcc9fdb did was the following:

  • Updates the changelog 🚀
  • Updates the privacy policy for the repo and the app
  • Allows for words that aren't pronouns (or German nouns) to be be suggested as lower case even if they're capitalized in the autosuggestions dict
  • Adds the autosuggestion keys into the autocompletion options
    • Takes the lower case one if it's available and removes duplicates

I figure it's all relatively standard and at this point we need to just get this out and start testing it 😊 Will close this now as all the tasks that were being tracked are now done!!

Thanks for all your hard work on this! 😊

@SaurabhJamadagni
Copy link
Collaborator

I figure it's all relatively standard and at this point we need to just get this out and start testing it 😊 Will close this now as all the tasks that were being tracked are now done!!

Thanks for all your hard work on this! 😊

Thanks @andrewtavis! So happy to finish this feature finally! Thanks for your work and help as well 😊 🚀

@andrewtavis
Copy link
Member Author

It's by no means perfect, @SaurabhJamadagni, but it's definitely good enough to get it out, especially with all the improvements that autocomplete and the preposition-case annotation interface is bringing 😊

I'm working on all the design elements that we need for v2.0.0. Get to the task that you have in #197 when you can, and then I'll work on the rest for getting this out! 🚀

@SaurabhJamadagni
Copy link
Collaborator

Yeah @andrewtavis, I'm actually done with the changes. I have a presentation in a bit so I was holding off the PR till evening. Will issue it asap. 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Oct 7, 2022

Sounds good, @SaurabhJamadagni! I think a bit of extra coding will be necessary for the preposition case conjugation interface, but generally I think that early next week is a reasonable release date at this point. From there we can work on small issues again, finally 😅😅😊

@SaurabhJamadagni
Copy link
Collaborator

I think that early next week is a reasonable release date at this point. From there we can work on small issues again, finally

Let's gooo @andrewtavis! 🚀

As much as I am looking forward to working on the smaller issues, can't deny that large feature additions are a different kind of joy haha 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-next release- Included in the next release feature New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants