New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Baseline autosuggest #194
Comments
[Edit]: I think the Wikipedia based implementation below might be better than this 🤔😊 @SaurabhJamadagni, for this issue, how does using wordfreq sound? It has a function It has an MIT license, so we should mention it in the privacy policy and I'll maybe make a NOTICE.txt for us to add some things that we're using. Seems like the easiest thing to do, as I can just make a simple Python script and save it to Scribe-Data 😊 |
Looking at German autosuggest for the first time in a bit, some patterns are coming that we could also use to make this more performant, as I think that some of the system autosuggest isn't really NLP based at all it seems. The following two are ones I think we can implement, and maybe we can think of one or two others:
We should definitely make sure to not over-engineer this with things that are just going to be removed with the NLP implementation. The focus should be on that happens when a user types something at the start of the proxy, not later on in situations where context would be used later. |
Something to consider as well is that Wikipedia is licensed I feel like this would basically be equivalent to system autosuggest, as there really doesn't seem to be much variance. We could also use the frequency of words that follow to determine if there should be any variance at all in the selections, as if it's basically only three words that follow it, then we wouldn't need to include a fourth. This list given JSON sizes could easily be 500 words with 3-5 options each without being too large, with these 500 or so words basically accounting for 75% of the words that are used for a given language. |
Putting this all in a Python context would also allow for the general data analytics/data science community to give some feedback on the process. I think that adding this kind of implementation to Scribe-Data could help grow the community 😊 |
First off, this was a really nice observation @andrewtavis. I never noticed it before. To clarify, you are saying that using Wikipedia we can find the four most commonly used words following the word the user types. For example, If the user types "I love", using Wikipedia dumps we can come up with a list that hypothetically gives us ["cars", "movies", "food", "math"] assuming they are the most common words that follow the word 'love' in a random selection of articles. Is this a correct understanding? I think the Wikipedia option seems better than just randomly displaying suggestions based on word frequency. But I am still not clear on what exactly will go on behind the scenes. Does the four word list have to be generated each time a user types a new word or do we just do a basic lookup in the already downloaded data? |
I’m not sure if there are any other hard coded rules like the first words or words after pronouns to think about, but I was happy to realize them 😊 Let’s stick to these for now unless something jumps out. You’re understanding is correct, @SaurabhJamadagni! So not what usually comes after “I love”, but instead deriving what usually comes after “love”. Maybe we can figure out phrases a bit later, but for the implementation now let’s focus on one word so that the iOS side will be similar to autocomplete in that we’ll just get the last word, not need to check the last two or more. We’ll do a basic lookup of already downloaded data for this. I think that JSONs with 500 keys and the four most common words as values will be within the range of size we can add as of now. The same as it works for other data, the process will happen in Scribe-Data that will then update files in the various apps :) |
That sounds like a nice plan @andrewtavis. What do we default to when the user types a word that isn't included in the 500 keys? Also I was wondering if we could use the wordfreq library to amp up autocomplete? Instead of giving completions that are alphabetical, we could give the completions based on their frequency. And then when the user is done typing, the autosuggestions that come later would be from the Wikipedia list we create of common 500 words. |
Your idea of basing the suggestions on frequency is a good one, @SaurabhJamadagni, but we’ll basically be doing that in the approach ourselves :) After this we’ll get the 500 or so words, and with the top 3-5 words we’ll also have frequencies that will determine how often the words are suggested (we could do this, if we wanted). We certainly do not need to do a random selection from the words that we find for suggestions :) We can just do some suggestions from common words if the word isn’t in the 500 we get. This again will only be ~20 percent of the time, and maybe even less given that mobile keyboards are more for simpler sprach. We can increase the dictionary size within the language packs later potentially 😊 |
I think there's a typo there. I didn't understand what you meant 😅
Alright, that sounds like a good idea! Let me know how we can get started with the implementation. Excited to dig into something new haha. I'll go through the Scribe-Data documentation in the meanwhile. 😊 |
My guess is approach, reading it again 😅 Sorry, have been typing on my phone :)
I already made an issue for the data process with some tasks: Scribe-Data #15 :) Check that out a bit, and we can discuss more who does what later! Looking forward as well 😊😊 |
Once this issue is finished, the keys of the dictionary should also be included into the autocomplete options :) |
Note that as autosuggest will be functioning at the same time as annotations, it would be best if the first best option is shifted to the second position and the second to the third in case of an annotation :) |
Another observation, for numbers, the autosuggestions should be the words for "is", "or" and "minimum" in the given language :) |
I'll do the small fixes I mentioned in the PR now, @SaurabhJamadagni :) A small aside though: this page here shows something that I was expecting sooner rather than later, and it makes me happy 😊 We don't have too many contributors, but you're commits now give you the credit that your code contributions deserve 🏆 Thanks for all the work you've been putting in! |
Just sent along some minor fixes that I didn't think needed review in 3ffbdce, @SaurabhJamadagni 😊
Let me know how you're feeling right now as far as the issues that are being worked on. Everything that's left is coding that you till now you haven't worked on like |
Note that autosuggest as of now is deleting part of the previous word as if it's autocomplete, so we need to make sure that this isn't triggered :) Edit: I fixed it just now 🙌 |
@SaurabhJamadagni, this issue isn't closed yet, but the data is in there via 1a7ee7d 🚀 We'd talked about you switching over the autosuggest source to these files, which you'd be welcome to do if you'd like to 😊 |
Almost done with v2.0.0!! Still lots of work, but we're very very close 😊 |
Yeah @andrewtavis. I think I'll do this issue first then. The annotation formatting issue is getting a bit confusing when it comes to moving the function from the |
That one is definitely a bit more confusing, @SaurabhJamadagni :) I'll answer your question there later in the day! And looking forward to the work here! 🚀 |
@SaurabhJamadagni, what fcc9fdb did was the following:
I figure it's all relatively standard and at this point we need to just get this out and start testing it 😊 Will close this now as all the tasks that were being tracked are now done!! Thanks for all your hard work on this! 😊 |
Thanks @andrewtavis! So happy to finish this feature finally! Thanks for your work and help as well 😊 🚀 |
It's by no means perfect, @SaurabhJamadagni, but it's definitely good enough to get it out, especially with all the improvements that autocomplete and the preposition-case annotation interface is bringing 😊 I'm working on all the design elements that we need for v2.0.0. Get to the task that you have in #197 when you can, and then I'll work on the rest for getting this out! 🚀 |
Yeah @andrewtavis, I'm actually done with the changes. I have a presentation in a bit so I was holding off the PR till evening. Will issue it asap. 😊 |
Sounds good, @SaurabhJamadagni! I think a bit of extra coding will be necessary for the preposition case conjugation interface, but generally I think that early next week is a reasonable release date at this point. From there we can work on small issues again, finally 😅😅😊 |
Let's gooo @andrewtavis! 🚀 As much as I am looking forward to working on the smaller issues, can't deny that large feature additions are a different kind of joy haha 😊 |
Terms
Description
In order to release the work that's been done in #188, Scribe also needs to add baseline autosuggest functionality. Reason being is that the user would expect to also see word suggestions even if they're not currently typing a word. This is the other part of the baseline version of #3, which would see this feature implemented using natural language processing models over the current words in the text proxy (will be another issue later).
The working idea for this is that we should find/generate a list of most used words in each Scribe keyboard language, and then do random selections from this list. In this way the architecture of the feature will be set for future iterations.
Contribution
I'll be working on this with those who are interested prior to the v1.5.0 release :)
The text was updated successfully, but these errors were encountered: