Baseline autosuggest #194

andrewtavis · 2022-08-14T09:19:01Z

Terms

I have searched open and closed feature requests
I agree to follow Scribe-iOS' Code of Conduct

Description

In order to release the work that's been done in #188, Scribe also needs to add baseline autosuggest functionality. Reason being is that the user would expect to also see word suggestions even if they're not currently typing a word. This is the other part of the baseline version of #3, which would see this feature implemented using natural language processing models over the current words in the text proxy (will be another issue later).

The working idea for this is that we should find/generate a list of most used words in each Scribe keyboard language, and then do random selections from this list. In this way the architecture of the feature will be set for future iterations.

Contribution

I'll be working on this with those who are interested prior to the v1.5.0 release :)

andrewtavis · 2022-08-18T18:34:30Z

[Edit]: I think the Wikipedia based implementation below might be better than this 🤔😊

@SaurabhJamadagni, for this issue, how does using wordfreq sound? It has a function top_n_list() that would give us the words we need, and we could maybe just subset so that we're selecting words that are three characters or longer? I was thinking we can take a look at the words over three characters given certain cutoffs for n, and from there we can decide how many words we want to include for this baseline (i.e. 30, 40, etc).

It has an MIT license, so we should mention it in the privacy policy and I'll maybe make a NOTICE.txt for us to add some things that we're using. Seems like the easiest thing to do, as I can just make a simple Python script and save it to Scribe-Data 😊

andrewtavis · 2022-08-18T19:25:04Z

Looking at German autosuggest for the first time in a bit, some patterns are coming that we could also use to make this more performant, as I think that some of the system autosuggest isn't really NLP based at all it seems. The following two are ones I think we can implement, and maybe we can think of one or two others:

There are three words that are always presented for the first options when nothing is typed that we can copy
If the first word the user types is a pronoun, the three suggestions are normally the conjugations of have, be and can

We should definitely make sure to not over-engineer this with things that are just going to be removed with the NLP implementation. The focus should be on that happens when a user types something at the start of the proxy, not later on in situations where context would be used later.

andrewtavis · 2022-08-18T20:21:54Z

Something to consider as well is that Wikipedia is licensed CC-BY-SA, meaning with attribution we can use their data (similar to Wikidata, which is CC0 where we don't have to attribute, but do to show respect). We could potentially do a more final version of this feature from the get go, in that I could download Wikipedia dumps in a similar manner that I did for wikirec, and from there the dumps could be used to generate the list of most common words, and a random selection of articles could also be used to generate the four words that most often follow them on Wikipedia (four so that we could do a random selection from them for some variance). Words that aren't some of the most common would receive random selections.

I feel like this would basically be equivalent to system autosuggest, as there really doesn't seem to be much variance. We could also use the frequency of words that follow to determine if there should be any variance at all in the selections, as if it's basically only three words that follow it, then we wouldn't need to include a fourth. This list given JSON sizes could easily be 500 words with 3-5 options each without being too large, with these 500 or so words basically accounting for 75% of the words that are used for a given language.

andrewtavis · 2022-08-18T20:38:16Z

Putting this all in a Python context would also allow for the general data analytics/data science community to give some feedback on the process. I think that adding this kind of implementation to Scribe-Data could help grow the community 😊

SaurabhJamadagni · 2022-08-20T08:38:50Z

If the first word the user types is a pronoun, the three suggestions are normally the conjugations of have, be and can

First off, this was a really nice observation @andrewtavis. I never noticed it before.

To clarify, you are saying that using Wikipedia we can find the four most commonly used words following the word the user types. For example, If the user types "I love", using Wikipedia dumps we can come up with a list that hypothetically gives us ["cars", "movies", "food", "math"] assuming they are the most common words that follow the word 'love' in a random selection of articles. Is this a correct understanding?

I think the Wikipedia option seems better than just randomly displaying suggestions based on word frequency. But I am still not clear on what exactly will go on behind the scenes. Does the four word list have to be generated each time a user types a new word or do we just do a basic lookup in the already downloaded data?

andrewtavis · 2022-08-20T20:48:03Z

I’m not sure if there are any other hard coded rules like the first words or words after pronouns to think about, but I was happy to realize them 😊 Let’s stick to these for now unless something jumps out.

You’re understanding is correct, @SaurabhJamadagni! So not what usually comes after “I love”, but instead deriving what usually comes after “love”. Maybe we can figure out phrases a bit later, but for the implementation now let’s focus on one word so that the iOS side will be similar to autocomplete in that we’ll just get the last word, not need to check the last two or more.

We’ll do a basic lookup of already downloaded data for this. I think that JSONs with 500 keys and the four most common words as values will be within the range of size we can add as of now. The same as it works for other data, the process will happen in Scribe-Data that will then update files in the various apps :)

SaurabhJamadagni · 2022-08-21T09:10:29Z

We’ll do a basic lookup of already downloaded data for this. I think that JSONs with 500 keys and the four most common words as values will be within the range of size we can add as of now. The same as it works for other data, the process will happen in Scribe-Data that will then update files in the various apps :)

That sounds like a nice plan @andrewtavis. What do we default to when the user types a word that isn't included in the 500 keys?

Also I was wondering if we could use the wordfreq library to amp up autocomplete? Instead of giving completions that are alphabetical, we could give the completions based on their frequency. And then when the user is done typing, the autosuggestions that come later would be from the Wikipedia list we create of common 500 words.

andrewtavis · 2022-08-21T12:31:43Z

Your idea of basing the suggestions on frequency is a good one, @SaurabhJamadagni, but we’ll basically be doing that in the approach ourselves :) After this we’ll get the 500 or so words, and with the top 3-5 words we’ll also have frequencies that will determine how often the words are suggested (we could do this, if we wanted). We certainly do not need to do a random selection from the words that we find for suggestions :)

We can just do some suggestions from common words if the word isn’t in the 500 we get. This again will only be ~20 percent of the time, and maybe even less given that mobile keyboards are more for simpler sprach. We can increase the dictionary size within the language packs later potentially 😊

SaurabhJamadagni · 2022-08-21T18:08:53Z

but we’ll basically be doing that in the apron h ourselves

I think there's a typo there. I didn't understand what you meant 😅

We can just do some suggestions from common words if the word isn’t in the 509 we get. This again will only be ~20 percent of the time, and maybe even less given that mobile keyboards are more for simpler sprach. We can increase the dictionary size within the language packs later potentially 😊

Alright, that sounds like a good idea! Let me know how we can get started with the implementation. Excited to dig into something new haha. I'll go through the Scribe-Data documentation in the meanwhile. 😊

andrewtavis · 2022-08-21T21:35:13Z

I think there's a typo there. I didn't understand what you meant 😅

My guess is approach, reading it again 😅 Sorry, have been typing on my phone :)

Alright, that sounds like a good idea! Let me know how we can get started with the implementation. Excited to dig into something new haha. I'll go through the Scribe-Data documentation in the meanwhile. 😊

I already made an issue for the data process with some tasks: Scribe-Data #15 :) Check that out a bit, and we can discuss more who does what later! Looking forward as well 😊😊

andrewtavis · 2022-08-28T18:31:49Z

Once this issue is finished, the keys of the dictionary should also be included into the autocomplete options :)

andrewtavis · 2022-09-03T07:51:02Z

Note that as autosuggest will be functioning at the same time as annotations, it would be best if the first best option is shifted to the second position and the second to the third in case of an annotation :)

andrewtavis · 2022-09-03T15:59:52Z

Another observation, for numbers, the autosuggestions should be the words for "is", "or" and "minimum" in the given language :)

andrewtavis · 2022-09-12T17:58:43Z

I'll do the small fixes I mentioned in the PR now, @SaurabhJamadagni :)

A small aside though: this page here shows something that I was expecting sooner rather than later, and it makes me happy 😊 We don't have too many contributors, but you're commits now give you the credit that your code contributions deserve 🏆 Thanks for all the work you've been putting in!

andrewtavis · 2022-09-12T18:44:04Z

Just sent along some minor fixes that I didn't think needed review in 3ffbdce, @SaurabhJamadagni 😊

This issue is now blocked by Generate data for autosuggest Scribe-Data#15 (I always do the unordered lists here so that the issue name and status are displayed, btw 🤓)

Let me know how you're feeling right now as far as the issues that are being worked on. Everything that's left is coding that you till now you haven't worked on like .xib files in #197 and Wikidata related issues in Scribe-Data. Happy to guide you if you have interest in any of it, but also totally fine if you'd like to take a break for a moment 😊

andrewtavis · 2022-09-24T17:02:17Z

Note that autosuggest as of now is deleting part of the previous word as if it's autocomplete, so we need to make sure that this isn't triggered :)

Edit: I fixed it just now 🙌

andrewtavis · 2022-10-04T09:39:16Z

When Generate data for autosuggest Scribe-Data#15 is done and the data is in this repo, we can then switch over the dummy dictionary to one that's loaded in CommandVariables.swift

@SaurabhJamadagni, this issue isn't closed yet, but the data is in there via 1a7ee7d 🚀 We'd talked about you switching over the autosuggest source to these files, which you'd be welcome to do if you'd like to 😊

andrewtavis · 2022-10-04T09:39:35Z

Almost done with v2.0.0!! Still lots of work, but we're very very close 😊

SaurabhJamadagni · 2022-10-04T16:57:56Z

We'd talked about you switching over the autosuggest source to these files, which you'd be welcome to do if you'd like to 😊

Yeah @andrewtavis. I think I'll do this issue first then. The annotation formatting issue is getting a bit confusing when it comes to moving the function from the ViewController to Annotate.swift.

andrewtavis · 2022-10-05T08:41:26Z

That one is definitely a bit more confusing, @SaurabhJamadagni :) I'll answer your question there later in the day! And looking forward to the work here! 🚀

andrewtavis · 2022-10-06T21:33:52Z

@SaurabhJamadagni, what fcc9fdb did was the following:

Updates the changelog 🚀
Updates the privacy policy for the repo and the app
Allows for words that aren't pronouns (or German nouns) to be be suggested as lower case even if they're capitalized in the autosuggestions dict
Adds the autosuggestion keys into the autocompletion options
- Takes the lower case one if it's available and removes duplicates

I figure it's all relatively standard and at this point we need to just get this out and start testing it 😊 Will close this now as all the tasks that were being tracked are now done!!

Thanks for all your hard work on this! 😊

SaurabhJamadagni · 2022-10-07T06:26:46Z

I figure it's all relatively standard and at this point we need to just get this out and start testing it 😊 Will close this now as all the tasks that were being tracked are now done!!

Thanks for all your hard work on this! 😊

Thanks @andrewtavis! So happy to finish this feature finally! Thanks for your work and help as well 😊 🚀

andrewtavis · 2022-10-07T06:44:31Z

It's by no means perfect, @SaurabhJamadagni, but it's definitely good enough to get it out, especially with all the improvements that autocomplete and the preposition-case annotation interface is bringing 😊

I'm working on all the design elements that we need for v2.0.0. Get to the task that you have in #197 when you can, and then I'll work on the rest for getting this out! 🚀

SaurabhJamadagni · 2022-10-07T09:46:50Z

Yeah @andrewtavis, I'm actually done with the changes. I have a presentation in a bit so I was holding off the PR till evening. Will issue it asap. 😊

andrewtavis · 2022-10-07T10:34:54Z

Sounds good, @SaurabhJamadagni! I think a bit of extra coding will be necessary for the preposition case conjugation interface, but generally I think that early next week is a reasonable release date at this point. From there we can work on small issues again, finally 😅😅😊

SaurabhJamadagni · 2022-10-08T09:04:17Z

I think that early next week is a reasonable release date at this point. From there we can work on small issues again, finally

Let's gooo @andrewtavis! 🚀

As much as I am looking forward to working on the smaller issues, can't deny that large feature additions are a different kind of joy haha 😊

andrewtavis added feature New feature or request good first issue Good for newcomers help wanted Extra attention is needed -next release- Included in the next release labels Aug 14, 2022

This was referenced Aug 14, 2022

Coloration of nouns in autocomplete and suggest #164

Closed

[Deleted] Allow for inappropriate language filtration scribe-org/Scribe-Data#13

Closed

Alphabetized Autocomplete #188

Closed

Wikimania 2022 Prep and Documentation #187

Closed

andrewtavis mentioned this issue Aug 18, 2022

Add emojis to autosuggestions #51

Closed

2 tasks

This was referenced Aug 18, 2022

Emoji data for Scribe apps scribe-org/Scribe-Data#14

Closed

Generate data for autosuggest scribe-org/Scribe-Data#15

Closed

This was referenced Aug 28, 2022

Add Wikipedia Logo to App Store Images #198

Closed

Update App Store videos given autocomplete/suggest #199

Closed

andrewtavis removed the good first issue Good for newcomers label Aug 29, 2022

This was referenced Sep 3, 2022

Add UILexicon access to autocomplete and autosuggest #201

Closed

[Deleted] Add autocomplete / predictive text #3

Closed

andrewtavis added a commit that referenced this issue Sep 4, 2022

#188 #194 autocomplete bug fix and autosuggest placeholders

c683872

andrewtavis added a commit that referenced this issue Sep 11, 2022

#194 add autosuggestions for numbers for all languages

c688d5c

SaurabhJamadagni mentioned this issue Sep 12, 2022

Adding Autosuggestions function #207

Merged

andrewtavis mentioned this issue Sep 12, 2022

Default autosuggestions for pronouns #208

Closed

2 tasks

andrewtavis added a commit that referenced this issue Sep 12, 2022

#194 autosuggest after numbers and ds period shortcut

3ffbdce

andrewtavis added the blocked Another issue is blocking label Sep 12, 2022

This was referenced Sep 25, 2022

Update Annotation Formatting #197

Closed

Add information icon to the command bar #166

Closed

andrewtavis assigned andrewtavis and SaurabhJamadagni Oct 2, 2022

andrewtavis added a commit that referenced this issue Oct 2, 2022

Baseline autosuggest files for #194

efccc13

andrewtavis added a commit that referenced this issue Oct 3, 2022

Baseline autosuggest files for #194

d45908c

andrewtavis added a commit that referenced this issue Oct 4, 2022

#194 add autosuggestion jsonss for all languages

1a7ee7d

andrewtavis removed the blocked Another issue is blocking label Oct 4, 2022

SaurabhJamadagni mentioned this issue Oct 6, 2022

Linking autosuggestions data with functions #224

Merged

andrewtavis added a commit that referenced this issue Oct 6, 2022

#194 add suggestions to completions and update privacy

fcc9fdb

andrewtavis closed this as completed Oct 6, 2022

andrewtavis mentioned this issue Oct 8, 2022

Delete autocomplete space if punctuation follows #169

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseline autosuggest #194

Baseline autosuggest #194

andrewtavis commented Aug 14, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

SaurabhJamadagni commented Aug 20, 2022

andrewtavis commented Aug 20, 2022 •

edited

SaurabhJamadagni commented Aug 21, 2022

andrewtavis commented Aug 21, 2022 •

edited

SaurabhJamadagni commented Aug 21, 2022

andrewtavis commented Aug 21, 2022

andrewtavis commented Aug 28, 2022

andrewtavis commented Sep 3, 2022

andrewtavis commented Sep 3, 2022

andrewtavis commented Sep 12, 2022

andrewtavis commented Sep 12, 2022 •

edited

andrewtavis commented Sep 24, 2022 •

edited

andrewtavis commented Oct 4, 2022

andrewtavis commented Oct 4, 2022 •

edited

SaurabhJamadagni commented Oct 4, 2022

andrewtavis commented Oct 5, 2022

andrewtavis commented Oct 6, 2022

SaurabhJamadagni commented Oct 7, 2022

andrewtavis commented Oct 7, 2022

SaurabhJamadagni commented Oct 7, 2022

andrewtavis commented Oct 7, 2022 •

edited

SaurabhJamadagni commented Oct 8, 2022

Baseline autosuggest #194

Baseline autosuggest #194

Comments

andrewtavis commented Aug 14, 2022 • edited

Terms

Description

Contribution

andrewtavis commented Aug 18, 2022 • edited

andrewtavis commented Aug 18, 2022 • edited

andrewtavis commented Aug 18, 2022 • edited

andrewtavis commented Aug 18, 2022 • edited

SaurabhJamadagni commented Aug 20, 2022

andrewtavis commented Aug 20, 2022 • edited

SaurabhJamadagni commented Aug 21, 2022

andrewtavis commented Aug 21, 2022 • edited

SaurabhJamadagni commented Aug 21, 2022

andrewtavis commented Aug 21, 2022

andrewtavis commented Aug 28, 2022

andrewtavis commented Sep 3, 2022

andrewtavis commented Sep 3, 2022

andrewtavis commented Sep 12, 2022

andrewtavis commented Sep 12, 2022 • edited

andrewtavis commented Sep 24, 2022 • edited

andrewtavis commented Oct 4, 2022

andrewtavis commented Oct 4, 2022 • edited

SaurabhJamadagni commented Oct 4, 2022

andrewtavis commented Oct 5, 2022

andrewtavis commented Oct 6, 2022

SaurabhJamadagni commented Oct 7, 2022

andrewtavis commented Oct 7, 2022

SaurabhJamadagni commented Oct 7, 2022

andrewtavis commented Oct 7, 2022 • edited

SaurabhJamadagni commented Oct 8, 2022

andrewtavis commented Aug 14, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

andrewtavis commented Aug 18, 2022 •

edited

andrewtavis commented Aug 20, 2022 •

edited

andrewtavis commented Aug 21, 2022 •

edited

andrewtavis commented Sep 12, 2022 •

edited

andrewtavis commented Sep 24, 2022 •

edited

andrewtavis commented Oct 4, 2022 •

edited

andrewtavis commented Oct 7, 2022 •

edited