Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for in-app download of data #89

Open
2 tasks done
andrewtavis opened this issue Jan 4, 2022 · 18 comments
Open
2 tasks done

Allow for in-app download of data #89

andrewtavis opened this issue Jan 4, 2022 · 18 comments
Labels
blocked Another issue is blocking data Relates to data or Wikidata feature New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@andrewtavis
Copy link
Member

andrewtavis commented Jan 4, 2022

Terms

Description

This issue is for the discussion and implementation of data downloads within the Scribe app. As more keyboards are added to Scribe, the size of the app will slowly grow and become cumbersome. To counteract this, it would be best if keyboard data was downloaded by the user in app. Downloading a keyboard would allow a user to then add the given keyboard in the settings.

@andrewtavis andrewtavis added help wanted Extra attention is needed question Further information is requested data Relates to data or Wikidata labels Jan 4, 2022
@andrewtavis
Copy link
Member Author

andrewtavis commented Jan 7, 2022

It should be noted that for translations this data would need to be on a per language pair basis, as a user would likely only need the data for their spoken or system language to the target keyboard.

@andrewtavis andrewtavis added the blocked Another issue is blocking label Jan 7, 2022
@andrewtavis andrewtavis added the feature New feature or request label Sep 5, 2022
@wkyoshida
Copy link
Member

Hey @andrewtavis! I was wondering how this feature would eventually work in the back-end side. I know there's the Scribe-Data repo, which could be used as the download location for the data packs, but I was also wondering if the data pack hosting could be somewhere else even. For instance, could Scribe use Wikimedia's Toolforge perhaps to run a service that the apps could periodically check for available data pack updates? And then the apps could present the option to download those updates (or even have the option to do so automatically)? Some things that came to mind were:

  • The service could facilitate a way to provide individual packs, based on what the user needs, in a compressed format. The compressed format could facilitate downloads as pack sizes get larger.
  • Apart from hosting the packs, Scribe could also automate the periodic checks in Wikipedia and Wikidata for any data updates. From what I understood, I believe that is done manually today - would that be right?
    • If vetting the data updates before making them available is a concern, I think there could be a way to still do so. For instance, the automation could create PRs for the team to review the updated data.
  • I think one thing that could be discussed is if Scribe would like to treat releases for the apps separate from releases of the data packs. I got the feeling that perhaps that could have been a thought behind the reasoning for this issue, but I could be mistaken.
    • As a note, treating app vs data pack releases differently could allow for data pack updates to happen more or less frequently than the app updates, depending on the need.

These are just some thoughts I had. I don't think a service and automation should be definitively implemented or even if Toolforge should be used for it (Scribe would have to see if it could and if it makes sense to). I was more curious if there were already any thoughts on how to implement this. Of course, the apps just downloading the files that it needs directly from GitHub could very well be enough for what Scribe needs 😆 which in that case, feel free to ignore 😬

@andrewtavis
Copy link
Member Author

A suggestion that I got from an Apple employee tonight was to use downloadable assets that could be hosted on the App Store, which I didn't even know existed. I think that Toolforge would likely be a better option though, as when Android gets up and running we'll need something that's platform agnostic :)

We could definitely provide an option for automatic updates in the settings, and maybe something that could happen at the start is that we could prompt the user to download the new data in the background whenever they open the keyboard after a new update update (we'd just update both at the same time as is done now).

  • Apart from hosting the packs, Scribe could also automate the periodic checks in Wikipedia and Wikidata for any data updates. From what I understood, I believe that is done manually today - would that be right?

This is exactly what we're looking to do, and yet it's currently manual (with the help of Python, but I do type a single command to run update_data.py 😊).

  • If vetting the data updates before making them available is a concern, I think there could be a way to still do so. For instance, the automation could create PRs for the team to review the updated data.

For now let's not worry about vetting data. There is a script where I remove profanities from autosuggestions via a Wikidata query, with that being integrated into an eventual system going forward. Big thing for this is let's focus on getting what we can by expanding Wikidata if need be, and we can also code in specific things if need be (I removed the word "nazi" from all autosuggestions, for instance, as WWII is talked about on so many Wikipedia articles that it was randomly popping up).

  • As a note, treating app vs data pack releases differently could allow for data pack updates to happen more or less frequently than the app updates, depending on the need.

I'm 100% for decoupling the releases as it will doubtless lead to more flexibility, especially considering we're hoping to one day be releasing iOS, Android and Desktop 😉 I think that downloading files directly from GItHub will likely be the first thing this ends up doing, but I'd love to be proved wrong and jump directly to Toolforge 😊

Thanks for further explorations and all the thought you're giving this! :) :)

@wkyoshida
Copy link
Member

when Android gets up and running we'll need something that's platform agnostic :)

Completely agree! 👍

I removed the word "nazi" from all autosuggestions, for instance, as WWII is talked about on so many Wikipedia articles that it was randomly popping up

Huh.. how interesting! 🤔😆 unexpected outcomes reflected in the data. They will surely happen, of course, but it's just interesting to see them.

I think that downloading files directly from GItHub will likely be the first thing this ends up doing, but I'd love to be proved wrong and jump directly to Toolforge 😊

You know, I think Scribe could be fine starting off with GitHub actually. The client-side mechanism on the apps to go check for available updates would need to be implemented, and that would be regardless if it's GitHub, Toolforge, or whatever else serving the back-end. The details of what it checks for to determine a new update would be different, but the core functionality could stay. Scribe could start with GitHub and then later switch if it'd like. My idea for Toolforge, I think, came firstly more so cause it is a Wikimedia project. With that, I believe there could be some benefits:

  • Scribe continues its affiliation with Wikimedia. It's "on-brand" - though that might not be enough reason to use Toolforge.
  • Also, however, due to this affiliation, perhaps getting assistance from the greater Wikimedia community, or even staff, could come easier.

On the other hand though, GitHub could just very well be a viable path. We already mentioned that files could be downloaded directly from the repo. As far as the automation for new data from Wikipedia/Wikidata, perhaps GitHub Actions could be used as the tool for that? It seems there's a way to schedule Actions via a cron schedule. The Action could be even to just run update_data.py 😆 Finally, the data packs in compressed format could also be linked as separate files to releases as assets.

My curiosity - was there another earlier idea that you had in mind in how to do the automation for new data checks?

In summary - Toolforge could be an appealing option to get the chance to use it, to work closer within more Wikimedia things, or to not rely on GitHub too much. However, there could very well be a way to accomplish this with GitHub.

@andrewtavis
Copy link
Member Author

We can definitely do a comparison between the offerings of Toolforge and GitHub for this 😊 I don't inherently have a preference for either, but do like to keep things kind of centralized so that we're not bouncing between platforms. GitHub actions will definitely be a major step for us going forward as far as testing :)

With that being said, keeping the Wikidata related processes in a Wikimedia flow would also be ok. And I very much expect that we can get support from the community for Toolforge. We do have a tag on Phabricator, which is their task system. I'd say let's make a decision between these two :)

The Action could be even to just run update_data.py
My curiosity - was there another earlier idea that you had in mind in how to do the automation for new data checks?

Honestly I hadn't thought too in depth about how we'd do it. There's been enough going on with the project and everything else that I just made this issue to start a discussion :) Generally the idea is that we need to centralize the JSONs, update them regularly in this centralized location, provide a mechanism to get into the app, and let the user know when there's new data to download/do this automatically if possible. Also wasn't sure what the end data structure for the app was going to be, but through discussions for #96 it looks like we'll be figuring out a SQLite solution.

Thanks as always for your insights! Really is nice to talk this stuff over with you 🚀 Also, let me know if you'd be interested in Wikimedia events :D Happy to keep you up to date on ones where we'll be participating 😊

@wkyoshida
Copy link
Member

Agreed! We can definitely compare both offerings and evaluate. However, whether it ends up being GitHub or Toolforge for the long-term, I am thinking atm perhaps that the GitHub route could be the starting point, mostly since I believe that the level of effort would likely be smaller (the files are already being hosted via the git repository anyways). Scribe could then, at first, focus more on completing the client-side mechanism. With Toolforge, I feel there will likely be more - exposing a way to check for updates (perhaps via an API), figuring out data storage, etc. Yet, Toolforge can always later replace GitHub. However!! 😆 this is just what I'm feeling now and can definitely change after further comparing both offerings. Also - the introduction of SQLite is interesting regarding all this (see [1]).

We do have a tag on Phabricator, which is their task system.

I didn't realize Scribe did. That is awesome!

Also wasn't sure what the end data structure for the app was going to be, but through discussions for #96 it looks like we'll be figuring out a SQLite solution.

I see. That makes sense to me! Working against SQLite directly, as opposed to Core Data, could also potentially prove useful when later doing the same in Android. Findings and implementation could carry over more easily.

[1] Going back to the topic of centralized location, I wonder if it could make sense to also leverage a DB on the back-end side. Reason is, in thinking about update downloads, I wondered if there could be a way to only download the diff of what is new. That could help with download sizes as data grows. A DB could help with identifying that if there is perhaps a last_updated field for data points. Bringing this up, as it could be a point in favor of using Toolforge. Have to think more on how that would play out though.

Thanks as always for your insights! Really is nice to talk this stuff over with you 🚀 Also, let me know if you'd be interested in Wikimedia events :D Happy to keep you up to date on ones where we'll be participating 😊

Of course! 😄 I'm more than happy to help out with Scribe. I'll also be glad and willing to stay involved with this specific feature/side of Scribe. I think it'll be some interesting work 😉🧑‍💻

Also, you know what - yes! I think I would be interested in the events, actually. It'd be fun 🤘

@andrewtavis
Copy link
Member Author

... whether it ends up being GitHub or Toolforge for the long-term, I am thinking atm perhaps that the GitHub route could be the starting point, mostly since I believe that the level of effort would likely be smaller (the files are already being hosted via the git repository anyways).

Makes sense to me as well :) For the initial offering on this we focus on GitHub, and then we go from there 🚀

[1] Going back to the topic of centralized location, I wonder if it could make sense to also leverage a DB on the back-end side. Reason is, in thinking about update downloads, I wondered if there could be a way to only download the diff of what is new. That could help with download sizes as data grows. A DB could help with identifying that if there is perhaps a last_updated field for data points.

Only downloading the new differences would be great! Having to wait what'd doubtless be a long download is definitely something we'd try to avoid, and this sounds like a great solution 😊 I agree that this would also point us a bit towards Toolforge ⚒️

I guess from here we work on #16 and the data solution that we'll do as a part of #96. Then this'd be unblocked and we can do an initial GitHub solution followed by a more advanced one :) Maybe something we can try at first is to have the app preloaded, and then we can have downloading an update outside of an app update be an initial thing. Once that's decoupled we can then move on to the rest 🚀

Of course! 😄 I'm more than happy to help out with Scribe. I'll also be glad and willing to stay involved with this specific feature/side of Scribe. I think it'll be some interesting work 😉🧑‍💻
Also, you know what - yes! I think I would be interested in the events, actually. It'd be fun 🤘

Glad Scribe's something that you can get so much from 😊 I'll write on here for the next Wikimedia events coming up. Likely the next one would be Data Reuse Days - assuming that they do another one in 2023 :)

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 19, 2023

Hey @wkyoshida and @SaurabhJamadagni 👋😊

Really happy to have gotten v2.2.0 out today! A major step to add in emojis — thank you both for your efforts! Obviously little bits to fix here and there, but this was a major step on the roadmap. Plus the way Scribe can repeat emojis via #283 one after another adds a cool extra feature that system keyboards lack 🙌 There really is some progress being made here! I went to the iOS meetup again tonight and there really is a difference in how people are viewing Scribe with some of these major features in the interface. The obvious next step of overhauling the app interface in #16 seems like the final bit to put "still MVP" to rest as the single page app still gets a bit of a look.

Reason I'm writing you both in #89 here is something I was a bit worried about while reworking the data to SQLite now definitely seems like it's cause for concern: we're at 137.9 MB. Aside from #16, the next major things we have on the roadmap are:

I wanted to check with you both about your opinions of the app size. How big is too big? To me we're at or close to that size now, and it's ok to be there, but adding more keyboards and translation data is going to really expand the size. I think it's safe to say that we cannot do the top two options above without doing the data download process, but then I wanted to check :)

Does it makes sense to shift this to next in the priority after #16? @wkyoshida, I could talk with some coworkers and we could eventually do a call with them to ask about the ideas of doing the data download process that we've discussed. Two engineers and I were doing some good brainstorming for Scribe over coffee today 😊

Thanks again to you both! 🚀

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 19, 2023

I just went through and did a quick organization of the projects board, btw. Obviously not set in stone, and let me know if something looks off. As we're not doing sprints or anything, I think for now it makes sense to archive the finished issues upon release and do another ordering :)

@SaurabhJamadagni
Copy link
Collaborator

Hey @andrewtavis, I went through the popular keyboard options on the app store (ex: Gboard, SwiftKey) and they are somewhere around the size of 80-90 MB. That is barebones, without the additional theme downloads. Whereas Grammarly, a keyboard to help with grammar is around 230 MB. So I would say we are not far off with the size. It definitely is a top three priority. I was thinking what if we work on the cross-translation before this issue? That way the languages that we currently are offering could become complete packages. We then modularise so that specific languages can be downloaded (i.e. this issue). We could move on to adding further languages and creating their packages for downloading after that. But again, the changes will be pushed together anyway right?

@wkyoshida
Copy link
Member

I think it's safe to say that we cannot do the top two options above without doing the data download process, but then I wanted to check :)

Yeah, adding keyboards or more translations aren't entirely blocked exactly, but I do agree that it would be a good idea to add the data download feature to avoid bloating the app size.

Does it makes sense to shift this to next in the priority after #16? @wkyoshida, I could talk with some coworkers and we could eventually do a call with them to ask about the ideas of doing the data download process that we've discussed.

I think having this data download feature after the menu makes sense 👍 It would free us up to add more keyboards with less concern over the app size.

Awesome to hear that we could get some feedback from your coworkers, @andrewtavis! They could for sure help. Just a thought though, if we'd like to, the Scribe-Server idea could be something that we hold off on for a bit. We could implement the data download feature to download the .sqlite straight out of GitHub in the interim, if we need to. It could be an easier option just so we can get any data download working, as setting up Scribe-Server will surely be more involved. I guess a downside though, is that maybe we'd want to wait on the cross-translation then? Mostly so that Scribe-Data repo doesn't get bloated (not a huge concern though, but just a point for consideration).

I was thinking what if we work on the cross-translation before this issue?

Like mentioned above, I think there could be some hesitancy with adding all of the data for cross-translation into the Scribe-Data repo, but I do think though that we could definitely already do the work for the data extract/transform logic at least 😄
Something we could keep in mind, pending the existence of a potential Scribe-Server, is how to store the cross-translation data well in a DB structure.

But again, the changes will be pushed together anyway right?

Yeap, we could push them together. I think another option for us though, could be to split releases even. I think there might be some flexibility here for us. Adding the menu + the data download (simply of the data that we have today) could be one release. I think this as a base already frees us up to add more keyboards we'd like with less app size concern. Later adding the ability to select which languages to do any cross-translation for though could be another release. I think this selection ability might be important for cross-translation, as I suspect users likely won't want miscellaneous translations for other languages to get downloaded too.


Just some thoughts ✌️

@andrewtavis
Copy link
Member Author

We could implement the data download feature to download the .sqlite straight out of GitHub in the interim, if we need to.

This makes sense to me, @wkyoshida. We’d just need to think on how to update the data on the iOS side of things :)

Something we could keep in mind, pending the existence of a potential Scribe-Server, is how to store the cross-translation data well in a DB structure.

My assumption was we’d make source language based tables where the word is the key and all the translations are the other column elements? How does this sound?

I think this selection ability might be important for cross-translation, as I suspect users likely won't want miscellaneous translations for other languages to get downloaded too.

This is a good point and we’ll need to talk about the options that the user is being asked for data downloads. Do we ask what their source language is and only get that plus whatever their target keyboard language’s translations are? I guess that’d be fine, but we would need to check on that during the download phase with the default option then being their phone’s language or English if we don’t have it.

Thanks for the thoughts! 🙏✌️

@wkyoshida
Copy link
Member

My assumption was we’d make source language based tables where the word is the key and all the translations are the other column elements? How does this sound?

I did think of this option too tbf, but I'm a little unsure. There are some downsides, such as:

  • not sure how to leverage a potential last_updated column, as a row would have translations for different/new languages getting added at different times. Would the addition of one new language mean having to grab the entire row again?
  • the potential for many empty fields, either from a translation for one language not having been added yet or from the translation simply not existing for the source word.

Some other ideas I think could be to:

  • have a table for every source-target language pairing. However, an obvious downside is the number of language pairing combinations, i.e. number of tables.
  • this SO answer

Do we ask what their source language is and only get that plus whatever their target keyboard language’s translations are?

This is interesting actually, because I would advocate even for the option to select multiple source languages. Personally, for instance:

  • with Spanish as the target language, I'd use Portuguese as the source
  • with German as the target language, I'd use English as the source

There might be several different reasons for this, but this could have to do with which language someone learned another one in, which someone's known language has more similarities with their target, how some source languages have better translation data for the target, etc. Giving the option for multiple source languages could be good for those reasons.

Adding to thoughts on the DB structure, I guess that multiple source languages would likely also make downloading the translation data pack not as easy as simply "only download the table with source language X" since multiple sources would be in play.

@andrewtavis
Copy link
Member Author

I think that source language-target language pairings could work as I don’t think that there would be more than two or three Scribe keyboards used by a given user usually, and on the DB side the maintenance would be cumbersome, but should be doable. I’ll take a look at the SO question more thoroughly though :) Sorry I’m super exhausted after the week/activist meetup (which went really well, btw 😊), so I couldn’t focus as much on it as I’d like to :)

I guess I hadn’t fully considered the option of a user using different source languages. Very interesting :) :) I think that that the case of selecting a source language would also work though :) The default would just be set to their phone’s language for the source language, but they could select a new one from a dropdown before downloading. Assuming your phone’s in Portuguese, you’d just need to select English as the source language within the German keyboard download interface 😊

@SaurabhJamadagni
Copy link
Collaborator

Sorry @andrewtavis, @wkyoshida can't provide any input regarding this issue right now. I have kind of fallen behind in the discussion. Will catch up during our meeting! Loving the progress and discussion though! 😄

@andrewtavis
Copy link
Member Author

We can discuss this on Tuesday a bit as well :) :)

@andrewtavis
Copy link
Member Author

Based on my tasks for this week, the following is the new data download screen. The circle to the right of the Check for new data button is a spinner, so clicking anything within the field will trigger a progress spinner for the download. We can look into this in depth during the next weekly, but the full designs are on Figma :)

@andrewtavis
Copy link
Member Author

andrewtavis commented Jun 27, 2023

I really like the new look of the designs! Thanks to both of you for such great suggestions today 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Another issue is blocking data Relates to data or Wikidata feature New feature or request help wanted Extra attention is needed question Further information is requested
Projects
Status: Todo
Development

No branches or pull requests

3 participants