Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a vector of uids to navigate the units #8

Closed
iafan opened this issue Oct 3, 2016 · 20 comments
Closed

Use a vector of uids to navigate the units #8

iafan opened this issue Oct 3, 2016 · 20 comments
Assignees

Comments

@iafan
Copy link
Contributor

iafan commented Oct 3, 2016

Bugs like translate/pootle#4983 show that the current unit retrieval approach can't reliably work by design. We want to experiment with reverting the offset-related changes and at the same time use this as an opportunity to rethink the API between the client and the server and make this a foundation to the future translation UI improvements, e.g. continuous (non-paginated) scrolling. We also want to be considerate of the memory used by the browser, because this was the primary concern of the Pootle team with the old code.

Moving relevant bits from this comment here:

Let's keep the snapshot of uids on the client and not redo any search. It would work as follows:

  1. Whenever the user goes into translation UI, server returns them a vector of uids capped at e.g. 1000 units (this is more than enough for one pass, be it review or translation). This adds about 10KB to the ungzipped page source, which should be ok.
  2. Client renders the UI, and as user navigates through the units, it sends requests to the server specifying exact uids for the units it needs information about. The returned data is needed for view rows only, so it needs to contain only source and target text. If the unit is no longer there, it is returned as a blank entry in the results, which allows us to clean up these uids in the list.
  3. When/if the user reaches the 1000th unit in a single pass, and if there were more than this number of units in the filter initially, the UI can render a button at the bottom of the table, by clicking on which the user can redo the search (reapply the filter) and start with unit 1. The effect of this button would be the same as going back to the browser table and then clicking the pill link again.
@iafan
Copy link
Contributor Author

iafan commented Nov 30, 2016

With all the merged PRs into the feature branch, the outstanding issues are:

  • Dealing with edit rows of large vertical height
  • Displaying separators between the store boundaries

@iafan
Copy link
Contributor Author

iafan commented Dec 6, 2016

Missing separators between stores in edit mode is the last missing feature that prevents new code from being merged into master. Here is how I suggest us to have it implemented:

  1. When doing the search and getting uids, get not only the vector of uids, but pairs: [unit_id, store_id]. This should be cheap in terms of SQL (I don't expect any slowdowns here)
  2. Transform the list of pairs into: (A) list of uids, (B) list of 'first uids in each store' and send down to the client
  3. When requesting view row data, send not only uids of the rows we to get the data for, but also list of uids to get extended data for (the data needed to render separators, like language name, project name, and store path)

Example:

  1. Imagine we get the following data from the db after searching for units:
unit_id  |  store_id
--------------------
1001     |  1
1002     |  1
1025     |  1
2007     |  2
2250     |  2
3140     |  3
4096     |  4
  1. The preprocessed data as returned to the client will look like this:
{
    "uids": [1001, 1002, 1025, 2007, 2250, 3140, 4096],
    "headers": [1001, 2007, 3140, 4096]
}

Internally in the client code, "headers" needs to be converted into an object with the corresponding keys for fast lookups.

  1. When we fetch view rows, we pass unit ids and the corresponding matching uids to render headers:

GET => /xhr/units/?uids=1001,1002,1025,2007&headers=1001,2007

The returned result will look something like:

{
    "1001": {
        "source": "...",
        "source_lang": "...",
        "target": "...",
        "target_lang": "...",
        "project_id": "foobar",
        "project": "Foo Bar",
        "path": "/path/to/file1.html"
    },
    "1002": {
        "source": "...",
        "source_lang": "...",
        "target": "...",
        "target_lang": "..."
    },
    "1025": {
        "source": "...",
        "source_lang": "...",
        "target": "...",
        "target_lang": "..."
    },
    "2007": {
        "source": "...",
        "source_lang": "...",
        "target": "...",
        "target_lang": "...",
        "project_id": "foobar",
        "project": "Foo Bar",
        "path": "/path/to/file2.html"
    }
}

Notice that units "1001" and "2007" have more properties. Once this data gets into a React component, it will not only render the view row by itself, but also a separator row above it.

iafan pushed a commit that referenced this issue Dec 6, 2016
This commit adds rendering of header rows between stores. This is
WIP, because it doesn't display any text in the header, just the
blank separators; XHR and backend need to be adjusted to pass extra
header data for rendering. This is a partial implementation of
#8 (comment)
iafan pushed a commit that referenced this issue Dec 8, 2016
This commit adds rendering of header rows between stores.
This is an implementation of
#8 (comment)
@iafan
Copy link
Contributor Author

iafan commented Dec 8, 2016

The above logic was fully implemented in 9ba3cc1

@julen please review.

@julen
Copy link
Contributor

julen commented Dec 9, 2016

There is a rendering issue which is probably caused by some of the latest commits. Load the editor, then search for a string which will provide no results and the editor's shadow will remain:
screen shot 2016-12-09 at 09 48 01

By the way, the index should say 0/0 instead of 1/0.

@julen
Copy link
Contributor

julen commented Dec 9, 2016

Performing a search (or filtering units) is missing the header information:
screen shot 2016-12-09 at 09 53 04

I also understand having these everywhere is simple and doesn't get in the way, however when being in the context of a single store, it feels like there is redundant information which we are unnecessarily displaying:
screen shot 2016-12-09 at 10 07 10

On the other hand, in the cross-language view, my feeling is the headers are too prominent and they take over the content:
screen shot 2016-12-09 at 10 09 20
By the way, in this view accessing the context rows is broken.

Likewise, opening the editor seems to exhibit some jumpiness of content (apologies for the bad quality of the gif):
editor_jumpiness

@julen
Copy link
Contributor

julen commented Dec 9, 2016

I have checked the implementation and while it works for our current use case, I have the feeling the fact that we currently want to display this information in the header is leaking through the implementation: we are mixing what information we get (unit's store-related metadata) with how and where we display it (header).

It won't be late until we want to use the file-related information we are fetching in other places; for instance in the header of the editing unit or anywhere else in the UI. How can units know in which language/project/file they are? As far as I see there is no direct way to know that without extra code, as this information is only available to the first unit in the store for the result set.

One way to avoid the leakage would be to return the list of uids grouped in lists, e.g. the example above would become:

{
    "uids": [[1001, 1002, 1025], [2007, 2250], [3140], [4096]]
}

Likewise, this endpoint could potentially already provide store-related metadata which will be shared across the units in each group: source_lang, target_lang, language, project, file, ... e.g.

[
  {"source_lang": "ab", "target_lang": "cd", <...metadata...>, "uids": [1002, 1002, 1025]},
  {"source_lang": "ab", "target_lang": "cd", <...metadata...>, "uids": [2007, 2250]},
  {"source_lang": "ab", "target_lang": "cd", <...metadata...>, "uids": [3140]},
  {"source_lang": "ab", "target_lang": "cd", <...metadata...>, "uids": [4096]}
]

So at the expense of adding a few bytes more in the initial response, the /xhr/units/ endpoint would remain straightforward, because there would be no leakage of header or similar concepts which are attached to the UI and in addition to that, the payloads would also become lighter, as there would be no need to return source_lang and target_lang pairs (nor language, project nor file keys for the first units in the store of a result set) — these are already available in the client.

@julen julen assigned julen and iafan Dec 9, 2016
iafan pushed a commit that referenced this issue Dec 12, 2016
This commit adds rendering of header rows between stores.
This is an implementation of
#8 (comment)
@iafan
Copy link
Contributor Author

iafan commented Dec 12, 2016

I force-pushed the fixes to the concerns/issues outlined above and in the line comments, except for the last part about using nested lists.

Speaking of the API:

this endpoint could potentially already provide store-related metadata which will be shared across the units in each group

We definitely don't want to do this, because this would prefetch header data for the entire slice (up to 1000 uids) and in the worst case scenario return the header data for up to 1000 stores. We only want to fetch header data that is about to be visible to the user, and usually we do this just one row at a time, as the user advances through the list of units, allowing us to break payload in smaller chunks.

Grouping units as a list of lists will definitely be more compact when sending the vector of uids from the server to the client, so I'll experiment with this.

iafan pushed a commit that referenced this issue Dec 12, 2016
This commit adds rendering of header rows between stores.
This is an implementation of
#8 (comment)
iafan pushed a commit that referenced this issue Dec 12, 2016
This commit adds rendering of header rows between stores.
This is an implementation of
#8 (comment)
iafan pushed a commit that referenced this issue Dec 12, 2016
This commit adds rendering of header rows between stores.
This is an implementation of
#8 (comment)
@iafan
Copy link
Contributor Author

iafan commented Dec 12, 2016

Grouping units as a list of lists will definitely be more compact when sending the vector of uids from the server to the client, so I'll experiment with this.

Latest pushes introduce this format.

@julen
Copy link
Contributor

julen commented Dec 12, 2016

We definitely don't want to do this, because this would prefetch header data for the entire slice (up to 1000 uids) and in the worst case scenario return the header data for up to 1000 stores.

A bit aside of the point I was making, I think we want to seriously reconsider the limit of 1000 units — it's pretty high, probably way above the threshold of the amount of units a translator will work continuously without taking a break. Halving it wouldn't be bad.

@julen
Copy link
Contributor

julen commented Dec 12, 2016

this endpoint could potentially already provide store-related metadata which will be shared across the units in each group

We definitely don't want to do this, because this would prefetch header data for the entire slice (up to 1000 uids) and in the worst case scenario return the header data for up to 1000 stores. We only want to fetch header data that is about to be visible to the user, and usually we do this just one row at a time, as the user advances through the list of units, allowing us to break payload in smaller chunks.

We should probably make a distinction between when/how-much data is fetched, and how it is stored.

With the current implementation, in all cases regardless of units being part of the same language pair/project/store or not, the network request is fetching sourceLang and targetLang pairs. Then, for first units seen in store, targetLanguageName, projectName, file values are fetched too. The former data could benefit from being only part of the latter type of responses.

All of this is then being accumulated inUnit objects, even if the likelihood for data redundancy is high (albeit this being limited to ~45 units due to cleanup). In the majority of use cases (translators usually translate a single language; most likely one project at a time), these values could benefit from being normalized into Stores to require less client-side memory.

@iafan
Copy link
Contributor Author

iafan commented Dec 13, 2016

reconsider the limit of 1000 units — it's pretty high, probably way above the threshold of the amount of units a translator will work continuously without taking a break. Halving it wouldn't be bad.

The idea was exactly to make sure people won't reach the end of the list if the list is long, giving an impression that we're not limiting them in any way.

@iafan
Copy link
Contributor Author

iafan commented Dec 13, 2016

Then, for first units seen in store, targetLanguageName, projectName, file values are fetched too. The former data could benefit from being only part of the latter type of responses.

I agree that source / target language belong to the store, not to the unit, and don't have to be transferred with each view row. You also have correctly noted that we only store this information for the units that are visible on the screen (+ a few extra units at the edges) and clean unused rows immediately as we move up or down the list, so the overall impact of optimizing this (both in terms of transferred data and in-client memory use) is arguably small. So I'd say this is not worth the trouble for now unless you have a better picture what this optimization will bring to the table.

@julen
Copy link
Contributor

julen commented Dec 13, 2016

The idea was exactly to make sure people won't reach the end of the list if the list is long, giving an impression that we're not limiting them in any way.

Let's estimate that a translator translates 2000 source words per day; if we analyze the data of our most proficient translators, this doesn't happen that often. In order to get 500 units, that would be about 4 words per unit, which is likely to be lower than our average number of words per unit (I haven't done the math, but skimming through the data I would say so).

Even if this was a constant daily throughput, it is highly unlikely translators will go through all these units in one go, as strings are scattered over different projects and allows them to group and logically split the work.

Personally, on my 10 years+ experience as a localizer, I have never gone through 500 units in one go. I am pretty sure full-time professional localizers are more efficient that I am, however stepping through 500 consecutive units is as well enough reason to encourage people to take a break.

So with 500 units, and unless I missed some important detail when estimating roughly, I will need a very convincing argument to think they are not enough number of units because people will be hitting the end of the list.

@julen
Copy link
Contributor

julen commented Dec 14, 2016

Then, for first units seen in store, targetLanguageName, projectName, file values are fetched too. The former data could benefit from being only part of the latter type of responses.

So I'd say this is not worth the trouble for now unless you have a better picture what this optimization will bring to the table.

I wonder what is special about the source_lang and target_lang fields. Why are language, project and file worth the trouble whereas these two other fields are not? Still, all data logically belongs to the same bucket, so I'm not sure is the motivation behind special-casing first units in the store and complicating the endpoint's logic.

In this regards, I have made a few tests to measure the impact of changing the data the /xhr/units/ endpoint returns, using the most lightweight shape as possible (source-target pairs) and a shape that includes for every single unit the current fields + language, project and file.

When it comes to the DB performance, there is no negligible difference in my tests and all queries run below 5ms time. This is because even if we select a few more/less fields, the tables that need to be joined for retrieving such fields are joined anyway, because of permission checking etc.

Regarding network overhead, I measured two types of requests (the initial one, with a sample of 35 units; consequent ones, with a sample of 5 units) with payload sizes for the lightest and the heaviest shapes which includes all fields.

  • Initial request (35 units)
Payload shape Payload size (gzip'ed) Payload size (original)
Light 1.66K 10.375K
Heavy 1.75K 13.99K
  • Rest requests (5 units)
Payload shape Payload size (gzip'ed) Payload size (original)
Light 0.217K 0.76K
Heavy 0.29K 1.56K

So the network impact is less of a concern with gzip, which compresses duplicates pretty well. This means we can opt for using the heavy variant, which ensures a straightforward endpoint while not compromising performance. And this is not attached to any UI concepts, and there is no need to use extra query string parameters.

In terms of storing the data, having store metadata in a separate entity is not only about having a smaller memory footprint (which is good regardless of any internal garbage cleanups), but also about being able to query the data differently, enabling future use cases, and not have it tied to specific UI views (in this aspect, UnitList.headers is a code smell).

This is not very different from a traditional backend, where data is split in logical entities and managed by relations: the list of uids we get at the beginning would allow us to build the relation between units and stores, and unit retrieval payloads would allow us to normalize the data into both entities.

@iafan
Copy link
Contributor Author

iafan commented Dec 14, 2016

being able to query the data differently, enabling future use cases, and not have it tied to specific UI views

These are non-goal, actually. these XHRs are not a part of some generic public API — they are there to support particular client-side UI with the data the client needs. If we have new requirements from the client side in the future, we will update our API as needed.

I wonder what is special about the source_lang and target_lang fields. Why are language, project and file worth the trouble whereas these two other fields are not? Still, all data logically belongs to the same bucket, so I'm not sure is the motivation behind special-casing first units in the store and complicating the endpoint's logic.

There's no trouble now. What our API does is it serves exactly the right data needed to render regular rows (which includes source/target string and source/target language) and, for header rows, serves extra data required to render those. No more, no less. Yes, the data is somewhat denormalized, which makes it easy to work with without complicating the client-side code further.

First you're saying that the data needs to be normalized, but then suggest another — completely opposite — route of denormalizing all the data and send headers with every view unit knowing that they won't be used, anyway. And all of this to avoid adding a notion of two kinds of rows (ones with and without headers) into the internal API contract which sole purpose is to support the particular edit view. Both approaches are the extremes that have their downsides. Again, what we currently have is we serve exactly the data needed to render rows, and we don't complicate the client and backend code much.

@julen
Copy link
Contributor

julen commented Dec 14, 2016

After voice conversation we decided to settle on the following:

  • Reduce the list of unit IDs that are returned to a maximum of 500 items.
  • Keep the API endpoint as it is now. Should the needs change in the future, we will re-evaluate the implementation details.

@iafan
Copy link
Contributor Author

iafan commented Dec 14, 2016

Reduce the list of unit IDs that are returned to a maximum of 500 items.

Done in 68f88a3

@julen
Copy link
Contributor

julen commented Dec 21, 2016

I pulled in the latest master branch changes to the feature branch, updated test data, fixed three bugs just discovered (bogus variable reference, going to units from context rows, accessing to a store-specific unit by id).

One minor thing I realize is when the number of results is displayed, this is not centered vertically:
screen shot 2016-12-21 at 15 50 45

Last but not least, there is a bug when being within a search and switching the sort order: the search goes away and sorting is applied on top of the unfiltered result set. This is also present in master although it would be great to have it fixed now.

@iafan
Copy link
Contributor Author

iafan commented Dec 22, 2016

I addressed both lead-in row label positioning and an ability to sort search results.

julen pushed a commit to julen/zing that referenced this issue Jan 16, 2017
This commit adds rendering of header rows between stores.
This is an implementation of
serge-community#8 (comment)
julen pushed a commit that referenced this issue Jan 16, 2017
This commit adds rendering of header rows between stores.
This is an implementation of
#8 (comment)
julen pushed a commit that referenced this issue Jan 16, 2017
This commit adds rendering of header rows between stores.
This is an implementation of
#8 (comment)
@julen
Copy link
Contributor

julen commented Jan 17, 2017

This is now fixed and part of v0.2.0.

@julen julen closed this as completed Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants