Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: seperate website repository and assets #12048

Open
kbdharun opened this issue Jan 9, 2024 · 26 comments
Open

Proposal: seperate website repository and assets #12048

kbdharun opened this issue Jan 9, 2024 · 26 comments
Assignees
Labels
architecture Organization of the pages per language, platform, etc. archive Archive of changes made in tldr-pages, etc. decision A (possibly breaking) decision regarding tldr-pages content, structure, infrastructure, etc.

Comments

@kbdharun
Copy link
Member

kbdharun commented Jan 9, 2024

I have been thinking about opening this issue for a while, I initially hinted at it in this thread in the chat room and made occasional mentions about it in PRs, I had time to test the changes during Winter holidays last year; making a formal issue now to discuss it and the detailed steps on how to perform the transition seamlessly.

Proposal

Summary

This proposal aims to separate assets and website contents into separate repositories to ease up contributing, maintenance, etc.

Problem

  • Currently, our website repository when cloned using git takes a lot of time and also takes a whopping 6 GB+ storage space (the main culprit being the assets directories git cache gets large over time), this makes it nearly impossible for most contributors to contribute/improve our website.
  • Since the website repository is so large it makes archival or running a mirror of it in services like Gitea impossible. (I run a private self-hosted instance of Gitea where I have mirrors for all our repositories except this one as a form of backup.)
  • GitHub Pages has a soft limit of 1 GB for repositories (which we crossed a long time ago, although unlikely this would make the site susceptible to future limitations/enforcements from GitHub.)

Solution

My solution to solve this issue is to move the website and assets to separate repositories (tldr-pages.github.io, assets) and archive the current website repository under a different name like old-site.

Advantages

  • This change allows easy contributions to the website repository without worrying about wasting bandwidth or space.
  • This change allows us to use modern website frameworks and redesign the website independently in future. (This would be indeed nice to do, but that's a discussion for another day xD).
  • This change would allow us to archive/mirror the website repository in external services.
  • This allows us to easily swap assets repos in future (if we receive an intimation from GitHub about site size) [i.e. we can download the contents as ZIP from assets repo and unpublish it, then rename and archive it like assets-1, create a new one under the same name].

Considerations

  • This change would cause a couple of minutes of downtime while we perform the changes and at that time I wouldn't suggest merging any PRs during the transition.

Modifications (@ main repo)

  • deploy.yml would need to be modified to deploy to the new assets repository instead of inside a directory and the repo slug needs to be updated too. (See the patch file section below for more information)

Clarifications

  • Since @owenvoke's SSH key is used for the deploy process and he is an Org Owner; no additional permissions need to be granted to tldr-bot for it to commit and push changes.
  • Custom domain needs to be only set for the main tldr-pages.github.io repo and for other repositories, the same will be used automatically with the format https://tldr.sh/repo-name/contents, in this case, it would be https://tldr.sh/assets/ (we already use this approach for showing web version of manpages attlrc at https://tldr-pages.github.io/tlrc [https://tldr.sh/tlrc/]).

Steps

  1. Create draft PR with changes ready for deploy.yml (to commit/push to new assets repo).
  2. Download the ZIP archive of https://github.com/tldr-pages/tldr-pages.github.io and separate assets directory (and then rest of website files except .git).
  3. Remove the custom domain, then unpublish the GitHub page from https://github.com/tldr-pages/tldr-pages.github.io/settings/pages.
  4. Rename the repository to old-website.
  5. Create a new repository for the website under the name tldr-pages.github.io and for assets under the name assets, then commit the files taken from the ZIP archive to the respective repositories.
  6. Enable GitHub pages for the newly created website repository, then enable custom domain https://tldr.sh.
  7. Enable GitHub pages for the assets repository.
  8. Once it is done (and the page is published), test the changes by downloading a ZIP archive or by viewing the rendered index file at https://tldr.sh/assets.
  9. Now, merge the PR at the main repository and Voila the changes are done.

Patch for deploy.yml

Location: https://github.com/tldr-pages/tldr/blob/main/scripts/deploy.sh.

diff --git a/scripts/deploy.sh b/scripts/deploy.sh
index 88c6a8393..2497f893e 100755
--- a/scripts/deploy.sh
+++ b/scripts/deploy.sh
@@ -32,14 +32,14 @@ function initialize {
 }
 
 function upload_assets {
-  git clone --quiet --depth 1 "git@github.com:tldr-pages/tldr-pages.github.io.git" "$SITE_HOME"
+  git clone --quiet --depth 1 "git@github.com:tldr-pages/assets.git" "$SITE_HOME"
 
-  mv -f "$TLDR_ARCHIVE" "$SITE_HOME/assets/"
-  find "$TLDRHOME/language_archives" -maxdepth 1 -name '*.zip' -exec mv -f {} "$SITE_HOME/assets/" \;
-  cp -f "$TLDRHOME/index.json" "$SITE_HOME/assets/"
-  find "$TLDRHOME/scripts/pdf" -maxdepth 1 -name '*.pdf' -exec mv -f {} "$SITE_HOME/assets/" \;
+  mv -f "$TLDR_ARCHIVE" "$SITE_HOME/"
+  find "$TLDRHOME/language_archives" -maxdepth 1 -name '*.zip' -exec mv -f {} "$SITE_HOME/" \;
+  cp -f "$TLDRHOME/index.json" "$SITE_HOME/"
+  find "$TLDRHOME/scripts/pdf" -maxdepth 1 -name '*.pdf' -exec mv -f {} "$SITE_HOME/" \;
 
-  cd "$SITE_HOME/assets"
+  cd "$SITE_HOME/"
   sha256sum -- index.json *.zip > tldr.sha256sums
 
   git add -A

References and Testing

I tested these changes two weeks ago in my fork with this assets repository (the live version [available till this issues closure] can be found here)

Conclusion

We have been optimizing the build and deploy processes for the past few weeks making building and committing new archives/PDFs only when they are modified. This proposal is the last part of completing the optimization work.

If given the green light, I can perform the changes soon.

I would like to ping some of our active maintainers and people with access to infrastructure for your opinion about this. (Will inform the same in the chatroom)

cc @sebastiaanspeck , @sbrl, @SethFalco, @agnivade, @acuteenvy, @owenvoke, @blueskyson, @waldyrious

@kbdharun kbdharun added architecture Organization of the pages per language, platform, etc. decision A (possibly breaking) decision regarding tldr-pages content, structure, infrastructure, etc. archive Archive of changes made in tldr-pages, etc. labels Jan 9, 2024
@kbdharun kbdharun self-assigned this Jan 9, 2024
@acuteenvy
Copy link
Member

I think we shouldn't push these assets to a Git repository.

The tldr archives, PDFs, etc. do not need version control. Every client downloads the latest version anyway. On top of that, such a repository grows in size very quickly (we push a couple megabytes of binary data on every commit that changes a page), and because of that we are going to arrive at the exact same problem later on. This is a band-aid solution, and if we're going to change the way we distribute pages, we might as well do it right. In my opinion, we need something that can be easily overwritten - without the problem of garbage that piles up every commit.

takes a whopping 6 GB+ storage space

Actually, it's about 15 GB.
https://api.github.com/repos/tldr-pages/tldr-pages.github.io
size: 14856325 KB = 14.85633 GB


I've been thinking about this for a while now, and I actually wanted to open a similar issue. We could upload the assets to the latest release of tldr-pages/tldr. GitHub releases are not version controlled, and can be easily updated from a script. Of course, this is a breaking change, and would require all clients to change https://raw.githubusercontent.com/tldr-pages/tldr-pages.github.io/main/assets to https://github.com/tldr-pages/tldr/releases/latest/download. We could do a transition period of supporting both methods like we did with the change of master to main.

I've already edited the deploy script to upload assets to both places - if everyone agrees, I can make a PR.


archive the current website repository under a different name like old-site.

That should definitely be done - this repo currently takes 16 minutes to clone.

@kbdharun
Copy link
Member Author

The tldr archives, PDFs, etc. do not need version control. Every client downloads the latest version anyway. On top of that, such a repository grows in size very quickly (we push a couple megabytes of binary data on every commit that changes a page), and because of that we are going to arrive at the exact same problem later on. This is a band-aid solution, and if we're going to change the way we distribute pages, we might as well do it right. In my opinion, we need something that can be easily overwritten - without the problem of garbage that piles up every commit.

Agreed, this is indeed an efficient approach (in the long run). But I am not sure how some of our clients will fetch it, wildcards? GitHub Rest API? (If Rest API then there would be issues when fetching multiple archives cross-platform)

Actually, it's about 15 GB.

Wow, that's larger than I initially thought (I haven't cloned the repo in a while).

I've been thinking about this for a while now, and I actually wanted to open a similar issue. We could upload the assets to the latest release of tldr-pages/tldr. GitHub releases are not version controlled, and can be easily updated from a script. Of course, this is a breaking change, and would require all clients to change https://raw.githubusercontent.com/tldr-pages/tldr-pages.github.io/main/assets to https://github.com/tldr-pages/tldr/releases/latest/download. We could do a transition period of supporting both methods like we did with the change of master to main.

This looks good on paper, I am interested in hearing what others would think. If we went with it, we could set up link redirects to the correct location (either as DNS records or via a separate repo like https://github.com/tldr-pages/chatroom).

That should definitely be done - this repo currently takes 16 minutes to clone.

Exactly, will do the currently proposed changes once others agree too.

@acuteenvy
Copy link
Member

But I am not sure how some of our clients will fetch it, wildcards? GitHub Rest API?

No need for any API calls. Clients will fetch it the same way they do it now, just from a different location.
For example, the English pages archive:
current: https://raw.githubusercontent.com/tldr-pages/tldr-pages.github.io/main/assets/tldr-pages.en.zip
new: https://github.com/tldr-pages/tldr/releases/latest/download/tldr-pages.en.zip

If we went with it, we could set up link redirects to the correct location

Some clients download directly from https://raw.githubusercontent.com, and you can't tell GitHub to redirect this to something else. Additionally, it would be great if all clients used the same URL, because currently some use the redirect and some do not.

will do the currently proposed changes once others agree too.

If we decide to use releases, we can only switch repositories once all clients have been updated and we're ready to stop supporting the old method.

@sebastiaanspeck
Copy link
Member

If we decide to use releases, we can only switch repositories once all clients have been updated and we're ready to stop supporting the old method.

My only concern with releases is, the frequency. Right now all clients can access the "latest" version. If we decide to release, a client can only access the latest release. And when is a release "big" enough, to prevent a release per change? After changing X numbers of pages?

We need to think out this release strategy. We could keep the release process as is and keep release numbers, but add a timestamp for the commit? E.g. v2.1+20230111-101010 as release version. I would definitely not recommend to create tags per commit.

@acuteenvy
Copy link
Member

acuteenvy commented Jan 11, 2024

I would definitely not recommend to create tags per commit.

I meant uploading assets to the same release (overwriting them) every commit, because we do not need old versions. Releases will still be created only on client specification updates.

@kbdharun
Copy link
Member Author

will do the currently proposed changes once others agree too.

If we decide to use releases, we can only switch repositories once all clients have been updated and we're ready to stop supporting the old method.

Yep, I guess I wasn't clear in my previous comment😅. I meant if others agree will perform the changes suggested here first #12048 (comment) (to allow independent working with the website) then discuss your long-term viable method (maybe in a seperate issue for better visibility/trackability).

@acuteenvy
Copy link
Member

acuteenvy commented Jan 11, 2024

These changes will break clients that use https://raw.githubusercontent.com/tldr-pages/tldr-pages.github.io instead of https://tldr.sh/assets (the URL will then be https://raw.githubusercontent.com/tldr-pages/assets).

@kbdharun
Copy link
Member Author

These changes will break clients that use https://raw.githubusercontent.com/tldr-pages/tldr-pages.github.io instead of https://tldr.sh/assets (the URL will then be https://raw.githubusercontent.com/tldr-pages/assets).

Oh, I haven't considered this. Yeah, this would indeed break clients linking directly to the repository instead of the website. Will fix it first (by opening PRs in clients soon).

@acuteenvy
Copy link
Member

Will fix it first (by opening PRs in clients soon).

And if we are going to make PRs that change the URL, then they might as well be with the final solution. If you do this now, that will force every client to create a patch release, and then another one with the URL to GitHub releases.

then discuss your long-term viable method (maybe in a seperate issue for better visibility/trackability)

#12062

@waldyrious
Copy link
Member

I agree with @acuteenvy's suggestion to use release artifacts instead of hosting them using a bespoke mechanism (git-tracked, even 😱)

My only concern with releases is, the frequency. Right now all clients can access the "latest" version. If we decide to release, a client can only access the latest release. And when is a release "big" enough, to prevent a release per change? After changing X numbers of pages?

We need to think out this release strategy. We could keep the release process as is and keep release numbers, but add a timestamp for the commit? E.g. v2.1+20230111-101010 as release version. I would definitely not recommend to create tags per commit.

If we feel it's too onerous to update the release artifacts on every commit, perhaps we could adopt a snapshot strategy, where a new archive would be generated on a time basis (say, weekly, or daily) rather than on a commit basis. I think that either option ought to be frequent enough for the vast majority of users.

@waldyrious
Copy link
Member

On a separate note: since we'd be recreating the website repository, I'd strongly suggest that we take the opportunity to filter the git history to remove all the asset update commits but preserve all the other changes; that way we wouldn't have a split in the history of the website code between the old repository and the new one.

@acuteenvy
Copy link
Member

If we feel it's too onerous to update the release artifacts on every commit, perhaps we could adopt a snapshot strategy, where a new archive would be generated on a time basis (say, weekly, or daily) rather than on a commit basis. I think that either option ought to be frequent enough for the vast majority of users.

Building the assets and overwriting them in the release doesn't take a lot of time, and does not produce garbage (unlike the git method). I don't see why we wouldn't want to do that.

sbrl added a commit that referenced this issue Jan 24, 2024
Autolinks are part of the CommonMark spec (ref <https://spec.commonmark.org/0.30/#autolinks>) and well supported.

Redirect indicators are removed as a part of #12048.
@sbrl
Copy link
Member

sbrl commented Jan 24, 2024

This change seems like a good plan to me. I thank you once again @kdbharun for taking the initiative here wrt infrastructure!

I suggest we implement it at the earliest available opportunity.

@acuteenvy: I agree that something e.g. like using GitHub releases would be a better plan given it really doesn't need version control, but that would likely take longer to implement. I suggest we implement this as @kdbharun suggests first to buy ourselves some time and fix the immediate problem (since they've gone to all the trouble of testing it etc :P), and then look at that in a separate issue.

A related plan here could be once we have a separate assets repo to adjust the script to always amend the last commit & force-push, with only a normal commit every ~month or so? This would be more compatible with existing clients than GitHub releases.

These changes will break clients that use https://raw.githubusercontent.com/tldr-pages/tldr-pages.github.io instead of https://tldr.sh/assets (the URL will then be https://raw.githubusercontent.com/tldr-pages/assets).

It will, but the client spec says clients MUST download from e.g. https://tldr.sh/assets/tldr.zip - so any clients doing that are in violation of the client spec.

We do need to update the client spec ref this tho, given it indicates where each URL redirects to. I've opened PR #12133 to resolve this.

@acuteenvy
Copy link
Member

using GitHub releases would be a better plan given it really doesn't need version control, but that would likely take longer to implement.

I've already implemented and tested it (#12062). Updating all clients would definitely take a while though, but that is not much of a problem if we continue to support the old method for some time (which is what I did in the PR).

@kbdharun
Copy link
Member Author

kbdharun commented Jan 26, 2024

This change seems like a good plan to me. I thank you once again @kbdharun for taking the initiative here wrt infrastructure!

Welcome, @acuteenvy's solution would work the best in fixing the issue permanently. Dropping the DM I sent you here, as I don't have time to type it again 😅 .


My initial proposal was an ideally short-term solution (I didn't consider the use of GitHub releases or clients using redirects for GitHub page links). Now that we have discussed it (I have personally used this approach in other projects), I think going with releases would be better, in the long run, [I can separate the website once we have migrated fully, maybe next year; no need to do it immediately.].

Post your PRs merge (#12133) and acuteenvy's PR at #12062 (I think we can make a minor release, informing the client authors about the future change in 1 year [in the meanwhile we can try updating existing clients to use release links i.e. https://github.com/tldr-pages/tldr/releases/download/latest/tldr-pages..zip]).


We can leave this issue open for now, and close it when we fully drop this method of uploading assets using GitHub pages. Also, I would love to hear some client author's feedback on what they think about this (publishing to GitHub releases). cc @dbrgn @niklasmohrin @rwv

@rwv
Copy link
Contributor

rwv commented Jan 26, 2024

We can leave this issue open for now, and close it when we fully drop this method of uploading assets using GitHub pages. Also, I would love to hear some client author's feedback on what they think about this (publishing to GitHub releases). cc @dbrgn @niklasmohrin @rwv

tldr.inbrowser.app use tldr git repo archive download function directly.

https://github.com/tldr-pages/tldr/archive/refs/heads/main.zip

Therefore it shouldn’t matter.

@niklasmohrin
Copy link

Tealdeer uses tldr.sh to download the latest archive. As long as the semantics of that endpoint don't change, there shouldn't be anything to do for us.

Personally, I would like the delay of the github repo and the zip downloaded from tldr.sh to be kept to a minimum to avoid possible user confusion. I think something like 30min should be the maximum. If feasible, an instant update on every commit would be nice (although care must be taken to always keep the newest version when two PRs get merged right after one another).

@rwv
Copy link
Contributor

rwv commented Jan 27, 2024

Maybe we can redirect asset to https://github.com/tldr-pages/tldr/archive/refs/heads/main.zip. Let GitHub handle archive and cache for us. Also this keeps asset.zip always up to date. But locale asset will be a problem.

@rwv
Copy link
Contributor

rwv commented Jan 27, 2024

Client spec:

Caching SHOULD be done according to the user's language configuration (if any), to not waste unneeded space for unused languages. Additionally, clients MAY automatically update the cache regularly.

My personal opinion: do we really need language specific caching? The whole repo is only 7.2MB. I see little benefit and the complexity is increased since the client needs to deal with locale and the not-found circumstances.

@kbdharun
Copy link
Member Author

kbdharun commented Jan 27, 2024

My personal opinion: do we really need language specific caching? The whole repo is only 7.2MB. I see little benefit and the complexity is increased since the client needs to deal with locale and the not-found circumstances.

While space/connectivity isn't a concern for most of us, in clients like Node client (when a page isn't found or you make a typo the asset is fetched again adding to the overhead), the main reason for this approach is that we have certain clients/integrations (extensions) only targeting a few select languages (where having others isn't necessary [to prevent increasing the application's size] and also some embedded system/cellular users (from remote regions, etc) expressed the same issue in the past. [Thus we introduced this method]

Let me take myself as an example, in my university wireless network speeds are capped at 2 MBPS and the ones you get with GitHub are even lower (I can use cellular 5G) but if I use their network then fetching the entire archive would take anywhere between 30 seconds to a minute (whereas with the current method, it is way faster).

@acuteenvy
Copy link
Member

My personal opinion: do we really need language specific caching? The whole repo is only 7.2MB. I see little benefit and the complexity is increased since the client needs to deal with locale and the not-found circumstances.

If you think that's not needed, don't implement it. There are many clients for many different use cases, and we will still continue to provide the full archive.

@kbdharun
Copy link
Member Author

kbdharun commented Jan 28, 2024

Merged #12062 and it has successfully added the assets to the latest release.

Personally, I would like the delay of the github repo and the zip downloaded from tldr.sh to be kept to a minimum to avoid possible user confusion. I think something like 30min should be the maximum. If feasible, an instant update on every commit would be nice (although care must be taken to always keep the newest version when two PRs get merged right after one another).

Regarding this, we could add a branch protection rule for the main branch enforcing merge queue (i.e. when multiple PRs are merged only when a job is completed, the next one starts). I think I have proposed this in the chatroom or a thread before, not sure where 😅 . Checkout https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue for more information.

@kbdharun
Copy link
Member Author

kbdharun commented Feb 1, 2024

Regarding this, we could add a branch protection rule for the main branch enforcing merge queue (i.e. when multiple PRs are merged only when a job is completed, the next one starts). I think I have proposed this in the chatroom or a thread before, not sure where 😅 . Checkout https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue for more information.

Tested the implementation of merge queues for the past hour; the advantage it introduces is that we can enforce PRs to be in sync with the main branch and we can also limit the number of PRs that can be merged at a time.

I tested it in my fork (by setting the default branch, enabling it in branch protection, squash as the merge strategy, allowing a maximum 1 PR to run build in the queue, allowing only one PR to be merged at a merge commit); removed the "push" parameter from the action.

But it comes with a lot of disadvantages:

  1. It requires a specific merge strategy meaning if we set "Squash & merge" we can't use "Rebase & merge" (except admins who can bypass the entire merge queue) in PRs where authors have crafted individual commits.
  2. The addition is incompatible with the current mono CI structure (and if the github.refs identifier is used deploy step doesn't work at all).
  3. Even with a separate workflow due to how merge queues work there are 2 workflow runs pre-merge test and post-merge test/execution meaning deployment takes place twice.

After referring to this online and in docs; playing around with it. I would like to conclude at the current state I don't think it would be feasible for us. Since this is a fairly new feature I hope it will improve, we can check back into it in the future, until then maintainers should ensure not to retrigger old failed workflow runs if a newer one has succeeded (and the asset is deployed) as we specify in the maintainers guide.

I have attached the workflow files I used for future reference:

merge-queue-workflows-test.zip

@sbrl
Copy link
Member

sbrl commented Feb 6, 2024

The new assets on the release look cool, but I don't see an archive for English there?

I also don't see an archive for all pages attached to the release as we currently have in the website git repo?

@kbdharun
Copy link
Member Author

kbdharun commented Feb 7, 2024

The new assets on the release look cool, but I don't see an archive for English there?

I also don't see an archive for all pages attached to the release as we currently have in the website git repo?

See #12062 (comment) for more information.

TLDR. Release assets aren't alphabetically arranged and recently updated ones are shown at last (all the assets in the website git repo are present here too).

kbdharun added a commit that referenced this issue Feb 19, 2024
* CLIENT-SPECIFICATION: remove redirect indicators, use autolinks

Autolinks are part of the CommonMark spec (ref <https://spec.commonmark.org/0.30/#autolinks>) and are well supported.

Redirect indicators are removed as a part of #12048.

* CLIENT-SPECIFICATION: update changelog

---------

Co-authored-by: Lena <126529524+acuteenvy@users.noreply.github.com>
Co-authored-by: K.B.Dharun Krishna <kbdharunkrishna@gmail.com>
@sbrl
Copy link
Member

sbrl commented Feb 23, 2024

Cool, ty for the clarification @kbdharun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture Organization of the pages per language, platform, etc. archive Archive of changes made in tldr-pages, etc. decision A (possibly breaking) decision regarding tldr-pages content, structure, infrastructure, etc.
Projects
Development

No branches or pull requests

7 participants