Allow gitlab URL link shortening from non-gitlab/github.com domains #2068

mattpitkin · 2024-12-04T12:31:44Z

This is a PR for #2065. It allows URL shortening for GitLab/GitHub links that use the gitlab_url/github_url given in html_context, e.g.,:

html_context = {
    "gitlab_url": "https://gitlab.mydomain.com/",
    ...
}

rather than just for GitLab/Github URLs that are in the git[lab/hub].com domain.

Update: I've expanded this PR to also add in the ability to shorten bitbucket URLs.

…ains

gabalafou · 2025-06-17T18:22:23Z

Thanks @mattpitkin for this work! I felt that this work needed more tests, so I added tests, and that led to some slight refactoring.

@drammock, do you think you could give my additions a review? Please bear in mind that a lot of the test cases are marked with "TODO: this is wrong", but the same is true not just for the bitbucket URLs but for some of the GitHub and GitLab URLs, as well. I think those problems should maybe be tackled in a separate PR. What do you think?

src/pydata_sphinx_theme/short_link.py

tests/test_short_url.py

drammock · 2025-06-17T22:32:47Z

src/pydata_sphinx_theme/short_link.py

+        elif self.platform == "bitbucket":
+            # split the url content
+            parts = path.split("/")
+
+            if len(parts) > 0:
+                text = parts[0]  # organisation
+            if len(parts) > 1 and not (
+                parts[-2] == "workspace" and parts[-1] == "overview"
+            ):
+                if len(parts) > 1:
+                    text += f"/{parts[1]}"  # repository
+                if len(parts) > 2:
+                    if parts[2] in ["issues", "pull-requests"]:
+                        itemnumber = parts[-1]
+                        text += f"#{itemnumber}"  # element number


this is hard to follow, and feels brittle to me. Why aren't we using regex for this? I worked up this gist as a POC and it seems to capture the behavior that we want:

https://gist.github.com/drammock/78d2d3c9837aafd1259866c7b936b9e4

presumably similar things will work for other forges (though IIRC gitlab is a bit more complex). @gabalafou WDYT?

I agree. The code in this PR follows patterns already established for GitHub and GitLab, but I think those patterns are bad.

The way I think this should work is that we should be rather picky about which URL patterns we recognize/support and only shorten those. All the other URLs should be not be shortened. But right now, the code takes the opposite approach, as soon as it sees github.com or gitlab.com, it tries to shorten the URL.

For example, let's say we were just starting to support GitHub URLs. But in this scenario, let's start only with supporting pull request URLs.

Then we would convert the following link, like so:

https://github.com/pydata/pydata-sphinx-theme/101 => pydata/pydata-sphinx-theme#101

But we would not convert any of the following links (if we were only supporting pull request links):

https://github.com/pydata/pydata-sphinx-theme/issues https://github.com/pydata/pydata-sphinx-theme/issues/ https://github.com/pydata/ https://github.com/pydata/pydata-sphinx-theme/commit/3caf346cacd2dad2a192a83c6cc9f8852e5a722e

None of those links would get shortened until we specifically add each type of URL that we want to support.

If we want to go down the route of a rewrite of the link shortening logic, I would be happy to do it. In that case, I would just close this PR and open a new one.

I would abandon the feature in this PR to turn off link shortening because I suspect that part of the motivation for adding a config value to turn it off is because our current link shortener is bad. Perhaps with a better shortener, there would be little to no demand for a config setting to turn it off. What do you think, @mattpitkin?

The code in this PR follows patterns already established for GitHub and GitLab, but I think those patterns are bad. [...] If we want to go down the route of a rewrite of the link shortening logic, I would be happy to do it. In that case, I would just close this PR and open a new one.

Since I think I've basically solved it for bitbucket in the linked Gist, I feel like it might be the right call to just rewrite the others too. +1 to close and open a new PR to refactor what we have and also add bitbucket. See also #2215 which adds link shortening for codeberg/forgejo and gitea.

I would abandon the feature in this PR to turn off link shortening because I suspect that part of the motivation for adding a config value to turn it off is because our current link shortener is bad. Perhaps with a better shortener, there would be little to no demand for a config setting to turn it off. What do you think, @mattpitkin?

Do you mean, why did I put the gitlab_url etc in the ``html_contextdictionary of theconf.py` file, rather than being at the root level? In part, I think I just found that other information was conveyed in that dictionary, so it seemed a safe place to add new information that wouldn't interfere with anything else. If that information can go at root level, then I don't have an issue with that. You'd still have to specify the required URL bases to turn on/off URL shortening for them though - remember this is specifically for URLs that are not in the standard gitlab.com, github.com, bitbucket.org base domains, and they don't even have to have, e.g., gitlab, in the domain name (see, for example, https://git.ligo.org).

@mattpitkin, sorry, my comment actually didn't make any sense.

I was writing it in a hurry yesterday and I got two different PRs mixed up in my head: this one versus #2109. But the other PR is very much related to this PR because it provides an option to turn off link shortening.

Co-authored-by: Daniel McCloy <dan@mccloy.info>

gabalafou

I realized it would probably be better not to close this PR and open a new one but rather to do the refactor within this PR.

That said, I'm really wondering if this logic should be spun out into a separate Sphinx extension.

gabalafou · 2025-06-24T17:32:37Z

tests/test_build/bitbucket_links.html

@@ -0,0 +1,16 @@
+<div class="bitbucket-container docutils container">
+ <p>
+  <a class="reference external" href="https://bitbucket.org">


Should we still add the bitbucket class, which adds the BitBucket icon before the link, even if we do not shorten the link?

My sense is that only shortened links should get the icon.

it is could be slightly shortened, ~~as it omits~~ by omitting the https:// part. If we did so and I had to pick I would include the logo but I don't really care.

Yeah, that could be nice follow-on work

tests/test_build/github_links.html

gabalafou · 2025-06-24T17:34:45Z

tests/test_build/github_links.html

+ </p>
+ <p>
+  <a class="reference external" href="https://www.github.com/pydata/pydata-sphinx-theme/pull/1012">
+   https://www.github.com/pydata/pydata-sphinx-theme/pull/1012


Not sure if we should support www-prefixed URLs or not.

what's the argument for not supporting it? Is it hard to make it work?

In my head, the argument against is just to nudge the documentation writers to use canonical, rather than non-canonical URLs. By that reasoning, I should remove support for the non-canonical GitLab URLs too.

I think we should add support for non-canonical forms; in the long run I think it will save us time. If we don't, it seems almost certain that user complaints will arise. Can be a follow-up PR

tests/test_build/gitlab_links.html

tests/test_short_url.py

gabalafou · 2025-06-24T17:40:40Z

src/pydata_sphinx_theme/short_link.py

                # only do something if the platform is identified
-                self.platform = self.supported_platform.get(uri.netloc)


I'm not 100% sure about this part of the class refactor.

I got rid of the platform member. I also renamed the parse_url method and moved it out of the class.

If I am reading the code correctly, it seems odd to me that each time this class encounters a node, it changes self.platform to whatever is matched in that moment.

I felt like it would be cleaner and easier to test if I just passed platform as an argument to the shortener function.

gabalafou · 2025-06-24T17:41:28Z

src/pydata_sphinx_theme/short_link.py

+                platform = self.supported_platform.get(parsed_uri.netloc)
+                if platform is not None:
+                    short = shorten_url(platform, uri)
+                    if short != uri:


This is the code change that prevents adding the platform class (and thereby the icon) unless the link is actually shortened.

gabalafou · 2025-06-24T17:44:04Z

.pre-commit-config.yaml

@@ -39,6 +39,7 @@ repos:
    hooks:
      - id: djlint-jinja
        types_or: ["html"]
+        exclude: ^tests/test_build/.*\.html$


Linter was complaining about http link in a test fixture so I excluded it here.

drammock · 2025-06-24T19:05:57Z

src/pydata_sphinx_theme/short_link.py

+    # Branch URL
+    elif match := re.match(r"/([^/]+)/([^/]+)/tree/([^/]+)", path):
+        owner, repo, branch = match.groups()
+        return f"{owner}/{repo}:{unquote(branch)}"


why the colon here? Aren't (tips of) branches usually represented with @? For example pip install git+https://git.example.com/MyProject.git@master

I used AI to help write this and I thought that it would get the short patterns correct.

Looking into this some more, it looks like this also goes above and beyond what GitHub supports. I'm not sure we should do that.

Maybe for GitHub, I should change this code to only support URLs to PRs, issues and commits.

drammock · 2025-06-24T19:10:18Z

src/pydata_sphinx_theme/short_link.py

+    if (m := re.match(r"^/(.+)/([^/]+)/-/merge_requests/(\d+)$", path)) or (
+        m := re.match(r"^/(.+)/([^/]+)/merge_requests/(\d+)$", path)


I think the second pattern can be omitted if you insert a (?:/-)? (optional non-capturing group containing literal /-) in place of the /- in the first pattern.

How would the full regex look? Like this?

r"^/(.+)/([^/]+)(?:/-)?/merge_requests/(\d+)$"

I don't think that works because what happens is if you try to match that regex on a URL path like:

/gitlab-org/gitlab/-/merge_requests/123

You get:

>>> m = re.match(r"^/(.+)/([^/]+)(?:/-)?/merge_requests/(\d+)$", "/gitlab-org/gitlab/-/merge_requests/123") >>> m.groups() ('gitlab-org/gitlab', '-', '123')

It also doesn't work if you try to make the patterns in the capturing groups lazy:

>>> m = re.match(r"^/(.+?)/([^/]+?)/merge_requests/(\d+)$", "/gitlab-org/gitlab/-/merge_requests/123") >>> m.groups() ('gitlab-org/gitlab', '-', '123')

I fiddled with this several times until I just gave up and decided it's probably easier to write two separate regexes (one for canonical, one for non-canonical GitLab URLs) than to try to combine them into one.

drammock · 2025-06-24T19:13:13Z

src/pydata_sphinx_theme/short_link.py

+    # Issue URL
+    elif match := re.match(r"/([^/]+)/([^/]+)/issues/(\d+)", path):
+        workspace, repo, issue_id = match.groups()
+        return f"{workspace}/{repo}!{issue_id}"


is the ! standard for bitbucket issues? above we use # for both PRs and issues in the GitHub/GitLab shorteners.

Honestly, I have no clue about Bitbucket. Never used it.

drammock · 2025-06-24T19:14:45Z

tests/test_build/github_links.html

+ </p>
+ <p>
+  <a class="reference external" href="https://www.github.com/pydata/pydata-sphinx-theme/pull/1012">
+   https://www.github.com/pydata/pydata-sphinx-theme/pull/1012


what's the argument for not supporting it? Is it hard to make it work?

gabalafou

Thanks for your review @drammock!

gabalafou · 2025-06-26T16:22:22Z

src/pydata_sphinx_theme/short_link.py

+    # Branch URL
+    elif match := re.match(r"/([^/]+)/([^/]+)/tree/([^/]+)", path):
+        owner, repo, branch = match.groups()
+        return f"{owner}/{repo}:{unquote(branch)}"


I used AI to help write this and I thought that it would get the short patterns correct.

Looking into this some more, it looks like this also goes above and beyond what GitHub supports. I'm not sure we should do that.

Maybe for GitHub, I should change this code to only support URLs to PRs, issues and commits.

gabalafou · 2025-06-26T16:33:17Z

src/pydata_sphinx_theme/short_link.py

+    if (m := re.match(r"^/(.+)/([^/]+)/-/merge_requests/(\d+)$", path)) or (
+        m := re.match(r"^/(.+)/([^/]+)/merge_requests/(\d+)$", path)


How would the full regex look? Like this?

r"^/(.+)/([^/]+)(?:/-)?/merge_requests/(\d+)$"

I don't think that works because what happens is if you try to match that regex on a URL path like:

/gitlab-org/gitlab/-/merge_requests/123

You get:

>>> m = re.match(r"^/(.+)/([^/]+)(?:/-)?/merge_requests/(\d+)$", "/gitlab-org/gitlab/-/merge_requests/123") >>> m.groups() ('gitlab-org/gitlab', '-', '123')

It also doesn't work if you try to make the patterns in the capturing groups lazy:

>>> m = re.match(r"^/(.+?)/([^/]+?)/merge_requests/(\d+)$", "/gitlab-org/gitlab/-/merge_requests/123") >>> m.groups() ('gitlab-org/gitlab', '-', '123')

I fiddled with this several times until I just gave up and decided it's probably easier to write two separate regexes (one for canonical, one for non-canonical GitLab URLs) than to try to combine them into one.

gabalafou · 2025-06-26T16:59:57Z

src/pydata_sphinx_theme/short_link.py

+    # Issue URL
+    elif match := re.match(r"/([^/]+)/([^/]+)/issues/(\d+)", path):
+        workspace, repo, issue_id = match.groups()
+        return f"{workspace}/{repo}!{issue_id}"


Honestly, I have no clue about Bitbucket. Never used it.

gabalafou · 2025-06-26T17:00:27Z

tests/test_build/bitbucket_links.html

@@ -0,0 +1,16 @@
+<div class="bitbucket-container docutils container">
+ <p>
+  <a class="reference external" href="https://bitbucket.org">


Yeah, that could be nice follow-on work

gabalafou · 2025-06-26T17:01:17Z

tests/test_build/github_links.html

+ </p>
+ <p>
+  <a class="reference external" href="https://www.github.com/pydata/pydata-sphinx-theme/pull/1012">
+   https://www.github.com/pydata/pydata-sphinx-theme/pull/1012


In my head, the argument against is just to nudge the documentation writers to use canonical, rather than non-canonical URLs. By that reasoning, I should remove support for the non-canonical GitLab URLs too.

drammock · 2025-06-26T18:01:04Z

This may be moot given the move to not support non-canonical URLs, but:

How would the full regex look?

In [1]: import re
In [2]: path1 = "/gitlab-org/gitlab/-/merge_requests/123"
In [3]: path2 = "/gitlab-org/gitlab/merge_requests/123"
In [4]: pattern = re.compile(r"^/([^/]+)/([^/]+)(?:/-)?/merge_requests/(\d+)$")
In [5]: pattern.match(path1).groups()
Out[5]: ('gitlab-org', 'gitlab', '123')
In [6]: pattern.match(path2).groups()
Out[6]: ('gitlab-org', 'gitlab', '123')

I think the reason the prior pattern wasn't working was that the initial (.+) was too greedy, so it was grabbing both workspace and repo. Above it's now ([^/]+), matching exactly one element of the path (workspace).

drammock · 2025-06-26T18:38:00Z

Honestly, I have no clue about Bitbucket. Never used it.

I did a little digging. this page has sections on linking to issues and PRs. PRs are projectkey/repo-slug#pr_id.

Issues are more complicated; there is a distinction between repo-level issues and project-level "JIRA issues". Repo-level issues seem to count serially, separately from PRs. Except when they don't: this page shows an odd jump from issue 170 to issue 441, with apparently no (publicly viewable?) issues in between. It's a 5-month gap, so not impossible that there were ~250 private issues in that time frame, but seems really unlikely. Also I don't even know if private issues are a thing.

"JIRA Issues" rely on something called the "JIRA Application Issue Key": this page says "By default, JIRA issue keys are of the format <project key>-<issue id>". That seems to imply that those issues are tracked at the project/org level, not at the repo level? There's also a note here saying that the project key only supports uppercase characters.

TL;DR: the PR shortening matches GitHub, and I guess for now it's OK to keep the ! when shortening repo issues (seems necessary since repo-issues and PRs are assigned numbers independently). We should not touch "JIRA issues" --- I'm not even clear on what the URL would look like, much less how to shorten it.

gabalafou · 2025-06-27T08:09:25Z

Totally agreed on not touching JIRA issues.

And exactly! We can't shorten both Bitbucket PRs and issues to org/repo#101 because like you said a PR and an issue can have the same number, unlike GitHub, for example:

Despite what Bitbucket's markdown syntax guide says, it looks like in the Bitbucket UI you cannot link to another pull request with #101. You have to use PR #101 or pull request #101, which you can see on an example Bitbucket PR I created and commented.

Maybe for consistency we should adopt GitLab's notation, using ! for pull requests and # for issues, what do you think?

gabalafou · 2025-06-27T08:21:43Z

I think the reason the prior pattern wasn't working was that the initial (.+) was too greedy, so it was grabbing both workspace and repo. Above it's now ([^/]+), matching exactly one element of the path (workspace).

The problem is that GitLab namespaces can be more than two levels deep. For example, this is the URL to a merge request that I created: https://gitlab.com/gabalafou-group/ala/fou/test/-/merge_requests/1. (You can't visit it, unfortunately, because the repo is private but trust me that this is the URL.)

That's why I had the initial wild pattern, (.+), instead of the more restrained non-forward-slash pattern, ([^/]+).

drammock · 2025-06-27T18:56:45Z

The problem is that GitLab namespaces can be more than two levels deep.

ah. didn't know that. All makes sense now.

drammock · 2025-06-27T19:23:02Z

The problem is that GitLab namespaces can be more than two levels deep.

ah. didn't know that. All makes sense now.

for the record I did sorta find a working regex that handles that case too:

#         non-capturing group allowed
#             to repeat itself
#             ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓                
r"^/([^/]+)/((?:(?!-/)[^/]+/)+)(?:-/)?merge_requests/(\d+)$"
#                ↑↑↑↑           ↑↑↑↑↑↑
#              prevent          allow -/
#           capture of -/      if present

but it has the downside that the workspace will always contain a trailing /, so you get

import re
pattern = re.compile(r"^/([^/]+)/((?:(?!-/)[^/]+/)+)(?:-/)?merge_requests/(\d+)$")
path0 = "/gabalafou-group/ala/fou/test/-/merge_requests/1"
path1 = "/gitlab-org/gitlab/-/merge_requests/123"
path2 = "/gitlab-org/gitlab/merge_requests/123"
pattern.match(path0).groups()
# ('gabalafou-group', 'ala/fou/test/', '1')
pattern.match(path1).groups()
# ('gitlab-org', 'gitlab/', '123')
pattern.match(path2).groups()
# ('gitlab-org', 'gitlab/', '123')

__init__.py: allow gitlab URL link shortening from non-gitlab.com dom…

e786e0d

…ains

mattpitkin mentioned this pull request Dec 4, 2024

Allow URL shortening for non-gitlab.com GitLab repositories #2065

Open

__init__.py: also allow for non-github.com github domains

0e43585

mattpitkin changed the title ~~Allow gitlab URL link shortening from non-gitlab.com domains~~ Allow gitlab URL link shortening from non-gitlab/github.com domains Dec 4, 2024

mattpitkin added 5 commits December 4, 2024 14:55

Add ability to shorten bitbucket links

9f9df68

__init__.py: add custum bitbucket url

617b977

__init__.py: fix typo html_content -> html_context

277730f

short_link.py: treatmemt for bitbucket workspace/overview URLs

86dd418

Update documentation

96a1812

trallard added the kind: enhancement New feature or request label Dec 10, 2024

drammock mentioned this pull request Mar 26, 2025

DISCUSS - 0.17.0 Release #2176

Open

gabalafou requested review from gabalafou June 17, 2025 11:46

gabalafou added 3 commits June 17, 2025 06:46

Merge branch 'main' into gitlab_url

2e03182

add tests, TODO comments

82339e3

class method to configure ShortenLinkTransform, more tests

8158d88

gabalafou requested a review from drammock June 17, 2025 18:53

drammock reviewed Jun 17, 2025

View reviewed changes

mattpitkin and others added 3 commits June 18, 2025 10:50

Update src/pydata_sphinx_theme/short_link.py

0924968

Co-authored-by: Daniel McCloy <dan@mccloy.info>

rewrite link shortener

13535e3

update docstring for test function

394bf57

gabalafou reviewed Jun 24, 2025

View reviewed changes

gabalafou mentioned this pull request Jun 24, 2025

WIP: Support additional forges #2215

Open

drammock reviewed Jun 24, 2025

View reviewed changes

gabalafou reviewed Jun 26, 2025

View reviewed changes

gabalafou added 4 commits June 26, 2025 19:28

reduce link shortener to 3 types

5a7cda5

update unit tests

6d58054

update fixtures

a5d7086

update docstring

b55a80e

		# only do something if the platform is identified
		self.platform = self.supported_platform.get(uri.netloc)

		if (m := re.match(r"^/(.+)/([^/]+)/-/merge_requests/(\d+)$", path)) or (
		m := re.match(r"^/(.+)/([^/]+)/merge_requests/(\d+)$", path)

Allow gitlab URL link shortening from non-gitlab/github.com domains #2068

Are you sure you want to change the base?

Allow gitlab URL link shortening from non-gitlab/github.com domains #2068

Conversation

mattpitkin commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabalafou commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattpitkin Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabalafou Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabalafou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drammock Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabalafou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattpitkin commented Dec 4, 2024 •

edited

Loading

mattpitkin Jun 19, 2025 •

edited

Loading

gabalafou Jun 19, 2025 •

edited

Loading

drammock Jun 24, 2025 •

edited

Loading

drammock commented Jun 26, 2025 •

edited

Loading

gabalafou commented Jun 27, 2025 •

edited

Loading