Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking Links in published documentation #1759

Closed
adamlazik1 opened this issue Nov 8, 2022 · 7 comments
Closed

Checking Links in published documentation #1759

adamlazik1 opened this issue Nov 8, 2022 · 7 comments

Comments

@adamlazik1
Copy link
Contributor

I have noticed that despite having linkchecker we are getting bug reports about invalid links in the existing documentation. For example, BZ#2139221.
I have created a simple script that checks html files for links, validates them using curl, and returns any links for which the request returned any other code than 200 (meaning an error happened somewhere).
You can find the script here: https://github.com/adamlazik1/link-check
Instructions on installation and how to use it are in the README file.
I will appreciate any feedback, bug reports, or suggestions for improvement you may have.
Feel free to create PRs with improvements if you feel like it.
As a demostration of difference between link-check and linkchecker, I am attaching this picture:
image
This concerns Satellite 6.11 (Foreman 3.1) build of Configuring Load Balancer guide. You can see that there is not any invalid link detected by linkchecker but there is in fact one that points to non-existing page.

Naturally, link-check does cannot check links that point to guides that have not been published yet (Satellite build of Foreman 3.3. guides for example.).

@adamlazik1
Copy link
Contributor Author

adamlazik1 commented Nov 8, 2022

Nevermind all that!
After six hours of yesterday's work, I learned that linkchecker does work if you specify it in the make like this: make linkchecker BUILD=satellite. HOWEVER, it is strange that it does not seem to detect all the links. make linkechecker BUILD=satellite on 3.1 branch in doc-Upgrading_and_Updating checks 29 URLs, while a grep search for different URLs returns 38, so my work may not be completely in vain 😅. Attaching output.
image

@ekohl
Copy link
Member

ekohl commented Nov 8, 2022

This is configured via the ignore option in linkchecker.ini:

ignore=example.com
cdn.redhat.com
file:///modules
file:///js/versions.js
file:///js/nav.js
tools.ietf.org
host/unattended/provision
projects.theforeman.org
access.redhat.com/solutions
forge.puppetlabs.com
keycloak.com
rhsso.com
sources/
atixservice.zendesk.com
# This isn't published downstream yet
access.redhat.com/documentation/en-us/red_hat_satellite/6.10/html-single/managing_configurations_using_puppet_integration/index
access.redhat.com/documentation/en-us/red_hat_satellite/6.11

As you can see, it ignores 6.11. In feb42e7 I turned it on for 3.1.

#1750 is an effort to properly resolve it for master. You're (very) welcome to take over the effort.

@adamlazik1
Copy link
Contributor Author

Still, not all the links are checked for the 3.1 branch as can be seen from the output above. Nevertheless I can take over #1750 if you wish, but the work will probably be postponed because I will be less active in the coming days.

@ekohl
Copy link
Member

ekohl commented Nov 8, 2022

I'd recommend running linkchecker with --verbose and compare URLs.

@adamlazik1
Copy link
Contributor Author

adamlazik1 commented Nov 8, 2022

I tested this on 3.1 for the Upgrading guide. Here are the commands that I used and the output:

$ make BUILD=satellite
$ linkchecker --verbose ../build/Upgrading_and_Updating/index-satellite.html | grep 'Real URL' | tr -s ' ' | cut -d' ' -f3 | sort > linkchecker-checked-urls.txt
$ grep -oE 'href="http[^"]+' ../build/Upgrading_and_Updating/index-satellite.html | sed -e 's/href="//' | sort | uniq > grepped-urls.txt
$ diff linkchecker-checked-urls.txt grepped-urls.txt

1,3d0
< file:///home/alazik/repos/foreman-documentation/guides/build/Upgrading_and_Updating/index-satellite.html
< file:///js/nav.js
< file:///js/versions.js
11c8,9
< https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html/release_notes/index
---
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html/release_notes/index#
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html/release_notes/index#ref_known-issues_assembly_introducing-red-hat-satellite
12a11,14
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/administering_red_hat_satellite/index#Performing_an_Incremental_Backup_admin
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/administering_red_hat_satellite/index#Renaming_Server_admin
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/administering_red_hat_satellite/index#Restoring_from_a_Full_Backup_admin
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/administering_red_hat_satellite/index#Restoring_from_Incremental_Backups_admin
13a16,17
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_capsule_server/index#
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_capsule_server/index#deploying-a-custom-ssl-certificate-to-capsule-server_capsule
14a19,20
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_satellite_server_in_a_connected_network_environment/index#
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_satellite_server_in_a_connected_network_environment/index#configuring-external-services
15a22,25
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_satellite_server_in_a_disconnected_network_environment/index#
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_satellite_server_in_a_disconnected_network_environment/index#configuring-the-base-operating-system-with-offline-repositories-in-rhel-7_satellite
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_satellite_server_in_a_disconnected_network_environment/index#configuring-the-base-operating-system-with-offline-repositories-in-rhel-8_satellite
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_satellite_server_in_a_disconnected_network_environment/index#downloading-the-binary-dvd-images_satellite
16a27
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/installing_satellite_server_in_a_disconnected_network_environment/index#resolving-package-dependency-errors_satellite
17a29
> https://access.redhat.com/documentation/en-us/red_hat_satellite/6.11/html-single/managing_hosts/index#Configuring_and_Setting_Up_Remote_Jobs_managing-hosts

According to this, there are quite a few URLs that didn't get checked by the linkchecker.
Attaching the two output files.

grepped-urls.txt
linkchecker-checked-urls.txt

@ekohl
Copy link
Member

ekohl commented Dec 1, 2022

If I strip off the #ref (which the server doesn't care about) it looks like there is no difference:

$ diff -Nut <(linkchecker --verbose ../build/Upgrading_and_Updating/index-satellite.html | grep 'Real URL' | tr -s ' ' | cut -d' ' -f3 | sed 's/#.*//' | sort) <(grep -oE 'href="http[^"]+' ../build/Upgrading_and_Updating/index-satellite.html | sed -e 's/href="// ; s/#.*//' | sort -u)
INFO linkcheck.cmdline 2022-12-01 14:02:49,057 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
--- /dev/fd/63	2022-12-01 14:02:48.886036197 +0100
+++ /dev/fd/62	2022-12-01 14:02:48.887036210 +0100
@@ -1,6 +1,3 @@
-file:///home/ekohl/dev/foreman-documentation/3.1/guides/build/Upgrading_and_Updating/index-satellite.html
-file:///js/nav.js
-file:///js/versions.js
 https://access.redhat.com/articles/3664871
 https://access.redhat.com/articles/4977891
 https://access.redhat.com/articles/6393361

Can we close this?

@adamlazik1
Copy link
Contributor Author

Sure. I realized later that link-check does not solve that problem either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants