Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problem with false positives (external URLs) #13

Open
sypets opened this issue Jan 19, 2021 · 3 comments
Open

Fix problem with false positives (external URLs) #13

sypets opened this issue Jan 19, 2021 · 3 comments
Labels
false positives false positives for external URLs high priority

Comments

@sypets
Copy link
Owner

sypets commented Jan 19, 2021

todos

  • in brofix: provide information in documentation, but it is not brofix's job to download intermediate certificates
  • make it possible to exclude specific error types (e.g. curl(60), but this usually has problem, that legitimate errors are not displayed as well
  • use some heuristic for better detection of probable false positives (difficult, because has several reasons and several error types can be affected)

summary

So far, the following reasons for false positives could be verified:

  1. certificate chain issue (this is actually an error on the server side of the webserver which is checked, but it is a minor error and page can be loaded without warning in browser, so this is perceived (!) as not broken by user (this should be distinguished from other TLS security isssues, such as outdated certificate etc.)
    • error is usually curl 60 (can be verified by using curl -I "url" on server

curl: (60) Peer's Certificate issuer is not recognized.
More details here: http://curl.haxx.se/docs/sslcerts.html

* SSLLabs shows "chain issues: incomplete" and "extra download"
* to fix on server side: put complete certificate chain in certifcate (including intermediate certificates)
* to fix on client server side (where brofix is running): download intermediate certificates
  1. cloudflare
    • error 503 is given

problem description

some URLs are reported as errors even though they work (in browser)

Examples:

The 999 HTTP error is a Linkedin error. It happens when Linkedin blocks the User-Agent that tries to access a link. I’m afraid it is an issue from the Linkedin, is streets your site as fake User-Agent.
Since the link is not broken, please feel free to set it as “Not Broken” link.


  • twitter URLs

other

  • ...

Apart from this, all 401, 403 (access restricted URLs) will fail. In that case, it is not really an error, but expected. For these cases, they could either be added as exclude link target entry, or we could make external link type errors configurable (e.g. have an exclude list for that as well, where you could exclude for example 401, 403, maybe also "too many redirects").


see also: https://notes.typo3.org/linkvalidator_problem_external_urls

Related:

@sypets
Copy link
Owner Author

sypets commented May 19, 2021

Analysis of some URLs which are causing problems.

Currently brofix sends the following HTTP headers (see TSconfig):

User-Agent: configurable
Accept: */*
Accept-Language: *
Accept-Encoding: *

It looks like the Accept-Language / Accept-Encoding may be causing problems in some cases.

It is possible to simulate this with curl:

curl -IL -H "Accept-Language: *" -H "Accept-Encoding: *"


curl sends these headers (by default):

curl -ILv URL

HEAD /pages/de/news411455 HTTP/2
Host: idw-online.de
user-agent: curl/7.68.0
accept: /

be sure to add the -L to follow redirects ....

@sypets
Copy link
Owner Author

sypets commented Apr 3, 2022

Reason: Incomplete certificate chain

  • curl: (60) SSL certificate problem: unable to get local issuer certificate
  • can be verified with Qualys SSL Labs server test (we see: "Chain issues | Incomplete" and "extra download"
  • the server should supply all intermediate certificates (except root certificate), but it does not
  • browser fetch the other certs or get them from their cache, so they do not show an error
  • solution can be to fetch the intermediate certs and make them available on the server

Example:

curl -I "https://www.ylook.de/search.php?&linklist_idx=11116563"
curl: (60) SSL certificate problem: unable to get local issuer certificate

Solutions

  1. implicit exclude (consider URLs as not broken for now) (this is not a good solution, as these servers often have other issues as well)
  2. Add intermediate certs on server.

This could be done with an extra tool but should not be implemented in brofix.

  1. extend client somehow (brofix:ExternalLinkType)
  2. extend guzzle somehow, see

That is a good start, but instead of extending the client, I suggest creating an event subscriber that can work for both synchronous and asynchronous requests.

Use custom CA bundle

Side note

The same error code (curl(60)) may also be the result for more severe TLS / certificate issues.

Same error code but different error message (in command line curl) !

certificate has expired
self-signed certificate

  1. "Certificate name mismatch"). Unfortunately, it is not possible to determine which is the case, just from the error message.

Example:

curl -I "https://klimakongressoldenburg.de"
curl: (60) SSL certificate problem: self-signed certificate

Certificate has "Certificate Name mismatch, see Qualis SSL Labs

  1. Certificate expired, e.g.
curl -I https://openjournal.uni-oldenburg.de/
curl: (60) SSL certificate problem: certificate has expired

see resources:

guzzle

other

image

@sypets
Copy link
Owner Author

sypets commented Apr 3, 2022

Reason: probably cloudflare DDoS protection I'm unter attack

  • error 503 is given
  • if body of webpage is fetched, we see something like cf-im-under-attack
<tr>
      <td align="center" valign="middle">
          <div class="cf-browser-verification cf-im-under-attack">
  <noscript>
    <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>

@sypets sypets added the false positives false positives for external URLs label Sep 27, 2023
sypets added a commit that referenced this issue Mar 3, 2024
This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
sypets added a commit that referenced this issue Mar 3, 2024
This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
sypets added a commit that referenced this issue Mar 3, 2024
This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
sypets added a commit that referenced this issue Mar 3, 2024
This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
sypets added a commit that referenced this issue Mar 3, 2024
This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
sypets added a commit that referenced this issue Mar 5, 2024
This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
sypets added a commit that referenced this issue Mar 7, 2024
This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
sypets added a commit that referenced this issue Mar 7, 2024
* !!![FEATURE] Better handling of link target results

This is a breaking change! It is advised to look at the
Changelog in the documentation for more information.

The checkLinks functions in all LinktypeInterface classes now
results a LinkTargetResponse object. In this, a status is
stored for the link checking, which makes it easier to handling
other link target status apart from broken.

This effectively makes the following possible:

- show all links in the broken link list, not just the broken links
- better handling of link targets, which can't be checked. This includes
  for example URLs with 401 or 403 HTTP status codes, where it is not
  possible to check the URLs. Previously, these URLs were considered
  broken while in fact we do not know if they are broken or not and
  we have no was to check them. This also includes URLs protected
  by cloudflare. They are now stored not as broken but as "can't be
  checked"
- it is possible to filter in the link list by this new status

Resolves: #296
Resolves: #289
Related: #13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
false positives false positives for external URLs high priority
Projects
None yet
Development

No branches or pull requests

1 participant