Skip to content

Commit

Permalink
Improve URL checks to reduce false-negatives
Browse files Browse the repository at this point in the history
This commit improves the URL health checking mechanism to reduce false
negatives.

- Treat all 2XX status codes as successful, addressing issues with codes
  like `204`.
- Improve URL matching to exclude URLs within Markdown inline code block
  and support URLs containing parentheses.
- Add `forceHttpGetForUrlPatterns` to customize HTTP method per URL to
  allow verifying URLs behind CDN/WAFs that do not respond to HTTP HEAD.
- Send the Host header for improved handling of webpages behind proxies.
- Improve formatting and context for output messages.
- Fix the defaulting options for redirects and cookie handling.
- Update the user agent pool to modern browsers and platforms.
- Add support for randomizing TLS fingerprint to mimic various clients
  better, improving the effectiveness of checks. However, this is not
  fully supported by Node.js's HTTP client; see nodejs/undici#1983 for
  more details.
- Use `AbortSignal` instead of `AbortController` as more modern and
  simpler way to handle timeouts.
  • Loading branch information
undergroundwires committed Mar 16, 2024
1 parent e721885 commit 5abf8ff
Show file tree
Hide file tree
Showing 18 changed files with 363 additions and 222 deletions.
5 changes: 4 additions & 1 deletion src/infrastructure/Threading/AsyncSleep.ts
@@ -1,7 +1,10 @@
export type SchedulerCallbackType = (...args: unknown[]) => void;
export type SchedulerType = (callback: SchedulerCallbackType, ms: number) => void;

export function sleep(time: number, scheduler: SchedulerType = setTimeout) {
export function sleep(
time: number,
scheduler: SchedulerType = setTimeout,
): Promise<void> {
return new Promise((resolve) => {
scheduler(() => resolve(undefined), time);
});
Expand Down
@@ -1,4 +1,4 @@
import { splitTextIntoLines, indentText } from '../utils/text';
import { indentText, splitTextIntoLines } from '@tests/shared/Text';
import { log, die } from '../utils/log';
import { readAppLogFile } from './app-logs';
import { STDERR_IGNORE_PATTERNS } from './error-ignore-patterns';
Expand Down
@@ -1,7 +1,7 @@
import { filterEmpty } from '@tests/shared/Text';
import { runCommand } from '../../utils/run-command';
import { log, LogLevel } from '../../utils/log';
import { SupportedPlatform, CURRENT_PLATFORM } from '../../utils/platform';
import { filterEmpty } from '../../utils/text';

export async function captureWindowTitles(processId: number) {
if (!processId) { throw new Error('Missing process ID.'); }
Expand Down
@@ -1,3 +1,4 @@
import { indentText } from '@tests/shared/Text';
import { logCurrentArgs, CommandLineFlag, hasCommandLineFlag } from './cli-args';
import { log, die } from './utils/log';
import { ensureNpmProjectDir, npmInstall, npmBuild } from './utils/npm';
Expand All @@ -15,7 +16,6 @@ import {
APP_EXECUTION_DURATION_IN_SECONDS,
SCREENSHOT_PATH,
} from './config';
import { indentText } from './utils/text';
import type { ExtractionResult } from './app/extractors/common/extraction-result';

export async function main(): Promise<void> {
Expand Down
@@ -1,5 +1,6 @@
import { exec, type ExecOptions, type ExecException } from 'node:child_process';
import { indentText } from './text';
import { exec } from 'child_process';
import { indentText } from '@tests/shared/Text';
import type { ExecOptions, ExecException } from 'child_process';

const TIMEOUT_IN_SECONDS = 180;
const MAX_OUTPUT_BUFFER_SIZE = 1024 * 1024; // 1 MB
Expand Down
69 changes: 35 additions & 34 deletions tests/checks/external-urls/StatusChecker/BatchStatusChecker.ts
@@ -1,64 +1,65 @@
import { sleep } from '@/infrastructure/Threading/AsyncSleep';
import { getUrlStatus, type IRequestOptions } from './Requestor';
import { groupUrlsByDomain } from './UrlPerDomainGrouper';
import type { IUrlStatus } from './IUrlStatus';
import { getUrlStatus, type RequestOptions } from './Requestor';
import { groupUrlsByDomain } from './UrlDomainProcessing';
import type { FollowOptions } from './FetchFollow';
import type { UrlStatus } from './UrlStatus';

export async function getUrlStatusesInParallel(
urls: string[],
options?: IBatchRequestOptions,
): Promise<IUrlStatus[]> {
// urls = [ 'https://privacy.sexy' ]; // Here to comment out when testing
options?: BatchRequestOptions,
): Promise<UrlStatus[]> {
// urls = ['https://privacy.sexy']; // Comment out this line to use a hardcoded URL for testing.
const uniqueUrls = Array.from(new Set(urls));
const defaultedOptions = { ...DefaultOptions, ...options };
console.log('Options: ', defaultedOptions);
const results = await request(uniqueUrls, defaultedOptions);
const defaultedDomainOptions: Required<DomainOptions> = {
...DefaultDomainOptions,
...options?.domainOptions,
};
console.log('Batch request options applied:', defaultedDomainOptions);
const results = await request(uniqueUrls, defaultedDomainOptions, options);
return results;
}

export interface IBatchRequestOptions {
domainOptions?: IDomainOptions;
requestOptions?: IRequestOptions;
export interface BatchRequestOptions {
readonly domainOptions?: Partial<DomainOptions>;
readonly requestOptions?: Partial<RequestOptions>;
readonly followOptions?: Partial<FollowOptions>;
}

interface IDomainOptions {
sameDomainParallelize?: boolean;
sameDomainDelayInMs?: number;
interface DomainOptions {
readonly sameDomainParallelize?: boolean;
readonly sameDomainDelayInMs?: number;
}

const DefaultOptions: Required<IBatchRequestOptions> = {
domainOptions: {
sameDomainParallelize: false,
sameDomainDelayInMs: 3 /* sec */ * 1000,
},
requestOptions: {
retryExponentialBaseInMs: 5 /* sec */ * 1000,
requestTimeoutInMs: 60 /* sec */ * 1000,
additionalHeaders: {},
},
const DefaultDomainOptions: Required<DomainOptions> = {
sameDomainParallelize: false,
sameDomainDelayInMs: 3 /* sec */ * 1000,
};

function request(
urls: string[],
options: Required<IBatchRequestOptions>,
): Promise<IUrlStatus[]> {
if (!options.domainOptions.sameDomainParallelize) {
domainOptions: Required<DomainOptions>,
options?: BatchRequestOptions,
): Promise<UrlStatus[]> {
if (!domainOptions.sameDomainParallelize) {
return runOnEachDomainWithDelay(
urls,
(url) => getUrlStatus(url, options.requestOptions),
options.domainOptions.sameDomainDelayInMs,
(url) => getUrlStatus(url, options?.requestOptions, options?.followOptions),
domainOptions.sameDomainDelayInMs,
);
}
return Promise.all(urls.map((url) => getUrlStatus(url, options.requestOptions)));
return Promise.all(
urls.map((url) => getUrlStatus(url, options?.requestOptions, options?.followOptions)),
);
}

async function runOnEachDomainWithDelay(
urls: string[],
action: (url: string) => Promise<IUrlStatus>,
action: (url: string) => Promise<UrlStatus>,
delayInMs: number | undefined,
): Promise<IUrlStatus[]> {
): Promise<UrlStatus[]> {
const grouped = groupUrlsByDomain(urls);
const tasks = grouped.map(async (group) => {
const results = new Array<IUrlStatus>();
const results = new Array<UrlStatus>();
/* eslint-disable no-await-in-loop */
for (const url of group) {
const status = await action(url);
Expand Down
@@ -1,27 +1,33 @@
import { sleep } from '@/infrastructure/Threading/AsyncSleep';
import type { IUrlStatus } from './IUrlStatus';
import { indentText } from '@tests/shared/Text';
import { type UrlStatus, formatUrlStatus } from './UrlStatus';

const DefaultBaseRetryIntervalInMs = 5 /* sec */ * 1000;

export async function retryWithExponentialBackOff(
action: () => Promise<IUrlStatus>,
action: () => Promise<UrlStatus>,
baseRetryIntervalInMs: number = DefaultBaseRetryIntervalInMs,
currentRetry = 1,
): Promise<IUrlStatus> {
): Promise<UrlStatus> {
const maxTries = 3;
const status = await action();
if (shouldRetry(status)) {
if (currentRetry <= maxTries) {
const exponentialBackOffInMs = getRetryTimeoutInMs(currentRetry, baseRetryIntervalInMs);
console.log(`Retrying (${currentRetry}) in ${exponentialBackOffInMs / 1000} seconds`, status);
console.log([
`Attempt ${currentRetry}: Retrying in ${exponentialBackOffInMs / 1000} seconds.`,
'Details:',
indentText(formatUrlStatus(status)),
].join('\n'));
await sleep(exponentialBackOffInMs);
return retryWithExponentialBackOff(action, baseRetryIntervalInMs, currentRetry + 1);
}
console.warn('💀 All retry attempts failed. Final failure to retrieve URL:', indentText(formatUrlStatus(status)));
}
return status;
}

function shouldRetry(status: IUrlStatus) {
function shouldRetry(status: UrlStatus): boolean {
if (status.error) {
return true;
}
Expand All @@ -32,14 +38,14 @@ function shouldRetry(status: IUrlStatus) {
|| status.code === 429; // Too Many Requests
}

function isTransientError(statusCode: number) {
function isTransientError(statusCode: number): boolean {
return statusCode >= 500 && statusCode <= 599;
}

function getRetryTimeoutInMs(
currentRetry: number,
baseRetryIntervalInMs: number = DefaultBaseRetryIntervalInMs,
) {
): number {
const retryRandomFactor = 0.5; // Retry intervals are between 50% and 150%
// of the exponentially increasing base amount
const minRandom = 1 - retryRandomFactor;
Expand Down
39 changes: 23 additions & 16 deletions tests/checks/external-urls/StatusChecker/FetchFollow.ts
@@ -1,19 +1,22 @@
import { indentText } from '@tests/shared/Text';
import { fetchWithTimeout } from './FetchWithTimeout';
import { getDomainFromUrl } from './UrlDomainProcessing';

export function fetchFollow(
url: string,
timeoutInMs: number,
fetchOptions: RequestInit,
followOptions: IFollowOptions | undefined,
fetchOptions?: Partial<RequestInit>,
followOptions?: Partial<FollowOptions>,
): Promise<Response> {
const defaultedFollowOptions = {
const defaultedFollowOptions: Required<FollowOptions> = {
...DefaultFollowOptions,
...followOptions,
};
if (followRedirects(defaultedFollowOptions)) {
console.log(indentText(`Follow options: ${JSON.stringify(defaultedFollowOptions)}`));
if (!followRedirects(defaultedFollowOptions)) {
return fetchWithTimeout(url, timeoutInMs, fetchOptions);
}
fetchOptions = { ...fetchOptions, redirect: 'manual' /* handled manually */ };
fetchOptions = { ...fetchOptions, redirect: 'manual' /* handled manually */, mode: 'cors' };
const cookies = new CookieStorage(defaultedFollowOptions.enableCookies);
return followRecursivelyWithCookies(
url,
Expand All @@ -24,13 +27,13 @@ export function fetchFollow(
);
}

export interface IFollowOptions {
followRedirects?: boolean;
maximumRedirectFollowDepth?: number;
enableCookies?: boolean;
export interface FollowOptions {
readonly followRedirects?: boolean;
readonly maximumRedirectFollowDepth?: number;
readonly enableCookies?: boolean;
}

export const DefaultFollowOptions: Required<IFollowOptions> = {
const DefaultFollowOptions: Required<FollowOptions> = {
followRedirects: true,
maximumRedirectFollowDepth: 20,
enableCookies: true,
Expand Down Expand Up @@ -64,6 +67,10 @@ async function followRecursivelyWithCookies(
if (cookieHeader) {
cookies.addHeader(cookieHeader);
}
options.headers = {
...options.headers,
Host: getDomainFromUrl(nextUrl),
};
return followRecursivelyWithCookies(nextUrl, timeoutInMs, options, newFollowDepth, cookies);
}

Expand All @@ -77,7 +84,7 @@ class CookieStorage {
constructor(private readonly enabled: boolean) {
}

public hasAny() {
public hasAny(): boolean {
return this.enabled && this.cookies.length > 0;
}

Expand All @@ -88,17 +95,17 @@ class CookieStorage {
this.cookies.push(header);
}

public getHeader() {
public getHeader(): string {
return this.cookies.join(' ; ');
}
}

function followRedirects(options: IFollowOptions) {
if (!options.followRedirects) {
function followRedirects(options: FollowOptions): boolean {
if (options.followRedirects !== true) {
return false;
}
if (options.maximumRedirectFollowDepth === 0) {
return false;
if (options.maximumRedirectFollowDepth === undefined || options.maximumRedirectFollowDepth <= 0) {
throw new Error('Invalid followRedirects configuration: maximumRedirectFollowDepth must be a positive integer');
}
return true;
}
Expand Down
12 changes: 6 additions & 6 deletions tests/checks/external-urls/StatusChecker/FetchWithTimeout.ts
Expand Up @@ -2,13 +2,13 @@ export async function fetchWithTimeout(
url: string,
timeoutInMs: number,
init?: RequestInit,
): Promise<Response> {
const controller = new AbortController();
): ReturnType<typeof fetch> {
const options: RequestInit = {
...(init ?? {}),
signal: controller.signal,
signal: AbortSignal.timeout(timeoutInMs),
};
const promise = fetch(url, options);
const timeout = setTimeout(() => controller.abort(), timeoutInMs);
return promise.finally(() => clearTimeout(timeout));
return fetch(
url,
options,
);
}
5 changes: 0 additions & 5 deletions tests/checks/external-urls/StatusChecker/IUrlStatus.ts

This file was deleted.

28 changes: 13 additions & 15 deletions tests/checks/external-urls/StatusChecker/README.md
Expand Up @@ -13,7 +13,10 @@ A CLI and SDK for checking the availability of external URLs.
- 😇 **Rate Limiting**: Queues requests by domain to be polite.
- 🔁 **Retries**: Implements retry pattern with exponential back-off.
-**Timeouts**: Configurable timeout for each request.
- 🎭️ **User-Agent Rotation**: Change user agents for each request.
- 🎭️ **Impersonation**: Impersonate different browsers for each request.
- **🌐 User-Agent Rotation**: Change user agents.
- **🔑 TLS Handshakes**: Perform TLS and HTTP handshakes that are identical to that of a real browser.
- 🫙 **Cookie jar**: Preserve cookies during redirects to mimic real browser.

## CLI

Expand Down Expand Up @@ -54,6 +57,7 @@ const statuses = await getUrlStatusesInParallel([ 'https://privacy.sexy', /* ...
- **`sameDomainDelayInMs`** (*number*), default: `3000` (3 seconds)
- Sets the delay between requests to the same domain.
- `requestOptions` (*object*): See [request options](#request-options).
- `followOptions` (*object*): See [follow options](#follow-options).

### `getUrlStatus`

Expand All @@ -72,7 +76,6 @@ console.log(`Status code: ${status.code}`);
- The longer the base time, the greater the intervals between retries.
- **`additionalHeaders`** (*object*), default: `false`
- Additional HTTP headers to send along with the default headers. Overrides default headers if specified.
- **`followOptions`** (*object*): See [follow options](#follow-options).
- **`requestTimeoutInMs`** (*number*), default: `60000` (60 seconds)
- Time limit to abort the request if no response is received within the specified time frame.

Expand All @@ -83,19 +86,7 @@ Follows `3XX` redirects while preserving cookies.
Same fetch API except third parameter that specifies [follow options](#follow-options), `redirect: 'follow' | 'manual' | 'error'` is discarded in favor of the third parameter.

```js
const status = await fetchFollow('https://privacy.sexy', {
// First argument is same options as fetch API, except `redirect` options
// that's discarded in favor of next argument follow options
headers: {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
},
}, {
// Second argument sets the redirect behavior
followRedirects: true,
maximumRedirectFollowDepth: 20,
enableCookies: true,
}
);
const status = await fetchFollow('https://privacy.sexy', 1000 /* timeout in milliseconds */);
console.log(`Status code: ${status.code}`);
```

Expand All @@ -109,3 +100,10 @@ console.log(`Status code: ${status.code}`);
- **`enableCookies`** (*boolean*), default: `true`
- Enables cookie storage to facilitate seamless navigation through login or other authentication challenges.
- 💡 Helps to over-come sign-in challenges with callbacks.
- **`forceHttpGetForUrlPatterns`** (*array*), default: `[]`
- Specifies URL patterns that should always use an HTTP GET request instead of the default HTTP HEAD.
- This is useful for websites that do not respond to HEAD requests, such as those behind certain CDN or web application firewalls.
- Provide patterns as regular expressions (`RegExp`), allowing them to match any part of a URL.
- Examples:
- To match any URL starting with "https://example.com/api": `/^https:\/\/example\.com\/api/`
- To match any domain ending with "cloudflare.com": `/^https:\/\/.*\.cloudflare\.com\//`

0 comments on commit 5abf8ff

Please sign in to comment.