Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestAgent and Test (for the same user-agent) gives different results in case of temporary error when fetching the robots.txt file #40

Open
masonlouchart opened this issue Nov 21, 2023 · 0 comments

Comments

@masonlouchart
Copy link

masonlouchart commented Nov 21, 2023

diff --git a/robotstxt_test.go b/robotstxt_test.go
index 6ccb730..6cbda57 100644
--- a/robotstxt_test.go
+++ b/robotstxt_test.go
@@ -291,3 +291,38 @@ func newHttpResponse(code int, body string) *http.Response {
 		ContentLength: int64(len(body)),
 	}
 }
+
+func TestDisallowAll(t *testing.T) {
+	r, err := FromStatusAndBytes(500, nil) // We got a 500 response => Disallow all
+	require.NoError(t, err)
+
+	a := r.TestAgent("/", "*")
+	assert.False(t, a) // Resource access NOT allowed (EXPECTED)
+
+	b := r.FindGroup("*").Test("/")
+	assert.True(t, b) // Resource access allowed (UNEXPECTED)
+
+	assert.Equal(t, a, b) // Results for test on Agent and Group are differents...
+
+	/*
+		It's because the `disallowAll` is checked by `TestAgent` but not `Test`.
+
+		Because `TestAgent` also calls `FindGroup` internally but obfuscates the
+		value of `CrawlDelay`, users of this library might prefer to use
+		(`FindGroup` + `Test`) to have access to the `CrawlDelay` value in case the
+		path is allowed.
+
+		FindGroup -> Test (ok) -> check CarwlDelay
+
+		Unfortunately, the `Test` method does not use the `disallowAll` member set
+		on response with status in the range [500; 599]. This behavior is unexpected
+		and can lead to involuntary politeness policy violation.
+
+		Unless we resign to call `TestAgent` and `FindGroup` to get the `CrawlDelay`
+		value.
+
+		TestAgent (ok) -> FindGroup -> check CrawlDelay
+
+		This way, `FindGroup` has been called twice.
+		Is there a way to avoid it without risking politeness policy violation?
+	*/
+}

Run:

go test ./... -run TestDisallowAll
@masonlouchart masonlouchart changed the title TestAgent and Test (for the same agent) gives different result in case of temporary error when fetching the robots.txt file TestAgent and Test (for the same user-agent) gives different result in case of temporary error when fetching the robots.txt file Nov 21, 2023
@masonlouchart masonlouchart changed the title TestAgent and Test (for the same user-agent) gives different result in case of temporary error when fetching the robots.txt file TestAgent and Test (for the same user-agent) gives different results in case of temporary error when fetching the robots.txt file Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant