Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change robots.txt to exclude only media proxy URLs #10038

Merged
merged 3 commits into from Feb 14, 2019

Conversation

Projects
None yet
2 participants
@nightpool
Copy link
Collaborator

nightpool commented Feb 13, 2019

Reverts #10037

This change as written prevents archive.org from effectively archiving or displaying archived mastodon sites.

If the concern is googlebot, then we should add specific googlebot rules.

@Gargron

This comment has been minimized.

Copy link
Member

Gargron commented Feb 13, 2019

I am not sure many people would be thrilled about being archived by archive.org in the first place. Making this specific to Googlebot is not a good idea, because there is Bing, Yandex, etc etc. I guess it would be easier to add a whitelist entry for archive.org instead.

@nightpool

This comment has been minimized.

Copy link
Collaborator Author

nightpool commented Feb 13, 2019

@Gargron what is the actual concern that motivated this change? Duplicate content? Additional server traffic?

@Gargron

This comment has been minimized.

Copy link
Member

Gargron commented Feb 14, 2019

Okay, @nightpool provided compelling arguments about why artists might expect their art to show up in Google Image Search, and why the followers/following pages should be excluded via a noindex meta tag instead. The media_proxy URL is, however, a valid exclusion.

nightpool added some commits Feb 14, 2019

Let's block media_proxy
/media_proxy/ is a dynamic route used for requesting uncached media, so it's
probably bad to let crawlers use it
@nightpool

This comment has been minimized.

Copy link
Collaborator Author

nightpool commented Feb 14, 2019

(updated)

@Gargron Gargron changed the title Revert "Change robots.txt to exclude some URLs" Change robots.txt to exclude only media proxy URLs Feb 14, 2019

@Gargron Gargron merged commit a5992e5 into master Feb 14, 2019

11 checks passed

ci/circleci: build Your tests passed on CircleCI!
Details
ci/circleci: check-i18n Your tests passed on CircleCI!
Details
ci/circleci: install Your tests passed on CircleCI!
Details
ci/circleci: install-ruby2.4 Your tests passed on CircleCI!
Details
ci/circleci: install-ruby2.5 Your tests passed on CircleCI!
Details
ci/circleci: install-ruby2.6 Your tests passed on CircleCI!
Details
ci/circleci: test-ruby2.4 Your tests passed on CircleCI!
Details
ci/circleci: test-ruby2.5 Your tests passed on CircleCI!
Details
ci/circleci: test-ruby2.6 Your tests passed on CircleCI!
Details
ci/circleci: test-webui Your tests passed on CircleCI!
Details
codeclimate All good!
Details

@Gargron Gargron deleted the revert-10037-fix-robots-txt branch Feb 14, 2019

Gargron added a commit that referenced this pull request Feb 17, 2019

Change robots.txt to exclude only media proxy URLs (#10038)
* Revert "Change robots.txt to exclude some URLs (#10037)"

This reverts commit 80161f4.

* Let's block media_proxy

/media_proxy/ is a dynamic route used for requesting uncached media, so it's
probably bad to let crawlers use it

* misleading comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.