Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user agents to be customized in robots.txt #2109

Open
dhow opened this issue Apr 22, 2024 · 4 comments
Open

Allow user agents to be customized in robots.txt #2109

dhow opened this issue Apr 22, 2024 · 4 comments
Labels
Milestone

Comments

@dhow
Copy link

dhow commented Apr 22, 2024

Summary

An ability to read a text file that contains customization of robots.txt so that customization can be backed up or be persisted outside the docker container.

Use case

I've been been editing the module/Core/src/Action/RobotsAction.php file inside the container as I (and possibly many other people with similar needs) would like to allow Facebook's bot[1] so that as I paste Shlink's links will have article preview. But this was broken when I switched to stable-roadrunner (great image btw!) because -- obviously -- I forgot to add my robots.txt customization.

Since this feature would be pretty straight forward (as I already know which file output robots.txt content) I was thinking to add it by myself, but I'm not sure this -- externalize a part of robots.txt for user to persist container data -- is a good idea, so I would like to validate this idea with you if I can add this feature.

Thanks for the great work folks btw!

[1] Allowing Facebook's user-agent in robots.txt

User-agent: facebookexternalhit
Disallow: 
@dhow dhow added the feature label Apr 22, 2024
@acelaya
Copy link
Member

acelaya commented Apr 22, 2024

A related topic has been recently discussed here #2067, and while I would prefer not to expect people to customize the robots.txt by providing a file, I agree certain level of customization should be possible.

I mentioned some of the problems and history from current implementation here #2067 (reply in thread), and I already put together and merged a feature to allow all short URLs to be crawlled by default, if desired #2107, which would result in the same you mentioned above, but for any crawler, not just facebook's specifically.

On top of that, the only missing piece would be to allow you to provide a list of user agents you want to allow, falling back to * if the option is not provided. Something in the lines of ROBOTS_ALLOW_USER_AGENTS=facebookexternalhit,Googlebot.

That said, you can already make your short URLs crawlable, with the limitation that it needs to be done one by one, hence the PR above.

@dhow
Copy link
Author

dhow commented Apr 23, 2024

Thanks! I'll take a look at #2107 next time!

@acelaya acelaya added this to the 4.2.0 milestone May 13, 2024
@acelaya acelaya changed the title Persistent robots.txt customization Allow user agents to be customized in robots.txt May 13, 2024
@acelaya
Copy link
Member

acelaya commented May 13, 2024

I'm going to re-purpose this issue to specifically allow user agents to be customized in robots.txt. That plus the already existing capabilities around robots.txt should cover most use cases in a more predictable and reproducible way.

Later on, if there's still some missing capability, I'm open to discuss more improvements and features.

@dhow
Copy link
Author

dhow commented May 13, 2024

That's cool @acelaya !! Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants