New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to omit all statements on csarven.ca #2

Closed
csarven opened this Issue Jun 14, 2018 · 6 comments

Comments

Projects
None yet
2 participants
@csarven

csarven commented Jun 14, 2018

Hi. I'm the current owner of csarven.ca. I would appreciate it if the dataset, map, and anything else, can omit all statements pertaining to csarven.ca. If there is any information that's currently on csarven.ca that the crawler is discovering, please let me know, I can remove those. Same goes for any other place that I might have access to. Thanks.

@snarfed

This comment has been minimized.

Owner

snarfed commented Nov 3, 2018

oh wow, hey @csarven. apologies, i totally missed this. absolutely, will do.

@snarfed snarfed closed this in ce794b9 Nov 3, 2018

snarfed added a commit that referenced this issue Nov 3, 2018

@snarfed

This comment has been minimized.

Owner

snarfed commented Nov 3, 2018

done! your site is now gone from the dataset. details.

fwiw, here's what i see in http://csarven.ca/robots.txt right now. i do get that you may want to allow some crawlers but not others, like indie map.

User-agent: *
Disallow: /archives
Disallow: /scripts
Disallow: /url
Disallow: /labs
Disallow: /presentations
Disallow: /webstream
Disallow: /search
Disallow: /statistical-linked-dataspaces-and-analysis-overview
Allow: /labs/indexability
Allow: /archives/articles
@csarven

This comment has been minimized.

csarven commented Nov 5, 2018

Thanks!

Pardon me if I'm looking at the wrong code, but perhaps it is also worthwhile to update the user-agent value: https://github.com/snarfed/indie-map/blob/master/crawl/wget.sh#L9 ?

@snarfed

This comment has been minimized.

Owner

snarfed commented Nov 5, 2018

interesting idea. you mean, set it to Indie Map? i could! technically the user agent still is wget though, right? Indie Map is just the use case? maybe I'm splitting hairs.

@csarven

This comment has been minimized.

csarven commented Nov 5, 2018

I would classify indie-map (the software of this repository) as the user-agent, as opposed to a particular library that's doing the fetching. Just as Firefox uses its own library to negotiate resources. But yes, generally it can be arbitrary.

So, you can do something like User-Agent: indie-map or if you're feeling adventurous use User-Agent: https://github.com/snarfed/indie-map/ or some other HTTP URI that can provide a structured description for the application eg view-source https://dokie.li/

@snarfed

This comment has been minimized.

Owner

snarfed commented Nov 5, 2018

also fwiw, i've only actually done this whole crawl once, and i have no plans to do it again right now, regularly or otherwise.

still, good idea. thanks for the nudge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment