New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User's preference to exclude user page from being indexed by search engines #1599

Closed
symac opened this Issue Apr 12, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@symac

symac commented Apr 12, 2017

Hello,
I have noticed that most instances keep the default robots.txt file, which allows any page to be indexed by search engines. As privacy is a concern for many users in the fediverse, I believe it would be nice to add a preference in users settings excluding you own user's page from search engines by adding it to the robots.txt.


  • I searched or browsed the repo’s other issues to ensure this is not a duplicate.
@Reventl0v

This comment has been minimized.

Show comment
Hide comment
@Reventl0v

Reventl0v Apr 22, 2017

In addition to the existence of a preference in user settings, I'd love to see some "active" request from Mastodon at the account creation. For example, at the first login, a window could ask what's the choice for the user:

  • Denying the bots to index the toots and the bio;
  • Authorizing the bots to index the toots and the bio;

Why? Because the bio and toots may contain "personnal data", and I believe to allow the indexing of this kind of data legally, in some countries, a deliberate choice must be done by the user[1]. This choice can't be a "default choice"[2].

[1]: (In french)

Il convient toutefois de différencier les données personnelles affichées sur ces sites (par exemple issues des réseaux sociaux), qui sont elles soumises à un régime d’autorisation spécifique, des autres données. Si les données sont indexées, conformément à la loi LCEN de 1978, il est systématiquement nécessaire d’obtenir l’autorisation de la personne concernée pour cet usage (par exemple via un outil d’opt-in lors de l’inscription sur un site)

https://www.legavox.fr/blog/maitre-matthieu-pacaud/extraction-indexation-donnees-crawlers-internet-22421.htm

[2]: (Also in french)

Pour Pages Jaunes, cette collecte était légale, puisqu'elle se fondait sur des informations publiques, librement accessibles sur Internet. Faux, estime la CNIL, qui souligne dans sa décision que "si les personnes concernées se sont inscrites sur des réseaux sociaux de leur plein gré, il ne résulte pas de cette démarche volontaire que l'ensemble de ces personnes aient également accepté, systématiquement et en toute conscience, que leurs informations communautaires soient récupérées par des tiers pour être agrégées à leurs données d'annuaires et diffusées sur le réseau".

http://www.lemonde.fr/technologies/article/2011/09/23/la-cnil-adresse-un-avertissement-severe-aux-pages-jaunes_1576684_651865.html

Reventl0v commented Apr 22, 2017

In addition to the existence of a preference in user settings, I'd love to see some "active" request from Mastodon at the account creation. For example, at the first login, a window could ask what's the choice for the user:

  • Denying the bots to index the toots and the bio;
  • Authorizing the bots to index the toots and the bio;

Why? Because the bio and toots may contain "personnal data", and I believe to allow the indexing of this kind of data legally, in some countries, a deliberate choice must be done by the user[1]. This choice can't be a "default choice"[2].

[1]: (In french)

Il convient toutefois de différencier les données personnelles affichées sur ces sites (par exemple issues des réseaux sociaux), qui sont elles soumises à un régime d’autorisation spécifique, des autres données. Si les données sont indexées, conformément à la loi LCEN de 1978, il est systématiquement nécessaire d’obtenir l’autorisation de la personne concernée pour cet usage (par exemple via un outil d’opt-in lors de l’inscription sur un site)

https://www.legavox.fr/blog/maitre-matthieu-pacaud/extraction-indexation-donnees-crawlers-internet-22421.htm

[2]: (Also in french)

Pour Pages Jaunes, cette collecte était légale, puisqu'elle se fondait sur des informations publiques, librement accessibles sur Internet. Faux, estime la CNIL, qui souligne dans sa décision que "si les personnes concernées se sont inscrites sur des réseaux sociaux de leur plein gré, il ne résulte pas de cette démarche volontaire que l'ensemble de ces personnes aient également accepté, systématiquement et en toute conscience, que leurs informations communautaires soient récupérées par des tiers pour être agrégées à leurs données d'annuaires et diffusées sur le réseau".

http://www.lemonde.fr/technologies/article/2011/09/23/la-cnil-adresse-un-avertissement-severe-aux-pages-jaunes_1576684_651865.html

@Reventl0v

This comment has been minimized.

Show comment
Hide comment
@Reventl0v

Reventl0v Apr 23, 2017

Also, the bio and the names are currently replicated accross the mastodon network on the "following"/"follower" pages of people following or being followed by the members. Should one instance block, using robots.txt for example, the crawling of its website by indexing bot, it does not prevent the indexation of the name and the bio of one member using other mastodon instances. How should this be managed? The creation of "meta" robots.txt for each following and follower pages? Disabling the indexation of the follower and following pages, by default, with one opt-in?

Reventl0v commented Apr 23, 2017

Also, the bio and the names are currently replicated accross the mastodon network on the "following"/"follower" pages of people following or being followed by the members. Should one instance block, using robots.txt for example, the crawling of its website by indexing bot, it does not prevent the indexation of the name and the bio of one member using other mastodon instances. How should this be managed? The creation of "meta" robots.txt for each following and follower pages? Disabling the indexation of the follower and following pages, by default, with one opt-in?

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 1, 2017

I agree with the OP that a lot of people chose Mastodon because of privacy concerns. The default setting should be to disallow crawling of all but the /about and /about/more pages.

In addition, according to https://en.wikipedia.org/wiki/Robots_exclusion_standard#Meta_tags_and_headers , it might be possible for crawlers to see user, toot, tags and whatnot pages if they are linked to from elsewhere.
I propose that as an extra protection, the <meta name="robots" content="noindex" /> tag be added by default to all pages except the /about and /about/more ones.

ghost commented May 1, 2017

I agree with the OP that a lot of people chose Mastodon because of privacy concerns. The default setting should be to disallow crawling of all but the /about and /about/more pages.

In addition, according to https://en.wikipedia.org/wiki/Robots_exclusion_standard#Meta_tags_and_headers , it might be possible for crawlers to see user, toot, tags and whatnot pages if they are linked to from elsewhere.
I propose that as an extra protection, the <meta name="robots" content="noindex" /> tag be added by default to all pages except the /about and /about/more ones.

@symac

This comment has been minimized.

Show comment
Hide comment
@symac

symac May 4, 2017

I am not sure that preventing anything from being indexed by default is the solution. Because even if privacy is a concern for many users I am pretty sure that we also have many (more ?) users who want a network that is visible and can welcome more and more members. That's why I was suggesting in my initial message a method that would allow users who are concerned about their activity on mastodon not being too much visible to be able to opt-in to be hidden.
The other risk with defaulting to noindex would be that it could be understood by some as : "great, I can safely post on mastodon without my content being searchable", whereas we know it's untrue, robots.txt and all theses directives are just suggestions and nothing prevents someone with bad intentions from ignoring them.

symac commented May 4, 2017

I am not sure that preventing anything from being indexed by default is the solution. Because even if privacy is a concern for many users I am pretty sure that we also have many (more ?) users who want a network that is visible and can welcome more and more members. That's why I was suggesting in my initial message a method that would allow users who are concerned about their activity on mastodon not being too much visible to be able to opt-in to be hidden.
The other risk with defaulting to noindex would be that it could be understood by some as : "great, I can safely post on mastodon without my content being searchable", whereas we know it's untrue, robots.txt and all theses directives are just suggestions and nothing prevents someone with bad intentions from ignoring them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment