New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perhaps #112

Open
hrbrmstr opened this Issue May 7, 2018 · 5 comments

Comments

Projects
None yet
4 participants
@hrbrmstr
Copy link

hrbrmstr commented May 7, 2018

At least — somewhere prominent — inform potential users that this is a violation of Twitter's ToS and also robots.txt (which has been held as a valid technical control in U.S. Federal circuit court).

I'd also be careful since development of this and encouragement of its use could be construed as a CFAA violation.

But putting folks into harm's way without telling them is not cool.

@ashgreat

This comment has been minimized.

Copy link

ashgreat commented May 17, 2018

If a significant users of this tool are from the US then it makes sense to make the legal implications explicit.

@hrbrmstr

This comment has been minimized.

Copy link
Author

hrbrmstr commented May 17, 2018

That's pretty myopic and incorrect. The US is far from only enforcer of these statues.

I'm also saddened at your lack of acknowledgment of the ethical issues associated with this package but I'm unfortunately getting all-too used to the data science community putting personal, perceived utility over ethics these days.

@ashgreat

This comment has been minimized.

Copy link

ashgreat commented May 17, 2018

You don't explain how what I wrote is myopic and incorrect. Here is a Wikipedia page (https://en.wikipedia.org/wiki/Web_scraping#United_States) which shows that web scraping laws in the US are pretty much evolving. Furthermore, this code scrapes Twitter without logging in, which as the current court ruling stands, is not breaking any laws (https://regmedia.co.uk/2017/08/14/hiqlinkedintro.pdf). Yet, as US laws are evolving, I think that it's better to warn users about this issue. If you can show me unequivocal legal frameworks in other countries that are against scraping websites without logging in, I will support giving warning to the people from those countries too.

From the second part of your comment it seems that you believe that your ethical code should be the universally binding set of ethics for data scientists and computer scientists (btw, IEEE strongly opposed LinkedIn's position in the above law suite). It's possible that people have differing sets of ethics and not everyone likes what others believe in. I have been following you on Twitter for quite some time and I greatly appreciate your work in the R community. But imposing your views on others aggressively and berating them in public is not what I was expecting from you.

@taspinar

This comment has been minimized.

Copy link
Owner

taspinar commented May 20, 2018

Hi Bob,
Thank you for the concerns you have raised. It has been discussed before in issue #60 but because the legality of scraping websites in general and twitter.com in particular is not that clear cut and strongly depends on the country the users are living in I have not done anything about it. Furthermore, as far as i know, most of the users are scraping twitter for academical and/or research purposes, without any commercial incentives.

As it was also pointed out in issue #60 , although this package violates Twitter TOS, scraping a public website does not require people to accept the TOS. Furthermore, robots.txt allows for scraping of /search?q=%23, which is what twitterscraper does. The only clear violation I have seen is the time-delay of 1 sec between successive requests.
When I have time, I'll try to implement the option to add a time-delay of 1 sec between successive requests people can turn on with a command line argument. Users for whom scraping time is not important could turn that option on.

In the mean time, feel free to sent in an PR with an added section to the readme file where this issue is explained.

@maelle

This comment has been minimized.

Copy link

maelle commented Aug 13, 2018

Users for whom scraping time is not important could turn that option on.

Just wondering why respecting the delay would not be the default? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment