Skip to content
This repository has been archived by the owner on Jul 27, 2022. It is now read-only.

functional design #1

Open
o7n opened this issue Jun 5, 2020 · 5 comments
Open

functional design #1

o7n opened this issue Jun 5, 2020 · 5 comments

Comments

@o7n
Copy link
Collaborator

o7n commented Jun 5, 2020

I would like to start off with proposing (and agreeing on) a functional design of the new tool. I had already written something up and I will post that in this issue over the next day or so.

I think the combination of a package and a command line tool is something we absolutely want to keep.

  • Propose to remove databases, ES, translations, Those can be handled by different tools.
  • Do we really need all the async stuff, especially if we won't be doing database access? I think the code would be much more easy to maintain if we use Requests.
  • Python3 only

Below is my braindump. Please respond in the comments what you think!!!

Output

Output will be written to stdout (default) or file.

Command line arguments:

-o <filename> output to file
--csv Write as .csv format
--json Write as .jsonlines format (independent json object on every line)

Error messages and debugging

Errors and informational messages will be output to stderr (default) or file.

Command line arguments:

-v enable verbose output (loglevel info)
-vv enable debug logging (loglevel debug)
-q disable error output completely (loglevel none)
-l <filename> logfile instead of stderr

--count Display number of Tweets scraped at the end of session.
--stats Show number of replies, retweets, and likes.

Cloaking and rate limiting options

The tool needs might need to be able to circumvent most measures taken by Twitter.

  • configurable user agent (not for now)
  • proxy support
  • rate limit handling

Command line arguments:

-ua <user agent>
-uafile <filename> (with ua strings, one per line, tool will rotate through them)

-proxy <proxyurl>
-proxyfile <filename> (with proxyurls, one per line, tool will rotate through them))

TBD rate limiting, for instance backoff exponent, min/max/random wait time

Searching and filtering

I consolidated all command line args that have to do with searching and filtering. I think we need to keep the search params (i.e. those that send a different request to Twitter) and remove the filters (i.e. those that remove things that are in the output). Filtering can be done by an external program.

Can someone with more internal knowledge split these args in those two groups maybe?

--to USERNAME Search Tweets to a user.
--all USERNAME Search all Tweets associated with a user.
--favorites Scrape Tweets a user has liked.

-nr, --native-retweets
Filter the results for retweets only.
--min-likes MIN_LIKES
Filter the tweets by minimum number of likes.
--min-retweets MIN_RETWEETS
Filter the tweets by minimum number of retweets.
--min-replies MIN_REPLIES
Filter the tweets by minimum number of replies.
--links LINKS Include or exclude tweets containing one o more links.
If not specified you will get both tweets that might
contain links or not.
--source SOURCE Filter the tweets for specific source client.
--members-list MEMBERS_LIST
Filter the tweets sent by users in a given list.
-fr, --filter-retweets
Exclude retweets from the results.
--videos Display only Tweets with videos.
--images Display only Tweets with images.
--media Display Tweets with only images or videos.
--retweets Include user's Retweets (Warning: limited).

--email Filter Tweets that might have email addresses
--phone Filter Tweets that might have phone numbers
--verified Display Tweets only from verified users (Use with -s).

@elie-h
Copy link
Collaborator

elie-h commented Jun 5, 2020

My thoughts below:

1) The package & CLI options are a good way of exposing Twint's APIs
Agreed - the combination is enough to allow the majority of use cases
2) Propose to remove databases, ES, translations
Agreed - we should keep this package lean and purpose specific
3) Do we really need all the async stuff?
Synchronous requests should suffice
4) Python3 only
Definitely :)

Output should be a file or stdout
The module should expose a generator interface which can iterate between requests - this will allow "streaming" of results

The CLI options all make sense

I will setup project scaffold this weekend to get us started

@o7n
Copy link
Collaborator Author

o7n commented Jun 5, 2020

The module should expose a generator interface which can iterate between requests - this will allow "streaming" of results

Yeah, second that, that would be really neat.

@pielco11
Copy link
Member

pielco11 commented Jun 7, 2020

Totally agree 🎉

@elie-h
Copy link
Collaborator

elie-h commented Jun 8, 2020

Scaffolding done in a separate branch - let me know what you guys think of the tooling choices

How much is portable from the current Twint package? I would assume the scraper can be moved across

@o7n
Copy link
Collaborator Author

o7n commented Jun 9, 2020

I think you forgot to push the branch.
We can port all the logic concerning which URL's to use and the HTML elements to look at. But since we're going to use (synchronous) Requests there is no use in porting the entire scraper I think.

During tests I did this weekend, I did not find any reason to do things like rotate user agents. So we should keep the code very simple, only adding bells and whistles when we really need them.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

3 participants