functional design #1

o7n · 2020-06-05T13:05:41Z

I would like to start off with proposing (and agreeing on) a functional design of the new tool. I had already written something up and I will post that in this issue over the next day or so.

I think the combination of a package and a command line tool is something we absolutely want to keep.

Propose to remove databases, ES, translations, Those can be handled by different tools.
Do we really need all the async stuff, especially if we won't be doing database access? I think the code would be much more easy to maintain if we use Requests.
Python3 only

Below is my braindump. Please respond in the comments what you think!!!

Output

Output will be written to stdout (default) or file.

Command line arguments:

-o <filename> output to file
--csv Write as .csv format
--json Write as .jsonlines format (independent json object on every line)

Error messages and debugging

Errors and informational messages will be output to stderr (default) or file.

Command line arguments:

-v enable verbose output (loglevel info)
-vv enable debug logging (loglevel debug)
-q disable error output completely (loglevel none)
-l <filename> logfile instead of stderr

--count Display number of Tweets scraped at the end of session.
--stats Show number of replies, retweets, and likes.

Cloaking and rate limiting options

The tool ~~needs~~ might need to be able to circumvent most measures taken by Twitter.

~~configurable user agent~~ (not for now)
proxy support
rate limit handling

Command line arguments:

~~-ua <user agent>~~
~~-uafile <filename> (with ua strings, one per line, tool will rotate through them)~~

-proxy <proxyurl>
-proxyfile <filename> (with proxyurls, one per line, tool will rotate through them))

TBD rate limiting, for instance backoff exponent, min/max/random wait time

Searching and filtering

I consolidated all command line args that have to do with searching and filtering. I think we need to keep the search params (i.e. those that send a different request to Twitter) and remove the filters (i.e. those that remove things that are in the output). Filtering can be done by an external program.

Can someone with more internal knowledge split these args in those two groups maybe?

--to USERNAME Search Tweets to a user.
--all USERNAME Search all Tweets associated with a user.
--favorites Scrape Tweets a user has liked.

-nr, --native-retweets
Filter the results for retweets only.
--min-likes MIN_LIKES
Filter the tweets by minimum number of likes.
--min-retweets MIN_RETWEETS
Filter the tweets by minimum number of retweets.
--min-replies MIN_REPLIES
Filter the tweets by minimum number of replies.
--links LINKS Include or exclude tweets containing one o more links.
If not specified you will get both tweets that might
contain links or not.
--source SOURCE Filter the tweets for specific source client.
--members-list MEMBERS_LIST
Filter the tweets sent by users in a given list.
-fr, --filter-retweets
Exclude retweets from the results.
--videos Display only Tweets with videos.
--images Display only Tweets with images.
--media Display Tweets with only images or videos.
--retweets Include user's Retweets (Warning: limited).

--email Filter Tweets that might have email addresses
--phone Filter Tweets that might have phone numbers
--verified Display Tweets only from verified users (Use with -s).

elie-h · 2020-06-05T16:03:53Z

My thoughts below:

1) The package & CLI options are a good way of exposing Twint's APIs
Agreed - the combination is enough to allow the majority of use cases
2) Propose to remove databases, ES, translations
Agreed - we should keep this package lean and purpose specific
3) Do we really need all the async stuff?
Synchronous requests should suffice
4) Python3 only
Definitely :)

Output should be a file or stdout
The module should expose a generator interface which can iterate between requests - this will allow "streaming" of results

The CLI options all make sense

I will setup project scaffold this weekend to get us started

o7n · 2020-06-05T17:56:40Z

The module should expose a generator interface which can iterate between requests - this will allow "streaming" of results

Yeah, second that, that would be really neat.

pielco11 · 2020-06-07T19:36:18Z

Totally agree 🎉

elie-h · 2020-06-08T22:37:55Z

Scaffolding done in a separate branch - let me know what you guys think of the tooling choices

How much is portable from the current Twint package? I would assume the scraper can be moved across

o7n · 2020-06-09T05:31:48Z

I think you forgot to push the branch.
We can port all the logic concerning which URL's to use and the HTML elements to look at. But since we're going to use (synchronous) Requests there is no use in porting the entire scraper I think.

During tests I did this weekend, I did not find any reason to do things like rotate user agents. So we should keep the code very simple, only adding bells and whistles when we really need them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

functional design #1

functional design #1

o7n commented Jun 5, 2020 •

edited

elie-h commented Jun 5, 2020

o7n commented Jun 5, 2020

pielco11 commented Jun 7, 2020

elie-h commented Jun 8, 2020

o7n commented Jun 9, 2020

functional design #1

functional design #1

Comments

o7n commented Jun 5, 2020 • edited

Output

Command line arguments:

Error messages and debugging

Command line arguments:

Cloaking and rate limiting options

Command line arguments:

Searching and filtering

elie-h commented Jun 5, 2020

o7n commented Jun 5, 2020

pielco11 commented Jun 7, 2020

elie-h commented Jun 8, 2020

o7n commented Jun 9, 2020

o7n commented Jun 5, 2020 •

edited