Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use seconds in date queries, add json dump #258

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TrueCarry
Copy link

Hello guys. Sorry, I'm new to python, maybe something is done very badly. I've tried to follow original code as much as I can.

Twitter supports date queries using since_time and until_time, so I've updated package a little bit to use it. So now you can do query like twitterscraper "love OR day" --limit 1000 -bd "2020-02-01 21:40:50" -ed "2020-02-01 21:40:55".

I've also added -dj or --dump-json flag, to dump output directly, but in json, so I can use this package as cli utility from another program.

@taspinar
Copy link
Owner

Hi @TrueCarry
Thank you for your contribution to twitterscraper.
I think most people are scraping for tweets for a longer date range and hence have no need for specifying the date range to the second. But it could be useful for someone who is interested in a very specific timerange.

However, I would suggest that you do not change the already existing structure but make an addition (so it remains backward compatible). So people should still be able to specify dates in the "%Y-%m-%d" format, but can also specify date times in the "%Y-%m-%d %H:%M:%S" format.

So the valid_date function need to be able to return values for both formats.

It is good that you still check if end date time is later then begin date time in query.py with if(no_secs < 0): but the rest of query.py should use no_days instead of no_secs:

  • you should definitely not use the number of seconds as the poolsize (number of parallel processes). Use the number of days as the poolsize, so the poolsize should be set to 1 if you are scraping from "2019-04-17 18:00:00" to "2019-04-17 19:12:14"
  • do NOT make a separate query for each second dateranges = [begindate + dt.timedelta(seconds=elem) for elem in linspace(0, no_secs, poolsize+1)]

if the entered begindate /enddate have the "%Y-%m-%d" format use the old method for building up queries '{} since:{} until:{}'.format(query, since, until), and if it they have the "%Y-%m-%d %H:%M:%S" format use your method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants