Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only getting small amount of data before midnight #126

Closed
shenyifan17 opened this issue Jun 26, 2018 · 14 comments
Closed

Only getting small amount of data before midnight #126

shenyifan17 opened this issue Jun 26, 2018 · 14 comments

Comments

@shenyifan17
Copy link

I am trying to scrape tweets about Bitcoin from November to April. However, the data I obtained only contains those ones before midnight, looks like this:

screen shot 2018-06-26 at 16 12 24

which misses the majority of the tweets...

I wonder anyone has met the same issue

@lapp0
Copy link
Collaborator

lapp0 commented Jun 27, 2018

What is the command you're running?

@shenyifan17
Copy link
Author

twitterscraper Bitcoin -p 100 --csv -bd 2017-10-13 -ed 2017-12-31 --lang en -o Bitcoin.csv

@lapp0
Copy link
Collaborator

lapp0 commented Jul 3, 2018

What version are you running? Others have experienced issues with missing results prior to the fix applied to 0.7.0.

If you're running 0.7.0, could you run it twice and share your results for both runs via github (or other host) upload?

@shenyifan17
Copy link
Author

My version is 0.7.1

@lapp0
Copy link
Collaborator

lapp0 commented Jul 3, 2018

Okay, please run twice and share results, I will compare with my results and attempt to debug.

@shenyifan17
Copy link
Author

I am running "twitterscraper Bitcoin -p 100 --csv -bd 2017-10-13 -ed 2017-10-20 --lang en -o Bitcoin.csv" now, sometimes it meets a parsing json error, then it stops scrapping that day, just returning several minutes tweets before midnight.
If it does not meet the parsing json error, it seems the package is doing the right thing.
But the parsing json error comes randomly, and I cannot predict when it will happen...

@lapp0
Copy link
Collaborator

lapp0 commented Jul 6, 2018

I can confirm that this is an issue on my end as well. I am getting non-deterministic failures. Looking into it.

@shenyifan17
Copy link
Author

I guess this is due to twitter block scraping requests...?

@lapp0
Copy link
Collaborator

lapp0 commented Jul 16, 2018

I don't think twitter is blocking scraping requests, I think we're using an API in a way it's not intended to be used and they're not giving any effort to support our use case.

Anyways, I have found a workaround for two issues

Issue 1

Summary of issue:

Sometimes too few or no results are returned for a query

Observations:

*Headers {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'} and {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} consistently give too few results for twitterscraper Bitcoin --csv -bd 2011-10-13 -ed 2011-10-20 --lang en -o res.csv while all other headers give the correct number of results, 3075 (or at least the same number of results). The number of results given for these bad headers is non-deterministic, and wide ranged.

###Fix:

remove bad headers

Issue 2:

Summary of issue:

for some queries, involving datespans greater than a day (e.g. bd 2017-10-13 -ed 2017-10-20), twitters API generates zero results or way fewer results than actually exist. An example of a problematic query is Bitcoin.

Observations:

for the command twitterscraper Bitcoin -p 2 --csv -bd 2017-10-13 -ed 2017-10-15 --lang en -o res.csv, one time I got 120264 results, another I got 120259, and the final try I got a JSONDecodeError and had 51186 results. For the 51186-result run it stopped a little over an hour into 2017-10-13 (22:54 was the last time), and then transitioned to the next day.

Note: the second run having 5 fewer results is likely due to deleted tweets and I'm not going to worry about that.

Fix:

currently debugging

@lapp0
Copy link
Collaborator

lapp0 commented Jul 17, 2018

@taspinar could you shed some light on this? Why did you set the

RELOAD_URL = 'https://twitter.com/i/search/timeline?f=tweets&vertical=' \
             'default&include_available_features=1&include_entities=1&' \
             'reset_error_state=false&src=typd&max_position={pos}&q={q}&l={lang}'

?

Is this from some API or client you used?

When I run twitterscraper Bitcoin --csv -bd 2016-10-20 -ed 2016-10-21 --lang en -o foo.csv I get

DEBUG: querying https://twitter.com/search?f=tweets&vertical=default&q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&l=en
DEBUG: num new_tweets: 5, pos: TWEET-789254185657270272-789254699132317696, ids: ['789254699132317696', '789254679502913536', '789254633353048064', '789254472010653696', '789254185657270272']
DEBUG: querying https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-789254185657270272-789254699132317696&q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&l=en
DEBUG: num new_tweets: 0, pos: None, ids: []

So the issue is that it doesn't like this reload URL generated based on the tweets given.

However when I run it in browser and capture the requests it shows me these:

https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&src=typd

https://twitter.com/i/cards/tfw/v1/789233887335460864?cardname=summary&autoplay_disabled=true&forward=true&earned=true&edge=true&lang=en&card_height=130&scribe_context=%7B%22client%22%3A%22web%22%2C%22page%22%3A%22search%22%2C%22section%22%3A%22news%22%2C%22component%22%3A%22tweet%22%7D&bearer_token=redacted

https://twitter.com/i/cards/tfw/v1/789233886077161473?cardname=summary&autoplay_disabled=true&forward=true&earned=true&edge=true&lang=en&card_height=130&scribe_context=%7B%22client%22%3A%22web%22%2C%22page%22%3A%22search%22%2C%22section%22%3A%22news%22%2C%22component%22%3A%22tweet%22%7D&bearer_token=REDACTED

https://twitter.com/i/cards/tfw/v1/789253836053577728?cardname=summary_large_image&autoplay_disabled=true&forward=true&earned=true&edge=true&lang=en&card_height=344&scribe_context=%7B%22client%22%3A%22web%22%2C%22page%22%3A%22search%22%2C%22section%22%3A%22news%22%2C%22component%22%3A%22tweet%22%7D&bearer_token=REDACTED

These requests gave me the desired responses, but for some reason your URL isn't. Either because

  1. max_position=TWEET-789254185657270272-789254699132317696 is wrong

  2. some other parameter is wrong

  3. this way of reloading and fetching tweets using this API is unreliable.

Solutions to this are

  1. Something based on your wisdom that doesn't involve too many changes

  2. Me completely rewriting this to use the API my browser is using when searching https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&src=typd

@lapp0
Copy link
Collaborator

lapp0 commented Jul 18, 2018

Observation: these user agents work for me

  • Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36
  • M
  • ``

However, none of the headers currently in twitterscraper work for me when running

import requests

headers = {'user-agent': user_agent}

params = (
    ('f', 'tweets'),
    ('q', 'Bitcoin since:2016-10-20 until:2016-10-21'),
    ('l', 'en'),
)

response = requests.get('https://twitter.com/search', headers=headers, params=params)
print(response.text)

Very strange... how can M make twitter happy, but Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201 not?

@lapp0
Copy link
Collaborator

lapp0 commented Jul 23, 2018

@shenyifan17 please try this pull request, it should fix it #126

@taspinar
Copy link
Owner

@lapp0 This is the standard request issues by Twitter to fetch a new batch of tweets. See image.
url

I don't think there is anything wrong with the RELOAD_URL.

@taspinar
Copy link
Owner

How many tweets do you via your internet browser if you go to https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&src=typd ?
I only get a handful. I don't know if nobody was tweeting about bitcoin that particular day, or if Twitter does not return all results, but twitterscraper is scraping all tweets visible on Twitter.com.

PS: If I search for https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-22&src=typd I get about 2506 results and when I search for https://twitter.com/search?q=Bitcoin%20since%3A2016-10-21%20until%3A2016-10-22&src=typd I get about 2500 results. So I think you picked a day ( 2016-10-20 ) on which there were only five tweets about Bitcoin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants