Only getting small amount of data before midnight #126

shenyifan17 · 2018-06-26T15:13:38Z

I am trying to scrape tweets about Bitcoin from November to April. However, the data I obtained only contains those ones before midnight, looks like this:

which misses the majority of the tweets...

I wonder anyone has met the same issue

lapp0 · 2018-06-27T02:13:25Z

What is the command you're running?

shenyifan17 · 2018-07-02T12:51:00Z

twitterscraper Bitcoin -p 100 --csv -bd 2017-10-13 -ed 2017-12-31 --lang en -o Bitcoin.csv

lapp0 · 2018-07-03T23:22:16Z

What version are you running? Others have experienced issues with missing results prior to the fix applied to 0.7.0.

If you're running 0.7.0, could you run it twice and share your results for both runs via github (or other host) upload?

shenyifan17 · 2018-07-03T23:24:58Z

My version is 0.7.1

lapp0 · 2018-07-03T23:26:05Z

Okay, please run twice and share results, I will compare with my results and attempt to debug.

shenyifan17 · 2018-07-05T10:08:15Z

I am running "twitterscraper Bitcoin -p 100 --csv -bd 2017-10-13 -ed 2017-10-20 --lang en -o Bitcoin.csv" now, sometimes it meets a parsing json error, then it stops scrapping that day, just returning several minutes tweets before midnight.
If it does not meet the parsing json error, it seems the package is doing the right thing.
But the parsing json error comes randomly, and I cannot predict when it will happen...

lapp0 · 2018-07-06T03:27:01Z

I can confirm that this is an issue on my end as well. I am getting non-deterministic failures. Looking into it.

shenyifan17 · 2018-07-06T13:04:30Z

I guess this is due to twitter block scraping requests...?

lapp0 · 2018-07-16T01:14:06Z

I don't think twitter is blocking scraping requests, I think we're using an API in a way it's not intended to be used and they're not giving any effort to support our use case.

Anyways, I have found a workaround for two issues

Issue 1

Summary of issue:

Sometimes too few or no results are returned for a query

Observations:

*Headers {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'} and {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} consistently give too few results for twitterscraper Bitcoin --csv -bd 2011-10-13 -ed 2011-10-20 --lang en -o res.csv while all other headers give the correct number of results, 3075 (or at least the same number of results). The number of results given for these bad headers is non-deterministic, and wide ranged.

###Fix:

remove bad headers

Issue 2:

Summary of issue:

for some queries, involving datespans greater than a day (e.g. bd 2017-10-13 -ed 2017-10-20), twitters API generates zero results or way fewer results than actually exist. An example of a problematic query is Bitcoin.

Observations:

for the command twitterscraper Bitcoin -p 2 --csv -bd 2017-10-13 -ed 2017-10-15 --lang en -o res.csv, one time I got 120264 results, another I got 120259, and the final try I got a JSONDecodeError and had 51186 results. For the 51186-result run it stopped a little over an hour into 2017-10-13 (22:54 was the last time), and then transitioned to the next day.

Note: the second run having 5 fewer results is likely due to deleted tweets and I'm not going to worry about that.

Fix:

currently debugging

lapp0 · 2018-07-17T03:31:27Z

@taspinar could you shed some light on this? Why did you set the

RELOAD_URL = 'https://twitter.com/i/search/timeline?f=tweets&vertical=' \
             'default&include_available_features=1&include_entities=1&' \
             'reset_error_state=false&src=typd&max_position={pos}&q={q}&l={lang}'

?

Is this from some API or client you used?

When I run twitterscraper Bitcoin --csv -bd 2016-10-20 -ed 2016-10-21 --lang en -o foo.csv I get

DEBUG: querying https://twitter.com/search?f=tweets&vertical=default&q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&l=en
DEBUG: num new_tweets: 5, pos: TWEET-789254185657270272-789254699132317696, ids: ['789254699132317696', '789254679502913536', '789254633353048064', '789254472010653696', '789254185657270272']
DEBUG: querying https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-789254185657270272-789254699132317696&q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&l=en
DEBUG: num new_tweets: 0, pos: None, ids: []

So the issue is that it doesn't like this reload URL generated based on the tweets given.

However when I run it in browser and capture the requests it shows me these:

https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&src=typd

https://twitter.com/i/cards/tfw/v1/789233887335460864?cardname=summary&autoplay_disabled=true&forward=true&earned=true&edge=true&lang=en&card_height=130&scribe_context=%7B%22client%22%3A%22web%22%2C%22page%22%3A%22search%22%2C%22section%22%3A%22news%22%2C%22component%22%3A%22tweet%22%7D&bearer_token=redacted

https://twitter.com/i/cards/tfw/v1/789233886077161473?cardname=summary&autoplay_disabled=true&forward=true&earned=true&edge=true&lang=en&card_height=130&scribe_context=%7B%22client%22%3A%22web%22%2C%22page%22%3A%22search%22%2C%22section%22%3A%22news%22%2C%22component%22%3A%22tweet%22%7D&bearer_token=REDACTED

https://twitter.com/i/cards/tfw/v1/789253836053577728?cardname=summary_large_image&autoplay_disabled=true&forward=true&earned=true&edge=true&lang=en&card_height=344&scribe_context=%7B%22client%22%3A%22web%22%2C%22page%22%3A%22search%22%2C%22section%22%3A%22news%22%2C%22component%22%3A%22tweet%22%7D&bearer_token=REDACTED

These requests gave me the desired responses, but for some reason your URL isn't. Either because

max_position=TWEET-789254185657270272-789254699132317696 is wrong
some other parameter is wrong
this way of reloading and fetching tweets using this API is unreliable.

Solutions to this are

Something based on your wisdom that doesn't involve too many changes
Me completely rewriting this to use the API my browser is using when searching https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&src=typd

lapp0 · 2018-07-18T01:10:28Z

Observation: these user agents work for me

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36
M
``

However, none of the headers currently in twitterscraper work for me when running

import requests

headers = {'user-agent': user_agent}

params = (
    ('f', 'tweets'),
    ('q', 'Bitcoin since:2016-10-20 until:2016-10-21'),
    ('l', 'en'),
)

response = requests.get('https://twitter.com/search', headers=headers, params=params)
print(response.text)

Very strange... how can M make twitter happy, but Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201 not?

lapp0 · 2018-07-23T04:01:19Z

@shenyifan17 please try this pull request, it should fix it #126

taspinar · 2018-07-23T18:46:45Z

@lapp0 This is the standard request issues by Twitter to fetch a new batch of tweets. See image.

I don't think there is anything wrong with the RELOAD_URL.

taspinar · 2018-07-23T19:04:04Z

How many tweets do you via your internet browser if you go to https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-21&src=typd ?
I only get a handful. I don't know if nobody was tweeting about bitcoin that particular day, or if Twitter does not return all results, but twitterscraper is scraping all tweets visible on Twitter.com.

PS: If I search for https://twitter.com/search?q=Bitcoin%20since%3A2016-10-20%20until%3A2016-10-22&src=typd I get about 2506 results and when I search for https://twitter.com/search?q=Bitcoin%20since%3A2016-10-21%20until%3A2016-10-22&src=typd I get about 2500 results. So I think you picked a day ( 2016-10-20 ) on which there were only five tweets about Bitcoin.

lapp0 mentioned this issue Jul 17, 2018

Remove Bad Headers, Use Generator, and Misc #135

Merged

lapp0 added help wanted critical bug labels Jul 18, 2018

lapp0 self-assigned this Jul 18, 2018

lapp0 mentioned this issue Jul 23, 2018

Fix early halt of tweet fetching #142

Merged

taspinar closed this as completed in #142 Dec 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only getting small amount of data before midnight #126

Only getting small amount of data before midnight #126

shenyifan17 commented Jun 26, 2018

lapp0 commented Jun 27, 2018 •

edited

shenyifan17 commented Jul 2, 2018

lapp0 commented Jul 3, 2018 •

edited

shenyifan17 commented Jul 3, 2018

lapp0 commented Jul 3, 2018

shenyifan17 commented Jul 5, 2018

lapp0 commented Jul 6, 2018

shenyifan17 commented Jul 6, 2018

lapp0 commented Jul 16, 2018

lapp0 commented Jul 17, 2018 •

edited

lapp0 commented Jul 18, 2018 •

edited

lapp0 commented Jul 23, 2018

taspinar commented Jul 23, 2018

taspinar commented Jul 23, 2018

Only getting small amount of data before midnight #126

Only getting small amount of data before midnight #126

Comments

shenyifan17 commented Jun 26, 2018

lapp0 commented Jun 27, 2018 • edited

shenyifan17 commented Jul 2, 2018

lapp0 commented Jul 3, 2018 • edited

shenyifan17 commented Jul 3, 2018

lapp0 commented Jul 3, 2018

shenyifan17 commented Jul 5, 2018

lapp0 commented Jul 6, 2018

shenyifan17 commented Jul 6, 2018

lapp0 commented Jul 16, 2018

Issue 1

Summary of issue:

Observations:

Issue 2:

Summary of issue:

Observations:

Fix:

lapp0 commented Jul 17, 2018 • edited

lapp0 commented Jul 18, 2018 • edited

lapp0 commented Jul 23, 2018

taspinar commented Jul 23, 2018

taspinar commented Jul 23, 2018

lapp0 commented Jun 27, 2018 •

edited

lapp0 commented Jul 3, 2018 •

edited

lapp0 commented Jul 17, 2018 •

edited

lapp0 commented Jul 18, 2018 •

edited