Parralel scraping doesn't seem to work #17

sils · 2016-12-07T10:21:04Z

I did a few logging modifications and if you checkout https://github.com/sils/twitterscraper/tree/sils/parallel and scrape for test or something like that you'll get like 60ish tweets sometimes for some parts of months which seems rather impossible (and doesn't check out if you put in the advanced query into the search UI)

@taspinar if you have any idea that'd help a lot :/

The text was updated successfully, but these errors were encountered:

sils · 2016-12-11T14:32:49Z

Also it seems that even without scraping in parallel (using map instead of the pool's imap) those big queries don't really work too well.

adtac · 2016-12-11T17:34:48Z

Hi, I'd like to take this up. Can I get assigned?

PS: @sils pointed me over here :)

taspinar · 2016-12-11T22:41:03Z

Hi @adtac , it would be great if you could have a look at it :)

adtac · 2016-12-20T07:39:25Z

@sils could you post the json somewhere? I tried running TwitterScraper "hillary OR trump AND election" --limit 100 and all the output tweets were properly in December 2016. Should I try with -a?

sils · 2016-12-20T08:14:40Z

problem seems to appear only withlarge amounts of tweets

adtac · 2016-12-20T08:53:39Z

Rough order? I'm currently running it for 100,000 tweets.

adtac · 2016-12-20T10:33:05Z

@sils just came back - there doesn't seem to be any tweet with a weird month - timestamp.month is always between 1 and 12 (inclusive). Any pointers as to how to reproduce this issue?

sils · 2016-12-20T10:33:28Z

@adtac checkout my branch and scrape for test without limit. You will see that for some queries there's a very low number of results, the queries are logged with the number of results on that branch: paste the exact query to the search box in https://twitter.com/ and you'll see that there are way more results in the web UI for that query than were scraped.

adtac · 2016-12-20T11:32:15Z

@sils Yay, I was able to reproduce it 👍 Some months returned 0 results (when results actually exist).

I made this small change and after several tests I never got the problem again. I haven't concluded if this is the the issue. Still looking into it.

diff --git a/twitterscraper/query.py b/twitterscraper/query.py
index 12a2880..bb5dca7 100644
--- a/twitterscraper/query.py
+++ b/twitterscraper/query.py
@@ -84,7 +84,7 @@ def query_tweets_once(query, limit=None, num_tweets=0):
     :return:      A list of twitterscraper.Tweet objects. You will get at least
                   ``limit`` number of items.
     """
-    query = query.replace(' ', '%20').replace("#", "%23")
+    query = query.replace(' ', '%20').replace("#", "%23").replace(":", "%3A")
     pos = None
     tweets = []
     try:

adtac · 2016-12-20T11:33:50Z

I have, however, been getting a few weird ERROR: URLError EOF occurred in violation of protocol (_ssl.c:600) while requesting sometimes :/

sils · 2016-12-20T11:36:38Z

is that disturbing the scraping or is the retry mechanism catching it?

adtac · 2016-12-20T11:37:09Z

The retry catches it and then things resume on well after that.

adtac · 2016-12-20T11:37:57Z

Probably just a issue with my network - it's a little flaky at the moment.

sils · 2016-12-20T11:40:33Z

sounds fine then

adtac · 2016-12-20T11:43:01Z

Yeah, it's working quite well now - none of the first 20 parallel workers have finished (each at > 2500 tweets atm).

sils · 2016-12-20T11:44:11Z

nice, make a PR?

adtac · 2016-12-20T11:45:24Z

Starting from your branch? I guess it doesn't really matter - you can rebase later.

sils · 2016-12-20T11:46:00Z

my commit is a WIP, pull in all changes that are good, I think the logging like that is better and more transparent

adtac · 2016-12-20T11:49:46Z

I disagree with removing this - it's hard to know what's going on in the background if there're no updates. For example test has over 3000 tweets (and counting) and I have no idea if it is even running (3000 tweets =~ 150+ http requests =~ 3 or 4 minutes at the very least). In fact I added in a simple

        logging.info(repr(query) + " - " + str(len(tweets)))

just to keep myself updated :D

Also I don't understand why we don't need Line 87.

adtac · 2016-12-20T11:50:13Z

But I agree with the other two changes 👍

sils · 2016-12-20T11:51:56Z

I was just trying to make it less verbose so I could see the relevant stuff

adtac · 2016-12-20T11:52:35Z

Ah, alright :3

I'll make a PR in a minute 👍

Fixes taspinar/twitterscraper#17

taspinar closed this as completed in 1df8cb5 Dec 27, 2016

meticulousfan added a commit to meticulousfan/scraping-site that referenced this issue Aug 19, 2022

query.py: URL encode colon

90cb00a

Fixes taspinar/twitterscraper#17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parralel scraping doesn't seem to work #17

Parralel scraping doesn't seem to work #17

sils commented Dec 7, 2016

sils commented Dec 11, 2016

adtac commented Dec 11, 2016

taspinar commented Dec 11, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016 via email

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016 via email

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016 via email

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016

Parralel scraping doesn't seem to work #17

Parralel scraping doesn't seem to work #17

Comments

sils commented Dec 7, 2016

sils commented Dec 11, 2016

adtac commented Dec 11, 2016

taspinar commented Dec 11, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016 via email

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016 via email

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016 via email

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016

adtac commented Dec 20, 2016

sils commented Dec 20, 2016

adtac commented Dec 20, 2016