Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parralel scraping doesn't seem to work #17

Closed
sils opened this issue Dec 7, 2016 · 22 comments
Closed

Parralel scraping doesn't seem to work #17

sils opened this issue Dec 7, 2016 · 22 comments

Comments

@sils
Copy link
Collaborator

sils commented Dec 7, 2016

I did a few logging modifications and if you checkout https://github.com/sils/twitterscraper/tree/sils/parallel and scrape for test or something like that you'll get like 60ish tweets sometimes for some parts of months which seems rather impossible (and doesn't check out if you put in the advanced query into the search UI)

@taspinar if you have any idea that'd help a lot :/

@sils
Copy link
Collaborator Author

sils commented Dec 11, 2016

Also it seems that even without scraping in parallel (using map instead of the pool's imap) those big queries don't really work too well.

@adtac
Copy link
Contributor

adtac commented Dec 11, 2016

Hi, I'd like to take this up. Can I get assigned?

PS: @sils pointed me over here :)

@taspinar
Copy link
Owner

Hi @adtac , it would be great if you could have a look at it :)

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

@sils could you post the json somewhere? I tried running TwitterScraper "hillary OR trump AND election" --limit 100 and all the output tweets were properly in December 2016. Should I try with -a?

@sils
Copy link
Collaborator Author

sils commented Dec 20, 2016 via email

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

Rough order? I'm currently running it for 100,000 tweets.

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

@sils just came back - there doesn't seem to be any tweet with a weird month - timestamp.month is always between 1 and 12 (inclusive). Any pointers as to how to reproduce this issue?

@sils
Copy link
Collaborator Author

sils commented Dec 20, 2016

@adtac checkout my branch and scrape for test without limit. You will see that for some queries there's a very low number of results, the queries are logged with the number of results on that branch: paste the exact query to the search box in https://twitter.com/ and you'll see that there are way more results in the web UI for that query than were scraped.

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

@sils Yay, I was able to reproduce it 👍 Some months returned 0 results (when results actually exist).

I made this small change and after several tests I never got the problem again. I haven't concluded if this is the the issue. Still looking into it.

diff --git a/twitterscraper/query.py b/twitterscraper/query.py
index 12a2880..bb5dca7 100644
--- a/twitterscraper/query.py
+++ b/twitterscraper/query.py
@@ -84,7 +84,7 @@ def query_tweets_once(query, limit=None, num_tweets=0):
     :return:      A list of twitterscraper.Tweet objects. You will get at least
                   ``limit`` number of items.
     """
-    query = query.replace(' ', '%20').replace("#", "%23")
+    query = query.replace(' ', '%20').replace("#", "%23").replace(":", "%3A")
     pos = None
     tweets = []
     try:

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

I have, however, been getting a few weird ERROR: URLError EOF occurred in violation of protocol (_ssl.c:600) while requesting sometimes :/

@sils
Copy link
Collaborator Author

sils commented Dec 20, 2016 via email

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

The retry catches it and then things resume on well after that.

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

Probably just a issue with my network - it's a little flaky at the moment.

@sils
Copy link
Collaborator Author

sils commented Dec 20, 2016 via email

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

Yeah, it's working quite well now - none of the first 20 parallel workers have finished (each at > 2500 tweets atm).

@sils
Copy link
Collaborator Author

sils commented Dec 20, 2016

nice, make a PR?

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

Starting from your branch? I guess it doesn't really matter - you can rebase later.

@sils
Copy link
Collaborator Author

sils commented Dec 20, 2016

my commit is a WIP, pull in all changes that are good, I think the logging like that is better and more transparent

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

I disagree with removing this - it's hard to know what's going on in the background if there're no updates. For example test has over 3000 tweets (and counting) and I have no idea if it is even running (3000 tweets =~ 150+ http requests =~ 3 or 4 minutes at the very least). In fact I added in a simple

        logging.info(repr(query) + " - " + str(len(tweets)))

just to keep myself updated :D

Also I don't understand why we don't need Line 87.

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

But I agree with the other two changes 👍

@sils
Copy link
Collaborator Author

sils commented Dec 20, 2016

I was just trying to make it less verbose so I could see the relevant stuff

@adtac
Copy link
Contributor

adtac commented Dec 20, 2016

Ah, alright :3

I'll make a PR in a minute 👍

meticulousfan added a commit to meticulousfan/scraping-site that referenced this issue Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants