New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parralel scraping doesn't seem to work #17
Comments
Also it seems that even without scraping in parallel (using map instead of the pool's imap) those big queries don't really work too well. |
Hi, I'd like to take this up. Can I get assigned? PS: @sils pointed me over here :) |
Hi @adtac , it would be great if you could have a look at it :) |
@sils could you post the json somewhere? I tried running |
problem seems to appear only withlarge amounts of tweets
|
Rough order? I'm currently running it for 100,000 tweets. |
@sils just came back - there doesn't seem to be any tweet with a weird month - |
@adtac checkout my branch and scrape for |
@sils Yay, I was able to reproduce it 👍 Some months returned 0 results (when results actually exist). I made this small change and after several tests I never got the problem again. I haven't concluded if this is the the issue. Still looking into it. diff --git a/twitterscraper/query.py b/twitterscraper/query.py
index 12a2880..bb5dca7 100644
--- a/twitterscraper/query.py
+++ b/twitterscraper/query.py
@@ -84,7 +84,7 @@ def query_tweets_once(query, limit=None, num_tweets=0):
:return: A list of twitterscraper.Tweet objects. You will get at least
``limit`` number of items.
"""
- query = query.replace(' ', '%20').replace("#", "%23")
+ query = query.replace(' ', '%20').replace("#", "%23").replace(":", "%3A")
pos = None
tweets = []
try: |
I have, however, been getting a few weird |
is that disturbing the scraping or is the retry mechanism catching it?
|
The retry catches it and then things resume on well after that. |
Probably just a issue with my network - it's a little flaky at the moment. |
sounds fine then
|
Yeah, it's working quite well now - none of the first 20 parallel workers have finished (each at > 2500 tweets atm). |
nice, make a PR? |
Starting from your branch? I guess it doesn't really matter - you can rebase later. |
my commit is a WIP, pull in all changes that are good, I think the logging like that is better and more transparent |
I disagree with removing this - it's hard to know what's going on in the background if there're no updates. For example
just to keep myself updated :D Also I don't understand why we don't need Line 87. |
But I agree with the other two changes 👍 |
I was just trying to make it less verbose so I could see the relevant stuff |
Ah, alright :3 I'll make a PR in a minute 👍 |
I did a few logging modifications and if you checkout https://github.com/sils/twitterscraper/tree/sils/parallel and scrape for
test
or something like that you'll get like 60ish tweets sometimes for some parts of months which seems rather impossible (and doesn't check out if you put in the advanced query into the search UI)@taspinar if you have any idea that'd help a lot :/
The text was updated successfully, but these errors were encountered: