-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk query wrapper #134
Comments
Some initial thoughts on design... Requirements:
Short term workflow (this issue):
Future Performance Considerations (issue(s) to be opened):
I appreciate any feedback/suggestions/testing. EDIT Moved bulk ASN lookups to short term - Cymru was timing out on me for large individual queries |
I'll be happy to do some testing (and take a shot at development, if needed). Why the separation in to two steps (ASN and then lookup_*)? |
Thanks. ASN lookups are very fast, and the results map IPs to their RIR. We can then use this info to sort the IPs before actually performing the whois/RDAP lookups. Basically, we don't want a bunch of lookups to LACNIC in a row, since they rate limit aggressively. The sort should spread out those LACNIC IPs (or any other RIRs that decide to rate limit). This method becomes more effective as the variation in the bulk IP list increases. |
This is a much-needed feature, thanks to aggressive rate limiting beyond even the published limits. I throttled my queries to 6 per minute and am still getting blocked within a couple of minutes. Even at that rate, my script would need to run for over 2 days to get the info I need. |
I understand your frustration. I have a couple of questions...
I have been running a bunch of different tests, and am finally seeing some better results. I will update this post with more detailed sample size information. I edited my design post above (moved bulk ASN lookups to this issue). EDIT
All lookups completed in a total of 26m 8s. I was looking up 2 of every non-LACNIC IP for every 1 LACNIC IP. I had an internal timer set to not trigger the rate-limit on LACNIC (9 per minute), and only hit the limit 3 times. The problem with the 2-1 method, is that all the other lookups completed long before the LACNIC ones were done, and then it was just a matter of waiting for the remaining LACNIC IPs @ 9 per minute. I know I can improve on this:
Let me make these changes, run some more tests, clean up the code, and I will update expirmental.py on dev. |
Yes, all the blocks are from LACNIC. The list I'm trying to get information for is about 18k IPs. I haven't tried a depth argument, so likely whatever the default is. Honestly not sure what depth means or how to set it. I'll need to browse the package source and find out. |
For each entity found by the main/root query, and if depth>=1, it will run a query for each of those entities, and put the result in the objects dict. If depth>=2, and a root entity query result also has entities, it will query each of those as well. This will continue recursively until depth is reached. Basically, an IP has entities, which may also have entities, which may also have entities, and so on... Each unique entity results in a HTTP query. |
@cdubz @nstr10 I could use some help testing on the dev branch against experimental.bulk_lookup_rdap() and experimental.get_bulk_asn_whois(). I have been testing 500-1000 random IPs at a time (with at least 100 LACNIC IPs), so I'm sure you can understand why I need to outsource some of this testing. Any feedback, suggestions, and query results/time taken will be helpful. This is still far from perfect, but I am seeing some decent results. I definitely still need to add some exception handling, lookup limits, and rate/timeout tweaks. LACNIC is giving me weird 400 error message failures for IPs sporadically. These are not rate-limit timeouts, and they eventually pull the correct information when I retry one or more times. I am thinking they are blocking/failing without actually giving the rate-limit error. Very frustrating... Here is something I have been using to generate random IPv4 addresses: import random
import socket
import struct
from ipwhois.utils import ipv4_is_defined
i = 0
ip_list = []
while i < 1000:
tmp = socket.inet_ntoa(struct.pack('>I', random.randint(1, 0xffffffff)))
if not ipv4_is_defined(tmp)[0]:
ip_list.append(tmp)
i += 1 Thanks in advance. |
Debug log and stats attached for two test runs with 1,000 random IPs. Happy to do more of these, or let me know if you are looking for any other kind of data. |
Why not 6,000 IPs? Well, almost... lacnic |
Thanks, this is a good format for now (although I would like to see the rate limit stats). The larger the sample size, the better. Also feel free to adjust some of the timeout/rate limit arguments to see if things get better/worse. I will work on returning some better stats. The LACNIC failures (400) are happening randomly even if I hit a single IP in the browser. I am not sure what to do about this, other than create a kickstarter to improve their infrastructure haha. |
Oops! I didn't notice the rate limit debug messages in there. Here is a re-run of the stats from those three tests with rate limit info. |
Interesting... I was attempting a 3,000 IP test with Test 4 (1,000 IPs, Test 5 (3,000 IPs,
I was going to do a really large list while away for the day, but I don't want to get pegged while gone so I'll save that for another time. |
Here are some more smaller runs. I'll try a larger list tonight as well - Test 6 (500 IPs, Test 7 (500 IPs, Test 8 (500 IPs, Test 9 (1,000 IPs, Test 10 (2,000 IPs, Not seeing much difference with the LACNIC failure rates. |
Thanks, this is really helpful. I will go through what you have attached. I didn't expect much with improving the LACNIC failures, other than increasing the retry count. The CPU utilization is definitely a concern. I know Python likes to utilize as much as possible. Could you provide some system stats (CPU, mem, OS)? Was it frozen, or just pegged at 100% for a bit? |
One core was pegged at 100% and it was like that for at least 20 minutes or so before I got to it. Didn't appear to be frozen. It seemed like an infinite loop problem but I wasn't able to figure that out for sure. It was running on a pretty powerful machine - Core i5-4670K, 16GB memory, Linux Mint 18.1. Crazily enough, it appears I have just reproduced it with another large set. About eight hours ago I kicked off a set of 50,000 IPs (haha) and it only made it through about 8,000 (because of the LACNIC delays). But it was still running until right this moment, when it is has pegged a core at 100% again! This is with a different computer (laptop) - Core i3 M370, 8GB memory, Ubuntu 16.04.2. Interestingly, the pegged core changes from time to time. Seems to point to the same line, experimental.py@L263 but again I'm not sure how to uncover much more than that. Test 11 (50,000 IPs, |
Based on your output, it looks to be pegging on dict copy, which can be expected with very large results. I may have to limit the bulk size, or at least provide a default argument of maybe 1-5k. Python is greedy in that way, and will utilize as much resources as possible. Reducing the ip_list size won't stop it from hogging resources, but will take less time. I am going to run more tests, add additional logging, and more stats as previously mentioned this weekend. The only other testing I still need is tweaking with retry_count + a default/low socket_timeout. retry_count defaults to 3, so we could possibly see better results with LACNIC on those 400 errors if we retry a few more times (replicated in browser). This will also increase the total time, but might yield fewer overall failures. Your extensive testing is greatly appreciated. |
Interesting. So you think the hanging issue has to do with Python's own |
From just a couple of tests, it looks like going to eight seconds with Test 16 (1,000 IPs, Test 17 (1,000 IPs, Test 18 (1,000 IPs, However, I managed to get another hung process. This time with only a 1,000 IP list. I started saving the IP lists to a file before each run so I could try running the same list again to see if maybe it is reproducible. I will report back on that when I try this list again: Test 19 (1,000 IPs, I don't have any gdb output for this one as I have been running these tests on a Windows host today and don't have the appropriate tools to inspect. The behavior during the (10 minute long) hang was slightly different - the CPU was averaging 25% overall the entire time but no single core was pegged. Resource Monitor also noted the CPU was running at 120%+ "maximum frequency". I suspect the differences here don't tell us much, so noting just in case. This is an Intel Core i5-6440HQ, 16GB memory, Windows 10 Pro. |
Unfortunately could not reproduce the issue with the same IP list. I figured it wouldn't happen, but it sure would've been nice to have a reproducible case... However, more support for an eight second Test 20 (1,000 IPs,
|
Thanks. The dict.copy is needed there since you can't remove elements from the original while looping it. I'm going to run a big IP list and see if I can reproduce as well. As far as debug logging goes, I will add that in there. Also note that the function is returning a tuple of some of the stats/info you are parsing for already: |
Ah yeah. I probably should have been recording that with all these tests, haha. |
I had a 30k IP list stop running lookups after ~6 hours. Added in a debug message to show a counter of IPs left, then ran a 10k list. There is definitely an infinite loop (not caused by dict.copy). I am still not sure what is causing it, and why it only happens for larger IP lists. Running some more tests... |
Found the problem, it was on the Cymru end. Their bulk ASN whois lookup was returning duplicates. My code didn't account for that since I had already made the input IP list unique. Basically the internal counter for tracking RIRs was higher than the total number of IPs to look up. Once it would get towards the end of the LACNIC list, it would continually loop thinking there were more LACNIC IPs to lookup. |
Some more good results with a lowered Test 21 (10,000 IPs, I did get a couple socket errors with APNIC in this case, with a small group of the following messages along the way (blocking?): |
Interesting, they could be blocking or just congestion on their end. It looks like those IPs eventually got sorted out on retries. I am a bit leery of classifying that error (ConnectionResetError) as rate-limiting at the bulk_lookup_rdap level, as it will infinitely wait for timeout and retry. Let me think that over. I pushed another update to dev, with rate-limit handling for the rest of the RIRs and new return/stats: |
One more test on the latest dev. Results not looking like quite as big of an improvement as the original tests did with an eight second Test 23 (20,000 IPs, |
Merged in #186. Leaving open pending any last minute tweaks for 1.0.0 and still need to update generate_examples.py/EXPERIMENTAL.rst. |
RTD build is failing for the latest commit to dev (just added RST links). They give no indication or error messages. Will try again later... EDIT This is failing for previous successful builds (latest/master). Seems to be an issue on their end. EDIT 2 Others are having this issue: readthedocs/readthedocs.org#3006 |
1.0.0 pushed to pypi |
Create a new wrapper in ipwhois.py for bulk lookups. Add rate limit optimized ordering of IP lookups.
The text was updated successfully, but these errors were encountered: