Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk query wrapper #134

Closed
secynic opened this issue Aug 13, 2016 · 30 comments
Closed

Bulk query wrapper #134

secynic opened this issue Aug 13, 2016 · 30 comments
Assignees
Milestone

Comments

@secynic
Copy link
Owner

secynic commented Aug 13, 2016

Create a new wrapper in ipwhois.py for bulk lookups. Add rate limit optimized ordering of IP lookups.

@secynic secynic added this to the 1.0.0 stable milestone Aug 13, 2016
@secynic secynic self-assigned this Aug 13, 2016
@secynic
Copy link
Owner Author

secynic commented Apr 5, 2017

Some initial thoughts on design...

Requirements:

  • Maintain Python 2.6+, 3.3+ compatibility
  • Results must be the same as if run via IPWhois
  • Support WHOIS, RDAP, NIR
  • Utilize existing code where possible
  • Offline tests
  • Stress tests

Short term workflow (this issue):

  • Add Bulk ASN lookups function to query Cymru ASN Whois (https://www.team-cymru.org/IP-ASN-mapping.html#whois)
  • Lookup ASN info for all IPs first via above function (~500 IPs in 1.5 minutes)
  • Sort IPs based on ASN results for optimal overall query time (LACNIC is the only rate-limiting factor so far), and return sorted dict (IP -> ASN info mapping)
  • User takes sorted dict and runs IPWhois.lookup_*() as usual, bypassing ASN lookup
  • Fail on 1 timeout, rinse and repeat for failures

Future Performance Considerations (issue(s) to be opened):

  • Multiprocessing
  • Optionally reduce queries by limiting lookups per subnet.
  • Optional KV store for tracking average response time per RIR/NIC, further optimization based on these averages

I appreciate any feedback/suggestions/testing.

EDIT Moved bulk ASN lookups to short term - Cymru was timing out on me for large individual queries

@cdubz
Copy link

cdubz commented Apr 5, 2017

I'll be happy to do some testing (and take a shot at development, if needed).

Why the separation in to two steps (ASN and then lookup_*)?

@secynic
Copy link
Owner Author

secynic commented Apr 5, 2017

Thanks.

ASN lookups are very fast, and the results map IPs to their RIR. We can then use this info to sort the IPs before actually performing the whois/RDAP lookups. Basically, we don't want a bunch of lookups to LACNIC in a row, since they rate limit aggressively. The sort should spread out those LACNIC IPs (or any other RIRs that decide to rate limit).

This method becomes more effective as the variation in the bulk IP list increases.

@nstr10
Copy link

nstr10 commented May 9, 2017

This is a much-needed feature, thanks to aggressive rate limiting beyond even the published limits. I throttled my queries to 6 per minute and am still getting blocked within a couple of minutes. Even at that rate, my script would need to run for over 2 days to get the info I need.
Happy to help with testing any proposed solution.
EDIT: I've now throttled my script to 2 lookups per minute, only to get blocked again after 15 minutes. This is infuriating.

@secynic
Copy link
Owner Author

secynic commented May 19, 2017

I understand your frustration. I have a couple of questions...

  • Were all those blocks LACNIC?
  • What was your depth argument set to? (depth greater than 0 will cause more queries per IP)

I have been running a bunch of different tests, and am finally seeing some better results. I will update this post with more detailed sample size information.

I edited my design post above (moved bulk ASN lookups to this issue).

EDIT
Here are my stats right now for the random sample:

  • ARIN - 64 IPs
  • AFRINIC - 5 IPs
  • RIPE - 105 IPs
  • LACNIC - 137 IPs
  • APNIC - 291 IPs

All lookups completed in a total of 26m 8s. I was looking up 2 of every non-LACNIC IP for every 1 LACNIC IP. I had an internal timer set to not trigger the rate-limit on LACNIC (9 per minute), and only hit the limit 3 times. The problem with the 2-1 method, is that all the other lookups completed long before the LACNIC ones were done, and then it was just a matter of waiting for the remaining LACNIC IPs @ 9 per minute.

I know I can improve on this:

  • Make sample less random - add more IPs for ARIN, AFRINIC
  • Remove the 2 for every 1 rule (see next point)
  • Lookup 9 LACNIC IPs in a row, then the others until 60 total seconds have passed, rinse, repeat
  • Adjust timeout - I was getting socket timeout with the default 5s for a bunch of LACNIC lookups

Let me make these changes, run some more tests, clean up the code, and I will update expirmental.py on dev.

@nstr10
Copy link

nstr10 commented Jun 6, 2017

Yes, all the blocks are from LACNIC. The list I'm trying to get information for is about 18k IPs. I haven't tried a depth argument, so likely whatever the default is. Honestly not sure what depth means or how to set it. I'll need to browse the package source and find out.

@secynic
Copy link
Owner Author

secynic commented Jun 14, 2017

For each entity found by the main/root query, and if depth>=1, it will run a query for each of those entities, and put the result in the objects dict. If depth>=2, and a root entity query result also has entities, it will query each of those as well. This will continue recursively until depth is reached.

Basically, an IP has entities, which may also have entities, which may also have entities, and so on... Each unique entity results in a HTTP query.

@secynic
Copy link
Owner Author

secynic commented Jun 30, 2017

@cdubz @nstr10 I could use some help testing on the dev branch against experimental.bulk_lookup_rdap() and experimental.get_bulk_asn_whois(). I have been testing 500-1000 random IPs at a time (with at least 100 LACNIC IPs), so I'm sure you can understand why I need to outsource some of this testing. Any feedback, suggestions, and query results/time taken will be helpful.

This is still far from perfect, but I am seeing some decent results. I definitely still need to add some exception handling, lookup limits, and rate/timeout tweaks.

LACNIC is giving me weird 400 error message failures for IPs sporadically. These are not rate-limit timeouts, and they eventually pull the correct information when I retry one or more times. I am thinking they are blocking/failing without actually giving the rate-limit error. Very frustrating...

Here is something I have been using to generate random IPv4 addresses:

import random
import socket
import struct
from ipwhois.utils import ipv4_is_defined
i = 0
ip_list = []
while i < 1000:
    tmp = socket.inet_ntoa(struct.pack('>I', random.randint(1, 0xffffffff)))
    if not ipv4_is_defined(tmp)[0]:
        ip_list.append(tmp)
        i += 1

Thanks in advance.

secynic added a commit that referenced this issue Jun 30, 2017
secynic added a commit that referenced this issue Jun 30, 2017
secynic added a commit that referenced this issue Jun 30, 2017
@cdubz
Copy link

cdubz commented Jun 30, 2017

Debug log and stats attached for two test runs with 1,000 random IPs. Happy to do more of these, or let me know if you are looking for any other kind of data.

@cdubz
Copy link

cdubz commented Jun 30, 2017

Why not 6,000 IPs? Well, almost... lacnic :goberserk:

@secynic
Copy link
Owner Author

secynic commented Jun 30, 2017

Thanks, this is a good format for now (although I would like to see the rate limit stats). The larger the sample size, the better. Also feel free to adjust some of the timeout/rate limit arguments to see if things get better/worse.

I will work on returning some better stats.

The LACNIC failures (400) are happening randomly even if I hit a single IP in the browser. I am not sure what to do about this, other than create a kickstarter to improve their infrastructure haha.

@cdubz
Copy link

cdubz commented Jul 1, 2017

Oops! I didn't notice the rate limit debug messages in there. Here is a re-run of the stats from those three tests with rate limit info.

@cdubz
Copy link

cdubz commented Jul 1, 2017

Interesting... I was attempting a 3,000 IP test with socket_timeout set to 30. I left it to run for a while and came back to a pegged CPU. I haven't really done Python debugging before so I kludged my way through some things and can provide at least a bit of info from a gdb session.

Test 4 (1,000 IPs, socket_timeout=20)

Test 5 (3,000 IPs, socket_timeout=30) 💣

I was going to do a really large list while away for the day, but I don't want to get pegged while gone so I'll save that for another time.

@cdubz
Copy link

cdubz commented Jul 2, 2017

Here are some more smaller runs. I'll try a larger list tonight as well -

Test 6 (500 IPs, rate_limit_timeout=30)

Test 7 (500 IPs, rate_limit_timeout=10)

Test 8 (500 IPs, rate_limit_timeout=10, socket_timeout=20)

Test 9 (1,000 IPs, rate_limit_timeout=10, socket_timeout=20)

Test 10 (2,000 IPs, rate_limit_timeout=10, socket_timeout=20)

Not seeing much difference with the LACNIC failure rates.

@secynic
Copy link
Owner Author

secynic commented Jul 2, 2017

Thanks, this is really helpful. I will go through what you have attached. I didn't expect much with improving the LACNIC failures, other than increasing the retry count.

The CPU utilization is definitely a concern. I know Python likes to utilize as much as possible. Could you provide some system stats (CPU, mem, OS)? Was it frozen, or just pegged at 100% for a bit?

@cdubz
Copy link

cdubz commented Jul 2, 2017

One core was pegged at 100% and it was like that for at least 20 minutes or so before I got to it. Didn't appear to be frozen. It seemed like an infinite loop problem but I wasn't able to figure that out for sure. It was running on a pretty powerful machine - Core i5-4670K, 16GB memory, Linux Mint 18.1.

Crazily enough, it appears I have just reproduced it with another large set. About eight hours ago I kicked off a set of 50,000 IPs (haha) and it only made it through about 8,000 (because of the LACNIC delays). But it was still running until right this moment, when it is has pegged a core at 100% again! This is with a different computer (laptop) - Core i3 M370, 8GB memory, Ubuntu 16.04.2. Interestingly, the pegged core changes from time to time.

Seems to point to the same line, experimental.py@L263 but again I'm not sure how to uncover much more than that.

Test 11 (50,000 IPs, rate_limit_timeout=10, socket_timeout=20) 💣

@secynic
Copy link
Owner Author

secynic commented Jul 7, 2017

Based on your output, it looks to be pegging on dict copy, which can be expected with very large results. I may have to limit the bulk size, or at least provide a default argument of maybe 1-5k. Python is greedy in that way, and will utilize as much resources as possible. Reducing the ip_list size won't stop it from hogging resources, but will take less time.

I am going to run more tests, add additional logging, and more stats as previously mentioned this weekend.

The only other testing I still need is tweaking with retry_count + a default/low socket_timeout. retry_count defaults to 3, so we could possibly see better results with LACNIC on those 400 errors if we retry a few more times (replicated in browser). This will also increase the total time, but might yield fewer overall failures.

Your extensive testing is greatly appreciated.

@cdubz
Copy link

cdubz commented Jul 7, 2017

Interesting. So you think the hanging issue has to do with Python's own dict.copy method having issues with larger dicts? I was assuming my minimal understanding of gdb was leading to some sort of false-positive. Is there any way to work around using dict.copy there? I haven't taken a close enough look at what is actually going on. Setting an arbitrary limit on the IP list size just doesn't seem ideal, though I suppose it would be fair to expect users to handle larger lists in chunks of some sort.

@cdubz
Copy link

cdubz commented Jul 7, 2017

From just a couple of tests, it looks like going to eight seconds with socket_timeout may produce some good results (test 18). When I tried five seconds, LACNIC timeouts went waaay up and APNIC even had a few (test 17). My stats don't really tell us much about the effectiveness of retry_count, I don't think. I am only counting failures per attempt and not recording when an IP fails entirely (by reaching the retry count). A debug message would be useful for that so I don't have to fiddle too much with my very simple log parser (:

Test 16 (1,000 IPs, retry_count=5)

Test 17 (1,000 IPs, retry_count=5, socket_timeout=5)

Test 18 (1,000 IPs, retry_count=5, socket_timeout=8) 👍

However, I managed to get another hung process. This time with only a 1,000 IP list. I started saving the IP lists to a file before each run so I could try running the same list again to see if maybe it is reproducible. I will report back on that when I try this list again:

Test 19 (1,000 IPs, retry_count=5, socket_timeout=8) 💣

I don't have any gdb output for this one as I have been running these tests on a Windows host today and don't have the appropriate tools to inspect. The behavior during the (10 minute long) hang was slightly different - the CPU was averaging 25% overall the entire time but no single core was pegged. Resource Monitor also noted the CPU was running at 120%+ "maximum frequency". I suspect the differences here don't tell us much, so noting just in case. This is an Intel Core i5-6440HQ, 16GB memory, Windows 10 Pro.

@cdubz
Copy link

cdubz commented Jul 7, 2017

Unfortunately could not reproduce the issue with the same IP list. I figured it wouldn't happen, but it sure would've been nice to have a reproducible case...

However, more support for an eight second socket_timeout, LACNIC average attempt time much lower.

Test 20 (1,000 IPs, socket_timeout=8) 👍

@secynic
Copy link
Owner Author

secynic commented Jul 7, 2017

Thanks. The dict.copy is needed there since you can't remove elements from the original while looping it. I'm going to run a big IP list and see if I can reproduce as well.

As far as debug logging goes, I will add that in there. Also note that the function is returning a tuple of some of the stats/info you are parsing for already:
(results, failed_lookups, rated_lookups, ip_lookup_total, ip_unique_total, rir_stats)

@cdubz
Copy link

cdubz commented Jul 7, 2017

Ah yeah. I probably should have been recording that with all these tests, haha.

@secynic
Copy link
Owner Author

secynic commented Jul 8, 2017

I had a 30k IP list stop running lookups after ~6 hours. Added in a debug message to show a counter of IPs left, then ran a 10k list. There is definitely an infinite loop (not caused by dict.copy). I am still not sure what is causing it, and why it only happens for larger IP lists. Running some more tests...

@secynic
Copy link
Owner Author

secynic commented Jul 9, 2017

Found the problem, it was on the Cymru end. Their bulk ASN whois lookup was returning duplicates. My code didn't account for that since I had already made the input IP list unique.

Basically the internal counter for tracking RIRs was higher than the total number of IPs to look up. Once it would get towards the end of the LACNIC list, it would continually loop thinking there were more LACNIC IPs to lookup.

@cdubz
Copy link

cdubz commented Jul 10, 2017

Some more good results with a lowered socket_timeout. Managed to get a list off 10,000 IPs in 2h41m.

Test 21 (10,000 IPs, retry_count=5, socket_timeout=7)

I did get a couple socket errors with APNIC in this case, with a small group of the following messages along the way (blocking?): HTTP query socket error: [WinError 10054] An existing connection was forcibly closed by the remote host

@secynic
Copy link
Owner Author

secynic commented Jul 10, 2017

Interesting, they could be blocking or just congestion on their end. It looks like those IPs eventually got sorted out on retries. I am a bit leery of classifying that error (ConnectionResetError) as rate-limiting at the bulk_lookup_rdap level, as it will infinitely wait for timeout and retry. Let me think that over.

I pushed another update to dev, with rate-limit handling for the rest of the RIRs and new return/stats:
https://ipwhois.readthedocs.io/en/dev/ipwhois.html#ipwhois.experimental.bulk_lookup_rdap

secynic added a commit that referenced this issue Jul 11, 2017
secynic added a commit that referenced this issue Jul 11, 2017
@cdubz
Copy link

cdubz commented Jul 11, 2017

One more test on the latest dev. Results not looking like quite as big of an improvement as the original tests did with an eight second socket_timeout, but still good stuff.

Test 23 (20,000 IPs, retry_count=5, socket_timeout=8)

@secynic
Copy link
Owner Author

secynic commented Jul 15, 2017

Merged in #186. Leaving open pending any last minute tweaks for 1.0.0 and still need to update generate_examples.py/EXPERIMENTAL.rst.

secynic added a commit that referenced this issue Jul 20, 2017
@secynic
Copy link
Owner Author

secynic commented Jul 20, 2017

RTD build is failing for the latest commit to dev (just added RST links). They give no indication or error messages. Will try again later...

EDIT This is failing for previous successful builds (latest/master). Seems to be an issue on their end.

EDIT 2 Others are having this issue: readthedocs/readthedocs.org#3006

secynic added a commit that referenced this issue Jul 30, 2017
@secynic
Copy link
Owner Author

secynic commented Jul 31, 2017

1.0.0 pushed to pypi

@secynic secynic closed this as completed Jul 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants