Bulk query wrapper #134

secynic · 2016-08-13T02:37:09Z

Create a new wrapper in ipwhois.py for bulk lookups. Add rate limit optimized ordering of IP lookups.

secynic · 2017-04-05T02:12:43Z

Some initial thoughts on design...

Requirements:

Maintain Python 2.6+, 3.3+ compatibility
Results must be the same as if run via IPWhois
Support WHOIS, RDAP, NIR
Utilize existing code where possible
Offline tests
Stress tests

Short term workflow (this issue):

Add Bulk ASN lookups function to query Cymru ASN Whois (https://www.team-cymru.org/IP-ASN-mapping.html#whois)
Lookup ASN info for all IPs first via above function (~500 IPs in 1.5 minutes)
Sort IPs based on ASN results for optimal overall query time (LACNIC is the only rate-limiting factor so far), and return sorted dict (IP -> ASN info mapping)
User takes sorted dict and runs IPWhois.lookup_*() as usual, bypassing ASN lookup
Fail on 1 timeout, rinse and repeat for failures

Future Performance Considerations (issue(s) to be opened):

Multiprocessing
Optionally reduce queries by limiting lookups per subnet.
Optional KV store for tracking average response time per RIR/NIC, further optimization based on these averages

I appreciate any feedback/suggestions/testing.

EDIT Moved bulk ASN lookups to short term - Cymru was timing out on me for large individual queries

cdubz · 2017-04-05T10:34:09Z

I'll be happy to do some testing (and take a shot at development, if needed).

Why the separation in to two steps (ASN and then lookup_*)?

secynic · 2017-04-05T22:54:58Z

Thanks.

ASN lookups are very fast, and the results map IPs to their RIR. We can then use this info to sort the IPs before actually performing the whois/RDAP lookups. Basically, we don't want a bunch of lookups to LACNIC in a row, since they rate limit aggressively. The sort should spread out those LACNIC IPs (or any other RIRs that decide to rate limit).

This method becomes more effective as the variation in the bulk IP list increases.

nstr10 · 2017-05-09T15:32:36Z

This is a much-needed feature, thanks to aggressive rate limiting beyond even the published limits. I throttled my queries to 6 per minute and am still getting blocked within a couple of minutes. Even at that rate, my script would need to run for over 2 days to get the info I need.
Happy to help with testing any proposed solution.
EDIT: I've now throttled my script to 2 lookups per minute, only to get blocked again after 15 minutes. This is infuriating.

secynic · 2017-05-19T23:34:52Z

I understand your frustration. I have a couple of questions...

Were all those blocks LACNIC?
What was your depth argument set to? (depth greater than 0 will cause more queries per IP)

I have been running a bunch of different tests, and am finally seeing some better results. I will update this post with more detailed sample size information.

I edited my design post above (moved bulk ASN lookups to this issue).

EDIT
Here are my stats right now for the random sample:

ARIN - 64 IPs
AFRINIC - 5 IPs
RIPE - 105 IPs
LACNIC - 137 IPs
APNIC - 291 IPs

All lookups completed in a total of 26m 8s. I was looking up 2 of every non-LACNIC IP for every 1 LACNIC IP. I had an internal timer set to not trigger the rate-limit on LACNIC (9 per minute), and only hit the limit 3 times. The problem with the 2-1 method, is that all the other lookups completed long before the LACNIC ones were done, and then it was just a matter of waiting for the remaining LACNIC IPs @ 9 per minute.

I know I can improve on this:

Make sample less random - add more IPs for ARIN, AFRINIC
Remove the 2 for every 1 rule (see next point)
Lookup 9 LACNIC IPs in a row, then the others until 60 total seconds have passed, rinse, repeat
Adjust timeout - I was getting socket timeout with the default 5s for a bunch of LACNIC lookups

Let me make these changes, run some more tests, clean up the code, and I will update expirmental.py on dev.

nstr10 · 2017-06-06T20:49:53Z

Yes, all the blocks are from LACNIC. The list I'm trying to get information for is about 18k IPs. I haven't tried a depth argument, so likely whatever the default is. Honestly not sure what depth means or how to set it. I'll need to browse the package source and find out.

secynic · 2017-06-14T04:10:22Z

For each entity found by the main/root query, and if depth>=1, it will run a query for each of those entities, and put the result in the objects dict. If depth>=2, and a root entity query result also has entities, it will query each of those as well. This will continue recursively until depth is reached.

Basically, an IP has entities, which may also have entities, which may also have entities, and so on... Each unique entity results in a HTTP query.

secynic · 2017-06-30T04:49:37Z

@cdubz @nstr10 I could use some help testing on the dev branch against experimental.bulk_lookup_rdap() and experimental.get_bulk_asn_whois(). I have been testing 500-1000 random IPs at a time (with at least 100 LACNIC IPs), so I'm sure you can understand why I need to outsource some of this testing. Any feedback, suggestions, and query results/time taken will be helpful.

This is still far from perfect, but I am seeing some decent results. I definitely still need to add some exception handling, lookup limits, and rate/timeout tweaks.

LACNIC is giving me weird 400 error message failures for IPs sporadically. These are not rate-limit timeouts, and they eventually pull the correct information when I retry one or more times. I am thinking they are blocking/failing without actually giving the rate-limit error. Very frustrating...

Here is something I have been using to generate random IPv4 addresses:

import random
import socket
import struct
from ipwhois.utils import ipv4_is_defined
i = 0
ip_list = []
while i < 1000:
    tmp = socket.inet_ntoa(struct.pack('>I', random.randint(1, 0xffffffff)))
    if not ipv4_is_defined(tmp)[0]:
        ip_list.append(tmp)
        i += 1

Thanks in advance.

cdubz · 2017-06-30T14:23:45Z

Debug log and stats attached for two test runs with 1,000 random IPs. Happy to do more of these, or let me know if you are looking for any other kind of data.

cdubz · 2017-06-30T20:12:58Z

Why not 6,000 IPs? Well, almost... lacnic

secynic · 2017-06-30T21:42:20Z

Thanks, this is a good format for now (although I would like to see the rate limit stats). The larger the sample size, the better. Also feel free to adjust some of the timeout/rate limit arguments to see if things get better/worse.

I will work on returning some better stats.

The LACNIC failures (400) are happening randomly even if I hit a single IP in the browser. I am not sure what to do about this, other than create a kickstarter to improve their infrastructure haha.

cdubz · 2017-07-01T12:17:51Z

Oops! I didn't notice the rate limit debug messages in there. Here is a re-run of the stats from those three tests with rate limit info.

cdubz · 2017-07-01T15:22:32Z

Interesting... I was attempting a 3,000 IP test with socket_timeout set to 30. I left it to run for a while and came back to a pegged CPU. I haven't really done Python debugging before so I kludged my way through some things and can provide at least a bit of info from a gdb session.

Test 4 (1,000 IPs, socket_timeout=20)

Test 5 (3,000 IPs, socket_timeout=30) 💣

test5_log.txt
test5_stats.txt
test5_gdb.txt
test5_core.zip - Not sure if you can use this? I was testing with Python 3.5.2.

I was going to do a really large list while away for the day, but I don't want to get pegged while gone so I'll save that for another time.

cdubz · 2017-07-02T01:27:15Z

Here are some more smaller runs. I'll try a larger list tonight as well -

Test 6 (500 IPs, rate_limit_timeout=30)

Test 7 (500 IPs, rate_limit_timeout=10)

Test 8 (500 IPs, rate_limit_timeout=10, socket_timeout=20)

Test 9 (1,000 IPs, rate_limit_timeout=10, socket_timeout=20)

Test 10 (2,000 IPs, rate_limit_timeout=10, socket_timeout=20)

Not seeing much difference with the LACNIC failure rates.

secynic · 2017-07-02T05:26:01Z

Thanks, this is really helpful. I will go through what you have attached. I didn't expect much with improving the LACNIC failures, other than increasing the retry count.

The CPU utilization is definitely a concern. I know Python likes to utilize as much as possible. Could you provide some system stats (CPU, mem, OS)? Was it frozen, or just pegged at 100% for a bit?

cdubz · 2017-07-02T10:59:04Z

One core was pegged at 100% and it was like that for at least 20 minutes or so before I got to it. Didn't appear to be frozen. It seemed like an infinite loop problem but I wasn't able to figure that out for sure. It was running on a pretty powerful machine - Core i5-4670K, 16GB memory, Linux Mint 18.1.

Crazily enough, it appears I have just reproduced it with another large set. About eight hours ago I kicked off a set of 50,000 IPs (haha) and it only made it through about 8,000 (because of the LACNIC delays). But it was still running until right this moment, when it is has pegged a core at 100% again! This is with a different computer (laptop) - Core i3 M370, 8GB memory, Ubuntu 16.04.2. Interestingly, the pegged core changes from time to time.

Seems to point to the same line, experimental.py@L263 but again I'm not sure how to uncover much more than that.

Test 11 (50,000 IPs, rate_limit_timeout=10, socket_timeout=20) 💣

secynic · 2017-07-07T01:14:19Z

Based on your output, it looks to be pegging on dict copy, which can be expected with very large results. I may have to limit the bulk size, or at least provide a default argument of maybe 1-5k. Python is greedy in that way, and will utilize as much resources as possible. Reducing the ip_list size won't stop it from hogging resources, but will take less time.

I am going to run more tests, add additional logging, and more stats as previously mentioned this weekend.

The only other testing I still need is tweaking with retry_count + a default/low socket_timeout. retry_count defaults to 3, so we could possibly see better results with LACNIC on those 400 errors if we retry a few more times (replicated in browser). This will also increase the total time, but might yield fewer overall failures.

Your extensive testing is greatly appreciated.

cdubz · 2017-07-07T12:49:19Z

Interesting. So you think the hanging issue has to do with Python's own dict.copy method having issues with larger dicts? I was assuming my minimal understanding of gdb was leading to some sort of false-positive. Is there any way to work around using dict.copy there? I haven't taken a close enough look at what is actually going on. Setting an arbitrary limit on the IP list size just doesn't seem ideal, though I suppose it would be fair to expect users to handle larger lists in chunks of some sort.

cdubz · 2017-07-07T16:42:48Z

From just a couple of tests, it looks like going to eight seconds with socket_timeout may produce some good results (test 18). When I tried five seconds, LACNIC timeouts went waaay up and APNIC even had a few (test 17). My stats don't really tell us much about the effectiveness of retry_count, I don't think. I am only counting failures per attempt and not recording when an IP fails entirely (by reaching the retry count). A debug message would be useful for that so I don't have to fiddle too much with my very simple log parser (:

Test 16 (1,000 IPs, retry_count=5)

Test 17 (1,000 IPs, retry_count=5, socket_timeout=5)

Test 18 (1,000 IPs, retry_count=5, socket_timeout=8) 👍

However, I managed to get another hung process. This time with only a 1,000 IP list. I started saving the IP lists to a file before each run so I could try running the same list again to see if maybe it is reproducible. I will report back on that when I try this list again:

Test 19 (1,000 IPs, retry_count=5, socket_timeout=8) 💣

I don't have any gdb output for this one as I have been running these tests on a Windows host today and don't have the appropriate tools to inspect. The behavior during the (10 minute long) hang was slightly different - the CPU was averaging 25% overall the entire time but no single core was pegged. Resource Monitor also noted the CPU was running at 120%+ "maximum frequency". I suspect the differences here don't tell us much, so noting just in case. This is an Intel Core i5-6440HQ, 16GB memory, Windows 10 Pro.

cdubz · 2017-07-07T17:04:06Z

Unfortunately could not reproduce the issue with the same IP list. I figured it wouldn't happen, but it sure would've been nice to have a reproducible case...

However, more support for an eight second socket_timeout, LACNIC average attempt time much lower.

Test 20 (1,000 IPs, socket_timeout=8) 👍

test20_ip_list.txt (same as Test 19 💣)
test20_log.txt
test20_stats.txt

secynic · 2017-07-07T17:30:37Z

Thanks. The dict.copy is needed there since you can't remove elements from the original while looping it. I'm going to run a big IP list and see if I can reproduce as well.

As far as debug logging goes, I will add that in there. Also note that the function is returning a tuple of some of the stats/info you are parsing for already:
(results, failed_lookups, rated_lookups, ip_lookup_total, ip_unique_total, rir_stats)

cdubz · 2017-07-07T19:01:57Z

Ah yeah. I probably should have been recording that with all these tests, haha.

secynic · 2017-07-08T19:33:39Z

I had a 30k IP list stop running lookups after ~6 hours. Added in a debug message to show a counter of IPs left, then ran a 10k list. There is definitely an infinite loop (not caused by dict.copy). I am still not sure what is causing it, and why it only happens for larger IP lists. Running some more tests...

…results (#134)

secynic · 2017-07-09T20:39:54Z

Found the problem, it was on the Cymru end. Their bulk ASN whois lookup was returning duplicates. My code didn't account for that since I had already made the input IP list unique.

Basically the internal counter for tracking RIRs was higher than the total number of IPs to look up. Once it would get towards the end of the LACNIC list, it would continually loop thinking there were more LACNIC IPs to lookup.

cdubz · 2017-07-10T15:40:08Z

Some more good results with a lowered socket_timeout. Managed to get a list off 10,000 IPs in 2h41m.

Test 21 (10,000 IPs, retry_count=5, socket_timeout=7)

I did get a couple socket errors with APNIC in this case, with a small group of the following messages along the way (blocking?): HTTP query socket error: [WinError 10054] An existing connection was forcibly closed by the remote host

…ings (#134)

secynic · 2017-07-10T22:33:48Z

Interesting, they could be blocking or just congestion on their end. It looks like those IPs eventually got sorted out on retries. I am a bit leery of classifying that error (ConnectionResetError) as rate-limiting at the bulk_lookup_rdap level, as it will infinitely wait for timeout and retry. Let me think that over.

I pushed another update to dev, with rate-limit handling for the rest of the RIRs and new return/stats:
https://ipwhois.readthedocs.io/en/dev/ipwhois.html#ipwhois.experimental.bulk_lookup_rdap

cdubz · 2017-07-11T18:51:17Z

One more test on the latest dev. Results not looking like quite as big of an improvement as the original tests did with an eight second socket_timeout, but still good stuff.

Test 23 (20,000 IPs, retry_count=5, socket_timeout=8)

secynic · 2017-07-15T03:29:31Z

Merged in #186. Leaving open pending any last minute tweaks for 1.0.0 and still need to update generate_examples.py/EXPERIMENTAL.rst.

secynic · 2017-07-20T02:14:21Z

RTD build is failing for the latest commit to dev (just added RST links). They give no indication or error messages. Will try again later...

EDIT This is failing for previous successful builds (latest/master). Seems to be an issue on their end.

EDIT 2 Others are having this issue: readthedocs/readthedocs.org#3006

secynic · 2017-07-31T01:50:09Z

1.0.0 pushed to pypi

secynic added the enhancement label Aug 13, 2016

secynic added this to the 1.0.0 stable milestone Aug 13, 2016

secynic self-assigned this Aug 13, 2016

secynic mentioned this issue Mar 29, 2017

When using proxy #153

Open

secynic added a commit that referenced this issue May 20, 2017

Added experimental.py - get_bulk_asn_whois (#134)

daf2eb1

secynic added a commit that referenced this issue Jun 30, 2017

Initial bulk_lookup_rdap testing (#134)

63836e8

secynic added a commit that referenced this issue Jun 30, 2017

Experimental autodoc (#134)

9ac53bf

secynic added a commit that referenced this issue Jun 30, 2017

autodoc newline fix (#134)

68d1f4b

secynic added a commit that referenced this issue Jun 30, 2017

autodoc newline fix 2 (#134)

92d092a

secynic added a commit that referenced this issue Jul 9, 2017

Added debug messages, prep for all RIR rate limit tracking (#134)

89f6287

secynic added a commit that referenced this issue Jul 9, 2017

Fixed infinite loop - issue with counting due to duplicate Cymru ASN …

ceddb6f

…results (#134)

secynic added a commit that referenced this issue Jul 10, 2017

Base experimental.py tests/coverage (#134)

f9cbcd0

secynic added a commit that referenced this issue Jul 10, 2017

bulk_lookup_rdap rotating proxy_openers support (#134 #153)

7cbba34

secynic added a commit that referenced this issue Jul 10, 2017

rate-limit handling for all RIR, namedtuple return, new stats, docstr…

2182bf5

…ings (#134)

secynic added a commit that referenced this issue Jul 11, 2017

experimental.py docs (#134)

2d58d89

secynic added a commit that referenced this issue Jul 11, 2017

doc section descriptions (#134)

98d32a0

secynic added a commit that referenced this issue Jul 20, 2017

Function references/links (#134)

b8be686

secynic added a commit that referenced this issue Jul 30, 2017

experimental usage examples (#134)

4f20005

secynic closed this as completed Jul 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk query wrapper #134

Bulk query wrapper #134

secynic commented Aug 13, 2016

secynic commented Apr 5, 2017 •

edited

Loading

cdubz commented Apr 5, 2017

secynic commented Apr 5, 2017

nstr10 commented May 9, 2017 •

edited

Loading

secynic commented May 19, 2017 •

edited

Loading

nstr10 commented Jun 6, 2017

secynic commented Jun 14, 2017

secynic commented Jun 30, 2017

cdubz commented Jun 30, 2017 •

edited

Loading

cdubz commented Jun 30, 2017

secynic commented Jun 30, 2017

cdubz commented Jul 1, 2017

cdubz commented Jul 1, 2017

cdubz commented Jul 2, 2017

secynic commented Jul 2, 2017

cdubz commented Jul 2, 2017 •

edited

Loading

secynic commented Jul 7, 2017

cdubz commented Jul 7, 2017 •

edited

Loading

cdubz commented Jul 7, 2017

cdubz commented Jul 7, 2017

secynic commented Jul 7, 2017

cdubz commented Jul 7, 2017

secynic commented Jul 8, 2017

secynic commented Jul 9, 2017

cdubz commented Jul 10, 2017

secynic commented Jul 10, 2017

cdubz commented Jul 11, 2017 •

edited

Loading

secynic commented Jul 15, 2017

secynic commented Jul 20, 2017 •

edited

Loading

secynic commented Jul 31, 2017

Bulk query wrapper #134

Bulk query wrapper #134

Comments

secynic commented Aug 13, 2016

secynic commented Apr 5, 2017 • edited Loading

cdubz commented Apr 5, 2017

secynic commented Apr 5, 2017

nstr10 commented May 9, 2017 • edited Loading

secynic commented May 19, 2017 • edited Loading

nstr10 commented Jun 6, 2017

secynic commented Jun 14, 2017

secynic commented Jun 30, 2017

cdubz commented Jun 30, 2017 • edited Loading

cdubz commented Jun 30, 2017

secynic commented Jun 30, 2017

cdubz commented Jul 1, 2017

cdubz commented Jul 1, 2017

cdubz commented Jul 2, 2017

secynic commented Jul 2, 2017

cdubz commented Jul 2, 2017 • edited Loading

secynic commented Jul 7, 2017

cdubz commented Jul 7, 2017 • edited Loading

cdubz commented Jul 7, 2017

cdubz commented Jul 7, 2017

secynic commented Jul 7, 2017

cdubz commented Jul 7, 2017

secynic commented Jul 8, 2017

secynic commented Jul 9, 2017

cdubz commented Jul 10, 2017

secynic commented Jul 10, 2017

cdubz commented Jul 11, 2017 • edited Loading

secynic commented Jul 15, 2017

secynic commented Jul 20, 2017 • edited Loading

secynic commented Jul 31, 2017

secynic commented Apr 5, 2017 •

edited

Loading

nstr10 commented May 9, 2017 •

edited

Loading

secynic commented May 19, 2017 •

edited

Loading

cdubz commented Jun 30, 2017 •

edited

Loading

cdubz commented Jul 2, 2017 •

edited

Loading

cdubz commented Jul 7, 2017 •

edited

Loading

cdubz commented Jul 11, 2017 •

edited

Loading

secynic commented Jul 20, 2017 •

edited

Loading