New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request- capture ip address onto response object #1071
Comments
So, here's my question: why? What useful information does this provide? |
This particular use-case is tracking the IP address for error reporting / troubleshooting / re-tries. A surprising number of failures/404s we've encountered have come from one of these two scenarios: • legacy dns records during switchover In those situations, we're not guaranteed to have the dns resolve to the same upstream ip in a second call. The temporary fix has been relying on the non-api internal implementation details, but they appear to be fragile. |
Ok, so I think I need to better understand what's going on. For example, if you hit an error that raised an exception you wouldn't have access to the response object, so having an IP on that object isn't particularly useful. Are you wanting to take some automated action based on this information, or simply to log it out? |
Don't actually require this feature, but have potential use-case. Is there any way to see the IP on the response after a redirect? |
@SethMichaelLarson I mean, yes, there is: you can look at the socket object and find it. But again, I don't know what problem we're really solving here. This boils down to a "tell me your real question" situation. I think that people have settled on "I need access to the IP address that a response came from" as the solution to a problem they have, but it's not clear to me that it's the right solution, any more than exposing the size of the TCP receive buffer on the socket would be a good solution to a problem with read timeouts. |
Largely, yes. Exceptions are a concern... however I've been laser focused on not being able to reliably get the actual IP of a "valid response", and I've forgotten about them. So I'm talking about valid response objects (and failures being in our application logic), but Exceptions absolutely apply as well. By "Valid responses", I mean that one of the above scenarios will often generate a HTTP-OK 200. With urllib3, I really just want the basic ability to log the ip of the remote server that was actually communicated with for a particular response. here's a pseudocode example [not my own use-case, but this should illustrate things better]
in the case above, there are 2 most-likely reasons why a url may be missing the expected marker:
in order to properly audit this error, we need to log the actual IP address that responded to the request. making a secondary call can return a different IP address, and relying on the internal undocumented api implementations to find the open socket is very fragile and doesn't really work well. |
So, let me propose a problem with this approach: it's not resilient to the presence of layer 7 middleboxes. Specifically, if you put a reverse proxy between urllib3 and the service you're communicating with, you immediately lose track of what is going on. Similarly, the presence of a forward proxy will also totally outfox this solution. Is it not a better idea to have servers put this information into the HTTP headers? |
Oh I don't need this functionality, just presenting a potential case. :) I think best way to do this is probably like @Lukasa said via headers. |
I don't control the remote servers. Even if I did, a misconfiguration of the servers (or DNS) would lead me right back to this problem. While it would be great if servers put the origin information into the HTTP Headers, that is also distinctly different from being the ip address that is providing the response. The existence of Proxy servers could indeed create a problem if one were relying on the "upstream ip" to identify the "origin" -- but they also [perhaps more importantly] identify the source of the problem by pointing to that node. Because of how sockets work, urllib3 is operating as a blackbox regarding the upstream connection. Using the example from above -- if I run a test-case 100 times, the tcp buffer size will be the same on every iteration. Most issues with the host-machine and settings can be recreated across requests. The IP address for a given request - however - is subject to change across requests and not guaranteed outside the scope of the connection. There is simply no way to reliably tell where the response came from (not as the "origin" but as the server). |
I tend towards -1 on this, although I could probably be convinced of the value of a DEBUG log entry during DNS lookup. It sounds like you're doing webscraping or something similar; if that's the case, then you might be better off making your system more resilient to issues like this. For example: from collections import deque
from logging import getLogger
from urllib3 import PoolManager
LOGGER = getLogger(__name__)
urls_to_check = [
'url',
'otherurl',
'thirdurl',
]
class Scraper(object):
def __init__(self, urls, token, max_tries=1):
self.urls = deque([(x, 0,) for x in urls])
self.pm = PoolManager()
self.max_tries = max_tries
self.token = token
def scrape_next_url(self):
url, tries = self.urls.popleft()
result = self.pm.request('GET', url)
tries += 1
if self.token not in result.data:
if tries < max_tries:
self.urls.append((url, tries,))
return None
else:
raise ValueError('Token missing at URL {} after {} attempts.'.format(url, tries)))
return result
def scraped_content(self):
while len(self.urls):
try:
parsed = self.scrape_next_url()
except ValueError as e:
LOGGER.warning(e)
if parsed is not None:
yield parsed
continue
for scraped_page in Scraper(urls_to_check, 'my token', max_tries=3).scraped_content():
# BUSINESS LOGIC GOES HERE
pass It's often hard to tell, but it seems as though your problem isn't looking up the IP address that was connected to when you had an error; it's telling when an error is ephemeral and taking appropriate action. Something like the above gives you a structure in which URLs with errors will be added back into the queue to be retried (and hopefully succeed); errors that we have reasonable confidence aren't ephemeral will eventually be raised up. Obviously, write your own code; the above has not been tested in any way whatsoever. One example of an alternate structure would be to save the failed URLs to a file with their retry count to be picked up as part of the next batch. |
Well, not really. Each host machine may do any number of things differently, and that will affect low-level transport. What isn't clear to me is whether there is a better solution to this kind of problem. You say you don't control the origin servers: how are you detecting DNS failover if you don't own the machines? |
A given host machine is fairly reliable to not change the low-level transport protocols without user-intervention or a restart.
My problem is that I need to know what upstream server urllib3 actually connected to. The only reliable way to do that, is for urllib3 to note it. This can not be determined after-the-fact. Using the undocumented internal API, on a given python2.7 machine, the active connection might be on (and only on) any one of the following:
We're not detecting the DNS failover, but would like to. We're concluding it after analysis of the error logs, and depending upon the system that detected the error and the error report that was generated. Some of our systems deal exclusively with clients/partners/vendors, others just look at random public internet sites. • the current & historical ip is checked. these issues tend to happen the most when someone is switching whitelabel or hosting providers -- so there is a relatively smaller pool of IP addresses that most of these issues happen with. All this really just gives us a clue though. If we were able to log the IP address along with our success & fails, it would be much easier to pinpoint where an issue is (e.g. if a domain is serving 100% errors off IP-A and 100% success off IP-B, that is a huge red flag). The current workaround has 2 limitations:
I appreciate @haikuginger's suggestion, however that approach just says "hey there may have been a problem" and tries it's best to solve it. That is PERFECT for many needs, but not ours. it doesn't give us any data needed to actually diagnose and solve the problem. Our problem is in logging the bit of information that can actually help us understand why an error occurred, so we can take appropriate measures (both automated and in-person). If we're getting 3 responses for a url in 5 seconds, that's a potential issue with connectivity and we need to know the relevant IPs to diagnose. |
@jvanasco, one other option would be for you to resolve the domain name to an IP address yourself and set the |
M'kay, so I guess I am open to putting the IP address on a response object. It definitely feels odd, and we'll have to extract it early in the lifecycle, but we can do that. It feels like a half a solution, but it does also feel like it's the only thing that will meaningfully resolve your issue. |
@haikuginger I'm not sure that's really a good option (if it's an option at all). Here's why: urllib3 presently will get the DNS info and try each address in succession. For @jvanasco to do that, is a lot more work and a lot more tedious than urllib3 doing it, especially considering the level at which they're doing it and the fact that, if they first want to find an IP that they can connect to, they're creating sockets only to close them and have urllib3 open a new socket. That's really kind of awful. |
yeah, having to do the lookup+ip means coding around urllib3 (and avoiding the entire python ecosystem around it) -- because of how redirects are handled. also, I might be able to rephrase this/request less oddly(or offensively). What if there were a so it would look something more like this:
instead of
|
I think whatever we do we will want to put in a "private" member attribute, to discourage people from relying on it too heavily. But I'm ok with putting a "our_name" and "peer_name" pair of attributes on the response object. |
So @glyph has made a request over on httpie to be able to introspect a certificate that a server provided on a response. As I pointed out there, that requires us (urllib3) to provide that on the response object (or somewhere). It seems people want some level of ability to debug parts of the request/response cycle. I think some kind of |
I'm open to doing a debug information object if we think that will be helpful. We need to be cautious to see how this interacts with v2. |
Calling this information "debug" information is a little misleading. I might want to inspect attributes of the certificate to decide how I want to process the response, or (as the original requestor put it) I might want to gather IP addresses or analytics or compliance (via geoip) reasons. |
What possible decision can you be making based on the certificate that late in the connection process? |
bumping this back up as I'd like to stop using janky workarounds and try to sketch out the first draft of a PR
In terms of "why" the ssl data is important late in the game, I can imagine glyph's concern is largely on compliance and recordkeeping (otherwise he'd want a hook for inspection). often in finance / medicine / government work one needs to create a paper trail of where things were sent. |
bubbling this up again, because I'd love to start working on a solution if there is one. I'm at the point where I'd like to have the certificate info that @glyph mentioned as well. I'm using In terms of 'why', I need to get the certificate type (dv/ov/ev), CA and CN/SANs from the certificate. I'm not sure the best way to handle ssl stuff, as the handlling is also installation/platform dependent. on Python2 there are at least 2 ways. If pyopenssl is available, we'll only have the subjectAltName and subject If it's not available, we have the full certificate info. does anyone know:
|
Yeah, I like the idea of a separate object that contains much of this information. I'm still nervous, however, about how this will interact with/affect v2 and the async working that @njsmith and others are working on. |
It sounds like some coordination would be good, but there's nothing fundamentally difficult about pulling out the IP and certificate on our branch. The main thing is that we'd need to add some method to the abstract backend interface to expose the IP, and then implement it on the different backends. The urllib3 core already needs access to the raw cert information, in order to implement cert pinning. Cross-link: #1323 CC: @pquentin |
Given:
Perhaps it would make sense to simply define and scope a "DebugObject" API now. That would allow some of us to generate PR's that implement the API requirements now, and then worry about the future versions of the library later (as there is current disagreement on the 'how') |
Hi, i read the thead, and i'm not sure that my question is lecit, but i'll try to explain the use case ( maybe is equal to the use case of @jvanasco ) for testing purpose, i need to know what is the ip address of who is answering my call and maybe the ip address of who is making the call (the machine that runs the python script). i understand that
is this right? can i follow this workaround? https://stackoverflow.com/questions/22492484/how-do-i-get-the-ip-address-from-a-http-request-using-the-requests-library thanks. |
Hi, I'm using the third solution since many months with no clear issues, good luck!
|
The "idea" is generally approved, but there's no consensus on how it should be implemented. I'm currently hoping the maintainers will define/approve the api for a "DebugObject" to hold this type of information, so myself and others can generate the PRs to implement it. If you're using urllib through that's what I use in a Python package that I maintain:
|
oh! the above so using it in the above example would be:
|
I'd like to +1 on the exception. For us it is important to know the IP address if a request fails because we need it to open a support ticket with the CDN. They don't know which server is affected otherwise (due to DNA load balancing). |
I would also like to +1 on this request. Knowing the IP address is essential for many use-cases. For example, rate-limiting requests per IP address. I don't think any of the proposed work-arounds really work consitently. |
This solution works fine by me.
|
After quite some time experimenting with different methods, I've found a workaround that works consistently: import requests, socket
from urllib3.util import parse_url
rsp = requests.get(url, stream=True)
sock = socket.gethostbyname(parse_url(rsp.url).hostname)
print(sock) # Outputs peer IP address Hope this helps someone. |
@misotrnka that works if and only if there's a single IP address in the DNS response and not if there are multiple. |
Elaborating off what @sigmavirus24 said - it's not a workaround. You are creating a second request and obtaining DNS info from that. You may consistently get an ip address off that method, but there is no guarantee the ip address was associated with the first request. When dealing with domains that are fronted by CDNs or Load Balancers, there is a decreased chance the information will match up. |
True, but I'm not sure what other options there are, if one wants to continue using |
To be clear, @misotrnka I'm not saying you're a bad or you shouldn't have posted that. I'm clarifying for others that your solution is a solution to a narrow sliver of this larger problem. |
The technique I shared above, used in my library metadata_parser, should work in 99% of cases (using a hook to inspect the connection before reading any data). We have used this library to process well over a billion pages under Python2 and have not had issues with it. It's been used by a few dozens other companies under Python2 and Python3, and no one has voiced issues with it. If this approach did not work for you, I would like to know about it, so I can write appropriate tests and adjust my library to cover them. The only thing holding me back from issuing a PR to urllib on this, is I want to combine this with the SSL certificate tracking and that is nowhere near done. |
FWIW, here's my use case: I run a fleet of servers answering various web requests. That fleet of servers has some number of hosts which are failing requests 10% of the time for $SOMEREASON . Requests/urllib3 doesn't help me solve this problem, because they can't show me the IP they connected to which gave me the error. Problems like this crop up ALL THE TIME in my line of work. I'd love to knock up some quick requests + sessions + loop magic and just poke the servers until I get some errors, and then inspect those errors to figure out which servers to poke at next. |
One possible approach is writing a custom Set the pool manager's |
I also need this ability in my line of work, as @andreabisello mentioned, I'm using this (Python3 only, can be adjusted to work with Pyhton2)
|
pyOpenSSL is deprecated and will be removed in future release version 2.x (#2691). |
Is there any update on this? The v2.0 roadmap mentioned "IP Address resolved by DNS" under "Tracing" but I cannot find anything about tracing in the Changelog. |
this is an extension of a request from the
requests
library (https://github.com/kennethreitz/requests/issues/2158)I recently ran into an issue with the various "workarounds", and have been unable to consistently access an open socket across platforms, environments, or even servers queried (the latter might be from unpredictable timeouts)
it would honestly be great if the remote ip address were cached onto the response object by this library.
The text was updated successfully, but these errors were encountered: