Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request- capture ip address onto response object #1071

Open
jvanasco opened this issue Dec 14, 2016 · 44 comments
Open

feature request- capture ip address onto response object #1071

jvanasco opened this issue Dec 14, 2016 · 44 comments
Milestone

Comments

@jvanasco
Copy link

jvanasco commented Dec 14, 2016

this is an extension of a request from the requests library (https://github.com/kennethreitz/requests/issues/2158)

I recently ran into an issue with the various "workarounds", and have been unable to consistently access an open socket across platforms, environments, or even servers queried (the latter might be from unpredictable timeouts)

it would honestly be great if the remote ip address were cached onto the response object by this library.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 14, 2016

So, here's my question: why? What useful information does this provide?

@jvanasco
Copy link
Author

This particular use-case is tracking the IP address for error reporting / troubleshooting / re-tries.

A surprising number of failures/404s we've encountered have come from one of these two scenarios:

• legacy dns records during switchover
• one node in a roundrobin dns setup is not configured / no-longer configured

In those situations, we're not guaranteed to have the dns resolve to the same upstream ip in a second call. The temporary fix has been relying on the non-api internal implementation details, but they appear to be fragile.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 14, 2016

Ok, so I think I need to better understand what's going on. For example, if you hit an error that raised an exception you wouldn't have access to the response object, so having an IP on that object isn't particularly useful.

Are you wanting to take some automated action based on this information, or simply to log it out?

@sethmlarson
Copy link
Member

Don't actually require this feature, but have potential use-case. Is there any way to see the IP on the response after a redirect?

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 14, 2016

@SethMichaelLarson I mean, yes, there is: you can look at the socket object and find it. But again, I don't know what problem we're really solving here.

This boils down to a "tell me your real question" situation. I think that people have settled on "I need access to the IP address that a response came from" as the solution to a problem they have, but it's not clear to me that it's the right solution, any more than exposing the size of the TCP receive buffer on the socket would be a good solution to a problem with read timeouts.

@jvanasco
Copy link
Author

Largely, yes. Exceptions are a concern... however I've been laser focused on not being able to reliably get the actual IP of a "valid response", and I've forgotten about them. So I'm talking about valid response objects (and failures being in our application logic), but Exceptions absolutely apply as well. By "Valid responses", I mean that one of the above scenarios will often generate a HTTP-OK 200.

With urllib3, I really just want the basic ability to log the ip of the remote server that was actually communicated with for a particular response.

here's a pseudocode example [not my own use-case, but this should illustrate things better]

# get the response
url = 'http://example.com'
r = get(url)
# look for a marker in the response:
if marker not in r.data:
      raise ValueError("missing marker")

in the case above, there are 2 most-likely reasons why a url may be missing the expected marker:

  1. It's not there.
  2. there was a dns issue with switchover or round-robin

in order to properly audit this error, we need to log the actual IP address that responded to the request. making a secondary call can return a different IP address, and relying on the internal undocumented api implementations to find the open socket is very fragile and doesn't really work well.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 14, 2016

So, let me propose a problem with this approach: it's not resilient to the presence of layer 7 middleboxes. Specifically, if you put a reverse proxy between urllib3 and the service you're communicating with, you immediately lose track of what is going on. Similarly, the presence of a forward proxy will also totally outfox this solution.

Is it not a better idea to have servers put this information into the HTTP headers?

@sethmlarson
Copy link
Member

Oh I don't need this functionality, just presenting a potential case. :) I think best way to do this is probably like @Lukasa said via headers.

@jvanasco
Copy link
Author

I don't control the remote servers. Even if I did, a misconfiguration of the servers (or DNS) would lead me right back to this problem. While it would be great if servers put the origin information into the HTTP Headers, that is also distinctly different from being the ip address that is providing the response.

The existence of Proxy servers could indeed create a problem if one were relying on the "upstream ip" to identify the "origin" -- but they also [perhaps more importantly] identify the source of the problem by pointing to that node.

Because of how sockets work, urllib3 is operating as a blackbox regarding the upstream connection. Using the example from above -- if I run a test-case 100 times, the tcp buffer size will be the same on every iteration. Most issues with the host-machine and settings can be recreated across requests. The IP address for a given request - however - is subject to change across requests and not guaranteed outside the scope of the connection. There is simply no way to reliably tell where the response came from (not as the "origin" but as the server).

@haikuginger
Copy link
Contributor

haikuginger commented Dec 14, 2016

I tend towards -1 on this, although I could probably be convinced of the value of a DEBUG log entry during DNS lookup. It sounds like you're doing webscraping or something similar; if that's the case, then you might be better off making your system more resilient to issues like this. For example:

from collections import deque
from logging import getLogger

from urllib3 import PoolManager

LOGGER = getLogger(__name__)

urls_to_check = [
    'url',
    'otherurl',
    'thirdurl',
]

class Scraper(object):

    def __init__(self, urls, token, max_tries=1):
        self.urls = deque([(x, 0,) for x in urls])
        self.pm = PoolManager()
        self.max_tries = max_tries
        self.token = token

    def scrape_next_url(self):
        url, tries = self.urls.popleft()
        result = self.pm.request('GET', url)
        tries += 1
        if self.token not in result.data:
            if tries < max_tries:
                self.urls.append((url, tries,))
                return None
            else:
                raise ValueError('Token missing at URL {} after {} attempts.'.format(url, tries)))
        return result

    def scraped_content(self):
        while len(self.urls):
            try:
                parsed = self.scrape_next_url()
            except ValueError as e:
                LOGGER.warning(e)
            if parsed is not None:
                yield parsed
            continue

for scraped_page in Scraper(urls_to_check, 'my token', max_tries=3).scraped_content():
    # BUSINESS LOGIC GOES HERE
    pass

It's often hard to tell, but it seems as though your problem isn't looking up the IP address that was connected to when you had an error; it's telling when an error is ephemeral and taking appropriate action. Something like the above gives you a structure in which URLs with errors will be added back into the queue to be retried (and hopefully succeed); errors that we have reasonable confidence aren't ephemeral will eventually be raised up.

Obviously, write your own code; the above has not been tested in any way whatsoever. One example of an alternate structure would be to save the failed URLs to a file with their retry count to be picked up as part of the next batch.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 14, 2016

Using the example from above -- if I run a test-case 100 times, the tcp buffer size will be the same on every iteration. Most issues with the host-machine and settings can be recreated across requests.

Well, not really. Each host machine may do any number of things differently, and that will affect low-level transport. What isn't clear to me is whether there is a better solution to this kind of problem.

You say you don't control the origin servers: how are you detecting DNS failover if you don't own the machines?

@jvanasco
Copy link
Author

Each host machine may do any number of things differently

A given host machine is fairly reliable to not change the low-level transport protocols without user-intervention or a restart.

What isn't clear to me is whether there is a better solution to this kind of problem.

My problem is that I need to know what upstream server urllib3 actually connected to. The only reliable way to do that, is for urllib3 to note it. This can not be determined after-the-fact.

Using the undocumented internal API, on a given python2.7 machine, the active connection might be on (and only on) any one of the following:

r._connection.sock.socket
r._fp.fp._sock.socket
r._fp.fp._sock
r._fp.fp.raw._sock

You say you don't control the origin servers: how are you detecting DNS failover if you don't own the machines?

We're not detecting the DNS failover, but would like to. We're concluding it after analysis of the error logs, and depending upon the system that detected the error and the error report that was generated. Some of our systems deal exclusively with clients/partners/vendors, others just look at random public internet sites.

• the current & historical ip is checked. these issues tend to happen the most when someone is switching whitelabel or hosting providers -- so there is a relatively smaller pool of IP addresses that most of these issues happen with.
• we started automatically monitoring domains the dns of domains with many-errors (hourly, for 72 hours).
• there are manual reviews/monitors too

All this really just gives us a clue though. If we were able to log the IP address along with our success & fails, it would be much easier to pinpoint where an issue is (e.g. if a domain is serving 100% errors off IP-A and 100% success off IP-B, that is a huge red flag).

The current workaround has 2 limitations:

  1. depending on the url/response, [under python 2.7] an active socket can be on (at least) any one of 4 internal attributes:

     sock = r._connection.sock
     sock = r._connection.sock.socket
     sock = r._fp.fp._sock
     sock = r._fp.fp._sock.socket
    
  2. because the socket it ephemeral, it is inaccessible to redirects (in this package or something invoking it, like requests)

I appreciate @haikuginger's suggestion, however that approach just says "hey there may have been a problem" and tries it's best to solve it. That is PERFECT for many needs, but not ours. it doesn't give us any data needed to actually diagnose and solve the problem. Our problem is in logging the bit of information that can actually help us understand why an error occurred, so we can take appropriate measures (both automated and in-person). If we're getting 3 responses for a url in 5 seconds, that's a potential issue with connectivity and we need to know the relevant IPs to diagnose.

@haikuginger
Copy link
Contributor

@jvanasco, one other option would be for you to resolve the domain name to an IP address yourself and set the Host header to the original domain name.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 16, 2016

M'kay, so I guess I am open to putting the IP address on a response object. It definitely feels odd, and we'll have to extract it early in the lifecycle, but we can do that. It feels like a half a solution, but it does also feel like it's the only thing that will meaningfully resolve your issue.

@sigmavirus24
Copy link
Contributor

one other option would be for you to resolve the domain name to an IP address yourself and set the Host header to the original domain name.

@haikuginger I'm not sure that's really a good option (if it's an option at all). Here's why: urllib3 presently will get the DNS info and try each address in succession. For @jvanasco to do that, is a lot more work and a lot more tedious than urllib3 doing it, especially considering the level at which they're doing it and the fact that, if they first want to find an IP that they can connect to, they're creating sockets only to close them and have urllib3 open a new socket. That's really kind of awful.

@jvanasco
Copy link
Author

yeah, having to do the lookup+ip means coding around urllib3 (and avoiding the entire python ecosystem around it) -- because of how redirects are handled.

also, I might be able to rephrase this/request less oddly(or offensively). What if there were a debug object on the response/error objects that had a socket_peername attribute? That is, effectively, the IP address (+ port), but abstracted away from the "ip address" being an actual attribute of the response object.

so it would look something more like this:

 r.debug.socket_peername 

instead of

r.ip_address

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 19, 2016

I think whatever we do we will want to put in a "private" member attribute, to discourage people from relying on it too heavily. But I'm ok with putting a "our_name" and "peer_name" pair of attributes on the response object.

@sigmavirus24
Copy link
Contributor

So @glyph has made a request over on httpie to be able to introspect a certificate that a server provided on a response. As I pointed out there, that requires us (urllib3) to provide that on the response object (or somewhere).

It seems people want some level of ability to debug parts of the request/response cycle. I think some kind of DebugInformation object might actually be worthwhile. We could store the resolved IP address and parsed certificate information there with ease. I don't know what else we might care about storing but I don't think all of this belongs on a response object or stuffed into unreliable private attributes on a response object. Maybe if the debug object is private that would be enough.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 20, 2016

I'm open to doing a debug information object if we think that will be helpful. We need to be cautious to see how this interacts with v2.

@glyph
Copy link

glyph commented Dec 20, 2016

Calling this information "debug" information is a little misleading. I might want to inspect attributes of the certificate to decide how I want to process the response, or (as the original requestor put it) I might want to gather IP addresses or analytics or compliance (via geoip) reasons.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Dec 20, 2016

What possible decision can you be making based on the certificate that late in the connection process?

@jvanasco
Copy link
Author

bumping this back up as I'd like to stop using janky workarounds and try to sketch out the first draft of a PR

  1. what if the object were ConnectionInformation (instead of DebugInformation) and an attribute was connection_info? That would cover @glyph's concern, while still abstracting this stuff enough away from the core attributes.

  2. The data we're talking about preserving is:

  • remote socket data

  • ssl certificate data

    r.connection_info = ConnectionInfo()
    r.connection_info.socket_peername = (ip, port)
    r.connection_info.ssl_certificate = ?

In terms of "why" the ssl data is important late in the game, I can imagine glyph's concern is largely on compliance and recordkeeping (otherwise he'd want a hook for inspection). often in finance / medicine / government work one needs to create a paper trail of where things were sent.

@jvanasco
Copy link
Author

bubbling this up again, because I'd love to start working on a solution if there is one. I'm at the point where I'd like to have the certificate info that @glyph mentioned as well. I'm using urllib3 through requests and have been inserting hooks at index 0 to handle the peername and peercert. (if anyone needs the code for their usage, I'd be happy to put together a gist)

In terms of 'why', I need to get the certificate type (dv/ov/ev), CA and CN/SANs from the certificate.

I'm not sure the best way to handle ssl stuff, as the handlling is also installation/platform dependent.

on Python2 there are at least 2 ways.

If pyopenssl is available, we'll only have the subjectAltName and subject

If it's not available, we have the full certificate info.

does anyone know:

  1. is it possible to get all the info via pyOpenSSL?
  2. are there additional contexts that may wrap the ssl data ?
from ssl import SSLSocket
try:
    import urllib3.contrib.pyopenssl as pyopenssl
except ImportError:
    pyopenssl = None

def get_response_cert(resp):
    """
    used to get the certificate from the request

    IMPORTANT. this must happen BEFORE any content is consumed.
    
    SSLSocket
    {'subjectAltName': (('DNS', 'findmeon.com'),), 'notBefore': u'Feb 14 05:59:14 2018 GMT', 'caIssuers': (u'http://cert.int-x3.letsencrypt.org/',), 'OCSP': (u'http://ocsp.int-x3.letsencrypt.org',), 'serialNumber': u'0302EFD66D1701C39A7C6B5540A174E5E011', 'notAfter': 'May 15 05:59:14 2018 GMT', 'version': 3L, 'subject': ((('commonName', u'findmeon.com'),),), 'issuer': ((('countryName', u'US'),), (('organizationName', u"Let's Encrypt"),), (('commonName', u"Let's Encrypt Authority X3"),))}

    pyopenssl.WrappedSocket 
    {'subjectAltName': [('DNS', 'findmeon.com')], 'subject': ((('commonName', u'findmeon.com'),),)}
    """
    if not isinstance(resp, requests.models.Response):
        # raise AllowableError("Not a HTTPResponse")
        log.debug("Not a HTTPResponse | %s", resp)
        return None

    if hasattr(resp, '_mp_cert'):
        return resp._mp_cert

    def _get_socket():
        sock = resp.raw._connection.sock
        if pyopenssl:
            if isinstance(sock, pyopenssl.WrappedSocket):
                return sock
        else:
            if isinstance(sock, SSLSocket):
                return sock
        return None
        
    sock = _get_socket()
    if sock:
        # only cache if we have a sock
        # we may want/need to call again
        resp._mp_cert = sock.getpeercert()
        return resp._mp_cert
    return None

@sigmavirus24
Copy link
Contributor

Yeah, I like the idea of a separate object that contains much of this information. I'm still nervous, however, about how this will interact with/affect v2 and the async working that @njsmith and others are working on.

@njsmith
Copy link
Contributor

njsmith commented Feb 24, 2018

It sounds like some coordination would be good, but there's nothing fundamentally difficult about pulling out the IP and certificate on our branch. The main thing is that we'd need to add some method to the abstract backend interface to expose the IP, and then implement it on the different backends. The urllib3 core already needs access to the raw cert information, in order to implement cert pinning.

Cross-link: #1323

CC: @pquentin

@jvanasco
Copy link
Author

Given:

  1. People are generally supportive of a debug object with the remote IP Address and Certificate
  2. The issue moving forward is the future library changes

Perhaps it would make sense to simply define and scope a "DebugObject" API now.

That would allow some of us to generate PR's that implement the API requirements now, and then worry about the future versions of the library later (as there is current disagreement on the 'how')

@andreabisello
Copy link

Hi,

i read the thead, and i'm not sure that my question is lecit, but i'll try to explain the use case ( maybe is equal to the use case of @jvanasco )

for testing purpose,

i need to know what is the ip address of who is answering my call and maybe the ip address of who is making the call (the machine that runs the python script).

i understand that

  • these features now are not avaialbe
  • the requested features are approved to be implemented
  • @jvanasco is asking for a pull request of someone who implements these

is this right?

can i follow this workaround? https://stackoverflow.com/questions/22492484/how-do-i-get-the-ip-address-from-a-http-request-using-the-requests-library

thanks.

@scontini76
Copy link

Hi,

I'm using the third solution since many months with no clear issues, good luck!

rsp=requests.get(..., stream=True)
rsp.raw._connection.sock.getpeername()

@jvanasco
Copy link
Author

  • the requested features are approved to be implemented
  • @jvanasco is asking for a pull request of someone who implements these

The "idea" is generally approved, but there's no consensus on how it should be implemented. I'm currently hoping the maintainers will define/approve the api for a "DebugObject" to hold this type of information, so myself and others can generate the PRs to implement it.

If you're using urllib through requests, I suggest using a session_hook to grab the data. unless you're using multiple plugins/tools that define session hooks, it will run at the right time on every request.

that's what I use in a Python package that I maintain:

  1. inspect the response. i found 4 different ways the peername can be obtained. there may be more. https://github.com/jvanasco/metadata_parser/blob/master/metadata_parser/__init__.py#L266-L303

  2. define a requests hook to trigger the inspection :
    https://github.com/jvanasco/metadata_parser/blob/master/metadata_parser/__init__.py#L317

def response_peername__hook(resp, *args, **kwargs):
    get_response_peername(resp)
    # do not return anything
  1. insert the hook into the session.
    https://github.com/jvanasco/metadata_parser/blob/master/metadata_parser/__init__.py#L1409-L1410
requests_session = requests.Session()
requests_session.hooks['response'].insert(0, response_peername__hook)  # must be first

@jvanasco
Copy link
Author

oh!

the above get_peername function caches the peername value onto response._mp_peername (mp stands for 'metadata_parser` the library's name) on first run. subsequent runs return that cached value.

so using it in the above example would be:

response = requests_session.get('https://example.com')
peername = get_peername(response)  # the value was already cached 

response = requests_session.get('https://example.com')
content = response.text
peername = get_peername(response)  # the value was already cached, would fail otherwise because the connection is closed.

@fwjavox
Copy link

fwjavox commented Jun 24, 2019

Largely, yes. Exceptions are a concern... however I've been laser focused on not being able to reliably get the actual IP of a "valid response", and I've forgotten about them.

I'd like to +1 on the exception. For us it is important to know the IP address if a request fails because we need it to open a support ticket with the CDN. They don't know which server is affected otherwise (due to DNA load balancing).
Is there any workaround to store to IP address in the exception (maybe RequestError or HTTPError) object to be able to read it back later on?
Thanks in advance!

@misotrnka
Copy link

I would also like to +1 on this request. Knowing the IP address is essential for many use-cases. For example, rate-limiting requests per IP address. I don't think any of the proposed work-arounds really work consitently.

@yaojenkuo
Copy link

This solution works fine by me.
python: 3.7.5
requests: 2.22.0

rsp=requests.get(..., stream=True)
rsp.raw._connection.sock.socket.getpeername()

@misotrnka
Copy link

After quite some time experimenting with different methods, I've found a workaround that works consistently:

import requests, socket
from urllib3.util import parse_url

rsp = requests.get(url, stream=True)
sock = socket.gethostbyname(parse_url(rsp.url).hostname)
print(sock) # Outputs peer IP address

Hope this helps someone.

@sigmavirus24
Copy link
Contributor

@misotrnka that works if and only if there's a single IP address in the DNS response and not if there are multiple.

@jvanasco
Copy link
Author

jvanasco commented Jun 9, 2020

After quite some time experimenting with different methods, I've found a workaround that works consistently:

Elaborating off what @sigmavirus24 said - it's not a workaround. You are creating a second request and obtaining DNS info from that. You may consistently get an ip address off that method, but there is no guarantee the ip address was associated with the first request. When dealing with domains that are fronted by CDNs or Load Balancers, there is a decreased chance the information will match up.

@misotrnka
Copy link

misotrnka commented Jun 9, 2020

True, but I'm not sure what other options there are, if one wants to continue using requests. But you are right that this wouldn't work with more complex dns setups.

@sigmavirus24
Copy link
Contributor

True, but I'm not sure what other options there are, if one wants to continue using requests. But you are right that this wouldn't work with more complex dns setups.

To be clear, @misotrnka I'm not saying you're a bad or you shouldn't have posted that. I'm clarifying for others that your solution is a solution to a narrow sliver of this larger problem.

@jvanasco
Copy link
Author

jvanasco commented Jun 9, 2020

The technique I shared above, used in my library metadata_parser, should work in 99% of cases (using a hook to inspect the connection before reading any data). We have used this library to process well over a billion pages under Python2 and have not had issues with it. It's been used by a few dozens other companies under Python2 and Python3, and no one has voiced issues with it.

If this approach did not work for you, I would like to know about it, so I can write appropriate tests and adjust my library to cover them. The only thing holding me back from issuing a PR to urllib on this, is I want to combine this with the SSL certificate tracking and that is nowhere near done.

@zapman449
Copy link

FWIW, here's my use case: I run a fleet of servers answering various web requests. That fleet of servers has some number of hosts which are failing requests 10% of the time for $SOMEREASON .
However, I don't know which of those hosts is failing, so I must narrow those down to figure out what $SOMEREASON is.

Requests/urllib3 doesn't help me solve this problem, because they can't show me the IP they connected to which gave me the error.

Problems like this crop up ALL THE TIME in my line of work. I'd love to knock up some quick requests + sessions + loop magic and just poke the servers until I get some errors, and then inspect those errors to figure out which servers to poke at next.

@pcatach
Copy link

pcatach commented Jul 8, 2020

One possible approach is writing a custom HTTPAdapter with a custom PoolManager.

Set the pool manager's pool_classes_by_scheme dictionary to a subclass of ConnectionPool (you have to do that for both HTTP and HTTPS) and set the pool ConnectionCls to a custom Connection class. Overload new_conn method to use a patched create_connection function. Then you can cache the IP address in that function.

@arossert
Copy link

I also need this ability in my line of work, as @andreabisello mentioned, I'm using this (Python3 only, can be adjusted to work with Pyhton2)

import requests
import http.client

def getresponse(self, *args, **kwargs):
    response = self._orig_getresponse(*args,**kwargs)
    try:
        response.peer = self.sock.getpeername()
    except Exception as e:
        response.peer = None
    return response


http.client.HTTPConnection._orig_getresponse = http.client.HTTPConnection.getresponse
http.client.HTTPConnection.getresponse = getresponse

res = requests.get("https://www.google.com")
res.raw._original_response.peer

@IvanLauLinTiong
Copy link
Contributor

pyOpenSSL is deprecated and will be removed in future release version 2.x (#2691).

@flo-at
Copy link

flo-at commented Nov 16, 2023

Is there any update on this? The v2.0 roadmap mentioned "IP Address resolved by DNS" under "Tracing" but I cannot find anything about tracing in the Changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests