Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session cookies do not inherit during redirection #55

Closed
WhZzi opened this issue Apr 28, 2023 · 18 comments
Closed

Session cookies do not inherit during redirection #55

WhZzi opened this issue Apr 28, 2023 · 18 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@WhZzi
Copy link

WhZzi commented Apr 28, 2023

problem: Session cookies do not inherit during redirection
version: 0.5.4
descibe: When a session with cookie requests a redirected website A, 302 jumps to B, and the request to access B does not carry any cookies

@WhZzi
Copy link
Author

WhZzi commented Apr 28, 2023

from curl_cffi import requests
s2 = requests.Session(impersonate="chrome110")
s2.cookies.update({"a":"123"})
r = s2.get('http://www.baidu.com/',verify=False,proxies={"http":"127.0.0.1:8888","https":"127.0.0.1:8888"})

use fiddler to capture the requests
12
12-1

@FrenziedIM
Copy link

It's because the cookie header is in lowercase, I think the server does not pick it up. I am also facing the same issue and would love to know if it could be possible to fix.

@yifeikong yifeikong self-assigned this May 23, 2023
@yifeikong yifeikong added the bug Something isn't working label May 23, 2023
@yifeikong
Copy link
Owner

Cookies are lost during transition from python cookiejar to curl cookiejar. To fix this issue, we will have to change the implementation to handle cookies entirely by either python or curl. Before it's fixed, consider hitting the HTTPS url directly, instead of hitting the http url and getting redirected.

@yifeikong yifeikong added this to the v0.6 milestone May 23, 2023
@WhZzi
Copy link
Author

WhZzi commented May 24, 2023

The above demonstration is just an example. Typically, during the process of redirecting from one HTTPS page to another HTTPS page, it seems that cookies are also lost. I would be glad if you fix this problem.

@coletdjnz
Copy link
Contributor

I was looking into this for our project (yt-dlp/yt-dlp#7595). Ideally we want to keep the Python CookieJar and internal curl cookie store in sync.

One way I was looking into doing this was the following:

It may not be entirely efficient though, especially when lots of cookies are present in the CookieJar (in our project that is not uncommon). But not sure if we can get any better, so might just be a tradeoff we have to make.

@yifeikong
Copy link
Owner

@coletdjnz Yes, that's exactly the way I want it to be implemented, too. In my experiment, I found that libcurl is much more efficient than pure python http clients like requests or httpx, so I guess it wouldn't hurt too much to sync cookies like this.

BTW, I noticed that you listed the libcurl error codes in your PR, would you mind if I incorporate that enum in this project?

@coletdjnz
Copy link
Contributor

BTW, I noticed that you listed the libcurl error codes in your PR, would you mind if I incorporate that enum in this project?

Yeah absolutely, go for it. I was thinking it might be better here too.

yifeikong added a commit that referenced this issue Aug 6, 2023
* Preliminary support for cookie fix

* Extract cookies using CurlInfo.COOKIELIST, fix #23 #55 #83

* Remove unused domain related stuff

* Add CurlECode

* Remove unused types.py

* Add test for single request cookies
@yifeikong
Copy link
Owner

It seems that python's cookiejar only works with Cookiejar.make_cookies, there are very complicate processes behind it. Using set_cookie to add cookies does not work as expected.

In the docs:

It is not expected that users of http.cookiejar construct their own Cookie instances. Instead, if necessary, call make_cookies() on a CookieJar instance.

However, make_cookies requires a request/response pair, which can not be constructed when there are redirections, unless we handle redirects entirely in python, which seems to be less efficient.

The other way is to let curl handle cookies. For python, use something like a CookieProxy to read and set cookies from curl. Reopening this issue until it's implemented.

@yifeikong yifeikong reopened this Aug 7, 2023
@Grub4K
Copy link

Grub4K commented Aug 7, 2023

Using set_cookie to add cookies does not work as expected.

Here is what I gathered from looking at the CPython implementation:

CookieJar.make_cookies parses Netscape and Rfc2965 and calls _cookies_from_attrs_set for each: Rfc2965, Netscape.
That function calls _normalized_cookie_tuples and then creates the Cookie from these through _cookie_from_cookie_tuple.

_normalized_cookie_tuples does not do anything interesting, while _cookie_from_cookie_tuple sets various defaults in Cookie construction from the Request.
Here we get some differences in the handling due to the current curl_cffi implemetation:

  • According to https://curl.se/docs/http-cookies.html, expires can be 0.
    For CookieJar however, that would have to be translated to expires=None, discard=True. (Ref)
  • While not necessarily relevant, before setting path, if defined, it will get escaped through escape_path. (Ref)

So if these quirks are accounted for it should be fine if we create our own Cookie. What other roadblock were you thinking of?

@yifeikong
Copy link
Owner

Thanks for the investigation! Another quirk I noticed is the handling of initial dots mentioned in #97, which may be related to cookiejar implementation here.

The other way I was thinking is to handle cookies all by curl, this is a curl wrapper anyway. It looks like this:

class CookieProxy:
    def __init__(self, curl):
        self.curl = curl
        self.jar = FakeJar()  # cookie iterable for compatibility

    def get(self, name, domain=None, ...):
        cookies = self.curl.getinfo(CurlInfo.COOKIELIST)
        for cookie in cookies:
            cookie = self._parse_curl_cookie(cookie)
            if cookie.name == name:
                return cookie.value

    def set(self, name, value, domain=None, ...):
        cookie_line = self._dump_curl_cookie(name, value, ...)
        self.curl.setopt(CurlOpt.COOKIELIST, cookie_line)

    def __setitem__(self, name, value):
        self.set(name, value)
    ...

The benefits could be:

  • less code, no need to set and extract cookies in Session and add to CookieJar.
  • better performance, cookies are only extracted when needed, not for every request.

@coletdjnz
Copy link
Contributor

Thanks for the investigation! Another quirk I noticed is the handling of initial dots mentioned in #97, which may be related to cookiejar implementation here.

The other way I was thinking is to handle cookies all by curl, this is a curl wrapper anyway. It looks like this:

class CookieProxy:
    def __init__(self, curl):
        self.curl = curl
        self.jar = FakeJar()  # cookie iterable for compatibility

    def get(self, name, domain=None, ...):
        cookies = self.curl.getinfo(CurlInfo.COOKIELIST)
        for cookie in cookies:
            cookie = self._parse_curl_cookie(cookie)
            if cookie.name == name:
                return cookie.value

    def set(self, name, value, domain=None, ...):
        cookie_line = self._dump_curl_cookie(name, value, ...)
        self.curl.setopt(CurlOpt.COOKIELIST, cookie_line)

    def __setitem__(self, name, value):
        self.set(name, value)
    ...

The benefits could be:

* less code, no need to set and extract cookies in Session and add to CookieJar.

* better performance, cookies are only extracted when needed, not for every request.

Hmm true, something like that could work for curl_cffi to keep things simple. I do recall seeing there may be multiple curl instances for a session though (due to threading?) - could that be an issue for this?

Now that CurlInfo.COOKIELIST is working, projects that need more complex/specific cookie handling can always implement their own method on their end (such as full cookie sync in the instance of our project).

@yifeikong
Copy link
Owner

yifeikong commented Aug 14, 2023

Hmm true, something like that could work for curl_cffi to keep things simple. I do recall seeing there may be multiple curl instances for a session though (due to threading?) - could that be an issue for this?

Curl has dedicated functions for this already - curl_share_*, which should have been exposed anyway. I think I will add this as a second option for handling cookies along side of cookiejar.

Now that CurlInfo.COOKIELIST is working, projects that need more complex/specific cookie handling can always implement their own method on their end (such as full cookie sync in the instance of our project).

I'm still pondering on how to sync cookies between curl and cookiejar. Because of the different formats of cookies and indirect interface for adding/updating cookies in the cookiejar, there are more to consider.

Going back to the original issue here, we were using cookie: header to tell curl what cookies to use in outgoing requests. Since curl's cookie engine is not activated, when redirect happens, the second request is sent without cookies. When the final response arrives, cookies are striped from set-cookie headers and updated in the cookiejar.

It's very easy to sync cookies in this way, since the primary interfaces(extract_cookies/add_cookie_headers) exposed by cookiejar are based on cookie/set-cookie headers.

To solve the missing cookies problem above, the related changes on master branch are:

  • expose curl.getinfo(CurlInfo.COOKIELIST) which is not exposed before.
  • enable the cookie engine of curl by curl.setopt(CurlOpt.COOKIEFILE, "") or whatsoever.
  • use CURLOPT_COOKIELIST to send cookies with curl instead of headers.
  • use CURLINFO_COOKIELIST to extract cookies from curl.

However, as mentioned above, curl's cookie format does not work well with cookiejar's internal format. We have to consider expires, initial_dot, path and so one. To work around this issue, we can revert to cookiejar.extract_headers while keeping other changes, which should work since it's just a replay of how cookie changes on python side, i.e.

  • use CURLINFO_COOKIELIST to extract cookies from curl.
  • extract cookies from set-cookie headers of all history redirects in visit order.

What do you think?

@coletdjnz
Copy link
Contributor

coletdjnz commented Aug 19, 2023

In theory you should be able to full sync cookies - at the end of the day they should all be following the same standard. It might just require getting into the nitty gritty of how to translate the two. When I get some time I wouldn't mind looking into this more.

Another alternative could be to take a requests-like approach - handle the redirects in curl_cffi. That way you could use extract_cookies/add_cookie_headers for redirects too. But redirect handling can get a bit tricky too - lots of little things you have to follow (could inherit from requests? though, requests handling is not perfect either).

Going back to basics: is there no way to get information about the intermediate requests? (any curl callback options helpful?) If we can get the request urls and response headers as it goes through redirects then that would make things easier.

@yifeikong
Copy link
Owner

https://curl.se/libcurl/c/CURLOPT_HEADERFUNCTION.html implies you can get headers for all the requests?

Yes, you can read the response header lines for each request from the preset header BytesIO buffer. However, to use the cookiejar.extract_cookies function correctly, you will have to construct each request object, since it's needed here and here. It's even more complicated and cumbersome than CURLINFO_COOKIELIST with cookiejar.make_cookies. 😂

@yifeikong
Copy link
Owner

I uploaded a beta version to PyPI, cookies seem to be correct, I also changed the wrong option used for POST request, as I found here. Besides, the error code is exposed here, so you don't have to use a regex to extract it. You can try to see if it fits your needs when you have some free time.

pip install curl-cffi==0.5.9b2

@WhZzi
Copy link
Author

WhZzi commented Aug 24, 2023

I test same codes with version 0.5.9b2, it has perfectly solved this problem, thank for you and all guys maintenance 😊😊

@WhZzi WhZzi closed this as completed Aug 24, 2023
@coletdjnz
Copy link
Contributor

I uploaded a beta version to PyPI, cookies seem to be correct, I also changed the wrong option used for POST request, as I found here. Besides, the error code is exposed here, so you don't have to use a regex to extract it. You can try to see if it fits your needs when you have some free time.

pip install curl-cffi==0.5.9b2

Awesome, thanks, this will help heaps 😃

I also changed the wrong option used for POST request, as I found here

sweet - seems to be working better. Though I'm still having some redirect issues (e.g. think content-length header is being sent on GET after POST->GET redirect?). I'll have to investigate - might just be our test suite being too pedantic.

Besides, the error code is exposed here, so you don't have to use a regex to extract it.

Ah, missed that. Though, it isn't copied over to the RequestsError so I have to access with RequestError.args[0].code.

@yifeikong
Copy link
Owner

These two issues should be fixed in 0.5.9b3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

5 participants