Remove 'secure' attribute from cookies in ProxyGenerator to ensure co…#581
Remove 'secure' attribute from cookies in ProxyGenerator to ensure co…#581Luen wants to merge 5 commits intoscholarly-python-package:developfrom
Conversation
…mpatibility with httpx Cookies.set()
…or citedby URL extraction
There was a problem hiding this comment.
Pull request overview
Fixes cookie handoff from Selenium → httpx after CAPTCHA solving by removing unsupported cookie fields before calling httpx.Cookies.set(), preventing the solved CAPTCHA session from being discarded due to an exception.
Changes:
- Strip the
secureattribute from Selenium cookie dicts before passing them tohttpx.Cookies.set(). - Update test setup to skip proxy-dependent tests when no working free proxy is available.
- Adjust proxy initialization logic in
ProxyGenerator._new_session()to derive anhttpx.Client(proxy=...)value from the proxy dict.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
test_module.py |
Adds exception handling to skip tests when free proxies cannot be established; adds imports for proxy-related exceptions. |
scholarly/_proxy_generator.py |
Filters Selenium cookies to avoid passing unsupported kwargs to httpx; updates proxy initialization behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self._proxy_works: | ||
| init_kwargs["proxies"] = proxies #.get("http", None) | ||
| self._proxies = proxies | ||
| # httpx uses proxy= (single URL), not proxies= (dict) | ||
| proxy_url = proxies.get("https://") or proxies.get("http://") | ||
| if proxy_url: | ||
| init_kwargs["proxy"] = proxy_url |
There was a problem hiding this comment.
In _new_session(), proxy handling still relies on converting a proxies dict into a single proxy URL. Be careful about callers that pass only {'http://': ...} (e.g., ScraperAPI()), because the later conversion logic indexes the https:// entry and can raise KeyError. Suggested fix: normalize/validate the dict here (ensure both schemes exist) or choose https:// with a fallback to http:// when deriving the proxy URL.
| # httpx uses proxy= (single URL), not proxies= (dict) | ||
| proxy_url = proxies.get("https://") or proxies.get("http://") | ||
| if proxy_url: | ||
| init_kwargs["proxy"] = proxy_url |
There was a problem hiding this comment.
The PR title/description focuses on stripping the secure cookie attribute, but this hunk also changes how proxy configuration is passed into httpx.Client (switching behavior around proxies vs proxy). Please either update the PR description to cover this additional behavior change or split it into a separate PR so it can be reviewed/validated independently.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a563e1861
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if "expiry" in cookie: | ||
| cookie_kwargs["expires"] = cookie["expiry"] |
There was a problem hiding this comment.
Remove unsupported expires mapping from captcha cookie copy
In _handle_captcha2, the new expiry -> expires mapping is passed into self._session.cookies.set(**cookie_kwargs), but httpx.Cookies.set only supports name, value, domain, and path. When Selenium returns cookies that include expiry (common for persistent cookies), this still raises TypeError (now for expires) and prevents transferring the solved CAPTCHA session back to httpx.
Useful? React with 👍 / 👎.
|
@claude review |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Fix cookie transfer after CAPTCHA: strip secure before httpx Cookies.set()
When Google Scholar returns a CAPTCHA, the flow uses a browser to solve it, then copies cookies from the browser (Selenium) back into the httpx session so later requests use the same session. That copy step was raising:
TypeError: Cookies.set() got an unexpected keyword argument 'secure'
After that, the code retried with a new session, so the solved CAPTCHA session was effectively lost.
In _proxy_generator.py, _handle_captcha2() does:
Wait for the user to solve the CAPTCHA in the browser.
Call self._get_webdriver().get_cookies() (Selenium), which returns dicts that include browser-specific keys such as httpOnly, expiry, sameSite, and secure.
Strip only httpOnly, expiry, and sameSite, then call self._session.cookies.set(**cookie) (httpx).
httpx’s Cookies.set() does not accept a secure keyword. Passing the Selenium cookie dict as **cookie therefore included secure and triggered the TypeError.
Before calling self._session.cookies.set(**cookie), remove the secure key from each cookie dict, in the same way as the other non-httpx fields:
In _handle_captcha2(), add: cookie.pop("secure", None) after the existing pop calls for httpOnly, expiry, and sameSite.
With this, only supported arguments are passed to Cookies.set(), the cookie transfer completes, and the session continues to work after CAPTCHA solving without retrying due to this error.