Skip to content

Conversation

@clippered
Copy link
Contributor

…reason, so return a new dict instance with replaced user-agent

…reason, so return a new dict instance with replaced `user-agent`
@clippered
Copy link
Contributor Author

@elacuesta not sure if github actions is running the test.
Anyway, for some reason playwright_request.headers["user-agent"] was not replaced in that line.
I tried repacking the dict and replace it with scrapy headers["user-agent"]
I hope this is ok.

@codecov
Copy link

codecov bot commented May 11, 2022

Codecov Report

Merging #92 (9a6c6b3) into master (3376a13) will decrease coverage by 0.35%.
The diff coverage is 100.00%.

❗ Current head 9a6c6b3 differs from pull request most recent head 929b5d1. Consider uploading reports for the commit 929b5d1 to get more accurate results

@@             Coverage Diff             @@
##            master      #92      +/-   ##
===========================================
- Coverage   100.00%   99.64%   -0.36%     
===========================================
  Files            4        4              
  Lines          285      285              
===========================================
- Hits           285      284       -1     
- Misses           0        1       +1     
Impacted Files Coverage Δ
scrapy_playwright/headers.py 93.33% <100.00%> (-6.67%) ⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@elacuesta
Copy link
Member

Thanks for the contribution! The change seems harmless, but I don't really understand its motivation. Are you encountering problems with the current version? If so, please provide a minimal and reproducible example.

@clippered
Copy link
Contributor Author

Thanks. An example is scraping this page https://www.binance.com/en/markets with headless chrome and override the scrapy's user-agent headers with a value (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36)
And print out the headers:

if headers.get("user-agent"):
    playwright_request.headers["user-agent"] = headers["user-agent"]
    print('TEST', playwright_request.headers["user-agent"], headers["user-agent"])
return playwright_request.headers

The first printed line will Chrome but the next lines show HeadlessChrome:

TEST Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4929.0 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36
TEST Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4929.0 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4929.0 Safari/537.36
TEST Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4929.0 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/101.0.4929.0 Safari/537.36
...

Looks weird to me too, but I hope this example is enough.

@elacuesta
Copy link
Member

elacuesta commented May 11, 2022

Oh that makes a lot of sense, thanks for the explanation. I will be pushing a patch to use the new headers API in Playwright 1.15 soon, that will supersede this change. I don't see a reason not to merge this PR before that though, so thanks again!

@elacuesta elacuesta merged commit a632118 into scrapy-plugins:master May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants