Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: mechanism for dumping out list of resources requested #88

Closed
simonw opened this issue Aug 15, 2022 · 4 comments
Closed

Idea: mechanism for dumping out list of resources requested #88

simonw opened this issue Aug 15, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Aug 15, 2022

I needed that here: simonw/datasette-lite#40 (comment)

I came up with this prototype patch:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 018581c..d3e2fc6 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -663,6 +663,7 @@ def take_shot(
 
     if not use_existing_page:
         page = context_or_page.new_page()
+        page.on("request", lambda request: print(">>", request.method, request.url))
     else:
         page = context_or_page

Using this mechanism: https://playwright.dev/python/docs/network#handle-requests

@simonw simonw added the enhancement New feature or request label Aug 15, 2022
@simonw
Copy link
Owner Author

simonw commented Aug 15, 2022

One solution would be to provide an entirely new command.

Another solution would be to have --request-log X as an option - where you can specify a file that the requests should be logged out to, or - for standard out.

@simonw
Copy link
Owner Author

simonw commented Aug 15, 2022

Demo of that prototype:

(shot-scraper) shot-scraper % shot-scraper https://simonwillison.net
>> GET https://simonwillison.net/
>> GET https://plausible.io/js/plausible.js
>> GET https://www.googletagmanager.com/gtag/js?id=UA-1090368-1
>> GET https://simonwillison.net/static/css/all.6c8f16b5122b.css
>> GET https://platform.twitter.com/widgets.js
>> GET https://static.simonwillison.net/static/2022/psf-resolutions.jpg
>> GET https://static.simonwillison.net/static/2022/psf-resolutions-nigeria.jpg
>> GET https://static.simonwillison.net/static/2022/github-gpt3-math.png
>> GET https://static.simonwillison.net/static/2022/s3-ocr-sample-handwriting.jpg
>> GET https://simonwillison.net/static/css/img/lquote.d4233b109d57.png
>> GET https://simonwillison.net/static/css/img/rquote.0b9c95bcc65e.png
>> GET https://platform.twitter.com/widgets/widget_iframe.9d00f3a022654eb8edfbc3190e981f9d.html?origin=https%3A%2F%2Fsimonwillison.net
>> GET https://www.google-analytics.com/analytics.js
>> GET https://syndication.twitter.com/settings?session_id=41bb83e8447a5519b8a3865c37e883760ee44eb1
>> POST https://www.google-analytics.com/j/collect?v=1&_v=j96&a=1633639690&t=pageview&_s=1&dl=https%3A%2F%2Fsimonwillison.net%2F&ul=en-us&de=UTF-8&dt=Simon%20Willison%E2%80%99s%20Weblog&sd=30-bit&sr=1280x720&vp=1280x720&je=0&_u=YEBAAUABAAAAAC~&jid=131808553&gjid=1628149262&cid=417458799.1660605950&tid=UA-1090368-1&_gid=2083172831.1660605950&_r=1&gtm=2ou880&z=41234275
>> GET https://platform.twitter.com/js/tweet.5b94507822be1b77b58bef86fc7cd9f7.js
>> GET https://platform.twitter.com/embed/Tweet.html?dnt=false&embedId=twitter-widget-0&features=eyJ0ZndfdGltZWxpbmVfbGlzdCI6eyJidWNrZXQiOlsibGlua3RyLmVlIiwidHIuZWUiXSwidmVyc2lvbiI6bnVsbH0sInRmd190d2VldF9lZGl0X2JhY2tlbmQiOnsiYnVja2V0Ijoib24iLCJ2ZXJzaW9uIjpudWxsfSwidGZ3X3JlZnNyY19zZXNzaW9uIjp7ImJ1Y2tldCI6Im9uIiwidmVyc2lvbiI6bnVsbH0sInRmd190d2VldF9yZXN1bHRfbWlncmF0aW9uXzEzOTc5Ijp7ImJ1Y2tldCI6InR3ZWV0X3Jlc3VsdCIsInZlcnNpb24iOm51bGx9LCJ0Zndfc2Vuc2l0aXZlX21lZGlhX2ludGVyc3RpdGlhbF8xMzk2MyI6eyJidWNrZXQiOiJpbnRlcnN0aXRpYWwiLCJ2ZXJzaW9uIjpudWxsfSwidGZ3X2V4cGVyaW1lbnRzX2Nvb2tpZV9leHBpcmF0aW9uIjp7ImJ1Y2tldCI6MTIwOTYwMCwidmVyc2lvbiI6bnVsbH0sInRmd19kdXBsaWNhdGVfc2NyaWJlc190b19zZXR0aW5ncyI6eyJidWNrZXQiOiJvZmYiLCJ2ZXJzaW9uIjpudWxsfSwidGZ3X3R3ZWV0X2VkaXRfZnJvbnRlbmQiOnsiYnVja2V0Ijoib2ZmIiwidmVyc2lvbiI6bnVsbH19&frame=false&hideCard=false&hideThread=false&id=1555626060384911360&lang=en-gb&origin=https%3A%2F%2Fsimonwillison.net%2F&sessionId=41bb83e8447a5519b8a3865c37e883760ee44eb1&theme=light&widgetsVersion=31f0cdc1eaa0f%3A1660602114609&width=550px
>> GET https://platform.twitter.com/embed/embed.runtime.1d9669116f7b6c2f2def.js
>> GET https://platform.twitter.com/embed/embed.modules.22436ce161b8a1362ef3.js
>> GET https://platform.twitter.com/embed/embed.Tweet.ebf51334f3136d3769be.js
>> GET https://platform.twitter.com/embed/embed.vendors~ondemand.horizon-web.i18n.ar-js~ondemand.horizon-web.i18n.ar-x-fm-js~ondemand.horizon-web.i1~98d47477.022b10081a82154299a6.js
>> GET https://platform.twitter.com/embed/embed.ondemand.i18n.en-js.26aa117248996d58e1bc.js
>> GET https://platform.twitter.com/embed/embed.vendors~ondemand.horizon-web.i18n.en-js.1c97cb46d8f406ddd7b9.js
>> GET https://platform.twitter.com/embed/embed.vendors~ondemand.Tweet.e54d69b39047ba47eee9.js
>> GET https://platform.twitter.com/embed/embed.ondemand.Tweet.d95dfccc9bd426e11ff8.js
>> GET https://platform.twitter.com/embed/embed.ondemand.Dropdown.5c1c610935c86ba65697.js
>> GET https://cdn.syndication.twimg.com/tweet-result?features=tfw_timeline_list%3Alinktr.ee%2Ctr.ee%3Btfw_tweet_edit_backend%3Aon%3Btfw_refsrc_session%3Aon%3Btfw_tweet_result_migration_13979%3Atweet_result%3Btfw_sensitive_media_interstitial_13963%3Ainterstitial%3Btfw_experiments_cookie_expiration%3A1209600%3Btfw_duplicate_scribes_to_settings%3Aoff%3Btfw_tweet_edit_frontend%3Aoff&id=1555626060384911360&lang=en
>> GET https://syndication.twitter.com/i/jot?l=%7B%22_category_%22%3A%22tfw_client_event%22%2C%22triggered_on%22%3A1660605950753%2C%22event_namespace%22%3A%7B%22client%22%3A%22tfw%22%2C%22page%22%3A%22tweet%22%2C%22action%22%3A%22results%22%2C%22section%22%3A%22main%22%7D%2C%22context%22%3A%22horizon%22%2C%22client_version%22%3A%2231f0cdc1eaa0f%3A1660602114609%22%2C%22dnt%22%3Afalse%2C%22widget_id%22%3A%22twitter-widget-0%22%2C%22widget_origin%22%3A%22https%3A%2F%2Fsimonwillison.net%2F%22%2C%22widget_frame%22%3A%22false%22%2C%22widget_partner%22%3A%22%22%2C%22widget_site_screen_name%22%3A%22%22%2C%22widget_site_user_id%22%3A%22%22%2C%22widget_creator_screen_name%22%3A%22%22%2C%22widget_creator_user_id%22%3A%22%22%2C%22widget_iframe_version%22%3A%226961c5ce6d4a2%3A1660264378254%22%2C%22item_ids%22%3A%5B%221555626060384911360%22%5D%2C%22item_details%22%3A%7B%221555626060384911360%22%3A%7B%22item_type%22%3A0%7D%7D%7D
>> GET https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75_normal.jpeg
>> GET https://pbs.twimg.com/media/FZaxPRRUcAA2Lsn?format=jpg&name=360x360
>> GET https://pbs.twimg.com/media/FZaxld4VsAAbaho?format=jpg&name=360x360
>> GET https://syndication.twitter.com/i/jot?l=%7B%22_category_%22%3A%22tfw_client_event%22%2C%22triggered_on%22%3A1660605953159%2C%22event_namespace%22%3A%7B%22client%22%3A%22tfw%22%2C%22page%22%3A%22tweet%22%2C%22action%22%3A%22FCP%22%2C%22component%22%3A%22performance%22%2C%22section%22%3A%22main%22%7D%2C%22context%22%3A%22horizon%22%2C%22client_version%22%3A%2231f0cdc1eaa0f%3A1660602114609%22%2C%22dnt%22%3Afalse%2C%22widget_id%22%3A%22twitter-widget-0%22%2C%22widget_origin%22%3A%22https%3A%2F%2Fsimonwillison.net%2F%22%2C%22widget_frame%22%3A%22false%22%2C%22widget_partner%22%3A%22%22%2C%22widget_site_screen_name%22%3A%22%22%2C%22widget_site_user_id%22%3A%22%22%2C%22widget_creator_screen_name%22%3A%22%22%2C%22widget_creator_user_id%22%3A%22%22%2C%22widget_iframe_version%22%3A%226961c5ce6d4a2%3A1660264378254%22%2C%22item_ids%22%3A%5B%221555626060384911360%22%5D%2C%22item_details%22%3A%7B%221555626060384911360%22%3A%7B%22item_type%22%3A0%7D%7D%2C%22duration_ms%22%3A3103.5999999046326%7D
Screenshot of 'https://simonwillison.net' written to 'simonwillison-net.3.png'

Would be useful if it also indicated the size of the response. Maybe even timing information too.

@simonw
Copy link
Owner Author

simonw commented Sep 12, 2022

New prototype, outputting JSON-nl with size and timings:

 % shot-scraper https://www.google.com/ 
{"url": "https://www.google.com/", "method": "GET", "size": 147341, "timing": {"startTime": 1663014391108.365, "domainLookupStart": 0.509, "domainLookupEnd": 88.891, "connectStart": 88.891, "secureConnectionStart": 101.022, "connectEnd": 144.795, "requestStart": 145.066, "responseStart": 240.511, "responseEnd": 267.347}}
{"url": "https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png", "method": "GET", "size": 5969, "timing": {"startTime": 1663014391382.6047, "domainLookupStart": -1, "domainLookupEnd": -1, "connectStart": -1, "secureConnectionStart": -1, "connectEnd": -1, "requestStart": 0.164, "responseStart": 50.047, "responseEnd": 50.803}}

That's with this patch (now listening for the "response" event, not the "request" event):

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index fb8b1a3..799be82 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -698,6 +698,10 @@ def take_shot(
 
     if not use_existing_page:
         page = context_or_page.new_page()
+        def on_response(response):
+            print(json.dumps({"url": response.url, "method": response.request.method, "size": len(response.body()), "timing": response.request.timing}))
+
+        page.on("response", on_response)
     else:
         page = context_or_page

@simonw simonw closed this as completed in ca2ecf9 Sep 12, 2022
@simonw
Copy link
Owner Author

simonw commented Sep 12, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant