Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodriver: CDP get_response_body command not working #1832

Open
jwwq opened this issue Apr 18, 2024 · 4 comments
Open

Nodriver: CDP get_response_body command not working #1832

jwwq opened this issue Apr 18, 2024 · 4 comments

Comments

@jwwq
Copy link

jwwq commented Apr 18, 2024

Good afternoon, thank you for your great work! Based on your "network_monitor.py" example, I try to retrieve the contents of the response. I am using the LoadingFinished handler to make sure that the file is retrieved completely. Unfortunately, the process hangs forever when I'm trying to send command to CDP (see full code below).

cdp_cmd = cdp.network.get_response_body(event.request_id)
res = await global_browser.main_tab.send(cdp_cmd)

Please help!

(other than that there is one more question: is there any way to get tab in handler without global variables, but it's a minor issue)

from nodriver import start, cdp, loop

global_tab = None

async def main():
    browser = await start()
    tab = browser.main_tab
    global global_tab
    global_tab = tab
    tab.add_handler(cdp.network.RequestWillBeSent, send_handler)
    tab.add_handler(cdp.network.ResponseReceived, receive_handler)
    tab.add_handler(cdp.network.LoadingFinished, finished_handler)

    tab = await browser.get("https://www.google.com/?hl=en")


async def receive_handler(event: cdp.network.ResponseReceived):
    # print(event.response)
    return

async def send_handler(event: cdp.network.RequestWillBeSent):
    return

async def finished_handler(event: cdp.network.LoadingFinished):
    global global_tab
    print("finished:", event.request_id, ":", event.encoded_data_length)    
    if event.encoded_data_length > 0:
        cdp_cmd = cdp.network.get_response_body(event.request_id)
        print("SENDING...")
        res = await global_tab.send(cdp_cmd)
        # THE PROCESS HANGS HERE FOREVER.
        print("RESULT:", res)       

if __name__ == "__main__":
    loop().run_until_complete(main())
@jwwq jwwq changed the title Nodriver: Nodriver: get_response_body command freezes Apr 18, 2024
@jwwq jwwq changed the title Nodriver: get_response_body command freezes Nodriver: CDP get_response_body command not working Apr 18, 2024
@falmar
Copy link

falmar commented Apr 24, 2024

Hi there @jwwq

Im also trying to solve this situation, I'm under the impression that calling tab.send inside the event callback causes a deadlock, here is a snippet of how got something working, my use case is to extract the data from all Ajax requests

import time
import nodriver as uc
from nodriver import cdp

xhr_requests = []
last_xhr_request = None

def listenXHR(page):
    async def handler(evt):
        # get ajax requests
        if evt.type_ is cdp.network.ResourceType.XHR or evt.type_ is cdp.network.ResourceType.FETCH:
            xhr_requests.append([evt.response.url, evt.request_id])
            global last_xhr_request
            last_xhr_request = time.time()

    page.add_handler(cdp.network.ResponseReceived, handler)


async def receiveXHR(page, requests):
    responses = []
    retries = 0
    max_retries = 5

    # wait at least 2 second after the last xhr request to get some more
    while True:
        if last_xhr_request is None or retries > max_retries:
            break

        if time.time() - last_xhr_request <= 2:
            retries = retries + 1
            time.sleep(2)

            continue
        else:
            break

    await page # this is very important

    # loop through gathered requests and get its response body
    for request in requests:
        try:
            res = await page.send(cdp.network.get_response_body(request[1]))
            if res is None:
                continue

            responses.append({
                'url': request[0],
                'body': res[0],
                'is_base64': res[1]
            })
        except Exception as e:
            print("error get body", e)

    return responses


async def crawl():
    browser = await uc.start(headless=False)

    # use main tab
    tab = browser.main_tab

    listenXHR(tab)

    # change url to something that makes ajax requests
    tab = await browser.get("https://example.com")
    time.sleep(2)
    xhr_responses = await receiveXHR(tab, xhr_requests)

    print(xhr_responses)


if __name__ == '__main__':
    uc.loop().run_until_complete(crawl())

Excuse my python, i have been using the language for less than 10h lol

NOTE: If i call cdp.network.get_response_body on every request then i get None for all, so i had to pick specifically which urls to add into xhr_requests variable for it to work

I hope this help somehow, and looking forward for a better solution or examples/explanation on how to actually do this

@utam-1
Copy link

utam-1 commented May 3, 2024

Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-

  1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body.
  2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead.
  3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.

Here's a slightly modified version of the same code you provided:-

import asyncio
import nodriver as uc
from nodriver import cdp

class RequestMonitor:
    def __init__(self):
        self.requests = []
        self.last_request = None
        self.lock = asyncio.Lock()

    async def listen(self, page):
        async def handler(evt):
            async with self.lock:
                if evt.response.encoded_data_length > 0 and evt.type_ is cdp.network.ResourceType.XHR:
                    # print(f"EVENT PERCEIVED BY BROWSER IS:- {evt.type_}") # If unsure about event or to check behaviour of browser
                    self.requests.append([evt.response.url, evt.request_id])
                    self.last_request = time.time()

        page.add_handler(cdp.network.ResponseReceived, handler)

    async def receive(self, page):
        responses = []
        retries = 0
        max_retries = 5

        # Wait at least 2 seconds after the last XHR request to get some more
        while True:
            if self.last_request is None or retries > max_retries:
                break

            if time.time() - self.last_request <= 2:
                retries += 1
                await asyncio.sleep(2)
                continue
            else:
                break

        await page  # Waiting for page operation to complete.

        # Loop through gathered requests and get its response body
        async with self.lock:
            for request in self.requests:
                try:
                    res = await page.send(cdp.network.get_response_body(request[1]))
                    if res is None:
                        continue
                    responses.append({
                        'url': request[0],
                        'body': res[0],  # Assuming res[0] is the response body
                        'is_base64': res[1]  # Assuming res[1] indicates if response is base64 encoded
                    })
                except Exception as e:
                    print("Error getting body", e)

        return responses

async def crawl():
    browser = await uc.start(headless=False)
    monitor = RequestMonitor()
    tab = browser.main_tab

    await monitor.listenXHR(tab)
    
    # Change URL based on use case.
    tab = await browser.get("https://www.example.com")
    
    await asyncio.sleep(2)

    xhr_responses = await monitor.receiveXHR(tab)

    # Print URL and response body
    for response in xhr_responses:
        print(f"URL: {response['url']}")
        print("Response Body:")
        print(response['body'] if not response['is_base64'] else "Base64 encoded data")

if __name__ == '__main__':
    uc.loop().run_until_complete(crawl())

Apologies if I have made any mistakes and for my English too.

@RzNmKX
Copy link

RzNmKX commented May 26, 2024

Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-

1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body.

2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead.

3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.

Here's a slightly modified version of the same code you provided:-

import asyncio
import nodriver as uc
from nodriver import cdp

class RequestMonitor:
    def __init__(self):
        self.requests = []

Apologies if I have made any mistakes and for my English too.

You're aware this code does not run as posted, correct?

@utam-1
Copy link

utam-1 commented May 26, 2024

Hi @falmar, your code helped me a lot for my use case. I have a few suggestions to make for the code you provided:-

1. Checking the encoded_data_length for the evt :- I ran into this issue of receiving None for response body, hence if this check is included inside listenXHR function it'll help in providing that extra layer of check for response body.

2. Using asyncio.sleep() instead of time.sleep():- This is a minor change, but I've heard that time.sleep() is blocking in nature, hence it's good to use asyncio.sleep() instead.

3. Using asyncio.lock() :- Again just a minor change, asyncio.lock() used inside a class to encapsulate global variables provides additional protection for preventing corruption of data and race condition.

Here's a slightly modified version of the same code you provided:-

import asyncio
import nodriver as uc
from nodriver import cdp

class RequestMonitor:
    def __init__(self):
        self.requests = []

Apologies if I have made any mistakes and for my English too.

You're aware this code does not run as posted, correct?

Apologies, I might have made mistakes while modifying it , could you tell me the issue that you're encountering? I ran it in my system, it was working fine for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants