# <center>Web Scraping (2): Dyamic Web Pages by *Playwright*</center>

References:
- https://playwright.dev/python/docs/intro
- https://oxylabs.io/blog/playwright-web-scraping

<h2> A big picture</h2>

<img src = "https://python.langchain.com/v0.1/assets/images/web_scraping-001e2279b9e46c696012ac26d7f218a2.png" width="80%">

source: https://python.langchain.com/v0.1/docs/use_cases/web_scraping/

## 1. Why Playwright
- So far, we have learned how to scrape **static** HTML pages using **Requests + BeautifulSoup**
- However, if the web content relies on **javascript or AJAX** to build the content, this combination does not work
  - Elements in a web page loaded **asynchronously**
     * while requests.get(url) can only return the initial content
     * you may need to wait for a while to get web content fully loaded
  - You need to **interact with the page** to get some content loaded, e.g.
     * scroll down to load more
     * click a button like "more"
     * fill a form
- Example: https://www.quora.com/topic/Machine-Learning

In [1]:
# Exercise 1.1. Scape quora page using requests+beautifulsoup

# import requests package
import requests                   

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)
from bs4 import BeautifulSoup   

page = requests.get("https://www.quora.com/topic/Machine-Learning")    # send a get request to the web page

if page.status_code==200:      

    soup = BeautifulSoup(page.content, 'html.parser')
    
    # get all questions
    questions=soup.select("span.q-box.qu-userSelect--text")
    
    for i, q in enumerate(questions):
        print(i, q.get_text())
        print("\n")
    
# Note: nothing is returned. Do you know why?

## 2. Playwright
- Playwright is a **Web automation framework** that can automate web browser interactions. It can navigate to URLs, enter text, click buttons, extract text, etc. The most exciting feature of Playwright is that it can work with multiple pages at the same time, without getting blocked or having to wait for operations to complete in any of them. 
- It supports **asychronous** functions, which can work with multiple pages at the same time, without getting blocked or having to wait for operations to complete in any of them.
- It supports most browsers such as Google Chrome, Microsoft Edge using Chromium, Firefox and Safari with a single web driver (Selenium has a separate web driver for each browsser).
- Playwright is really useful when you have to perform action on a website such as:
  - clicking on buttons
  - filling forms
  - scrolling
  - taking a screenshot
  - execute Javascript code.
- Installation: `pip install playwright`
- For more details, check:
    - https://playwright.dev/python/docs/intro
    - https://oxylabs.io/blog/playwright-web-scraping

## 3. Use of Playwright for Scraping

### 3.1. **Navigating** 
- Navigate to a link: Similar to beautifulsoup, but using different syntax
- Find elements by id, name, xpath, CSS selectors
    - For more details, check https://playwright.dev/python/docs/api/class-locator
    * Note **query_selector()** vs. **fquery_selector<font color='red'>_all</font>()**
        - *query_selector*: collect the first elements by a CSS selector
        - *query_selector_all*: collect the first elements by a CSS selector
  
|    | requests/BeautifulSoup | Selenium WebDriver |
| -- |:------------------      |:-----------|
| Navigate to a link |   `requests.get(url)`           | `page.goto(url)`    |
| find elements  | `soup.select()` | `page.query_selector("p")`<br>`page.query_selector_all("p")`|
| get attributes of <br>element (say `p`) | `p.attrs` <br>    `p["class"]` | `p.get_attribute("class")` |
| get text | `p.get_text()` | `p.inner_text()`  or `p.text_content()`|
 

- **Headless browser** : 
    - A web browser without a graphical user interface (GUI), i.e., `playwright.firefox.launch(headless=True)` 
    - During web scraping there is often no need to start a website. Web scraping with a headless browser allows quickly navigating websites and collecting public data
- **`Sync` vs. `Async`**: 
    - Although Playwright provides both `Sync` and `Async` APIs, since Jupyter notebooks use an asyncio event loop, we need to use Playwright's async API as well.
    - Async IO is a single-threaded, single-process design, but uses cooperative multitasking. It is designed for IO-bound tasks.
    - For details about `Sync` and `Async` APIs, please read https://realpython.com/async-io-python/
- **Asychronous (async/await syntax)**:
    - The **async** keyword is used to define a coroutine function, which can be suspended and resumed at specific points during its execution. 
    - The **await** keyword is used to indicate that a coroutine function should pause execution until a certain event occurs

In [43]:
import asyncio
from playwright.async_api import async_playwright

# Use Playwright async API
async with async_playwright() as playwright:
    
    # Launch Firefox in non-headless mode (visible UI)
    browser = await playwright.firefox.launch(headless=False)
    
    # For chrom browser, use the following
    #browser = playwright.chromium.launch(headless=False)
    
    # Open a new browser page with specified dimensions
    page = await browser.new_page()
    
    # Navigate to the given URL
    await page.goto("https://www.quora.com/topic/Machine-Learning")
    
    # Wait for 5 seconds so that the page can loaded
    await asyncio.sleep(5)
    
    await browser.close()
    

### 3.2 Locating and Scraping Elements


For details of locator, see https://playwright.dev/docs/other-locators#introduction

In [45]:
# Use Playwright async API
async with async_playwright() as playwright:
    
    # Launch Firefox in non-headless mode (visible UI)
    browser = await playwright.firefox.launch(headless=False)
    
    # For chrom browser, use the following
    #browser = playwright.chromium.launch(headless=False)
    
    # Open a new browser page with specified dimensions
    page = await browser.new_page()
    
    # Navigate to the given URL
    await page.goto("https://www.quora.com/topic/Machine-Learning")
    
    # wait until page has been loaded
    await asyncio.sleep(5)
    
    # query all elements by CSS selector
    all_questions = await page.query_selector_all('span.q-box.qu-userSelect--text')
    
    # Collect the elements
    qs =[]
    for i, q in enumerate(all_questions):

        # query the span element under each question
        q = await q.query_selector('span')

        # get the text
        text = await q.inner_text()
        
        print(f"{i}:  {text}\n")
        
        qs.append(text)

    await browser.close()


0:  What is your opinion of DeepSeek vs other US LLMs in terms of intellectual capital, Nvidia, and the efficacy of export bans in the West vs China battleground?

1:  Why is everyone so obsessed with AI? What is it actually going to provide in the future?

2:  You can train any 8 year old kid to drive well. Why is it taking years and the full might of supercomputer AI to get a computer to do it?

3:  How do publishing platforms detect AI-generated content in books?

4:  How do Americans feel knowing about deep seek and chat GPT?

5:  Will AI replace programmers?

6:  What are the best sites for learning Data Science?

7:  What is the Candy AI app?

8:  Will AI replace programmers?

9:  How should software engineers prepare for the possibility of getting replaced by AI by 2025?



### 3.3. Simulates users' actions performed in a web browser. 

  - click a button
    * e.g. submit_button.click()
  - fill a form
    * e.g. text_box.fill("enter some text")
  - scroll page down or up
    * e.g. page.keyboard.down('End')
  ...
  - For details see https://playwright.dev/python/docs/input

**Click a link**


Click "more" link to expand the full text

In [46]:
async with async_playwright() as playwright:
    
    # Launch Firefox in non-headless mode (visible UI)
    browser = await playwright.firefox.launch(headless=False)
    
    # For chrom browser, use the following
    #browser = playwright.chromium.launch(headless=False)
    
    # Open a new browser page with specified dimensions
    page = await browser.new_page()
    
    # Navigate to the given URL
    await page.goto("https://www.quora.com/topic/Machine-Learning")
    
    # wait until page has been loaded
    await asyncio.sleep(5)
    
    # locate the first "more" link
    more = await page.query_selector("div.q-text.qu-cursor--pointer.qt_read_more")
    
    # click the link
    await more.click()
    
    # wait for 5 seconds so that the content can be loaded
    await asyncio.sleep(5)
    
    await browser.close()


**Scroll down**


Scroll down the page for 5 times to collect more questions

In [47]:
async with async_playwright() as playwright:
    
    # Launch Firefox in non-headless mode (visible UI)
    browser = await playwright.firefox.launch(headless=False)

    # For chrom browser, use the following
    #browser = playwright.chromium.launch(headless=False)

    # Open a new browser page with specified dimensions
    page = await browser.new_page()

    # Navigate to the given URL
    await page.goto("https://www.quora.com/topic/Machine-Learning")

    # wait until page has been loaded
    await asyncio.sleep(5)

    # scroll down 5 times
    for i in range(5):

        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")

        # wait for the content to be loaded
        await asyncio.sleep(2)   
    
    # query all elements by CSS selector
    all_questions = await page.query_selector_all('span.q-box.qu-userSelect--text')
        
    qs =[]
    for i, q in enumerate(all_questions):

        # query the span element under each question
        q = await q.query_selector('span')

        # get the text
        text = await q.inner_text()
        
        print(f"{i}:  {text}\n")
        
        qs.append(text)
        
await browser.close()

0:  What is your opinion of DeepSeek vs other US LLMs in terms of intellectual capital, Nvidia, and the efficacy of export bans in the West vs China battleground?

1:  Why is everyone so obsessed with AI? What is it actually going to provide in the future?

2:  You can train any 8 year old kid to drive well. Why is it taking years and the full might of supercomputer AI to get a computer to do it?

3:  How do publishing platforms detect AI-generated content in books?

4:  How do Americans feel knowing about deep seek and chat GPT?

5:  Will AI replace programmers?

6:  What are the best sites for learning Data Science?

7:  What is the Candy AI app?

8:  Will AI replace programmers?

9:  How should software engineers prepare for the possibility of getting replaced by AI by 2025?

10:  What is the best AI image generator?

11:  How do I change my hairstyle naturally with AI?

12:  Are AIs prone to being fearful and therefore wanting to own firearms like many fearful American humans?

13:

# 4. Exercise


Scrape **all questions** on the page

In [49]:
# define an async function

async def scrape():
    
    async with async_playwright() as playwright:
        
        # with visible browser. Change headless=True make browser invisible
        browser = await playwright.firefox.launch(headless=False)
        
        # wait for the browser opens a new page and direct to the url
        page = await browser.new_page()
        await page.goto('https://www.quora.com/topic/Machine-Learning')
        
        # can wait for 50 seconds before timeout
        await page.wait_for_timeout(50)
        
        # obtain intial html content
        html = await page.content()
        
        # keep scrolling until "break"
        while True:
            
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
            
            # Wait for 5 second or longer for content to be loaded
            await asyncio.sleep(5)
            
            # Retrieve page content to check udpates
            updated_html = await page.content()
            
            print(f"length of html: {len(updated_html)}")
            
            # if no new content is loaded, stop scrolling
            if updated_html == html:
                break
            else:
                html = updated_html
        
        # query all elements by CSS selector
        all_questions = await page.query_selector_all('span.q-box.qu-userSelect--text')
        
        qs =[]
        for i, q in enumerate(all_questions):
            
            # query the span element under each question
            q = await q.query_selector('span')
            
            # get the text
            text = await q.inner_text()
            
            print(f"{i}:  {text}\n")
            
            qs.append(text)
        
        
        await browser.close()

In [50]:
# Execute the scrape function in Jupyter
await scrape()

# In commandline, you can start many asychronous coroutines by 
#asyncio.run(scrape())
#asyncio.run(scrape())

length of html: 372655
length of html: 547833
length of html: 722124
length of html: 888158
length of html: 937923
length of html: 937923
0:  What is your opinion of DeepSeek vs other US LLMs in terms of intellectual capital, Nvidia, and the efficacy of export bans in the West vs China battleground?

1:  Why is everyone so obsessed with AI? What is it actually going to provide in the future?

2:  You can train any 8 year old kid to drive well. Why is it taking years and the full might of supercomputer AI to get a computer to do it?

3:  How do publishing platforms detect AI-generated content in books?

4:  How do Americans feel knowing about deep seek and chat GPT?

5:  Will AI replace programmers?

6:  What are the best sites for learning Data Science?

7:  What is the Candy AI app?

8:  Will AI replace programmers?

9:  How should software engineers prepare for the possibility of getting replaced by AI by 2025?

10:  What is the best AI image generator?

11:  How do I change my hairs