### **Handling Dynamic Content**
Many modern websites use JavaScript to dynamically load content after the initial page load (e.g., using frameworks like React or Angular). This means the content you need to scrape might not be present in the raw HTML source when the page first loads.

### Selenium
Selenium stands out from other scraping tools because it interacts with a real browser. This makes it particularly useful for websites that rely heavily on JavaScript, AJAX, or dynamic content loading. Selenium can handle:
- Pages that load content dynamically with JavaScript.
- Interaction with forms, buttons, and scroll events.
- Real browser interactions

In [2]:
from selenium import webdriver
bookstoscrape_url = "https://books.toscrape.com/"

In [None]:
# TODO: Do 5 minute of web searching with 'selenium python' as the keyword and state out what you found in short sentence.

# Install extension
# pip install selenium

1. Selenium is a powerful web automation framework that can control browser actions programmatically
2. Key features found:
    - Supports multiple browsers (Chrome, Firefox, Safari, Edge)
    - Can automate form filling and button clicking
    - Handles dynamic JavaScript content
    - Supports waiting for elements to load
    - Can take screenshots and handle alerts
3. Common use cases:
    - Web scraping of JavaScript-heavy sites
    - Automated testing of web applications
    - Browser automation for repetitive tasks
    - Form automation and data entry
    - Cross-browser testing
4. Popular with Python because:
    - Simple syntax and good documentation
    - Large community support
    - Many available tutorials and examples
    - Integration with testing frameworks
    - Active development and updates
5. Recent updates include:
    - Selenium 4.x with improved stability
    - Better handling of shadow DOM
    - Enhanced WebDriver protocols
    - Improved error messages
    - Better performance and reliability

### Basic Usage of Selenium

In [3]:
"""
Objective: Creating a browser instance
"""
firefox_driver = webdriver.Firefox()

In [4]:
""" 
Objective: Opening bookstoscrape_url
"""
firefox_driver.get(bookstoscrape_url)

In [6]:
""" 
Objective: Save the page content to html variable then use Beautifulsoup to extract the data
"""
html = firefox_driver.page_source


In [7]:
""" 
Objective: Close the browser instance
"""
firefox_driver.quit()

### Browser Instance Variance

In [None]:
""" 
Objective: Compare the difference between Firefox webdriver and Chrome webdriver
"""
# TODO: Create a Chrome browser instance
# TODO: Create a Firefox browser instance
# TODO: Analyze the difference between firefox and chrome browser instance

chrome_driver = webdriver.Chrome()
chrome_driver.get(bookstoscrape_url)

firefox_driver = webdriver.Firefox()
firefox_driver.get(bookstoscrape_url)

# Analyze differences
print("Chrome Browser Info:")
print(f"Title: {chrome_driver.title}")
print(f"Window size: {chrome_driver.get_window_size()}")
print(f"User Agent: {chrome_driver.execute_script('return navigator.userAgent')}")

print("\nFirefox Browser Info:")
print(f"Title: {firefox_driver.title}") 
print(f"Window size: {firefox_driver.get_window_size()}")
print(f"User Agent: {firefox_driver.execute_script('return navigator.userAgent')}")

# Chrome Browser Info:
# Title: All products | Books to Scrape - Sandbox
# Window size: {'width': 1050, 'height': 708}
# User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36

# Firefox Browser Info:
# Title: All products | Books to Scrape - Sandbox
# Window size: {'width': 1382, 'height': 744}
# User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) Gecko/20100101 Firefox/134.0

# Key differences:
# 1. Different default window sizes
# 2. Different user agents
# 3. Different rendering engines (AppleWebKit/537.36 (KHTML, like Gecko) vs Gecko/20100101)



Firefox Browser Info:
Title: All products | Books to Scrape - Sandbox
Window size: {'width': 1382, 'height': 744}
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) Gecko/20100101 Firefox/134.0


In [None]:
""" 
Objective: Change the window size in Chrome webdriver

Set the window size to a particular size x and y --> driver.set_window_size(x, y)
Set the window size to max --> driver.maximize_window()
Get the current window size --> driver.get_window_size()
"""
# TODO: Change the window size of the firefox browser instance to 200 and 400
# TODO: Print the current window size of the chrome browser instance
# TODO: Change the window size of the chrome browser instance to full screen

# Change Firefox window size to 200x400
firefox_driver.set_window_size(200, 400)

# Print Chrome window size
print("Firefox window size:", chrome_driver.get_window_size())

# Maximize Chrome window. Can not be done in Firefox ???
chrome_driver.maximize_window()

Firefox window size: {'width': 1382, 'height': 744}


In [None]:
# TODO: Close all the browser instance using quit method

In [18]:
""" 
Objective: Applying browser variance using Options
"""
# TODO: Execute this cell before continue
# TODO: Choose between these 2 method to add options into webdriver instance

## Method 1:
from selenium.webdriver.chrome.options import Options
options = Options()

## Method 2:
# options = webdriver.ChromeOptions()

# I chose Method 1 as it provides better code readability and is more explicit 
# about what we're importing. Both methods work equally well for adding browser options.

In [20]:
""" 
Objective: Change the windows to full size with Options
Options is a way to create a browser instance with customization.

Example:
## Create the options object from one of two previous cell mentioned
# options = Options()

## Add customization as argument
options.add_argument("--start-maximized")

## You can add multiple argument after the other
options.add_argument("argument 1")
options.add_argument("argument 2")
options.add_experimental_option("experimental argument")

## Add options object to webdriver instance
driver = webdriver.Chrome(options=options)
"""
# TODO: Create options object
# TODO: Add argument to set the window size to max
# TODO: Create a browser instance with options

# Create options object
from selenium.webdriver.chrome.options import Options
options = Options()

# Add argument to set window size to max
options.add_argument("--start-maximized")
options.add_argument("argument 1")
options.add_argument("argument 2")
# options.add_experimental_option("experimental argument")
options.add_experimental_option("excludeSwitches", ["enable-automation"])


# Create browser instance with options
driver = webdriver.Chrome(options=options)
driver.get(bookstoscrape_url)

In [24]:
""" 
Objective: Run a headless browser
"""
# TODO: Create options object
# TODO: Add argument "--headless" to the options
# TODO: Create a browser instance with options
# TODO: Open bookstoscrape_url using get method
# TODO: print the page title using driver.title
# TODO: Close the browser instance using quit method

# Create options object
from selenium.webdriver.chrome.options import Options
options = Options()

# Add headless argument
options.add_argument("--headless")

# Create browser instance with options
driver = webdriver.Chrome(options=options)

# Open URL
driver.get(bookstoscrape_url)

# Print page title
print("Page title:", driver.title)

# Close browser
driver.quit()



Page title: All products | Books to Scrape - Sandbox


In [25]:
""" 
Objective: Opening a page without loading the image
"""
# TODO: Create options object
# TODO: Add argument "--blink-settings=imagesEnabled=false" to the options
# TODO: Create a browser instance with options
# TODO: Open bookstoscrape_url using get method

# Create options object
from selenium.webdriver.chrome.options import Options
options = Options()

# Add argument to disable images
options.add_argument("--blink-settings=imagesEnabled=false")

# Create browser instance with options and open URL
driver = webdriver.Chrome(options=options)
driver.get(bookstoscrape_url)

In [26]:
""" 
Objective: Explore any other options available
options.add_argument("--disable-gpu")  # Disable GPU rendering
options.add_argument("--no-sandbox")  # Disable sandbox for Docker
options.add_argument("--disable-dev-shm-usage")  # Prevent shared memory issues
options.add_argument("--disable-extensions")  # Disable extensions
options.add_argument("--disable-infobars")  # Remove info bars
"""
# TODO: Explore the official documentation from https://www.selenium.dev/documentation/webdriver/browsers/chrome/
# Create options object with additional Chrome arguments
from selenium.webdriver.chrome.options import Options
options = Options()

# Performance options
options.add_argument("--disable-gpu")  # Disable GPU hardware acceleration
options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource issues
options.add_argument("--no-sandbox")  # Disable sandbox security feature

# User experience options
options.add_argument("--disable-notifications")  # Block notification prompts
options.add_argument("--disable-popup-blocking")  # Allow popups
options.add_argument("--disable-infobars")  # Remove Chrome info bars

# Privacy options
options.add_argument("--incognito")  # Run in incognito mode
options.add_argument("--disable-extensions")  # Disable browser extensions

# Create and run browser with all options
driver = webdriver.Chrome(options=options)
driver.get(bookstoscrape_url)

# Clean up
driver.quit()

In [27]:
""" 
Objective: Understand what information from a webdriver instance we can get
"""
# TODO: Create a new browser instance with any options you like
# TODO: Open any website from your own preference
# TODO: Print the current page title by using driver.title
# TODO: Print the current url by using driver.current_url
# TODO: Get the HTML content by using driver.page_source
# TODO: Compare the HTML content from selenium with HTML content from requests then provide your insight

# Import required libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

# Create browser instance with some options
options = Options()
options.add_argument("--start-maximized")
options.add_argument("--disable-notifications")
driver = webdriver.Chrome(options=options)

# Open a dynamic website (GitHub as an example)
github_url = "https://github.com"
driver.get(github_url)

# Get information from Selenium
print("=== Selenium WebDriver Info ===")
print(f"Page Title: {driver.title}")
print(f"Current URL: {driver.current_url}")
selenium_html = driver.page_source

# Get HTML content using requests
response = requests.get(github_url)
requests_html = response.text

# Compare HTML content lengths
print("\n=== HTML Content Comparison ===")
print(f"Selenium HTML length: {len(selenium_html)}")
print(f"Requests HTML length: {len(requests_html)}")
print("\nInsight: Selenium HTML content is typically larger because it includes")
print("dynamically generated content from JavaScript execution")

# Clean up
driver.quit()

=== Selenium WebDriver Info ===
Page Title: GitHub · Build and ship software on a single, collaborative platform · GitHub
Current URL: https://github.com/

=== HTML Content Comparison ===
Selenium HTML length: 303903
Requests HTML length: 280820

Insight: Selenium HTML content is typically larger because it includes
dynamically generated content from JavaScript execution


### **Reflection**
Which is faster for retrieving HTML content: sending an HTTP request directly using the requests library or creating a browser instance with Selenium?

(answer here)

The requests library is significantly faster than Selenium for retrieving HTML content. Here's why:

1. Requests Library:
- Makes direct HTTP calls to the server
- Only downloads the initial HTML content
- Minimal overhead and resource usage
- No browser initialization required
- Typically takes milliseconds to complete
2. Selenium:
- Needs to start up a complete browser instance
- Loads all resources (CSS, JavaScript, images)
- Executes JavaScript code
- Waits for dynamic content to load
- Can take several seconds to initialize and load
However, it's important to note that while requests is faster, it can't:

- Execute JavaScript
- Render dynamic content
- Handle user interactions
- Access content that requires browser capabilities
Choose based on your needs:

- Use requests for static content or simple API calls
- Use Selenium when you need JavaScript execution, dynamic content, or browser interaction

### **Exploration**
Explore what can be done manually in a real browser and what can be achieved using Selenium.

Here's a comparison of manual browser actions and their Selenium automation equivalents:

1. Navigation
- Manual: Type URLs, click links, use back/forward buttons
- Selenium:

In [None]:
driver.get(url)              # Navigate to URL
driver.back()                # Go back
driver.forward()             # Go forward
driver.refresh()             # Refresh page

2. Form Interactions
- Manual: Fill forms, check boxes, select dropdowns
- Selenium:

In [None]:
element.send_keys("text")    # Type text
element.click()              # Click elements
element.submit()             # Submit forms
Select(element).select_by_visible_text("option")  # Select dropdown

3. Mouse/Keyboard Actions
- Manual: Hover, drag-drop, right-click
- Selenium:

In [None]:
actions = ActionChains(driver)
actions.move_to_element(element)     # Hover
actions.drag_and_drop(source, target)  # Drag and drop
actions.context_click()              # Right click
actions.perform()                    # Execute actions

4. Window Management
- Manual: Switch tabs, resize window, scroll
- Selenium:

In [None]:
driver.switch_to.window(handle)      # Switch tabs
driver.set_window_size(width, height)  # Resize
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")  # Scroll

5. Alert Handling
- Manual: Interact with popups and alerts
- Selenium:

In [None]:
alert = driver.switch_to.alert
alert.accept()              # Click OK
alert.dismiss()            # Click Cancel
alert.send_keys("text")    # Type in prompt

6. Screenshots
- Manual: Use screenshot tools
- Selenium:

In [None]:
driver.save_screenshot("screenshot.png")  # Full page
element.screenshot("element.png")         # Specific element      

7. Cookies/Storage
- Manual: View developer tools, manage cookies
- Selenium:

In [None]:
driver.add_cookie({"name": "cookie_name", "value": "value"})
driver.get_cookies()       # Get all cookies
driver.delete_all_cookies()  # Clear cookies

8. Network Monitoring
- Manual: Use developer tools network tab
- Selenium with CDP (Chrome DevTools Protocol):

In [None]:
driver.execute_cdp_cmd('Network.enable', {})  # Enable network monitoring
driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': request_id})  # Get response

Essentially, almost anything you can do manually in a browser can be automated with Selenium, making it powerful for:

- Web testing
- Data scraping
- Process automation
- Form filling
- Browser behavior simulation