# Scraping basics for Playwright

This notebook is a combination of small scraping techniques along with how to use Playwright. Along with the class notes, the [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.

## Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [1]:
!pip install playwright


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
!playwright install

In [3]:
from playwright.async_api import async_playwright

In [4]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

## Scraping by class

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-class.html using their **class name**, printing out the title, subhead, and byline.

In [5]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-class.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' method='GET'>>

In [6]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector(\'body\').innerHTML = html\n}, 250)</script>\n</head><body>\n\n</body></html>'

In [7]:
from bs4 import BeautifulSoup

# Take the HTML from currently fully loaded page
# notice: THERE'S NO REQUESTS ANYWHERE!!!!
soup_doc = BeautifulSoup(html)

In [8]:
title = soup_doc.find(class_ = 'title').text.strip()
subhead = soup_doc.find(class_ = 'subhead').text.strip()
byline = soup_doc.find(class_ = 'byline').text.strip()
list = f"Title: {title}, Subhead: {subhead}, Byline: {byline}"
list

AttributeError: 'NoneType' object has no attribute 'text'

## Scraping using a single tag

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-list.html, creating a dictionary out of the title, subhead, and byline.

In [None]:
title = soup_doc.find('h1').text.strip()
subhead = soup_doc.find('h3').text.strip()
byline = soup_doc.find('p').text.strip()

dict = {
    'Title': title,
    'Subhead': subhead,
    'Byline': byline
}

dict

## Waiting

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html just like you above, but use  **wait_for** to wait for the text "Everything has shown up" to show up.

In [None]:
from playwright.async_api import async_playwright

playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html")

# Wait for the specific text or element to appear
await page.wait_for_selector("text='Everything has shown up'")

# Get the page content
html = await page.content()

from bs4 import BeautifulSoup

soup_doc = BeautifulSoup(html, 'html.parser')

line1 = soup_doc.find('p').text.strip() 
line2 = soup_doc.find_all('p')[1].text.strip()  
line3 = soup_doc.find_all('p')[2].text.strip() 
line4 = soup_doc.find_all('p')[3].text.strip()

data_dict = {
    line1,
    line2,
    line3,
    line4
}

data_dict

## Forms

Display the content of the `h1` tag on http://jonathansoma.com/columbia/interactive-scrape/inputs.html. You'll need to follow the instructions to complete the form first.

In [None]:
from bs4 import BeautifulSoup

from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

await page.goto("https://jonathansoma.com/columbia/interactive-scrape/inputs.html")

In [None]:
# Select the dropdown value "Open"
await page.select_option('select', 'Open')

# Fill out the input box with "cat"
await page.fill('input[id="best-animal"]', 'cat')

# Click the button to submit the form
await page.click('input[id="submit"]')

# Wait for the page to load after form submission
await page.wait_for_selector('h1')

# Get the page content after form submission
html = await page.content()

# Parse the HTML content with BeautifulSoup
soup_doc = BeautifulSoup(html, 'html.parser')

# Extract and display the content of the h1 tag
h1_content = soup_doc.find('h1').text.strip()
h1_content

## Scraping a single table row

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, creating a dictionary out of the title, subhead, and byline.

## Saving into a dictionary

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [None]:
from bs4 import BeautifulSoup

from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

await page.goto("https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

In [None]:
# Get the page content
html = await page.content()

# Parse the HTML content with BeautifulSoup
soup_doc = BeautifulSoup(html, 'html.parser')

# Extract the row and its cells
row = soup_doc.find('tr')
cells = row.find_all('td')

# Save the extracted content into a dictionary
book = {
    'title': cells[0].text.strip(),
    'subhead': cells[1].text.strip(),
    'byline': cells[2].text.strip()
}

# Print the dictionary
book

## Scraping multiple table rows

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html, creating a list of dictionaries. Convert to a pandas dataframe with `pd.json_normalize`. Save it as `output.csv`.

In [None]:
from bs4 import BeautifulSoup

from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

await page.goto("https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")

In [None]:
import pandas as pd

# Get page content and parse with BeautifulSoup
content = await page.content()
soup = BeautifulSoup(content, 'html.parser')

# Find the table and extract rows
table = soup.find('table')
rows = table.find_all('tr')

# Create a list of dictionaries for each row
data = []

for row in rows:  # Loop through all rows
    cells = row.find_all('td')
    if cells:  # Only process rows that have <td> elements
        row_data = []
        for cell in cells:
            row_data.append(cell.text.strip())
        data.append(row_data)

# Convert the list of lists to a pandas dataframe
df = pd.json_normalize(data)

# Save the dataframe to a CSV file
df.to_csv('output.csv', index=False)

## Scraping an actual table

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html using pandas' HTML reading function. Save it as `output.csv`.

In [None]:
from bs4 import BeautifulSoup

from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

await page.goto("http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html")

In [None]:
# Step 3: Get page content
content = await page.content()

# Step 4: Use pandas to read the HTML content and extract the table
content_io = StringIO(content)
tables = pd.read_html(content_io)
df = tables[0]  # there's only one table on the page

# Step 5: Save the dataframe to a CSV file
df.to_csv('output2.csv', index=False)

## `html.parser` vs `html5lib`

Here is some good HTML:

```python
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

Here is some bad HTML:
    
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
```

When you're using BeautifulSoup, you can use different parsers, including `html.parser`, `html5lib` and `lxml`. Try both the good HTML and bad HTML with each parser and use `print(soup_doc.prettify())` to view the difference.

What is different about each one?

> You'll need to `pip install` for both html5lib and lxml. Since you aren't important them, they're coming from BeautifulSoup, you'll need to do **Kernel > Restart** and run from the top after installing to have them work.

In [None]:
pip install html5lib lxml

In [None]:
from bs4 import BeautifulSoup

# Good HTML
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

# Bad HTML
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""

In [None]:
# Using html.parser: 
soup_good_html_parser = BeautifulSoup(html_good, 'html.parser')
soup_bad_html_parser = BeautifulSoup(html_bad, 'html.parser')

print("Good HTML with html.parser:")
print(soup_good_html_parser.prettify())

print("\nBad HTML with html.parser:")
print(soup_bad_html_parser.prettify())


# Using html5lib: 
soup_good_html5lib = BeautifulSoup(html_good, 'html5lib')
soup_bad_html5lib = BeautifulSoup(html_bad, 'html5lib')

print("\nGood HTML with html5lib:")
print(soup_good_html5lib.prettify())

print("\nBad HTML with html5lib:")
print(soup_bad_html5lib.prettify())


# Using lxml
soup_good_lxml = BeautifulSoup(html_good, 'lxml')
soup_bad_lxml = BeautifulSoup(html_bad, 'lxml')

print("\nGood HTML with lxml:")
print(soup_good_lxml.prettify())

print("\nBad HTML with lxml:")
print(soup_bad_lxml.prettify())