# Scraping basics for Playwright

This notebook is a combination of small scraping techniques along with how to use Playwright. Along with the class notes, the [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.

## Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [149]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

In [150]:
await page.goto(" http://jonathansoma.com/columbia/interactive-scrape/by-class.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' method='GET'>>

In [151]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector(\'body\').innerHTML = html\n}, 250)</script>\n</head><body>\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n</body></html>'

## Scraping by class

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-class.html using their **class name**, printing out the title, subhead, and byline.

In [152]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

In [153]:
from bs4 import BeautifulSoup

soup_doc = BeautifulSoup(html)

In [154]:
title = soup_doc.find_all('h1', class_='title')
for title in title:
    print(title.text.strip())

How to Scrape Things


In [155]:
subhead = soup_doc.find_all('h3', class_='subhead')
for sub in subhead:
    print(sub.text.strip())

Probably using Playwright


In [156]:
by_line = soup_doc.find_all('p', class_='byline')
for by in by_line:
    print(by.text.strip())

By Jonathan Soma


## Scraping using a single tag

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-list.html, creating a dictionary out of the title, subhead, and byline.

In [157]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

In [158]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-list.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' method='GET'>>

In [159]:
html = await page.content()
html

"<!DOCTYPE html><html><head><script>\n    const html = `<p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n</body></html>"

In [165]:
from bs4 import BeautifulSoup

soup_doc = BeautifulSoup(html)

In [167]:
content = soup_doc.find_all('p')
for content in soup_doc.find_all('p'):
    print(content.get_text())

How to Scrape Things
Probably using Playwright
By Jonathan Soma
Everything has shown up


## Waiting

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html just like you above, but use  **wait_for** to wait for the text "Everything has shown up" to show up.

In [172]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

In [173]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' method='GET'>>

In [174]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `<p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n<p>Everything has shown up</p> \n`\n\nlet pieces = html.split("\\n")\n\nfunction addPiece() {\n    document.querySelector(\'body\').innerHTML = document.querySelector(\'body\').innerHTML + pieces.shift()\n    if(pieces.length > 0) {\n        setTimeout(addPiece, 250)\n    } else {\n        setTimeout(() => {\n            document.querySelector(\'body\').innerHTML = ""\n            pieces = html.split("\\n")\n            setTimeout(addPiece, 1000)\n        }, 2000)\n    }\n}\n\nsetTimeout(addPiece, 250)\n</script>\n</head><body><p>How to Scrape Things</p><p>Probably using Playwright</p><p>By Jonathan Soma</p><p>Everything has shown up</p> </body></html>'

In [175]:
from bs4 import BeautifulSoup

soup_doc = BeautifulSoup(html)

In [176]:
await page.get_by_text("Everything has shown up").wait_for()

content = soup_doc.find_all('p')
for content in soup_doc.find_all('p'):
    print(content.get_text())

How to Scrape Things
Probably using Playwright
By Jonathan Soma
Everything has shown up


## Forms

Display the content of the `h1` tag on http://jonathansoma.com/columbia/interactive-scrape/inputs.html. You'll need to follow the instructions to complete the form first.

In [181]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

In [178]:
await page.goto(" http://jonathansoma.com/columbia/interactive-scrape/inputs.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/inputs.html' method='GET'>>

In [179]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `<h1>You did it</h1>`\n</script>\n</head><body>\n    <div id="things"></div>\n    <p>The secret is\n    <select>\n        <option selected="">Closed</option>\n        <option>Open</option>\n    </select>\n</p>\n<p>\n    <input type="text" placeholder="write cat in here" id="best-animal">\n</p>\n<p>\n    <input type="button" id="submit" value="Click me">\n</p>\n    <script>\n        document.querySelector("#submit").addEventListener(\'click\', function() {\n            if(document.querySelector(\'#best-animal\').value == \'cat\') {\n                if(document.querySelector("select").value == \'Open\') {\n                    document.querySelector(\'body\').innerHTML = html\n                } else {\n                    alert(\'fix the dropdown!!!\')\n                }\n            } else {\n                alert(\'write cat in there!!!\')\n            }\n        })\n    </script>\n\n</body></html>'

In [188]:
await page.get_by_text("The secret is").select_option('Open')

TimeoutError: Locator.select_option: Timeout 30000ms exceeded.
Call log:
waiting for get_by_text("The secret is")


## Scraping a single table row

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, creating a dictionary out of the title, subhead, and byline.

In [104]:
page = await browser.new_page()

In [105]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' method='GET'>>

In [106]:
html = await page.content()
html

"<!DOCTYPE html><html><head><script>\n    const html = `<table>\n  <tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</table>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</tbody></table>\n</body></html>"

## Saving into a dictionary

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

## Scraping multiple table rows

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html, creating a list of dictionaries. Convert to a pandas dataframe with `pd.json_normalize`. Save it as `output.csv`.

## Scraping an actual table

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html using pandas' HTML reading function. Save it as `output.csv`.

## `html.parser` vs `html5lib`

Here is some good HTML:

```python
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

Here is some bad HTML:
    
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
```

When you're using BeautifulSoup, you can use different parsers, including `html.parser`, `html5lib` and `lxml`. Try both the good HTML and bad HTML with each parser and use `print(soup_doc.prettify())` to view the difference.

What is different about each one?

> You'll need to `pip install` for both html5lib and lxml. Since you aren't important them, they're coming from BeautifulSoup, you'll need to do **Kernel > Restart** and run from the top after installing to have them work.