# Scraping basics for Playwright

This notebook is a combination of small scraping techniques along with how to use Playwright. Along with the class notes, the [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.

## Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [200]:
!pip install playwright


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [201]:
!playwright install

In [202]:
from playwright.async_api import async_playwright

In [203]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

In [204]:
page = await browser.new_page()

## Scraping by class

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-class.html using their **class name**, printing out the title, subhead, and byline.

In [205]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-class.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' method='GET'>>

In [206]:
html = await page.content()
html

'<!DOCTYPE html><html><head><script>\n    const html = `\n<h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Probably using Playwright</h3>\n<p class="byline">By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector(\'body\').innerHTML = html\n}, 250)</script>\n</head><body>\n\n</body></html>'

## Scraping using a single tag

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-list.html, creating a dictionary out of the title, subhead, and byline.

In [207]:
await page.goto("http://jonathansoma.com/columbia/interactive-scrape/by-list.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' method='GET'>>

In [208]:
html = await page.content()
html

"<!DOCTYPE html><html><head><script>\n    const html = `<p>How to Scrape Things</p>\n<p>Probably using Playwright</p>\n<p>By Jonathan Soma</p>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body>\n\n</body></html>"

In [209]:
from bs4 import BeautifulSoup

my_url = "http://jonathansoma.com/columbia/interactive-scrape/by-list.html"
raw_html = await page.content()

In [210]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<p>How to Scrape Things</p>
<p>Probably using Playwright</p>
<p>By Jonathan Soma</p>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
 </body>
</html>



In [211]:
title = soup_doc.find('h1').text
title

AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
subhead = soup_doc.find('h3').text
subhead


In [None]:
byline = soup_doc.find('p').text
byline

In [None]:
my_dictionary = {
    'Title:': soup_doc.find('h1').text,
    'Subhead:': soup_doc.find('h3').text,
    'Byline:': soup_doc.find('p').text  
}

print(my_dictionary)

## Waiting

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html just like you above, but use  **wait_for** to wait for the text "Everything has shown up" to show up.

In [None]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html")

In [None]:
html = await page.content()
soup_doc = BeautifulSoup(raw_html, "html.parser")
await page.get_by_text('Everything').wait_for()
html

## Forms

Display the content of the `h1` tag on http://jonathansoma.com/columbia/interactive-scrape/inputs.html. You'll need to follow the instructions to complete the form first.

In [None]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/inputs.html")
html = await page.content()
html

In [None]:
title_1 = soup_doc.find('h1').text
print(title_1)

## Scraping a single table row

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, creating a dictionary out of the title, subhead, and byline.

In [226]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")
html = await page.content()
html

"<!DOCTYPE html><html><head><script>\n    const html = `<table>\n  <tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</table>\n`\n\nsetTimeout(() => {\n    console.log(html)\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body>\n\n</body></html>"

In [227]:
soup_doc2 = BeautifulSoup(html, 'html.parser')

# soup_doc.find_all
# print(soup_doc.prettify())
soup_doc2

<!DOCTYPE html>
<html><head><script>
    const html = `<table>
  <tr>
    <td>How to Scrape Things</td>
    <td>Probably using Playwright</td>
    <td>By Jonathan Soma</td>
  </tr>
</table>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)</script>
</head><body>
</body></html>

In [229]:
all_rows = soup_doc2.find_all('tr')
all_rows

[]

## Saving into a dictionary

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [None]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")
html = await page.content()
html

In [None]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(soup_doc.prettify())

In [None]:
book_title = soup_doc.find('body').find('h1').text
print(book_title)

In [None]:
book_subhead = soup_doc.find('body').find('h2')
print(book_subhead)

In [None]:
book = {
    'Title:': soup_doc.find('body').find('h1').text,
    'Subhead:': soup_doc.find('body').find('h3').text,
    'Byline:': soup_doc.find('body').find('p').text
}
print(book)


## Scraping multiple table rows

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html, creating a list of dictionaries. Convert to a pandas dataframe with `pd.json_normalize`. Save it as `output.csv`.

In [None]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")
html = await page.content()
html

In [None]:
soup_doc.find_all('html')

In [None]:
stories = soup_doc.find_all('html')
for story in stories:
    print("----")
    print(story.text)

## Scraping an actual table

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html using pandas' HTML reading function. Save it as `output.csv`.

In [None]:
#await page.goto("https://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html")
#html = await page.content()
#html

In [247]:
page = await browser.new_page()
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' method='GET'>>

In [248]:
html = await page.content()
html

"<!DOCTYPE html><html><head><script>\n    const html = `<table>\n  <tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</table>\n`\n\nsetTimeout(() => {\n    document.querySelector('body').innerHTML = html\n}, 250)</script>\n</head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Probably using Playwright</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table>\n</body></html>"

In [249]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE html>
<html>
 <head>
  <script>
   const html = `<p>How to Scrape Things</p>
<p>Probably using Playwright</p>
<p>By Jonathan Soma</p>
`

setTimeout(() => {
    console.log(html)
    document.querySelector('body').innerHTML = html
}, 250)
  </script>
 </head>
 <body>
 </body>
</html>



In [252]:
!pip install --quiet html5lib lxml
import io


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [253]:
import io
tables = pd.read_html(io.StringIO(html))
df = tables[0]
df.head()

Unnamed: 0,0,1,2
0,How to Scrape Things,Probably using Playwright,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [257]:
import pandas as pd

<Response url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html' method='GET'>>

In [None]:
soup_doc = BeautifulSoup(html, 'html.parser')
soup_doc

## `html.parser` vs `html5lib`

Here is some good HTML:

```python
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

Here is some bad HTML:
    
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
```

When you're using BeautifulSoup, you can use different parsers, including `html.parser`, `html5lib` and `lxml`. Try both the good HTML and bad HTML with each parser and use `print(soup_doc.prettify())` to view the difference.

What is different about each one?

> You'll need to `pip install` for both html5lib and lxml. Since you aren't important them, they're coming from BeautifulSoup, you'll need to do **Kernel > Restart** and run from the top after installing to have them work.

In [254]:
!pip install html.parser


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [255]:
!pip install html5lib


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
