# Scraping basics for Playwright

This notebook is a combination of small scraping techniques along with how to use Playwright. Along with the class notes, the [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.

## Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [4]:
!pip install playwright
!playwright install
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Scraping by class

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-class.html using their **class name**, printing out the title, subhead, and byline.

In [5]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-class.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-class.html' method='GET'>>

In [6]:
html = await page.content()
soup = BeautifulSoup(html, 'html.parser')


In [7]:
title = soup.find(class_ = 'title')
title.text

'How to Scrape Things'

In [8]:
subhead = soup.find(class_ = 'subhead')
subhead.text

'Probably using Playwright'

In [9]:
byline = soup.find(class_ = 'byline')
byline.text

'By Jonathan Soma'

## Scraping using a single tag

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-list.html, creating a dictionary out of the title, subhead, and byline.

In [10]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-list.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-list.html' method='GET'>>

In [11]:
html = await page.content()
soup2 = BeautifulSoup(html, 'html.parser')

In [12]:
all_info = soup2.find_all('p')
for info in all_info:
    dictionary= {}
    dictionary['title'] = all_info[0].text
    dictionary['subhead'] = all_info[1].text
    dictionary['byline'] = all_info[2].text
dictionary



{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Waiting

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html just like you above, but use  **wait_for** to wait for the text "Everything has shown up" to show up.

In [16]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html")


<Response url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/by-tag-wait.html' method='GET'>>

In [17]:
html = await page.content()
soup3 = BeautifulSoup(html, 'html.parser')
await page.get_by_text("Everything has shown up").wait_for()
all_info = soup3.find_all('p')
for info in all_info:
    dictionary= {}
    dictionary['title'] = all_info[0].text
    dictionary['subhead'] = all_info[1].text
    dictionary['byline'] = all_info[2].text
    dictionary['addendum'] = all_info[3].text
dictionary


IndexError: list index out of range

## Forms

Display the content of the `h1` tag on http://jonathansoma.com/columbia/interactive-scrape/inputs.html. You'll need to follow the instructions to complete the form first.

In [18]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/inputs.html")
await page.locator('#best-animal').fill('cat')
await page.locator('select').select_option('Open')
await page.locator('#submit').click()
html = await page.content()
soup4 = BeautifulSoup(html, 'html.parser')
h1 = soup4.find('h1')
print(h1.text)


You did it


## Scraping a single table row

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, creating a dictionary out of the title, subhead, and byline.

In [19]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html")

<Response url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' request=<Request url='https://jonathansoma.com/columbia/interactive-scrape/single-table-row.html' method='GET'>>

In [20]:
html = await page.content()
soup5 = BeautifulSoup(html, 'html.parser')
all_info = soup5.find_all('td')
for info in all_info:
    dictionary= {}
    dictionary['title'] = all_info[0].text
    dictionary['subhead'] = all_info[1].text
    dictionary['byline'] = all_info[2].text
dictionary

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Saving into a dictionary

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [21]:
for info in all_info:
    book= {}
    book['title'] = all_info[0].text
    book['subhead'] = all_info[1].text
    book['byline'] = all_info[2].text
book

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

## Scraping multiple table rows

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html, creating a list of dictionaries. Convert to a pandas dataframe with `pd.json_normalize`. Save it as `output.csv`.

In [22]:
await page.goto("https://jonathansoma.com/columbia/interactive-scrape/multiple-table-rows.html")
await page.wait_for_selector('tr')
html = await page.content()
soup6 = BeautifulSoup(html, 'html.parser')
all_info = soup6.find_all('tr')
info=all_info[0]
soma = {}
soma['title'] = info.find_all('td')[0].text
soma['subhead'] = info.find_all('td')[1].text
soma['byline'] = info.find_all('td')[2].text
soma

{'title': 'How to Scrape Things',
 'subhead': 'Probably using Playwright',
 'byline': 'By Jonathan Soma'}

In [23]:
info = all_info[1]
joma = {}
joma['title'] = info.find_all('td')[0].text
joma['subhead'] = info.find_all('td')[1].text
joma['byline'] = info.find_all('td')[2].text
joma

{'title': 'How to Scrape Many Things',
 'subhead': 'But, Is It Even Possible?',
 'byline': 'By Sonathan Joma'}

In [24]:
info = all_info[2]
nathanos = {}
nathanos['title'] = info.find_all('td')[0].text
nathanos['subhead'] = info.find_all('td')[1].text
nathanos['byline'] = info.find_all('td')[2].text
nathanos

{'title': 'The End of Scraping',
 'subhead': "Let's All Use CSV Files",
 'byline': 'By Amos Nathanos'}

In [25]:
scraped_list = [soma, joma, nathanos]
scraped_list


[{'title': 'How to Scrape Things',
  'subhead': 'Probably using Playwright',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

In [33]:
import pandas as pd
df = pd.json_normalize(scraped_list)
df.head()

Unnamed: 0,title,subhead,byline
0,How to Scrape Things,Probably using Playwright,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [34]:
pd.DataFrame(scraped_list).to_csv('output.csv')


## Scraping an actual table

Scrape the content at http://jonathansoma.com/columbia/interactive-scrape/the-actual-table.html using pandas' HTML reading function. Save it as `output.csv`.

In [35]:
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup
html = await page.content()
!pip install --quiet html5lib lxml
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
html_io = StringIO(str(table))
dfs = pd.read_html(html_io)
df = dfs[0]
df.to_csv("output2.csv", index=False)



## `html.parser` vs `html5lib`

Here is some good HTML:

```python
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""

Here is some bad HTML:
    
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
```

When you're using BeautifulSoup, you can use different parsers, including `html.parser`, `html5lib` and `lxml`. Try both the good HTML and bad HTML with each parser and use `print(soup_doc.prettify())` to view the difference.

What is different about each one?

> You'll need to `pip install` for both html5lib and lxml. Since you aren't important them, they're coming from BeautifulSoup, you'll need to do **Kernel > Restart** and run from the top after installing to have them work.

In [29]:
from bs4 import BeautifulSoup

# Good HTML
html_good = """
<h1>This is a title</h1>
<h2>This is a subhead</h2>
<p>This is a paragraph</p>
<p>This is another paragraph</p>
"""
# Bad HTML
html_bad = """
<h1>This is a title
<h2>This is a subhead
<p>This is a paragraph
<p>This is another paragraph
"""
parsers = ['html.parser', 'html5lib', 'lxml']
for parser in parsers:
    print("Using parser:", parser)
    soup_good = BeautifulSoup(html_good, parser)
    print("Good HTML:")
    print(soup_good.prettify())
    soup_bad = BeautifulSoup(html_bad, parser)
    print("Bad HTML:")
    print(soup_bad.prettify())

Using parser: html.parser
Good HTML:
<h1>
 This is a title
</h1>
<h2>
 This is a subhead
</h2>
<p>
 This is a paragraph
</p>
<p>
 This is another paragraph
</p>

Bad HTML:
<h1>
 This is a title
 <h2>
  This is a subhead
  <p>
   This is a paragraph
   <p>
    This is another paragraph
   </p>
  </p>
 </h2>
</h1>

Using parser: html5lib
Good HTML:
<html>
 <head>
 </head>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
  </h2>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>

Bad HTML:
<html>
 <head>
 </head>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
   <p>
    This is a paragraph
   </p>
   <p>
    This is another paragraph
   </p>
  </h2>
 </body>
</html>

Using parser: lxml
Good HTML:
<html>
 <body>
  <h1>
   This is a title
  </h1>
  <h2>
   This is a subhead
  </h2>
  <p>
   This is a paragraph
  </p>
  <p>
   This is another paragraph
  </p>
 </body>
</html>

Bad HTML:
<html>
 <body>
  <