# Scraping basics for Playwright

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

> The [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.
>
> I know I love them, but **you don't have to use CSS selectors!**

## Part 0: Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
import pandas as pd

In [4]:
from playwright.async_api import async_playwright

In [5]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [6]:
await page.goto("http://jonathansoma.com/lede/static/by-class.html")

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [7]:
html = await page.content()
soup_doc = BeautifulSoup(html)
soup_doc

<html><head></head><body><h1 class="title">How to Scrape Things</h1>
<h3 class="subhead">Some Supplemental Materials</h3>
<p class="byline">By Jonathan Soma</p></body></html>

In [8]:
# I am going to use CSS selectors exclusively here to see how I feel about them

title = soup_doc.select_one('.title')
subhead = soup_doc.select_one('.subhead')
byline = soup_doc.select_one('.byline')

print(title.text)
print(subhead.text)
print(byline.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


In [9]:
# cute!

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [10]:
await page.goto("http://jonathansoma.com/lede/static/by-tag.html")

<Response url='https://jonathansoma.com/lede/static/by-tag.html' request=<Request url='https://jonathansoma.com/lede/static/by-tag.html' method='GET'>>

In [11]:
html = await page.content()
soup_doc = BeautifulSoup(html)
soup_doc

<html><head></head><body><h1>How to Scrape Things</h1>
<h3>Some Supplemental Materials</h3>
<p>By Jonathan Soma</p></body></html>

In [12]:
title = soup_doc.select_one('h1')
subhead = soup_doc.select_one('h3')
byline = soup_doc.select_one('p')

print(title.text)
print(subhead.text)
print(byline.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, creating a dictionary out of the title, subhead, and byline in sentences, e.g. "the title is `______`"

> **This will be important for the next few:** you can use `.get_by_text` but it seems kind of silly since maybe the text would change. I think getting them all, then using list indexes like `[0]`, etc, would be better! If I sold you on CSS selectors, you can also look up `nth-of-type` and use it with `.select_one`.

In [13]:
await page.goto("http://jonathansoma.com/lede/static/by-list.html")

<Response url='https://jonathansoma.com/lede/static/by-list.html' request=<Request url='https://jonathansoma.com/lede/static/by-list.html' method='GET'>>

In [14]:
html = await page.content()
soup_doc = BeautifulSoup(html)
soup_doc

<html><head></head><body><p>How to Scrape Things</p>
<p>Some Supplemental Materials</p>
<p>By Jonathan Soma</p></body></html>

#### Note to self:
The :nth-of-type(n) selector matches every element that is the nth child, of the same type (tag name), of its parent.

n can be a number, a keyword (odd or even), or a formula (like an + b).

Syntax:

```
:nth-of-type(number) {
  css declarations;
}
```

or more relevant here: Select the third "td" element in a row
```
td:nth-child(3)
```
Note: the first element is nth-child(1), not nth-child(0)

In [16]:
dict1={}

dict1['title'] = f"The title is {soup_doc.select_one('p:nth-child(1)').text}"
dict1['subhead'] = f"The subhead is {soup_doc.select_one('p:nth-child(2)').text}"
dict1['byline'] = f"The byline is {soup_doc.select_one('p:nth-child(3)').text}"

dict1

{'title': 'The title is How to Scrape Things',
 'subhead': 'The subhead is Some Supplemental Materials',
 'byline': 'The byline is By Jonathan Soma'}

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline in sentences, e.g. "the title is `______`."

In [17]:
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")

<Response url='https://jonathansoma.com/lede/static/single-table-row.html' request=<Request url='https://jonathansoma.com/lede/static/single-table-row.html' method='GET'>>

In [18]:
html = await page.content()
soup_doc = BeautifulSoup(html)
soup_doc

<html><head></head><body><table>
<tbody><tr>
<td>How to Scrape Things</td>
<td>Some Supplemental Materials</td>
<td>By Jonathan Soma</td>
</tr>
</tbody></table></body></html>

In [19]:
title = f"The title is {soup_doc.select_one('td:nth-child(1)').text}"
subhead = f"The subhead is {soup_doc.select_one('td:nth-child(2)').text}"
byline = f"The byline is {soup_doc.select_one('td:nth-child(3)').text}"

print(title)
print(subhead)
print(byline)

The title is How to Scrape Things
The subhead is Some Supplemental Materials
The byline is By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [20]:
book={}

book['title'] = soup_doc.select_one('td:nth-child(1)').text
book['subhead'] = soup_doc.select_one('td:nth-child(2)').text
book['byline'] = soup_doc.select_one('td:nth-child(3)').text

book

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [21]:
await page.goto("http://jonathansoma.com/lede/static/multiple-table-rows.html")

<Response url='https://jonathansoma.com/lede/static/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/lede/static/multiple-table-rows.html' method='GET'>>

In [22]:
html = await page.content()
soup_doc = BeautifulSoup(html)
soup_doc

<html><head></head><body><table>
<tbody><tr>
<td>How to Scrape Things</td>
<td>Some Supplemental Materials</td>
<td>By Jonathan Soma</td>
</tr>
<tr>
<td>How to Scrape Many Things</td>
<td>But, Is It Even Possible?</td>
<td>By Sonathan Joma</td>
</tr>
<tr>
<td>The End of Scraping</td>
<td>Let's All Use CSV Files</td>
<td>By Amos Nathanos</td>
</tr>
</tbody></table></body></html>

In [25]:
all_rows = soup_doc.select('tr')

rows = []
for row in all_rows :
    title = row.select_one('td:nth-child(1)').text
    subhead = row.select_one('td:nth-child(2)').text
    byline = row.select_one('td:nth-child(3)').text
    one_row = [title, subhead, byline]
    rows.append(one_row)
rows

keys = ['Title', 'Subhead', 'Byline']

to_dict = []
for each in rows:
    this_row = dict(zip(keys,each))
    to_dict.append(this_row)
to_dict


[{'Title': 'How to Scrape Things',
  'Subhead': 'Some Supplemental Materials',
  'Byline': 'By Jonathan Soma'},
 {'Title': 'How to Scrape Many Things',
  'Subhead': 'But, Is It Even Possible?',
  'Byline': 'By Sonathan Joma'},
 {'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': 'By Amos Nathanos'}]

## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either, even though that's exactly what we did in class.

In [26]:
await page.goto("http://jonathansoma.com/lede/static/the-actual-table.html")

<Response url='https://jonathansoma.com/lede/static/the-actual-table.html' request=<Request url='https://jonathansoma.com/lede/static/the-actual-table.html' method='GET'>>

In [27]:
html = await page.content()
soup_doc = BeautifulSoup(html)
soup_doc

<html><head></head><body><table id="booklist">
<tbody><tr>
<td>How to Scrape Things</td>
<td>Some Supplemental Materials</td>
<td>By Jonathan Soma</td>
</tr>
<tr>
<td>How to Scrape Many Things</td>
<td>But, Is It Even Possible?</td>
<td>By Sonathan Joma</td>
</tr>
<tr>
<td>The End of Scraping</td>
<td>Let's All Use CSV Files</td>
<td>By Amos Nathanos</td>
</tr>
</tbody></table></body></html>

In [29]:
# I did that for the previous question for self-learning but it made this a lot easier

all_rows = soup_doc.select('tr')

rows = []
for row in all_rows :
    title = row.select_one('td:nth-child(1)').text
    subhead = row.select_one('td:nth-child(2)').text
    byline = row.select_one('td:nth-child(3)').text
    one_row = [title, subhead, byline]
    rows.append(one_row)
rows

keys = ['Title', 'Subhead', 'Byline']

to_dict = []
for each in rows:
    this_row = dict(zip(keys,each))
    to_dict.append(this_row)
to_dict


[{'Title': 'How to Scrape Things',
  'Subhead': 'Some Supplemental Materials',
  'Byline': 'By Jonathan Soma'},
 {'Title': 'How to Scrape Many Things',
  'Subhead': 'But, Is It Even Possible?',
  'Byline': 'By Sonathan Joma'},
 {'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [31]:
df = pd.DataFrame(to_dict)
df

Unnamed: 0,Title,Subhead,Byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [32]:
df.to_csv("output.csv", index=False)