# Scraping basics for Playwright or Selenium

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

**Possibly useful links:**

* Scraping section of my [everything page](https://jonathansoma.com/everything/)
* Some [old Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) (if you decide to use Selenium)
* [Loops in Playwright](https://jonathansoma.com/everything/scraping/loops-in-playwright/), which is the thing that we were having trouble with during class when using `.locator` so much.

## Part 0: Imports

Import what you need to use Playwright or Selenium, and start up a new browser to use for scraping. 
> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [2]:
# Install playwright
!pip install playwright

# Install a browser (Chromium, open-source Chrome)
!playwright install chromium


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
from playwright.async_api import async_playwright

In [4]:
playwright = await async_playwright().start()

In [5]:
browser = await playwright.chromium.launch(headless = False)

In [6]:
page = await browser.new_page()

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

In [7]:
await page.goto("http://jonathansoma.com/lede/static/by-class.html")

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [8]:
await page.content()

'<html><head></head><body><h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Some Supplemental Materials</h3>\n<p class="byline">By Jonathan Soma</p></body></html>'

In [14]:
from bs4 import BeautifulSoup

html = await page.content()

doc = BeautifulSoup(html)
doc

<html><head></head><body><h1 class="title">How to Scrape Things</h1>
<h3 class="subhead">Some Supplemental Materials</h3>
<p class="byline">By Jonathan Soma</p></body></html>

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

In [21]:
for row in doc:
    title = row.select_one(".title").text
    print(title)
    subhead = row.select_one(".subhead").text
    print(subhead)
    
    byline = row.select_one(".byline").text
    print(byline)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

> **This will be important for the next few:** if you scrape multiple items, you have a list. In Selenium you can use `[0]`, `[1]`, `[-1]` etc just like you would for a normal list (and in Playwright, too, asl ong as you're using `query_selector_all`). If you're using locators you'll need to use `.nth(0)`, `nth(1)`, `nth(2)`.

In [98]:

await page.goto("https://jonathansoma.com/lede/static/by-list.html")
await page.content()

'<html><head></head><body><p>How to Scrape Things</p>\n<p>Some Supplemental Materials</p>\n<p>By Jonathan Soma</p></body></html>'

In [108]:
doc_2 = await page.locator("p").nth(1).text_content()
doc_2

'Some Supplemental Materials'

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [48]:
url = "https://jonathansoma.com/lede/static/single-table-row.html"

In [51]:
import pandas as pd
await page.goto(url)

<Response url='https://jonathansoma.com/lede/static/single-table-row.html' request=<Request url='https://jonathansoma.com/lede/static/single-table-row.html' method='GET'>>

In [54]:
html = await page.content()
tables = pd.read_html(html)
tables
df = tables [0]
df.head()

Unnamed: 0,0,1,2
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [117]:
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")
await page.content()

'<html><head></head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</tbody></table></body></html>'

In [119]:
html = await page.content()
doc_3 = BeautifulSoup(html)
doc_3

doc_3 = doc_3.select("tr")
doc_3

[<tr>
 <td>How to Scrape Things</td>
 <td>Some Supplemental Materials</td>
 <td>By Jonathan Soma</td>
 </tr>]

In [143]:
# Start off with ZERO ROWS OF DATA
# an EMPTY LIST
all_data = []

for row in doc_3:
    book = {}
    
    # print out the html or the .text
    # .title .name-div p means
    # "a p tag inside of something with the class of name-div
    # inside of something with the class of .title"
    book['title'] = row.find_all("td")[0].text
    book['subhead'] = row.find_all("td")[1].text
    book['byline'] = row.find_all("td")[2].text
    all_data.append(book)

len(all_data)
all_data

[{'title': 'How to Scrape Things',
  'subhead': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'}]

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [82]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()
await page.goto("http://jonathansoma.com/lede/static/multiple-table-rows.html")
await page.content()

"<html><head></head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table></body></html>"

In [144]:
html = await page.content()
doc_6 = BeautifulSoup(html)
doc_6

doc_6 = doc_6.select("tr")
doc_6

[<tr>
 <td>How to Scrape Things</td>
 <td>Some Supplemental Materials</td>
 <td>By Jonathan Soma</td>
 </tr>]

In [159]:
all_data = []
for row in doc_6:
    data = row.find_all("td")
    all_data.append(data)
all_data

#once I put .text after ("td"), I get an error message. 

[[<td>How to Scrape Things</td>,
  <td>Some Supplemental Materials</td>,
  <td>By Jonathan Soma</td>]]

## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [160]:
await page.goto("http://jonathansoma.com/lede/static/the-actual-table.html")
await page.content()

'<html><head></head><body><table id="booklist">\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let\'s All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table></body></html>'

In [162]:
html = await page.content()
doc_7 = BeautifulSoup(html)

doc_7 = doc_7.select("tr")
doc_7

[<tr>
 <td>How to Scrape Things</td>
 <td>Some Supplemental Materials</td>
 <td>By Jonathan Soma</td>
 </tr>,
 <tr>
 <td>How to Scrape Many Things</td>
 <td>But, Is It Even Possible?</td>
 <td>By Sonathan Joma</td>
 </tr>,
 <tr>
 <td>The End of Scraping</td>
 <td>Let's All Use CSV Files</td>
 <td>By Amos Nathanos</td>
 </tr>]

In [164]:
all_data = []

for row in doc_7:
    actual_table = {}
    
    # print out the html or the .text
    # .title .name-div p means
    # "a p tag inside of something with the class of name-div
    # inside of something with the class of .title"
    actual_table['title'] = row.find_all("td")[0].text
    actual_table['subhead'] = row.find_all("td")[1].text
    actual_table['byline'] = row.find_all("td")[2].text
    all_data.append(actual_table)

all_data

[{'title': 'How to Scrape Things',
  'subhead': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [165]:
df = pd.DataFrame(all_data)
df.head()

Unnamed: 0,title,subhead,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [166]:
df.to_csv("output.csv", index=False)