# Scraping basics for Playwright or Selenium

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

**Possibly useful links:**

* Scraping section of my [everything page](https://jonathansoma.com/everything/)
* Some [old Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) (if you decide to use Selenium)
* [Loops in Playwright](https://jonathansoma.com/everything/scraping/loops-in-playwright/), which is the thing that we were having trouble with during class when using `.locator` so much.

## Part 0: Imports

Import what you need to use Playwright or Selenium, and start up a new browser to use for scraping. 
> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [2]:
import asyncio
from playwright.async_api import async_playwright

In [4]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

In [13]:
await page.goto('http://jonathansoma.com/lede/static/by-class.html')

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [14]:
await page.content()

'<html><head></head><body><h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Some Supplemental Materials</h3>\n<p class="byline">By Jonathan Soma</p></body></html>'

In [8]:
from bs4 import BeautifulSoup
html = await page.content()
doc = BeautifulSoup(html)

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

In [23]:
await page.goto('http://jonathansoma.com/lede/static/by-tag.html')

<Response url='https://jonathansoma.com/lede/static/by-tag.html' request=<Request url='https://jonathansoma.com/lede/static/by-tag.html' method='GET'>>

In [24]:
await page.content()

'<html><head></head><body><h1>How to Scrape Things</h1>\n<h3>Some Supplemental Materials</h3>\n<p>By Jonathan Soma</p></body></html>'

In [25]:

html = await page.content()
doc = BeautifulSoup(html)

for element in doc.find_all('body'):
    print(element.text)


How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

> **This will be important for the next few:** if you scrape multiple items, you have a list. In Selenium you can use `[0]`, `[1]`, `[-1]` etc just like you would for a normal list (and in Playwright, too, asl ong as you're using `query_selector_all`). If you're using locators you'll need to use `.nth(0)`, `nth(1)`, `nth(2)`.

In [26]:
await page.goto('http://jonathansoma.com/lede/static/by-list.html')

<Response url='https://jonathansoma.com/lede/static/by-list.html' request=<Request url='https://jonathansoma.com/lede/static/by-list.html' method='GET'>>

In [27]:

html = await page.content()
doc = BeautifulSoup(html)

for element in doc.find_all('p'):
    print(element.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [28]:
await page.goto('http://jonathansoma.com/lede/static/single-table-row.html')

<Response url='https://jonathansoma.com/lede/static/single-table-row.html' request=<Request url='https://jonathansoma.com/lede/static/single-table-row.html' method='GET'>>

In [29]:
html = await page.content()
doc = BeautifulSoup(html)


In [30]:
html = await page.content()
doc = BeautifulSoup(html)

for element in doc.find_all('td'):
    print(element.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [55]:
#Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`

await page.goto('http://jonathansoma.com/lede/static/single-table-row.html')

html = await page.content()
doc = BeautifulSoup(html)

rows = doc.find_all('td')
rows
print(len(rows))
print(rows[0])

book = {}
book['title'] = (rows[0])
book['subhead'] = (rows[1])
book['byline'] = (rows[2])
print(book)


#for row in rows:
   # book = {}

   # title_element = await row.query_selector('.title .name-div p')
   # data['title'] = await title_element.text_content()
    
   # link_element = await row.query_selector('.title .name-div a')
   # data['link_url'] = await link_element.get_attribute('href')
    
   # author_element = await row.query_selector('.name')
   # data['authors'] = await author_element.text_content()

   # all_data.append(data)


3
<td>How to Scrape Things</td>
{'title': <td>How to Scrape Things</td>, 'subhead': <td>Some Supplemental Materials</td>, 'byline': <td>By Jonathan Soma</td>}


## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [88]:
await page.goto('http://jonathansoma.com/lede/static/multiple-table-rows.html')

html = await page.content()
doc = BeautifulSoup(html)

rows = doc.find_all('tr')
rows
print(len(rows))

for row in rows:
    cells = row.find_all('td')
    print('title:',cells[0].text)
    print('subhead:',cells[1].text)
    print('byline:',cells[2].text)
   


3
title: How to Scrape Things
subhead: Some Supplemental Materials
byline: By Jonathan Soma
title: How to Scrape Many Things
subhead: But, Is It Even Possible?
byline: By Sonathan Joma
title: The End of Scraping
subhead: Let's All Use CSV Files
byline: By Amos Nathanos


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [77]:
await page.goto('http://jonathansoma.com/lede/static/the-actual-table.html')

html = await page.content()
doc = BeautifulSoup(html)

#create a list of dictionaries
books = []
table = doc.find('table', attrs={'id':'booklist'})
rows = doc.find_all('tr')
for row in rows:
    data = {}
    cells = row.find_all('td')
    data['title:'] =  cells[0].text
    data['subhead:'] =  cells[1].text
    data['byline:'] =  cells[2].text
    print(data)
    
    books.append(data)

books






{'title:': 'How to Scrape Things', 'subhead:': 'Some Supplemental Materials', 'byline:': 'By Jonathan Soma'}
{'title:': 'How to Scrape Many Things', 'subhead:': 'But, Is It Even Possible?', 'byline:': 'By Sonathan Joma'}
{'title:': 'The End of Scraping', 'subhead:': "Let's All Use CSV Files", 'byline:': 'By Amos Nathanos'}


[{'title:': 'How to Scrape Things',
  'subhead:': 'Some Supplemental Materials',
  'byline:': 'By Jonathan Soma'},
 {'title:': 'How to Scrape Many Things',
  'subhead:': 'But, Is It Even Possible?',
  'byline:': 'By Sonathan Joma'},
 {'title:': 'The End of Scraping',
  'subhead:': "Let's All Use CSV Files",
  'byline:': 'By Amos Nathanos'}]

In [None]:
await page.goto('http://jonathansoma.com/lede/static/the-actual-table.html')


In [78]:
import pandas as pd


## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [80]:
df = pd.DataFrame(books)

In [81]:
df

Unnamed: 0,title:,subhead:,byline:
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [86]:
df.to_csv('output.csv', index=False)