# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [100]:
# All of the necessary imports
import pandas as pd

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager

In [101]:
# Launch a new Chrome, install the appropriate ChromeDriver if necessary
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/susanmerriam/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [102]:
driver.get("http://jonathansoma.com/lede/static/by-class.html")

In [103]:
titles = driver.find_elements(By.CLASS_NAME, "title")
for title in titles:
    print(title.text)

subhead = driver.find_elements(By.CLASS_NAME, "subhead")
for title in subhead:
    print(title.text)
    
byline = driver.find_elements(By.CLASS_NAME, "byline")
for title in byline:
    print(title.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [104]:
driver.get("http://jonathansoma.com/lede/static/by-tag.html")

In [105]:
h1 = driver.find_elements(By.TAG_NAME, "h1")
for title in h1:
    print(title.text)

h3 = driver.find_elements(By.TAG_NAME, "h3")
for title in h3:
    print(title.text)

p = driver.find_elements(By.TAG_NAME, "p")
for title in p:
    print(title.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [106]:
driver.get("http://jonathansoma.com/lede/static/by-list.html")

In [107]:
p2 = driver.find_elements(By.TAG_NAME, "p")
print(p2[0].text)
print(p2[1].text)
print(p2[2].text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


In [108]:
# gets everything / all the p's
# p2 = driver.find_elements(By.TAG_NAME, "p")
# for title in p2:
#     print(title.text)

In [109]:
# By.CLASS_NAME or By.ID or By.CSS_SELECTOR
# resource: https://jonathansoma.com/lede/foundations-2018/classes/selenium/scraping-supplement/

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [110]:
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")

In [111]:
# Find all of the tds on the page
cells = driver.find_elements(By.TAG_NAME, "td")
# Print out the first one, the second one, the third one
print(cells[0].text)
print(cells[1].text)
print(cells[2].text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


In [112]:
# # Find all of the tds on the page
# cells = driver.find_elements_by_tag_name('td')
# # Print out the first one, the second one, the third one
# print("The title is", cells[0].text)
# print("Subhead is", cells[1].text)
# print("Byline is", cells[2].text)

## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [113]:
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")

In [114]:
# Find all of the tds on the page
cells2 = driver.find_elements(By.TAG_NAME, "td")

# Start with an empty dictionary
book = {}

# Add the keys one by one
book['title'] = cells2[0].text
book['subhead'] = cells2[1].text
book['byline'] = cells2[2].text

# Print it out
print(book)

{'title': 'How to Scrape Things', 'subhead': 'Some Supplemental Materials', 'byline': 'By Jonathan Soma'}


## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [115]:
driver.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")

In [116]:
rows = driver.find_elements(By.TAG_NAME, "tr")

for row in rows:
  # Find all of the tds inside of THAT ONE ROW
  cells3 = row.find_elements(By.TAG_NAME, "td")
  # Print out the first one, the second one, the third one
  print(cells3[0].text)
  # gives second td (column 2) in each row
  print(cells3[1].text)
  # gives third td (column 3) in each row
  print(cells3[2].text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma
How to Scrape Many Things
But, Is It Even Possible?
By Sonathan Joma
The End of Scraping
Let's All Use CSV Files
By Amos Nathanos


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [117]:
driver.get("http://jonathansoma.com/lede/static/the-actual-table.html")

In [118]:
# rows2 = driver.find_elements(By.TAG_NAME, "tr")

# for row in rows2:
#   # Find all of the tds inside of THAT ONE ROW
#   cells4 = row.find_elements(By.TAG_NAME, "td")
#   # Print out the first one, the second one, the third one
#   print(cells4[0].text)
#   # gives second td (column 2) in each row
#   print(cells4[1].text)
#   # gives third td (column 3) in each row
#   print(cells4[2].text)

In [125]:
# for multiple tables on the page, grab table by id
table = driver.find_element_by_id('booklist')
rows3 = table.find_elements_by_tag_name('tr')

for row in rows3:
  # Find all of the tds inside of THAT ONE ROW
  cells5 = row.find_elements_by_tag_name('td')
  # Print out the first one, the second one, the third one
  print(cells5[0].text)
  print(cells5[1].text)
  print(cells5[2].text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma
How to Scrape Many Things
But, Is It Even Possible?
By Sonathan Joma
The End of Scraping
Let's All Use CSV Files
By Amos Nathanos


  table = driver.find_element_by_id('booklist')


In [124]:
# # for multiple tables on the page, grab table by id
# table = driver.find_element_by_id('booklist')
# rows3 = table.find_elements_by_tag_name('tr')

# for row in rows3:
#   # Find all of the tds inside of THAT ONE ROW
#     cells5 = row.find_elements_by_tag_name('td')
    
#     # Start with an empty dictionary
#     book2 = {}
  
#     # Print out the first one, the second one, the third one
#     book2['title'] = cells5[0].text
#     book2['subhead'] = cells5[1].text
#     book2['byline'] = cells5[2].text

#     # Print it out
#     print(book2)

{'title': 'How to Scrape Things', 'subhead': 'Some Supplemental Materials', 'byline': 'By Jonathan Soma'}
{'title': 'How to Scrape Many Things', 'subhead': 'But, Is It Even Possible?', 'byline': 'By Sonathan Joma'}
{'title': 'The End of Scraping', 'subhead': "Let's All Use CSV Files", 'byline': 'By Amos Nathanos'}


  table = driver.find_element_by_id('booklist')


In [121]:
# for multiple tables on the page, grab table by id
table = driver.find_element_by_id('booklist')
rows4 = table.find_elements_by_tag_name('tr')

books = []
for row in rows4:
  # Find all of the tds inside of THAT ONE ROW
    cells6 = row.find_elements_by_tag_name('td')
    
    # Start with an empty dictionary
    book3 = {}
  
    # Print out the first one, the second one, the third one
    book3['title'] = cells6[0].text
    book3['subhead'] = cells6[1].text
    book3['byline'] = cells6[2].text
    
    # append list with elements of book3 dictionay    
    books.append(book3)

# Print it out
print(books)

[{'title': 'How to Scrape Things', 'subhead': 'Some Supplemental Materials', 'byline': 'By Jonathan Soma'}, {'title': 'How to Scrape Many Things', 'subhead': 'But, Is It Even Possible?', 'byline': 'By Sonathan Joma'}, {'title': 'The End of Scraping', 'subhead': "Let's All Use CSV Files", 'byline': 'By Amos Nathanos'}]


  table = driver.find_element_by_id('booklist')


## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [122]:
# creating dataframe from list of dictionaries. 
# keys become columns, values rows
# index gets added in the conversion

df = pd.DataFrame(books)
df.head()

Unnamed: 0,title,subhead,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [123]:
# Use index=False to prevent the 'extra' number column
df.to_csv("output.csv", index=False)