# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [1]:
import pandas as pd

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
from webdriver_manager.firefox import GeckoDriverManager



In [2]:
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())




  driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())


## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [3]:
driver.get("http://jonathansoma.com/lede/static/by-class.html")

In [4]:
print(f"Title: {driver.find_element(By.CLASS_NAME, 'title').text}")
print(f"Subhead: {driver.find_element(By.CLASS_NAME, 'subhead').text}")
print(f"Byline: {driver.find_element(By.CLASS_NAME, 'byline').text}")

Title: How to Scrape Things
Subhead: Some Supplemental Materials
Byline: By Jonathan Soma


## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [5]:
driver.get("http://jonathansoma.com/lede/static/by-tag.html")

In [6]:
print(f"Title: {driver.find_element(By.TAG_NAME, 'h1').text}")
print(f"Subhead: {driver.find_element(By.TAG_NAME, 'h3').text}")
print(f"Byline: {driver.find_element(By.TAG_NAME, 'p').text}")

Title: How to Scrape Things
Subhead: Some Supplemental Materials
Byline: By Jonathan Soma


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [7]:
driver.get("http://jonathansoma.com/lede/static/by-list.html")

In [8]:
print(f"Title: {driver.find_elements(By.TAG_NAME, 'p')[0].text}")
print(f"Subhead: {driver.find_elements(By.TAG_NAME, 'p')[1].text}")
print(f"Byline: {driver.find_elements(By.TAG_NAME, 'p')[-1].text}")

Title: How to Scrape Things
Subhead: Some Supplemental Materials
Byline: By Jonathan Soma


## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [9]:
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")

In [10]:
print(f"Title: {driver.find_elements(By.TAG_NAME, 'td')[0].text}")
print(f"Subhead: {driver.find_elements(By.TAG_NAME, 'td')[1].text}")
print(f"Byline: {driver.find_elements(By.TAG_NAME, 'td')[2].text}")


Title: How to Scrape Things
Subhead: Some Supplemental Materials
Byline: By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [11]:
book={
    'title': driver.find_elements(By.TAG_NAME, "td")[0].text,
    'subtitle': driver.find_elements(By.TAG_NAME, "td")[1].text,
    'byline': driver.find_elements(By.TAG_NAME, "td")[2].text
}


In [12]:
book

{'title': 'How to Scrape Things',
 'subtitle': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [13]:
driver.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")

In [14]:
print(f"Title: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[1]/td[1]').text}")
print(f"Subhead: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[1]/td[2]').text}")
print(f"Byline: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[1]/td[3]').text}")
print("-----")
print(f"Title: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[2]/td[1]').text}")
print(f"Subhead: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[2]/td[2]').text}")
print(f"Byline: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[2]/td[3]').text}")
print("-----")
print(f"Title: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[3]/td[1]').text}")
print(f"Subhead: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[3]/td[2]').text}")
print(f"Byline: {driver.find_element(By.XPATH, '/html/body/table/tbody/tr[3]/td[3]').text}")
print("-----")

Title: How to Scrape Things
Subhead: Some Supplemental Materials
Byline: By Jonathan Soma
-----
Title: How to Scrape Many Things
Subhead: But, Is It Even Possible?
Byline: By Sonathan Joma
-----
Title: The End of Scraping
Subhead: Let's All Use CSV Files
Byline: By Amos Nathanos
-----


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [16]:
driver.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")

In [19]:
table={
    'title': [driver.find_element(By.XPATH, '/html/body/table/tbody/tr[1]/td[1]').text, driver.find_element(By.XPATH, '/html/body/table/tbody/tr[2]/td[1]').text, driver.find_element(By.XPATH, '/html/body/table/tbody/tr[3]/td[1]').text],
    'subtitle': [driver.find_element(By.XPATH, '/html/body/table/tbody/tr[1]/td[2]').text, driver.find_element(By.XPATH, '/html/body/table/tbody/tr[2]/td[2]').text, driver.find_element(By.XPATH, '/html/body/table/tbody/tr[3]/td[2]').text],
    'byline': [driver.find_element(By.XPATH, '/html/body/table/tbody/tr[1]/td[3]').text, driver.find_element(By.XPATH, '/html/body/table/tbody/tr[2]/td[3]').text, driver.find_element(By.XPATH, '/html/body/table/tbody/tr[3]/td[3]').text]
}

table

{'title': ['How to Scrape Things',
  'How to Scrape Many Things',
  'The End of Scraping'],
 'subtitle': ['Some Supplemental Materials',
  'But, Is It Even Possible?',
  "Let's All Use CSV Files"],
 'byline': ['By Jonathan Soma', 'By Sonathan Joma', 'By Amos Nathanos']}

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [22]:
driver.get("http://jonathansoma.com/lede/static/the-actual-table.html")

In [25]:
df=pd.read_html(driver.page_source)[0]

In [26]:
df

Unnamed: 0,0,1,2
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [27]:
driver.get("http://jonathansoma.com/lede/static/the-actual-table.html")

In [28]:
df=pd.read_html(driver.page_source)[0]

Unnamed: 0,0,1,2
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [None]:
df.to_csv("output.csv", index=False)