# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [1]:
from bs4 import BeautifulSoup
import requests

import pandas as pd

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/tanazmeghjani/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [2]:
driver.get("http://jonathansoma.com/lede/static/by-class.html")

In [3]:
driver.find_element(By.CLASS_NAME, "title").text

'How to Scrape Things'

In [4]:
driver.find_element(By.CLASS_NAME, "subhead").text

'Some Supplemental Materials'

In [5]:
driver.find_element(By.CLASS_NAME, "byline").text

'By Jonathan Soma'

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [6]:
driver.get("http://jonathansoma.com/lede/static/by-tag.html")

In [7]:
driver.find_element(By.TAG_NAME, "h1").text

'How to Scrape Things'

In [8]:
driver.find_element(By.TAG_NAME, "h3").text

'Some Supplemental Materials'

In [9]:
driver.find_element(By.TAG_NAME, "p").text

'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [10]:
driver.get("http://jonathansoma.com/lede/static/by-list.html")

In [11]:
table = driver.find_elements(By.TAG_NAME, "p")
table

[<selenium.webdriver.remote.webelement.WebElement (session="ec7ece9eb25df698583fbaed03e4d43c", element="c2a39799-f031-4dea-9784-68f2e146bfcb")>,
 <selenium.webdriver.remote.webelement.WebElement (session="ec7ece9eb25df698583fbaed03e4d43c", element="abafa3b1-c9e5-4949-82cd-03f5cc719fc8")>,
 <selenium.webdriver.remote.webelement.WebElement (session="ec7ece9eb25df698583fbaed03e4d43c", element="a46aafbd-b8e2-4376-ac6e-c6f61d3301aa")>]

In [12]:
title = table[0].text
title

'How to Scrape Things'

In [13]:
subhead = table[1].text
subhead

'Some Supplemental Materials'

In [14]:
byline = table[2].text
byline

'By Jonathan Soma'

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [15]:
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")

In [16]:
table = driver.find_elements_by_xpath('/html/body/table')
for row in table:
    print(row.text)

How to Scrape Things Some Supplemental Materials By Jonathan Soma


  table = driver.find_elements_by_xpath('/html/body/table')


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [17]:
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")

In [18]:
rows = driver.find_elements(By.TAG_NAME, "td")
book = {}
book['title'] = rows[0].text
book['subtitle'] = rows[1].text
book['byline'] = rows[2].text
book

{'title': 'How to Scrape Things',
 'subtitle': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [19]:
driver.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")

In [20]:
table = driver.find_elements(By.TAG_NAME, 'tr')
for row in table:
    cells = row.find_elements(By.TAG_NAME, 'td')
    print('title:', cells[0].text)
    print('subhead:', cells[1].text)
    print('byline:', cells[2].text)
    print("--------")

title: How to Scrape Things
subhead: Some Supplemental Materials
byline: By Jonathan Soma
--------
title: How to Scrape Many Things
subhead: But, Is It Even Possible?
byline: By Sonathan Joma
--------
title: The End of Scraping
subhead: Let's All Use CSV Files
byline: By Amos Nathanos
--------


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [21]:
driver.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")

In [22]:
list_of_dict = []

table = driver.find_elements(By.TAG_NAME, "tr")

for row in table:
    dict = {}
    cells = row.find_elements(By.TAG_NAME, 'td')
    dict['title:'] = cells[0].text
    dict['subhead:'] = cells[1].text
    dict['byline:'] = cells[2].text
    list_of_dict.append(dict)
list_of_dict


[{'title:': 'How to Scrape Things',
  'subhead:': 'Some Supplemental Materials',
  'byline:': 'By Jonathan Soma'},
 {'title:': 'How to Scrape Many Things',
  'subhead:': 'But, Is It Even Possible?',
  'byline:': 'By Sonathan Joma'},
 {'title:': 'The End of Scraping',
  'subhead:': "Let's All Use CSV Files",
  'byline:': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [23]:
driver.get("http://jonathansoma.com/lede/static/the-actual-table.html")

In [24]:
list_of_dict = []

table = driver.find_elements(By.TAG_NAME, "tr")

for row in table:
    dict = {}
    cells = row.find_elements(By.TAG_NAME, 'td')
    dict['title:'] = cells[0].text
    dict['subhead:'] = cells[1].text
    dict['byline:'] = cells[2].text
    list_of_dict.append(dict)
list_of_dict


[{'title:': 'How to Scrape Things',
  'subhead:': 'Some Supplemental Materials',
  'byline:': 'By Jonathan Soma'},
 {'title:': 'How to Scrape Many Things',
  'subhead:': 'But, Is It Even Possible?',
  'byline:': 'By Sonathan Joma'},
 {'title:': 'The End of Scraping',
  'subhead:': "Let's All Use CSV Files",
  'byline:': 'By Amos Nathanos'}]

In [25]:
df = pd.DataFrame(list_of_dict)
df

Unnamed: 0,title:,subhead:,byline:
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [26]:
df.to_csv('output.csv')

In [27]:
df

Unnamed: 0,title:,subhead:,byline:
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos
