# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager



In [2]:
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())




  driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())


## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [11]:
driver.get('http://jonathansoma.com/lede/static/by-class.html')

In [14]:
driver.find_element(By.CLASS_NAME, 'title').text

'How to Scrape Things'

In [13]:
driver.find_element(By.CLASS_NAME, 'subhead').text

'Some Supplemental Materials'

In [12]:
driver.find_element(By.CLASS_NAME, 'byline').text

'By Jonathan Soma'

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [15]:
driver.get('http://jonathansoma.com/lede/static/by-tag.html')

In [18]:
driver.find_element(By.TAG_NAME, 'h1').text

'How to Scrape Things'

In [17]:
driver.find_element(By.TAG_NAME, 'h3').text

'Some Supplemental Materials'

In [16]:
driver.find_element(By.TAG_NAME, 'p').text

'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [66]:
driver.get('http://jonathansoma.com/lede/static/by-list.html')

In [67]:
p_tags = driver.find_elements(By.TAG_NAME, 'p')

In [68]:
title = p_tags[0].text
subhead = p_tags[1].text
byline = p_tags[2].text

list01 = [title, subhead, byline]
list01

['How to Scrape Things', 'Some Supplemental Materials', 'By Jonathan Soma']

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [69]:
driver.get('http://jonathansoma.com/lede/static/single-table-row.html')

In [70]:
tds = driver.find_elements(By.TAG_NAME, 'td')

In [73]:
title = tds[0].text
subhead = tds[1].text
byline = tds[2].text

list_rows = [title, subhead, byline]
list_rows

['How to Scrape Things', 'Some Supplemental Materials', 'By Jonathan Soma']

## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [78]:
driver.get('http://jonathansoma.com/lede/static/single-table-row.html')

In [79]:
for_dict = driver.find_elements(By.TAG_NAME, 'td')

In [80]:
book = {}

book['title'] = for_dict[0].text
book['subhead'] = for_dict[1].text
book['byline'] = for_dict[2].text

book

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [81]:
driver.get('http://jonathansoma.com/lede/static/multiple-table-rows.html')

In [83]:
table_rows = driver.find_elements(By.TAG_NAME, "tr")

In [95]:
scrape_multi_rows = []
each_row = {}

for row in table_rows:
    title = row.find_elements(By.TAG_NAME, "td")[0].text
    subhead = row.find_elements(By.TAG_NAME, "td")[1].text
    byline = row.find_elements(By.TAG_NAME, "td")[1].text
    
    each_row['Title'] = title
    each_row['Subhead'] = subhead
    each_row['Byline'] = byline
    
    scrape_multi_rows.append(each_row)
    
scrape_multi_rows

[{'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': "Let's All Use CSV Files"},
 {'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': "Let's All Use CSV Files"},
 {'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': "Let's All Use CSV Files"}]

## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [96]:
driver.get('http://jonathansoma.com/lede/static/the-actual-table.html')

In [116]:
table_body = driver.find_elements(By.TAG_NAME, 'tr')

scraped_table = []
count = 0

for row in table_body:
    count = count + 1
    title = row.find_elements(By.TAG_NAME, 'td')[0].text
    subhead = row.find_elements(By.TAG_NAME, 'td')[1].text
    byline = row.find_elements(By.TAG_NAME, 'td')[2].text
    
    table_row = {}
    
    table_row['Line'] = count
    table_row['Title'] = title
    table_row['Subhead'] = subhead
    table_row['Byline'] = byline
    
    scraped_table.append(table_row)
    
scraped_table

[{'Line': 1,
  'Title': 'How to Scrape Things',
  'Subhead': 'Some Supplemental Materials',
  'Byline': 'By Jonathan Soma'},
 {'Line': 2,
  'Title': 'How to Scrape Many Things',
  'Subhead': 'But, Is It Even Possible?',
  'Byline': 'By Sonathan Joma'},
 {'Line': 3,
  'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [117]:
import pandas as pd

In [119]:
pd.DataFrame(scraped_table)

Unnamed: 0,Line,Title,Subhead,Byline
0,1,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,2,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,3,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [120]:
driver.get('http://jonathansoma.com/lede/static/the-actual-table.html')

In [121]:
tr_part_9 = driver.find_elements(By.TAG_NAME, 'tr')

In [123]:
table_9 = []
for row in tr_part_9:
    title = row.find_elements(By.TAG_NAME, 'td')[0].text
    subhead = row.find_elements(By.TAG_NAME, 'td')[1].text
    byline = row.find_elements(By.TAG_NAME, 'td')[2].text
    
    table_rows = {}
    
    table_rows['Line'] = count
    table_rows['Title'] = title
    table_rows['Subhead'] = subhead
    table_rows['Byline'] = byline
    
    table_9.append(table_rows)
    
df_table_9 = pd.DataFrame(table_9)

In [125]:
df_table_9.to_csv('output.csv', index= False)