## Intro To Webscraping

We're going to learn how to scrape data from a website. It is quite common that you might want to scrape a bunch of data from a website in order to analyse it. For example:

 - Scrape a bunch of comments/social media posts to analyze sentiment
 - Scrape a bunch of data saved as .csv files somewhere
 - Pull public records that don't have a native download option on the webpage
 - etc, etc
 
Webscraping is SUPER powerful, because it enables you to create datasets out of almost anything you can find online. We'll walk through a toy example here, and then you can feel free to identify the website of your choice and scrape away!

We'll need some tools before we get started:

I should point out, there are multiple frameworks to use for webscraping: beautifulsoup/selenium+chromedriver/requests/urlib are all fairly common and are used for different applications. Each scraping task will require slightly different capabilities, and require choosing the correct tooling. We'll focus on the first two here.

- First, make sure you have Chrome. You should already, as it's the best browser on the planet :)

- Second, download and run the appropriate: <a href="https://chromedriver.chromium.org/downloads">ChromeDriver</a>

- Third, make sure you have both selenium and beautifulsoup4


## -----------DISCLAIMER!!!-------------WARNING!!!--------------

Many websites prohibit webscraping. This is not to say people don't do it all the time anyways, but we'll need to play by the rules here. If you search online, there are many awesome uses/examples/tutorials on webscraping. One common excercise is to scrape Indeed or LinkedIn for job postings in a given city, to figure out the most in demand skills. The problem is, many of the job boards prohibit scraping. So, just for liabilities sake, we're going to do a toy example here. <a href="http://toscrape.com">Toscrape.com</a> is a free website that was specifically set up for scraping, so we can play around with it without worrying about a specific site's Terms of Service changing on us.

In [6]:
!pip3 install selenium

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m

In [8]:
#import selenium and beautifulsoup

import selenium
from selenium import webdriver

import bs4
from bs4 import BeautifulSoup

In [13]:
#point selenium to your download of the chromedriver and instantiate a driver

#driver = webdriver.Chrome('/Users/mschuchardt/bin/chromedriver')

In [34]:
driver=webdriver.Chrome("/Users/vandana/Downloads/chromedriver")

  driver=webdriver.Chrome("/Users/vandana/Downloads/chromedriver")


In [35]:
#test it out and see if it's working

driver.get('http://books.toscrape.com/')

## Ok - let's get to scraping. Let's build a dataset that has the following for each book:

- Title
- Price
- Description
- Rating
- Genre

We'll use the same general process for each of the above. Web scraping can involve a bit of trial and error. Essentially you're just trying to figure out:
- Where does the element you're looking for live in the html?
- What is it's tag or XPath?
- What is the page structure? I.e, once you've found the tag you want, how can you loop through the whole page to find ALL of the thing you're interested in (like all the titles).

## I'll give an example below, then you can replicate it for each additional element you need.

 - right click one of the titles on the page and click 'inspect'
 - look at the section of html that pops up and right click that section of html and hover over 'copy'
 - click 'copy XPath'

In [36]:
#we'll use the find_elements_by_xpath function on the XPath we copied to see what we get:

title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/\
                                              div/div/div/section/div[2]/ol/li[1]/article/h3/a')[0]

  title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/\


In [37]:
#using '.text' on this element will show us the text displayed on the page

title = title_element.text
title

'A Light in the ...'

In [38]:
#that seems truncated - let's look back at the html and see if there's a tag that holds the whole title
#get_attribute will help here

title = title_element.get_attribute('title')
title

#there we go!

'A Light in the Attic'

## Ok - now we've found where each title is stored within the html of the page!

Now we need to repeat this process for every book.

In [39]:
# lets write a loop that pulls every title for the page and saves it in a list

title_list = []

for i in range(1,21): #this range represents the number of books per page

    title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/div/div/div/section/\
                                                    div[2]/ol/li[{}]/article/h3/a'.format(i))[0] #notice the .format()
    title_list.append(title_element.get_attribute('title'))
    
print('Magic! I\'ve scraped {} titles!'.format(len(title_list)))

  title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/div/div/div/section/\


Magic! I've scraped 20 titles!


In [40]:
#the loop above only scraped the first page. let's nest another loop to scrape every page

title_list = []

for i in range(1,51):

    driver.get('http://books.toscrape.com/catalogue/page-{}.html'.format(i))

    for x in range(1,21): #this range represents the number of books per page

        title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/div/div/div/section/\
                                                        div[2]/ol/li[{}]/article/h3/a'.format(x))[0]
        title_list.append(title_element.get_attribute('title'))
        
#quit your session so you don't have any ghost browsing sessions running :)

driver.quit()

print('Magic! I\'ve scraped {} titles!'.format(len(title_list)))

  title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/div/div/div/section/\


Magic! I've scraped 1000 titles!


The way we've done it above actually launches a Chrome GUI browser. This is computationally expensive, and therefore, bad practice. It's cool, because you can follow along and literally see the scraping happening. That said, in practice you'll likely want to do this in what's called a 'headless' manner. Go online and read up on how to run selenium without launching a GUI browser.

In [41]:
#set the options for the driver such that Chrome doesn't actually launch a GUI

from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
#driver = webdriver.Chrome('/Users/mschuchardt/bin/chromedriver', options=options)
driver = webdriver.Chrome("/Users/vandana/Downloads/chromedriver", options=options)

title_list = []

for i in range(1,51):

    driver.get('http://books.toscrape.com/catalogue/page-{}.html'.format(i))

    for x in range(1,21): #this range represents the number of books per page

        title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/div/div/div/section/\
                                                        div[2]/ol/li[{}]/article/h3/a'.format(x))[0]
        title_list.append(title_element.get_attribute('title'))
        
#quit your session so you don't have any ghost browsing sessions running :)

driver.quit()

print('Magic! I\'ve scraped {} titles!'.format(len(title_list)))

  driver = webdriver.Chrome("/Users/vandana/Downloads/chromedriver", options=options)
  title_element = driver.find_elements_by_xpath('//*[@id="default"]/div/div/div/div/section/\


Magic! I've scraped 1000 titles!


## There we go! We've scraped all 1000 titles. Let's step back for a moment now.

Above we did a quick and dirty scrape of every title on the website. We kept it super high level, and never navigated to any individual books's page. In order to pull more data for each book, let's dive in deeper.

In this next section we'll first try to figure out if there's a pattern we can follow for each page's link. Then we'll loop through every page and scrape as much as we can from it.

The selenium/ChromeDriver toolset is really cool, and helps visually illustrate what a webscraper is doing. In the next lab, we'll use beautifulsoup to get into the weeds.