# Scraping the Unscrapable

Some sites are hard to scrape.

Sometimes you get blocked.
Sometimes the site is using a lot of fancy Javascript.

We'll see a few examples of methods we can use as workarounds for the former and introduce the tool Selenium that lets us automate dynamic interactions with the browser, which can help with the latter.

In [1]:
# Selenium basically simulates a browser to click stuff for you

## How much is too much?

Sites have `robots.txt` pages that give guidelines about what they want to allow webcrawlers to access

In [2]:
import requests

url = 'http://www.github.com/robots.txt'
response  = requests.get(url)
print(response.text)

# If you would like to crawl GitHub contact us at support@github.com.
# We also provide an extensive API: https://developer.github.com/

User-agent: CCBot
Allow: /*/*/tree/master
Allow: /*/*/blob/master
Disallow: /ekansa/Open-Context-Data
Disallow: /ekansa/opencontext-*
Disallow: /*/*/pulse
Disallow: /*/*/tree/*
Disallow: /*/*/blob/*
Disallow: /*/*/wiki/*/*
Disallow: /gist/*/*/*
Disallow: /oembed
Disallow: /*/forks
Disallow: /*/stars
Disallow: /*/download
Disallow: /*/revisions
Disallow: /*/*/issues/new
Disallow: /*/*/issues/search
Disallow: /*/*/commits/*/*
Disallow: /*/*/commits/*?author
Disallow: /*/*/commits/*?path
Disallow: /*/*/branches
Disallow: /*/*/tags
Disallow: /*/*/contributors
Disallow: /*/*/comments
Disallow: /*/*/stargazers
Disallow: /*/*/search
Disallow: /*/tarball/
Disallow: /*/zipball/
Disallow: /*/*/archive/
Disallow: /raw/*
Disallow: /*/followers
Disallow: /*/following
Disallow: /stars/*
Disallow: /*/blame/
Disallow: /*/watchers
Disallow: /*/network
Disallow: /*/gra

Disallow: / means disallow everything (for all user-agents at the end that aren't covered earlier). Boxofficemojo is more accepting:

In [3]:
url = 'http://www.boxofficemojo.com/robots.txt'
response  = requests.get(url)
print(response.text)

# robots.txt for http://www.boxofficemojo.com

User-agent: *
Disallow: /movies/default.movies.htm
Disallow: /showtimes/buy.php
Disallow: /forums/
Disallow: /derbygame/
Disallow: /grades/
Disallow: /moviehangman/
Disallow: /users/




It's very common for sites to block you if you send too many requests in a certain time period. Sometimes all it takes to evade this is well-designed pauses in your scraping. 

2 general ways:
* pause after every request
* pause after each n requests

In [4]:
#every request
import time

page_list = ['page1','page2','page3']  # whiskey table 1, whiskey table 2

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(2)  # This means it is taking 2 seconds to run each iteration
    

page1
page2
page3


In [5]:
#every 200 requests
import time

page_list = ['page1','page2','page3','page4','page5','page6']

for i, page in enumerate(page_list):
    ### scrape a website
    ### ...
    print(page)
    
    if (i+1 % 200 == 0):
        time.sleep(320) # For every 200 requests, I will pause for 5 minutes

page1
page2
page3
page4
page5
page6


Or better yet, add a random delay (more human-like)

In [6]:
import random

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(.5+2*random.random())  # Make the pause random

page1
page2
page3
page4
page5
page6


## How do I make requests look like a real browser?

In [7]:
import sys
import requests
from bs4 import BeautifulSoup

url = 'http://www.reddit.com'

user_agent = {'User-agent': 'Mozilla/5.0'}# Will send you back a response suitable for your browser
response  = requests.get(url, headers = user_agent)

# Headers is a component of a html request.

We can generate a random user_agent

In [8]:
from fake_useragent import UserAgent

ua = UserAgent()
user_agent = {'User-agent': ua.random}  
# creates a different agent. So people will think a 
# different person accessing the data. Because it's all coming from different browsers.

print(user_agent)

response  = requests.get(url, headers = user_agent)
print(response.text)

{'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20130331 Firefox/21.0'}
<!DOCTYPE html><html lang="en-US"><head><script>
          var __SUPPORTS_TIMING_API = typeof performance === 'object' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;
          function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };
          var __firstLoaded = false;
          function __markFirstPostVisible() {
            if (__firstLoaded) { return; }
            __firstLoaded = true;
            __perfMark("first_post_title_image_loaded");
          }
        </script><script>
          __perfMark('head_tag_start');
        </script><title>reddit: the front page of the internet</title><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="referrer" content="origin-when-cross-origin"/><style>
  /* http://meyerweb.com/eric/tools/css/reset/
    v2.0 | 20110126
    License: none (publ

## Now to Selenium!

## What happens if I try to parse my gmail with `requests` and `BeautifulSoup`?

In [9]:
import requests
from bs4 import BeautifulSoup

gmail_url="https://mail.google.com"
soup=BeautifulSoup(requests.get(gmail_url).text, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient and useful. 15 GB of storage, less spam and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v15/mem5YaGs126MiZpBA-UN_r8OUuhs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v15/mem8YaGs126MiZpBA-UFVZ0e.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fontfix;
  -webkit-animation-iteration-

Well, this is a tiny page. We get redirected. Soupifying this is useless, of course. Luckily, in this case we can see where we are sent to. In many of cases, you won't be so lucky. The page contents will be rendered by JavaScript by a browser, so just getting the source won't help you.

Anyway, let's follow the redirection for now.

In [10]:
new_url = "https://mail.google.com/mail"

# get method will navigate the requested url.. 
soup =BeautifulSoup(requests.get(new_url).text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient and useful. 15 GB of storage, less spam and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v15/mem5YaGs126MiZpBA-UN_r8OUuhs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v15/mem8YaGs126MiZpBA-UFVZ0e.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fontfix;
  -webkit-animation-iteration-

In [11]:
# In BeautifulSoup, there is no way to do "clicking" to enter pages.

In [None]:
print(soup.find(id='Email'))

We have hit the login page. We can't get to the emails without logging in ... i.e. we need to actually interact with the browser using Selenium!

In [40]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

import chromedriver_binary


driver = webdriver.Chrome('/Users/xianj/Desktop/Coding Journey/metis/Week 2/Day 2/chromedriver')
driver.get("https://mail.google.com")  # driver.get('') will pull out a new window.

# Alternatives to Chrome: Firefox, PhantomJS

### Interlude: how to include usernames and passwords

We are going to have to enter a username  and password in order to log in. However, we **don't** want to have our password uploaded to Github for people to scrape! One solution to this is to use _environment variables_.

In your directory, create a file called `.env` that has the following format:
```bash
EMAIL="your_username@gmail.com"
PASSWORD="your_password"
```
DON'T ADD THIS FILE TO GITHUB!
It is prudent to add a line `.env` to your `.gitignore`

We add two commands to the top of the cell:
```
%load_ext dotenv  # allows us to use the %dotenv "magic" command
%dotenv           # reads .env, and makes USERNAME and PASSWORD environment variables
```
We can now use `os.environ.get` to access the environment variables without having them appear in the notebook.

In [None]:
!pip install python-dotenv

In [41]:
# See notes about environment variables
%load_ext dotenv
# This is a magic command to load a module. This module will automatically grab the invisible file and load it into your environment
%dotenv
import os
EMAIL = os.environ.get('EMAIL')
PASSWORD = os.environ.get('PASSWORD')

# Show that this is working. Don't do this for PASSWORD!
print(EMAIL)

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
xianjinseow92@gmail.com


### Fill out username and password, hit enter to log in

Now let use this to log in.

In [42]:
username_form = driver.find_element_by_id("identifierId")
username_form.send_keys(EMAIL) 

In [43]:
username_form.send_keys(Keys.RETURN)

In [45]:
password_form=driver.find_element_by_name("password") # note another approach
password_form.send_keys(PASSWORD) # enter password

In [46]:
password_form.send_keys(Keys.RETURN)

### Click compose button to start a new email draft

In [47]:
compose_button=driver.find_element_by_xpath('//div[text()="Compose"]') 

# .find_element_by_xpath is a general way of finding. This is like a ctrl-f
# //div find a division that has the text parameter adn this body
# it covers everything
# Can locate element by using the exact phrasing

compose_button.click()

### Write a nice, friendly (optional) message to your (least?) favorite person

In [48]:
to_field = driver.find_element_by_name("to")
to_field.send_keys("xianjinseow92@gmail.com") # enter recipient email

In [49]:
subject = driver.find_element_by_name("subjectbox")
subject.send_keys("This is an alert!")

In [38]:
message_body = driver.find_element_by_xpath("//div[@aria-label='Message Body']")
message_body.send_keys("Hello,")
message_body.send_keys([Keys.RETURN, Keys.RETURN])
message_body.send_keys("I am a computer and I just became self aware!")

ElementNotInteractableException: Message: element not interactable
  (Session info: chrome=77.0.3865.90)


### Press the send button

In [39]:
send_button = driver.find_element_by_xpath("//div[contains(@aria-label, 'Send')]")
send_button.click()

ElementNotInteractableException: Message: element not interactable
  (Session info: chrome=77.0.3865.90)


# Scraping Box Office Mojo with Selenium

In [50]:
matrix_url = "http://www.boxofficemojo.com/movies/?id=matrix.htm"
driver.get(matrix_url)


In [51]:
# 'contains' will find a match on the text, in this case return b tag
gross_selector = '//font[contains(text(), "Domestic")]/b'  # Return the /b tag. Everything behind it
print(driver.find_element_by_xpath(gross_selector).text)

$171,479,930


In [52]:
# scraping genre
genre_selector = '//a[contains(@href, "/genres/chart/")]/b' # Return the /b tag. Everything behind it (everything behind the <b> tag).
for genre_anchor in driver.find_elements_by_xpath(genre_selector):
    print(genre_anchor.text)

Action - Wire-Fu
Man vs. Machine
Post-Apocalypse
Virtual Reality


In [53]:
inf_adjust_2000_selector = '//select[@name="ticketyr"]/option[@value="2000"]' # '@' is like an argument. You are looking for an argument
driver.find_element_by_xpath(inf_adjust_2000_selector).click()

In [55]:
# Using xpath
go_button_selector = '//input[@name="Go"]'  # Remember the @!
driver.find_element_by_xpath(go_button_selector).click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//input[name="Go"]"}
  (Session info: chrome=77.0.3865.90)


In [56]:
go_button = driver.find_element_by_name("Go")
go_button.click()

Now the page has changed; it's showing inflation adjusted numbers. We can grab the new, adjusted number.

In [57]:
gross_selector = '//font[contains(text(), "Domestic ")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

$181,944,300


# Scraping IMDB with Selenium

In [58]:
url = "http://www.imdb.com"
driver.get(url)

In [62]:
query = driver.find_element_by_id("navbar-query")
query.send_keys("Julianne Moore")

In [63]:
query.send_keys(Keys.RETURN)

In [61]:
name_selector = '//a[contains(text(), "Julianne Moore")]'
driver.find_element_by_xpath(name_selector).click()
current_url = driver.current_url

In [64]:
driver.current_url

'https://www.imdb.com/find?ref_=nv_sr_fn&q=Julianne+Moore&s=all'

# Mixing Selenium and BeautifulSoup

In [65]:
# Using Selenium to navigate through the buttons
# THen you can parse the URL to BeautifulSoup to get the information you want

from bs4 import BeautifulSoup
"""Could use requests then send page.text to bs4
but Selenium actually stores the source as part of
the Selenium driver object inside driver.page_source

#import requests
#page = requests.get(current_url)
"""
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [66]:
soup.prettify()

'<html class="scriptsOn" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">\n <head>\n  <script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/ClientSideMetricsAUIJavascript@jserrorsForester.10f2559e93ec589d92509318a7e2acbac74c343a._V2_.js">\n  </script>\n  <script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/G/01/imdbads/custom/test/index/js/show_ads.js">\n  </script>\n  <script type="text/javascript">\n   var ue_t0=ue_t0||+new Date();\n  </script>\n  <script type="text/javascript">\n   window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(va

In [67]:
len(soup.find_all('a'))

178

In [68]:
driver.close()

**Conclusion**: If a page is static, we can just use Beautiful Soup. If there is some dynamic component or interaction, we can then bring Selenium into the mix. Selenium can be used on its own or in conjunction with Beautiful Soup.

*References:* 

Documentation on finding elements:
- https://selenium-python.readthedocs.io/locating-elements.html

Xpath tutorial:
-  https://www.w3schools.com/xml/xpath_syntax.asp