# Web Data Scraping

[Spring 2023 ITSS Mini-Course](https://www.colorado.edu/cartss/programs/interdisciplinary-training-social-sciences-itss/mini-course-web-data-scraping) — ARSC 5040  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)


In [1]:
# Lets us talk to servers on the web
import requests

# Parsing HTML magic
from bs4 import BeautifulSoup

# For data manipulation
import pandas as pd

# Will be helful for converting between timestamps
from datetime import datetime

# We want to sleep from time-to-time to avoid overwhelming another server
import time

# We'll need to parse some strings, so we'll write some regular expressions
import re

from urllib.parse import quote, unquote
import json

In [5]:
from selenium import webdriver
import time

driver = webdriver.Remote(
    command_executor='http://172.18.0.4:5555/wd/hub',
    options=webdriver.ChromeOptions()
)



In [24]:
driver.get('https://www.google.com')
print(driver.title)

Google


The block of code below will only work once you've installed Selenium.

In [25]:
print(driver.page_source)

WebDriverException: Message: GET /session/d9f11f0f60c7b26987f7b054c1f27284/source
Build info: version: '3.14.0', revision: 'aacccce0', time: '2018-08-02T20:13:22.693Z'
System info: host: '8d0d73b8c984', ip: '172.18.0.4', os.name: 'Linux', os.arch: 'amd64', os.version: '6.10.14-linuxkit', java.version: '1.8.0_181'
Driver info: driver.version: unknown
Stacktrace:
    at org.openqa.selenium.remote.http.AbstractHttpCommandCodec.decode (AbstractHttpCommandCodec.java:261)
    at org.openqa.selenium.remote.http.AbstractHttpCommandCodec.decode (AbstractHttpCommandCodec.java:117)
    at org.openqa.selenium.remote.server.ProtocolConverter.handle (ProtocolConverter.java:74)
    at org.openqa.selenium.remote.server.RemoteSession.execute (RemoteSession.java:127)
    at org.openqa.selenium.remote.server.WebDriverServlet.lambda$handle$3 (WebDriverServlet.java:250)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511)
    at java.util.concurrent.FutureTask.run (FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)
    at java.lang.Thread.run (Thread.java:748)

## Installing Selenium

This is a non-trivial process: you will need to (1) install the Python bindings for Selenium, (2) download a web driver to interface with a web browser, and (3) configure Selenium to recognize your web driver. Follow the installation instructions in the documentation [here](https://selenium-python.readthedocs.io/installation.html) (you won't need the Selenium server).

1. Install the Python bindings for Selenium. Go to your Anaconda terminal window, type in this command, and agree to whatever the package manager wants to install or update.

`conda install selenium`

2. Download the driver(s) for the web browser you want to use from the [links on the Selenium documentation](https://selenium-python.readthedocs.io/installation.html). If you use a Chrome browser, download the Chrome driver. Note that the Safari driver will not work on PCs and the Edge driver will not work on Macs. 

3. You will need to unzip the file and move the executable to the same directory where you are running this notebook. Make a note of the path to this directory.

### Using Selenium to control a web browser
The `driver` object we create is a connection from this Python environment out to the browser window.

If you're on a Mac, the latest versions of OS X *really* do not like letting you run applications you've just downloaded. You'll need to dive into your system settings to fix it: https://support.apple.com/en-us/HT202491

In [11]:

from selenium.webdriver.common.by import By

This single line of code will open a new browser window and will request the "xkcd" homepage.

Your computer's security protocols may vigorously protest because you are launching a program that is controlled by another process/program. You will need to dismiss these warnings in order to proceed. Whether and how to do that will vary considerably across PCs and Macs, the kinds of permissions your account has on this operating system, and other security measures employed by your computer.

In [16]:
driver.get('https://xkcd.com')


<selenium.webdriver.remote.webdriver.WebDriver (session="d9f11f0f60c7b26987f7b054c1f27284")>


Now that we can use Python to control a web browser, there are a host of powerful functions you can use to simulate keystrokes, locate elements on a page, manage waits, *etc*.: https://selenium-python.readthedocs.io/

In Classes 01 and 02, we used `BeautifulSoup` to turn HTML and XML into a data structure that we could search and access using Python-like syntax. With Selenium we use a standard called "XPath" to navigate through an HTML document: [this is the official tutorial](https://www.w3schools.com/xml/xpath_syntax.asp) for working with XPath. The syntax is different, but the intuition is similar: we can find a parent node by its attribute (class, id, *etc*.) and then navigate down the tree to its children.

The XPath below has the following elements in sequence
* `//` — Select all nodes that match the selection
* `[@id="middleContainer"]` — find the element that has a "middleContainer" id.
* `/ul[2]` — select the second `<ul>` element underneath the `<div id="middleContainer">`
* `/li[3]` — select the third `<li>` element 
* `/a` — select the a element

The combined XPath string `//*[@id="middleContainer"]/ul[1]/li[3]/a` is like a "file directory" that (hopefully!) points to the hyperlink button that takes us to a random xkcd comic. With the directions to this button, we can have the web browser "click" the "Random" button beneath the comic.

In [15]:
# Let's find the 'random' buttom
element = driver.find_element(By.XPATH,'//*[@id="middleContainer"]/ul[2]/li[3]/a')

# Once we've found it, now click it
element.click()


<selenium.webdriver.remote.webelement.WebElement (session="d9f11f0f60c7b26987f7b054c1f27284", element="0.8119973885642442-1")>

We can also get the attributes of different parts of the web page. xkcd is famous for its "hidden messages" inside the image alt-text.

In [13]:
alttext_element = driver.find_element(By.XPATH,'//*[@id="comic"]/img')
alttext_element.get_attribute("title")

'They had BETTER make this a sample return mission.'

We could write a simple loop to click on the random button five times and print the alt-text from each of those pages.

In [19]:
for c in range(5):
    time.sleep(2)
    random_element = driver.find_element(By.XPATH,'//*[@id="middleContainer"]/ul[2]/li[3]/a')
    random_element.click()
    
    alttext_element = driver.find_element(By.XPATH,'//*[@id="comic"]/img')
    print('\n',alttext_element.get_attribute("title"))


 This projection distorts both area and direction, but preserves Melbourne.

 Kind of rude of them to simultaneously issue an EVACUATION - IMMEDIATE alert, a SHELTER IN PLACE alert, and a 911 TELEPHONE OUTAGE alert.

 1. Nf3 ... ↘↘↘  2. Nc3 ... ↘↘↘  0-1

 The place I'd least like to live is the farm in the background of those diagrams showing how tornadoes form.

 Our investigation into whining-based remedies became the first study to be halted by the IRB on the grounds that the treatment group was 'too annoying.'


When you're done playing with your programmable web browser, make sure to close it.

In [None]:
driver.quit()

Note that with the connection to the web browser closed, any of the functions like `find_element_by_xpath`, `click()`, *etc*. will not work.

### Exercises

Start your driver again and get the xkcd homepage.

1. Change the XPath to click on the "Prev" button above the comic.
2. Change the XPath to search for the "comicNav" class instead of the "middleContainer" id.
3. Change the XPath to click on the "About" button in the upper-left.

## Warning!

The code we will write and execute below will violate the Terms of Service for [Twitter](https://twitter.com/en/tos) ("You may not...  access or search or attempt to access or search the Services by any means (automated or otherwise) other than through our currently available, published interfaces that are provided by Twitter") and [YouTube](https://www.youtube.com/static?template=terms) ("you are not allowed to... access the Service using any automated means (such as robots, botnets or scrapers)...") for retrieving information from the platform. In effect, we will transmit code in excess of our authorized access and potentially cause damage, in order to obtain information from a protected computer. 

We will do this in order to obtain public statements made by goverment officials acting in their official capacity because this data is otherwise unavailable for retrieval from YouTube. There is an interesting body of emerging legal precedent treating elected officials' use of Twitter as a public forum: [*Knight First Amendment Institute v. Trump*](https://en.wikipedia.org/wiki/Knight_First_Amendment_Institute_v._Trump) established that [the President may not block other Twitter users](https://www.courtlistener.com/docket/6087955/72/knight-first-amendment-institute-at-columbia-university-v-trump/):

> * "We hold that portions of the @realDonaldTrump account -- the “interactive space” where Twitter users may directly engage with the content of the President’s tweets -- are properly analyzed under the “public forum” doctrines set forth by the Supreme Court, that such space is a designated public forum..."
> * "we nonetheless conclude that the extent to which the President and Scavino can, and do, exercise control over aspects of the @realDonaldTrump account are sufficient to establish the government-control element as to the content of the tweets sent by the @realDonaldTrump account, the timeline compiling those tweets, and the interactive space associated with each of those tweets."
> * "Because a Twitter user lacks control over the comment thread beyond the control exercised over first-order replies through blocking, the comment threads -- as distinguished from the content of tweets sent by @realDonaldTrump, the @realDonaldTrump timeline, and the interactive space associated with each tweet -- do not meet the threshold criterion for being a forum."
> * "the account’s timeline, which “displays all tweets generated by the [account]”... all of which is government speech."

On this basis, I believe the White House's videos posted to Twitter or YouTube are government speech and our automated retrieval of this content and associated meta-data in violation of YouTube's Terms of Serice is justifiable for understanding this speech as a public forum.

I would advise you against using these tools and approaches without a similarly clear public interest rationale and jurisprudence linking behavior to public forum doctrines.

## Screen-scraping Twitter with Selenium

I am adapting a [tutorial by Shawn Wang](https://dev.to/swyx/scraping-my-twitter-social-graph-with-python-and-selenium--hn8) on scraping a Twitter graph with Python and Selenium.

In [None]:
# Path to the Chrome driver for my PC -- yours is likely very different
# driver = selenium.webdriver.Chrome(executable_path='E:/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver.exe')

# Path to the Chrome driver for my Mac -- yours is likely very different
driver = selenium.webdriver.Firefox(executable_path=mac_path)

driver.get('https://www.twitter.com')

Manually log in to your Twitter account through the driver page.

Then go to the "followings" (or followees, also called "friends" in the Twitter API) of an account. 

In [None]:
driver.get('https://twitter.com/JoeBiden/following')

At the time of this Notebook's writing, the "JoeBiden" account followed 47 other accounts. Depending on the resolution of your display, size of the window, *etc*. there may only be 10–20 accounts visible. We can scroll to see the rest of these accounts programatically.

Run this cell a few times to keep scrolling to the bottom.

In [None]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

Pass the HTML of the web page in the browser back to Python and turn it into soup.

In [None]:
raw = driver.page_source.encode('utf-8')

soup = BeautifulSoup(raw)

Unfortunately, since I last ran this class in 2019 Twitter changed how they design and populate their website. If we inspect the elements for the following, we can see how they obfuscate the elements to make them hard to scrape. 

*Back in the day*, they had nice div tags with "data-item-type"s called "user". No longer!

In [None]:
soup.body.find_all('div', attrs={'data-item-type':'user'})

## Screen-scraping YouTube with Selenium

Let's get data from YouTube instead, which appears to be better-behaved from a web scraper's perspective.

In [None]:
driver.get('https://www.youtube.com/c/whitehouse/videos')

YouTube like Twitter also loads additional videos on scroll, so let's re-use the code above to scroll until we can't to load as many videos as possible.

In [None]:
# https://stackoverflow.com/a/51345544
from selenium.webdriver.common.keys import Keys

In [None]:
for i in range(20):
    time.sleep(.1)
    html.send_keys(Keys.END)

Now that we've loaded as many videos as the YouTube interface allows by scrolling, pull the contents of the web page. We can revert back to our strategies from Week 2 to identify, navigate, and pull out the relevant fields we want.

In [None]:
raw = driver.page_source.encode('utf-8')

soup = BeautifulSoup(raw)

By inspecting the source, the video cells appear to live within elements called `<ytd-grid-video-renderer>`.

In [None]:
video_divs = soup.find_all('ytd-grid-video-renderer')

Inspect one of them.

In [None]:
video_divs[-1]

Further drill-down and inspection of this element from the browser reveals that some promising data lives within an element defined as `<a id="video-title>`.

In [None]:
video_divs[-1].find_all('a',{'id':'video-title'})

Within the `aria-label` string there's the title, account, a relative date and a detailed number of views. Within the `href` tag is the video ID. These all seem like promising bits of data to try to grab.

In [None]:
video_divs[-1].find_all('a',{'id':'video-title'})[0]['aria-label']

We can use regular expressions to try to match the numeric fields like the number of views.

In [None]:
_s = video_divs[-1].find_all('a',{'id':'video-title'})[0]['aria-label']
re.findall(r'([\d,]+) views',_s)

Extract other relevant fields.

In [None]:
# The video link
video_divs[-1].find_all('a',{'id':'video-title'})[0]['href']

In [None]:
# The video title
video_divs[-1].find_all('a',{'id':'video-title'})[0]['title']

There's also some helpful data about the length of the video hiding in a `<ytd-thumbnail-overlay-time-status-renderer>`.

In [None]:
video_divs[-1].find_all('ytd-thumbnail-overlay-time-status-renderer')

We can pull out that video length element with:

In [None]:
video_divs[-1].find_all('ytd-thumbnail-overlay-time-status-renderer')[0].text.strip()

There's also labels about whether the videos are closed captioned within a tag called `<span class="style-scope ytd-badge-supported-renderer">`.

In [None]:
video_divs[-1].find_all('span',{'class':'style-scope ytd-badge-supported-renderer'})

The first video (at the bottom) has a CC tag, the second one does not.

In [None]:
video_divs[-2].find_all('span',{'class':'style-scope ytd-badge-supported-renderer'})

Let's put these pieces together into a loop that grabs all the data from this list of videos.

In [None]:
videos_l = []

for d in video_divs:
    # Get the number of views
    _s = d.find_all('a',{'id':'video-title'})[0]['aria-label']
    _views = re.findall(r'([\d,]+) views',_s)[0]
    
    # Get the link
    _link = d.find_all('a',{'id':'video-title'})[0]['href']
    
    # Get the title
    _title = d.find_all('a',{'id':'video-title'})[0]['title']
    
    # Get the length
    _length = d.find_all('ytd-thumbnail-overlay-time-status-renderer')[0].text.strip()
    
    # Get the captioning
    if len(d.find_all('span',{'class':'style-scope ytd-badge-supported-renderer'})) > 1:
        _cc = True
    else:
        _cc = False
        
    # Package it all up into a dictionary
    _d = {'Views':_views,
          'Link':_link,
          'Title':_title,
          'Length':_length,
          'Captioned':_cc
         }
    
    # Add our dictionary to the container
    videos_l.append(_d)

Make the `videos_l` container into a DataFrame.

In [None]:
pd.DataFrame(videos_l)

Depending on your priorities, you could stop here and save this to a CSV since there is already rich data. 

Some limitations to think of include:
* How would YouTube serve up an account with hundreds or thousands of videos? Is there a limit to the videos you can get from scrolling?
* The "Views" columns are stored as strings not as numeric values: you'll want to convert them somehow. Those commas could also complicate things when storing as a *comma separated* file, so you'll want to strip them out too somehow. Fixing both of these are related.
* We don't have any of the valuable data about the actual date the video was posted, the number of up/down votes, or even the transcript from the captioning.

### Retrieving data from each video's page

There's valuable data on each video's page about the specific date, the up and down votes, and even the transcript of the video that we can also retrieve.

Let's start with the inauguration video.

In [20]:
driver.get('https://www.youtube.com/watch?v=q5iCPKDp4V4')

Get the raw markdown and soup-ify it.

In [23]:
yt_inauguration_raw = driver.page_source.encode('utf-8')

yt_inauguration_soup = BeautifulSoup(yt_inauguration_raw)

WebDriverException: Message: GET /session/d9f11f0f60c7b26987f7b054c1f27284/source
Build info: version: '3.14.0', revision: 'aacccce0', time: '2018-08-02T20:13:22.693Z'
System info: host: '8d0d73b8c984', ip: '172.18.0.4', os.name: 'Linux', os.arch: 'amd64', os.version: '6.10.14-linuxkit', java.version: '1.8.0_181'
Driver info: driver.version: unknown
Stacktrace:
    at org.openqa.selenium.remote.http.AbstractHttpCommandCodec.decode (AbstractHttpCommandCodec.java:261)
    at org.openqa.selenium.remote.http.AbstractHttpCommandCodec.decode (AbstractHttpCommandCodec.java:117)
    at org.openqa.selenium.remote.server.ProtocolConverter.handle (ProtocolConverter.java:74)
    at org.openqa.selenium.remote.server.RemoteSession.execute (RemoteSession.java:127)
    at org.openqa.selenium.remote.server.WebDriverServlet.lambda$handle$3 (WebDriverServlet.java:250)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511)
    at java.util.concurrent.FutureTask.run (FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)
    at java.lang.Thread.run (Thread.java:748)

The date the video was uploaded appears within a `<div id="date">` tag.

In [None]:
yt_inauguration_soup.find_all('div',{'id':'date'})

Digging in, we can pull out the date.

In [None]:
yt_inauguration_soup.find_all('div',{'id':'date'})[0].text[1:]

The up and downvotes appear as coarse aggregations ("16K","95K") but the true counts at the time are hidden in some "aria-label"s. 

First, find the `<button id="button">` within the `<div id="top-level-buttons">`.

In [None]:
yt_inauguration_soup.find_all('div',{'id':'top-level-buttons'})

There's apparently (and hopefully only!) two of these types of divs. The one we care about is the second.

In [None]:
tlb2 = yt_inauguration_soup.find_all('div',{'id':'top-level-buttons'})[1]

tlb2.find_all('button',{'id':'button'})

The up and down votes are the first two buttons and include an `aria-label` with a count of the number of up and down votes.

In [None]:
upvotes = tlb2.find_all('button',{'id':'button'})[0]
downvotes = tlb2.find_all('button',{'id':'button'})[1]

upvotes

Access the `aria-label`.

In [None]:
upvotes['aria-label']

Use a regex to extract the number.

In [None]:
re.findall(r'([\d,]+) other people',upvotes['aria-label'])

In [None]:
re.findall(r'([\d,]+) other people',downvotes['aria-label'])

For closed captioned videos, there's also a transcript of the video with timestamps and text within the `<ytd-transcript-renderer>` parent or `<div class="cue-group">` tags.

In [None]:
yt_inauguration_soup.find_all('div',{'class':'cue-group'})[-1]

From the `<div class="cue-group">` elements, we can extract the time code and the text.

In [None]:
last_cg = yt_inauguration_soup.find_all('div',{'class':'cue-group'})[-1]
last_cg.find_all('div',{'class':'cue-group-start-offset'})[0].text.strip()

In [None]:
last_cg.find_all('div',{'class':'cue'})[0].text.strip()

For the whole transcript.

In [None]:
cg_l = []

for cg in yt_inauguration_soup.find_all('div',{'class':'cue-group'}):
    _time_code = cg.find_all('div',{'class':'cue-group-start-offset'})[0].text.strip()
    _text = cg.find_all('div',{'class':'cue'})[0].text.strip()
    
    _d = {'Time':_time_code,
          'Text':_text.replace('\n',' ')}
    
    cg_l.append(_d)
    
pd.DataFrame(cg_l).set_index('Time')['Text']

### Scraping multiple videos
That was all to extract the data from a single video. Let's now scrape the content from each of the White House YouTube videos. With something like 160 videos times 2 seconds per video, this scrape should take just over 5 minutes. So let's let this run and take a break. Something will likely break, so let's check back in afterwards! 

In [None]:
for v in videos_l:
    _link = v['Link']
    
    # Have Selenium get the page, get the source, and convert to soup
    driver.get('https://www.youtube.com'+_link)
    
    # Give the page a second to load
    time.sleep(1)
    
    # Retrieve the content and soup-ify
    _raw = driver.page_source.encode('utf-8')
    _soup = BeautifulSoup(_raw)
    
    # Get the date
    try:
        _date = _soup.find_all('div',{'id':'date'})[0].text[1:]
    except:
        # If we get an index error above, the page didn't finish loading
        # Wait another 2 seconds
        time.sleep(2)
        _date = _soup.find_all('div',{'id':'date'})[0].text[1:]
    
    # Get the up and downvotes
    try:
        _tlb2 = _soup.find_all('div',{'id':'top-level-buttons'})[-1]
        _upvotes_soup = _tlb2.find_all('button',{'id':'button'})[0]
        _downvotes_soup = _tlb2.find_all('button',{'id':'button'})[1]
        _upvotes = re.findall(r'([\d,]+) other people',_upvotes_soup['aria-label'])[0]
        _downvotes = re.findall(r'([\d,]+) other people',_downvotes_soup['aria-label'])[0]
    except:
        # If we get an index error above, the page didn't finish loading
        # Wait another 2 seconds
        time.sleep(2)
        _tlb2 = _soup.find_all('div',{'id':'top-level-buttons'})[-1]
        _upvotes_soup = _tlb2.find_all('button',{'id':'button'})[0]
        _downvotes_soup = _tlb2.find_all('button',{'id':'button'})[1]
        _upvotes = re.findall(r'([\d,]+) other people',_upvotes_soup['aria-label'])[0]
        _downvotes = re.findall(r'([\d,]+) other people',_downvotes_soup['aria-label'])[0]
    
    # Update the dictionary
    v['Date'] = _date
    v['Upvotes'] = _upvotes
    v['Downvotes'] = _downvotes

Convert the revised `videos_l` to a DataFrame.

In [None]:
whitehouse_yt_df = pd.DataFrame(videos_l)

# Convert the Views, Upvotes, and Downvotes columns to ints
whitehouse_yt_df['Views'] = whitehouse_yt_df['Views'].fillna('0').str.replace(',','').astype(int)
whitehouse_yt_df['Upvotes'] = whitehouse_yt_df['Upvotes'].fillna('0').str.replace(',','').astype(int)
whitehouse_yt_df['Downvotes'] = whitehouse_yt_df['Downvotes'].fillna('0').str.replace(',','').astype(int)

# Convert the Date to a datetime
whitehouse_yt_df['Date'] = pd.to_datetime(whitehouse_yt_df['Date'])

whitehouse_yt_df.head()

### Spoofing headers

When we use `requests` or Selenium to get data from other web servers, each of the get requests carries some meta-data about ourselves, called [headers](https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). These headers tell the server what kind of web browser we are, what kinds of data we can receive, *etc*. so that the server can reply with properly-formatted information. 

But it is also possible for the server to understand a request and refuse to fulfill it, known as a [HTTP 403 error](https://en.wikipedia.org/wiki/HTTP_403). A server's refusal to fulfill a client's request can often be traced back to the identity a client presents through its headers or a client lacking authorization to access the data (*i.e.*, you need to authenticate with the website first). In the case of `requests`, its `get` request includes default header information that identifies it as a Python script rather than a human-driven web browser.

Let's make a request for an article from the NYTimes.

In [3]:
honest_response = requests.get('https://www.nytimes.com/2019/02/03/us/politics/trump-interview-mueller.html')

We can see the headers we sent with this request.

In [4]:
honest_response.request.headers

{'User-Agent': 'python-requests/2.32.3', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Accept': '*/*', 'Connection': 'keep-alive'}

Specifically, the 'User-Agent' string identifies this request as originating from the "python-requests/2.21.0" program, rather than a typical web browser. Some web servers will be configured to inspect the headers of incoming requests and refuse requests unless they are actual web browsers.

We can often circumvent these filters by sending alternative headers that claim to be from a web browser as a part of our `requests.get()`.

In [5]:
# Make a dictionary with spoofed headers for the User-Agent
spoofed_headers = {'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"}

# Make the request with the 
nytimes_url = 'https://www.nytimes.com/2019/02/03/us/politics/trump-interview-mueller.html'
spoofed_response = requests.get(nytimes_url,headers=spoofed_headers)

Sure enough, the get request we sent to the NYTimes web server now includes the spoofed "User-Agent" string we wrote that claims our request is from a web browser. The server should now return the data we requested, even though we are not who we claimed to be.

In [6]:
spoofed_response.request.headers

{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Accept': '*/*', 'Connection': 'keep-alive'}

I had trouble finding a website that refused "python-requests" connections automatically (*e.g.*, Amazon, NYTimes, etc.), but you will likely find some along the way. 

Spoofing headers to conceal the identity of your client to a web server is another example of how technological capabilities can overtake ethical responsibilities. The owners of a web server may have good reasons for refusing to serve content to non-web browsers (copyright, privacy, business model, *etc*.). Misrepresenting your identity to extract this data should only be done if the risks to others are small, the benefits are in the public interest, there are no other alternatives for obtaining the data, *etc*. 

There can be *very* real consequences for spoofing headers. Because it is such a common and relatively trivial method for circumventing server security settings, making repeated spoofed requests could result in your IP address or an IP address range (worst case, the entire university) being blocked from making requests to the server.

### Parallelizing requests

A third web scraping practice that warrants ethical scrutiny is parallelization. In the example of getting historical `@WhiteHouse` tweets, we launched a single browser window and "scrolled" until we reached the end; a process that took on the order of a minute.

However, we *could* launch multiple scripts that each creates a browser windows and collect different segments of the data in parallel for us to combine the results at the end. In an API context, we *could* create multiple applications and design our requests so that each works simultaneously to get all the data. 

Each request imposes some cost on the server to receive, process, and return the requested data: making these requests in parallel increases the convenience and efficiency for the data scraper, but also dramatically increases the strain on the server to fulfill other clients' requests. In fact, highly-parallelized and synchronized requests can look like [denial-of-service attacks](https://en.wikipedia.org/wiki/Denial-of-service_attack) and may get your requests far more scrutiny and blowback than patiently waiting for your data to arrive in series. The ethical justifications for employing highly-parallelized scraping approaches are thin: documenting a rapidly-unfolding event before the data disappears, for example.