# Collecting Text Data from Websites

Very often there is data on the internet that we would just love to use for our purposes as digital humanists. But, perhaps because it is humanities data, the people publishing it online might not have made it available in a format that is very easily used by you. In a perfect world, everyone would make available clearly described dumps of their data in formats that were usable by machines. In reality, a lot of times people just put things on a web page and call it a day. Web scraping refers to the act of using a computer program to pull down the content of a web page (or, often, many web pages). Scraping is very powerful - once you get the hang of it your potential objects of study will be exponentially increased, as you'll no longer be limited to the data that others make available to you. You can start building your own corpora using real-world information. 

One of the powerful realizations that comes with doing this kind of work is that text data is all around us. Anything could be a corpus with enough time and attention. The internet, in particular, offers a wealth of opportunities for aquiring text data if we know how to get at it. This could come in many forms:

* Data that exists openly on the internet but that has not been prepared for easy use.
* Data that exists openly on the internet and that has been provided in a usable form.

Let's start with the former use case. Most often, data on the internet is not presented in a form that is easily accessible. This might be because the author of a particular webpage was not expecting the site to be read and interpreted by anything other than humans, or it might be that the particular form in which the data is presented is not the best form for your purposes. In either case, the ability to extract textual information from a website can be quite powerful.

Texts drawn from http://www.gutenberg.org/files/12242/12242-h/12242-h.htm

In the following examples we will be using the Beautiful Soup package. First, some initial setup:

In [4]:
# import necessary packages for webscraping.

from bs4 import BeautifulSoup
from urllib import request

# store the url we want to work with in the variable 'url'

url = 'https://github.com/walshbr/humanists-nlp-cookbook/blob/master/scraping-corpus/dickinson/xi.txt'

Check out the website at that URL. You will find a text by Emily Dickinson surrounded by the web interface for GitHub. If we wanted to use the text of that poem in a particular program, we could just copy it manually to a text editor. But what if we had ten different poems on different pages? A hundred? At scale this quickly becomes something that we might want a computer to do for us. That's where scraping comes in. Let's use Python to pull in the HTML of that page. 

In [8]:
html = request.urlopen(url).read()
print(html[0:1000])
soup = BeautifulSoup(html, 'html5lib')

b'\n\n\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://assets-cdn.github.com">\n  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-98cac35b43fab8341490a2623fdaa7b696bbaea87bccf8f485fd5cdb4996cd9b52bdb24709fb3bab0a0dcff4a29187d65028ee693d609ce5c0c3283c77a247a9.css" media="all" rel="stylesheet" />\n  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-2909a5c7c333104e2fc3814a4565e315d2348d0d4a1f16c16ff32

If you've ever worked with HTML before this should look familiar. It looks a bit wonky becaue Python is reading it as one long string, devoid of any helpful tabbing and whitespace. If you haven't, you might check out the Mozilla [introduction to HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML) before reading further. For the purposes of this lesson, you mostly just need to know that any given web page has two distinct forms - the version you see and the version that the browser analyzes to present the version you see. So you might see this:

This is a paragraph that you see in the browser.

But the browser, behind the scenes, sees this:

`<p>This is a paragraph that you see in the browser.</p>`

The <p></p> tag pieces get stripped away by your browser, but, in the process, they tell it to present the enclosed text as a paragraph. The collected structuring system of encoding this represents is supplemented by a companion system called CSS, which gives the page the design and aesthetics to make it something other than just plain text and images on a page. 

HTML and CSS represent the structure and design of any given webpage. Beautiful Soup is a Python library that offers us an easy way to interact with the HTML, pull particular pieces out, and extract what we need. The previous lines say, "take the HTML that you've pulled down and get ready to do Beautiful Soup things to it." Think of it this way: you have a certain number of things that you can do in your car:
    
* Drive
* Fill it with gas
* Change the tires
    
But you can only really do those things once you actually get in your car. You couldn't change your tires if you were riding a horse. Horses don't have wheels. In programming speak, we're saying "turn that HTML into a Beautiful Soup **object**." Saying something is an object is a way of saying "I expect this data to have certain characteristics and be able to do certain things." In this case, BeautifulSoup gives us a series of ways to manipulate the HTML using HTML and CSS structural elements. We can do things like:

* Get all the links
* Get all the text on a page

The sky is the limit, and you can use HTML and CSS to drill down into the page.

In [11]:
# get the html and the soup

html = request.urlopen(url).read()
soup = BeautifulSoup(html, 'html5lib')

# get the HTML using the particular class structure
poem_html = soup.select(".blob-wrapper.data.type-text table")[0]
print(poem_html.get_text())



      
        
        XI.
      
      
        
        

      
      
        
        THE OUTLET.
      
      
        
        

      
      
        
        My river runs to thee:
      
      
        
        Blue sea, wilt welcome me?
      
      
        
        

      
      
        
        My river waits reply.
      
      
        
        Oh sea, look graciously!
      
      
        
        

      
      
        
        I'll fetch thee brooks
      
      
        
        From spotted nooks, —
      
      
        
        

      
      
        
        Say, sea,
      
      
        
        Take me!
      



In the above example, poem_html grabs the HTML we care about, and Beautiful Soup gives us access to the get_text() command. This powerful method allows us to throw away all the HTML tags and focus only on the text itself.

That's the text of the poem, but we'll need to do a bit more to make this workable data. All the whitespace that makes the text appear neatly on the page looks a little bizarre when pulled into Python. And, apart from aesthetics, it will cause problems with processing. We can strip that whitespace out, because those line breaks are actually represented as characters in the data itself, as '\n'.

In [12]:
poem_text = poem_html.get_text()
clean_poem = poem_text.replace('\n', ' ')
print(clean_poem)

                         XI.                                                                 THE OUTLET.                                                                 My river runs to thee:                                Blue sea, wilt welcome me?                                                                 My river waits reply.                                Oh sea, look graciously!                                                                 I'll fetch thee brooks                                From spotted nooks, —                                                                 Say, sea,                                Take me!        


That gets us closer - the .replace() method takes two arguments, the first of which is the thing we're looking to replace and that second of which is the thing to replace it with. The approach works here, but note how we _still_ have a lot of whitespace! That's because we have so many \n characters to replace that we've simply removed one problem and replaced it with another. We will discuss more of these sorts of issues more in the section on data cleaning.

To find out how to get to the data you want, the first step is to check out the HTML behind the website you're interested in. Modern web browsers have tools to help facilitate this. In Chrome, for example, at the time of writing you can right click on a part of a website and select ['inspect element'](https://developers.google.com/web/tools/chrome-devtools/inspect-styles/) to expose the underlying HTML of a page. This will show the HTML, that stuff that BeautifulSoup can help you work through. From there, knowledge of [HTML](https://www.w3schools.com/html/) and [CSS](https://www.w3schools.com/css/) can help you select particular pieces of the page. 

## Scraping a Series of Pages

Scraping often follows a pretty standard process:

* get a scraper working for one page
* gather a list of all the related pages
* apply the scraper to all the pages at once.

In general terms, this means that your first big task is simply to get a list of all the URLs you are interested in scraping. In this example, we have a series of texts available on GitHub [here](https://github.com/walshbr/humanists-nlp-cookbook/tree/master/scraping-corpus/dickinson). We'll scrape this site and manipulate the results to get the list we need for the final scraper.

In [39]:
url = 'https://github.com/walshbr/humanists-nlp-cookbook/tree/master/scraping-corpus/dickinson'
html = request.urlopen(url).read()
soup = BeautifulSoup(html)

# get the HTML using the particular class structure
links = soup.find_all('a', class_='js-navigation-open link-gray-dark')

# now we have the links, but we need to manipulate them to pull out the information we want. We mostly want the 'href' attribute
print(links[0])

links = [link['href'] for link in links]

# notice that the links are missing the top-level domain, which we need for them to resolve. 
print(links[0])
# let's go through and add the domain.
links = ['https://github.com' + link for link in links]
print(links[0])

# we now have a list of links, and we could loop over them to scrape each poem's text. 

poems = []
for link in links:
    html = request.urlopen(link).read()
    soup = BeautifulSoup(html)
    lines = soup.find_all('tr')
    this_poem = [line.text for line in lines]
    poems.append(this_poem)

# lots of extra line breaks, as this is scraped from the web, but it's there!
poems[0]
    

<a class="js-navigation-open link-gray-dark" href="/walshbr/humanists-nlp-cookbook/blob/master/scraping-corpus/dickinson/xi.txt" id="144560cfc44322009bd525f9587e622b-17349cc6366e2bb355a8c89bd47a9f4d6ee03add" title="xi.txt">xi.txt</a>
/walshbr/humanists-nlp-cookbook/blob/master/scraping-corpus/dickinson/xi.txt
https://github.com/walshbr/humanists-nlp-cookbook/blob/master/scraping-corpus/dickinson/xi.txt
https://github.com/walshbr/humanists-nlp-cookbook/blob/master/scraping-corpus/dickinson/xi.txt


['\n\nXI.\n',
 '\n\n\n\n',
 '\n\nTHE OUTLET.\n',
 '\n\n\n\n',
 '\n\nMy river runs to thee:\n',
 '\n\nBlue sea, wilt welcome me?\n',
 '\n\n\n\n',
 '\n\nMy river waits reply.\n',
 '\n\nOh sea, look graciously!\n',
 '\n\n\n\n',
 "\n\nI'll fetch thee brooks\n",
 '\n\nFrom spotted nooks, —\n',
 '\n\n\n\n',
 '\n\nSay, sea,\n',
 '\n\nTake me!\n']

This brings up a related point - not all websites are good or equally good candidates for scraping. In the above example, we used the main github page for the [file](https://raw.githubusercontent.com/walshbr/humanists-nlp-cookbook/master/scraping-corpus/dickinson/xi.txt), because that was easily accessible to us. The poem is there, but it is embedded in a viewing interface designed by GitHub. And the complicated set of scraping we did involving the 'tr' tags was meant to get rid of that interface and pull out just the poem. GitHub also offers a [raw link](https://raw.githubusercontent.com/walshbr/humanists-nlp-cookbook/master/scraping-corpus/dickinson/xi.txt) for each of these files that is separate from the viewing interface. We could have scraped that directly for cleaner results. In order to do so we would have had to manipulate our URLS to change this

https://github.com/walshbr/humanists-nlp-cookbook/blob/master/scraping-corpus/dickinson/xi.txt

to this

https://raw.githubusercontent.com/walshbr/humanists-nlp-cookbook/master/scraping-corpus/dickinson/xi.txt

Definitely doable! And we would have used string methods to transform our initial URL into the raw version.

So this was a slightly complicated example to use, but it still works. When deciding whether or not scraping is an option, at its core, you are looking for a site that has a repeated structure across all its pages. You can ask a short set of questions of a page to determine if scraping is an option:

* are all the links to the pages you're interested in present on a single page?
* are all the pages you're interested in, instead, represented on a series of pages?
* are the urls formed in a consistent manner that you could extrapolate? For example, site.com/posts/1; site.com/posts/2

In general, sites in these forms are the easiest to pull data from, because you can easily imagine constructing the URLs, collecting them, and harvesting them. If you need, say, to input a search query into a form on the page to get the results you would want to scrape the process gets much more complicated. 

## How to tell if a website is good for scraping re: URL construction.

## Ethics

Scraping sources through a script like this can raise a lot of questions. Do the people allow you to do so? Some websites explicitly detail whether or not you can in their terms of service. Project Gutenberg, for example, explicitly tells you that you *cannot* scrape their website. Doing so anyway potentially opens you to legal repercussions. Even if a site does not explicitly forbid scraping, it can still feel ethically suspect. A recent example of this is when a research scraping all publically available OKCupid user data. While it is true that these users made their personal information publicly available, they probably did not intend that their lives be exposed to this level of scrutiny. When getting ready to scrape data, it's usually a good idea to ask a series of questions:
* Was this data meant to be public?
* Am I harming anyone by pulling down this data?
* Is this data associated with anyone's identity in a way that they might object to?
* Is it worth it?
* Can I get the data in some other way?
* Is my scraping going to harm the website in some way?

Related to this last point - even if all these questions seem to be fine, you still need to be careful. Scraping a website can very often look like a [DDoS attack](https://en.wikipedia.org/wiki/Denial-of-service_attack). If you, say, try to scrape 10,000 links from Project Gutenberg, those 10,000 hits on Project Gutenberg's site could cause issues for their system. To get around this, it's often good practice to purposely slow down your scraper so that it more closely mimics the behavior of a human user. Rather than scraping multiple links per second, the following snippet tells the scraper to rest a random interval of up to 6 seconds between downloads:

In [13]:
import time
import random
def download(url, sleep=True):
    if sleep:
        time.sleep(random.random() * max_sleep)
    html = request.urlopen(url).read().decode('utf8', errors='replace')
    return BeautifulSoup(html)

Everytime you call the "download()" function, then, it would sleep a randoml amount of time. 

If you're really concerned it is usually a good idea to contact the people whose site you want to work with to ask if they mind you scraping their work. Sometimes they might make their data available in a more usable way. If you work at an institution with an IRB panel, they can probably help you make determinations about whether the data involved with your work is sensitive if it involves human subejcts.