# Fetch and Parse Webpage Data

### *Requests and BeautifulSoup*

---

## Author

[David J. Thomas](mailto:dave.a.base@gmail.com), [thePort.us](http://thePort.us)<br />
Instructor of Ancient History and Digital Humanities<br />
Department of History<br />
[University of South Florida](https://github.com/usf-portal)

---

## Historical Source:
Nearly 500 charters from Anglo-Saxon England, c. 600-900 and over 2,500 people who appear on them. Charters were elaborate documents containing grants of land, property, et.c. that always have a large number of very important witnesses which helped guaranteed its legitimacy. These charters, in aggregate, contain a wealth of information about reciprocity, relationships, and social display among medieval elites. In particular, each charter was signed by a list of witnesses. Between the texts of the charters and the names of the people on them, we can clean all kinds of useful information.

---

## Data Source:
Two databases. (1) The Anglo-Saxon Charters Database (ASC), the focus of our study. It contains the full text of hundreds of charters along with metadata. For this study, we will limit our purview to only these charters and only the individuals who appear in them. However, we need to round out our metadata on the charters and individuals. For that we will use (2) The Prosopography of Anglo-Saxon England Databases (PASE). Between these two we can get a significant amount of information on texts, people, and relationships.

---

## Packages Used

* Requests
    * [Main Documentation](https://python-requests.org)
    * [PyPi Package](https://pypi.org/project/requests/)
    * [GitHub Repo](https://github.com/psf/requests/)
* BeautifulSoup
    * [Main Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
    * [PyPi Package](https://pypi.org/project/beautifulsoup4/)

*To understand more about Requests, I highly recommend the following*... [1](https://www.w3schools.com/python/module_requests.asp) | [2](https://scotch.io/tutorials/getting-started-with-python-requests-get-requests) | [3](https://www.pythonforbeginners.com/requests/using-requests-in-python)

*To understand more about BeautifulSoup, I highly recommend the following*... [1](https://beautiful-soup-4.readthedocs.io/en/latest/) | [2](https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python) | [3](https://www.datacamp.com/community/tutorials/scraping-reddit-python-scrapy)

To FULLY understand everything, you probably need an intermediate Python skill. BUT, even if you have a beginners skill, you will be able to learn some useful tricks, work out larger strategies, and see the potential of Python for historical insight! If you want to learn how something is done, pay attention to the code documentation, or look up the official documentation on the package websites listed above, or at the tutorials I link inside the notebook. I hope this is useful to everyone from beginners to experts.

---

## 1 - Basic Example

Below shows the most basic usage in a script. First, a URL is sent to the `requests.get()` method, which sends back an HTML response object. That object contains many things, but part of it is the actual HTML that powers the webpage at the specified address. That data is stored in the `.text` property of the response object. This example simply fires off the request, gets the response object, stores the HTML, and then prints it to the screen.

---

In [None]:
# make sure to import the requests modules or it won't work
import requests

example_url = 'http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html'

# sent request to specified url and get back a response object
example_response = requests.get(example_url)
# single out the HTML and store the HTML at the .text property of the response object
example_html = example_response.text

# print result to screen to test
print(example_html)

## 2 - Automating with a Function

In the last step we wrote a script that will get a single webpage. But you will want to get many pages. So, it is time to turn what we just did in the last step into a function, making it repeatable. Our function will get a URL and it will return the HTML from the page.

In [None]:
# always make sure to import needed packages
import requests


# time to define the function, remember to leave extra space above and below for good form.
def get_page_html(webpage_url):
    """Gets a url, sends a web request, then returns the HTML from the response object."""
    webpage_response = requests.get(webpage_url)
    webpage_html = webpage_response.text
    return webpage_html


# now let's test our function, let's set our example url again, call our function, then print the result
example_url = 'http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html'
example_html = get_page_html(example_url)
print(example_html)

## 3 - Parsing HTML with BeautifulSoup

Okay, great. So we've figured out how to quickly get the raw HTML for any website in a handful of lines of code. But, how do you extract that data? The raw HTML is a giant chunk of string data, and you just want to get a little bit of it. Usually just some text between particular HTML tags (the things that look like this <>), or some other small bit of data. Fortunately, `beautifulsoup4` helps you burrow down through this mass of text and search it inside of Python. You can search for just certain tags, or all tags matching certain criteria. You can then extract all kinds of information from them. So, first, let's do the most basic set, as we did in step 1 above.

In [None]:
# YOU MUST HAVE RUN THE STEPS ABOVE FOR THIS TO WORK

from bs4 import BeautifulSoup

example_url = 'http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html'
example_html = get_page_html(example_url)

example_parsed = BeautifulSoup(example_html, 'html.parser')

In [None]:
import requests
from bs4 import BeautifulSoup


def soupify_page(webpage_url):
    webpage_response = requests.get(webpage_url)
    webpage_html = webpage_response.text
    webpage_parsed = BeautifulSoup(webpage_html, 'html.parser')
    return webpage_parsed

example_url = 'http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html'
example_parsed = soupify_page(example_url)

print(example_parsed.prettify())

In [None]:
# YOU MUST HAVE RUN THE STEPS ABOVE FOR THIS TO WORK

example_url = 'http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html'
example_parsed = soupify_page(example_url)

example_element = example_parsed.find('h2', id='d1605925e194')

print("Entire Beautiful Soup element:", example_element)
print("Visible text inside element:", example_element.text)
print("Attribute value of element:", example_element['id'])

In [None]:
example_url = 'http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html'
example_parsed = soupify_page(example_url)

list_of_kingdoms = example_parsed.find_all('ul', class_='asc-expand')

for kingdom in list_of_kingdoms[0:1]:
    print(kingdom.prettify())

In [None]:

for kingdom in list_of_kingdoms:
    kingdom_charters = kingdom.find_all('a')
    for kingdom_charter in kingdom_charters[0:1]:
        print(kingdom_charter.prettify())
    

In [None]:
example_url = 'http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html'
example_parsed = soupify_page(example_url)

list_of_kingdoms = example_parsed.find_all('ul', class_='asc-expand')

charters = []
for kingdom in list_of_kingdoms:
    kingdom_charters = kingdom.find_all('a')
    for kingdom_charter in kingdom_charters[0:5]:
        print(kingdom_charter['href'])

In [None]:
charter_urls = []

for kingdom in list_of_kingdoms:
    kingdom_charters = kingdom.find_all('a')
    for kingdom_charter in kingdom_charters[0:5]:
        print(kingdom_charter['href'][2:])

In [None]:
charter_urls = []

for kingdom in list_of_kingdoms:
    kingdom_charters = kingdom.find_all('a')
    for kingdom_charter in kingdom_charters:
        charter_url = 'http://www.aschart.kcl.ac.uk' + kingdom_charter['href'][2:]
        charter_urls.append(charter_url)
        
for charter_url in charter_urls[0:10]:
    print(charter_url)

## Going Object Oriented with Classes

In [None]:
import requests
from bs4 import BeautifulSoup


class BasePage:
    url = None

    def __init__(self, url):
        super().__init__()
        self.url = url
        
    def __str__(self):
        return self.url
    
    def __repr__(self):
        return self.url
        
    @property
    def soup(self):
        try:
            webpage_response = requests.get(self.url)
        except:
            print('Problem fetching page at...', self.url)
        webpage_html = webpage_response.text
        webpage_parsed = BeautifulSoup(webpage_html, 'html.parser')
        return webpage_parsed


test_object = BasePage('http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html')
# This should just print the URL to the screen, the behavior we specified in .__str__() and .__repr__()
print('The url for this test is', test_object)
print('And it\'s data is...')
print(test_object.soup.prettify())

In [None]:
class BrowsePage(BasePage):
    
    def get_data(self):
        page_html = self.soup
        list_of_kingdoms = page_html.find_all('ul', class_='asc-expand')
        for kingdom in list_of_kingdoms:
            kingdom_charters = kingdom.find_all('a')
            for kingdom_charter in kingdom_charters:
                charter_url = 'http://www.aschart.kcl.ac.uk' + kingdom_charter['href'][2:]
                charter_urls.append(charter_url)
        return charter_urls


test_object = BrowsePage('http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html')
# This should just print the URL to the screen, the behavior we specified in .__str__() and .__repr__()
print('Scraping data from', test_object)
test_urls = test_object.get_data()
for test_url in test_urls[0:5]:
    print(test_url)

In [None]:
class CharterPage(BasePage):
    
    def soup(self):
        page_html = super().soup()
        page_html.find('div', id='mainContent')


test_object = BrowsePage('http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html')
# This should just print the URL to the screen, the behavior we specified in .__str__() and .__repr__()
print('Scraping data from', test_object)
test_urls = test_object.get_data()
for test_url in test_urls[0:5]:
    print(test_url)