# Scraping and Storing

[David J. Thomas](mailto:dave.a.base@gmail.com), [thePortus.com](http://thePortus.com)<br />
Instructor of Ancient History and Digital Humanities,<br />
Department of History,<br />
[University of South Florida](https://github.com/usf-portal)

---

## This workbook will...

* Create a local db to store the data
* Scrape/save charter info from ASC and PASE
* Scrape/save witness info from PASE

---

## 1) Import Module Dependencies

The cell below loads all other Python packages needed. You **must** run this before any other cells.

In [1]:
import os
import sqlalchemy as sql
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from IPython.display import clear_output
import dhelp

## 2) Create Database/Schema

Before we try to scrape data from the website, we need to have a place to store it. Rather than export the data as a spreadsheet, storing it as a local database will allow us to perform far more powerful kinds of analysis in later steps. In addition, using the database will allow us to easily export our information in a variety of formats.

We are going to use a simple type of database, sqlite. In order to simplify interaction with the database, we are going to use the Python package [SQLAlchemy](https://www.sqlalchemy.org/). This will allow us to easily get related bits of data in a 'Pythonic' way. For example, to get all the people appearing on a charter named `charter` you would write `charter.people`, which will give you a list populated with the relevant items.

The code below first defines the database, and also three 'models'... `Charter`, `Person`, and a third table which will store the relational information (which people appeared on which charters). You can envision the way the data for each these models are stored as something like a spreadsheet. The `Person` model is actually stored in a table named `people`, which has 3 columns.

After defining these models inside Python using [SQLAlchemy](https://www.sqlalchemy.org/), the last line of code actually commits these changes to the database, which should now have three empty tables, named `charters`, `people`, and `charter_witnesses`. If you want to manually examine the database, you can use a free program like [SQLite Browser](http://sqlitebrowser.org/).

In [2]:
engine = sql.create_engine('sqlite:///charters.db', echo=False)
Base = declarative_base()
    

class Charter(Base):
    __tablename__ = 'charters'

    id = sql.Column(sql.String, primary_key=True)
    description = sql.Column(sql.String)
    sawyer = sql.Column(sql.Integer)
    birch = sql.Column(sql.Integer)
    kemble = sql.Column(sql.Integer)
    british_academy = sql.Column(sql.String)
    source_used = sql.Column(sql.String)
    archive = sql.Column(sql.String)
    language = sql.Column(sql.String)
    date = sql.Column(sql.Integer)
    scholarly_date = sql.Column(sql.String)
    scholarly_date_low = sql.Column(sql.Integer)
    scholarly_date_high = sql.Column(sql.Integer)
    scholarly_date_avg = sql.Column(sql.Float)
    text = sql.Column(sql.Text)
    notes = sql.Column(sql.Text)
    asc_source = sql.Column(sql.String)
    pase_source = sql.Column(sql.String)
    pase_witnesses = sql.Column(sql.String)
    
    witnesses = sql.orm.relationship('Person', secondary='charter_witnesses', back_populates='charters')
    
class Person(Base):
    __tablename__ = 'people'
    
    id = sql.Column(sql.String, primary_key=True)
    description = sql.Column(sql.String)
    link = sql.Column(sql.String)
    
    charters = sql.orm.relationship('Charter', secondary='charter_witnesses', back_populates='witnesses')
    
    @property
    def earliest_appearance(self):
        """Returns the date of the earliest charter features said person."""
        earliest_charter = None
        for charter in self.charters:
            if not earliest_charter:
                earliest_charter = charter.scholarly_date_avg
            else:
                if charter.scholarly_date_avg < earliest_charter:
                    earliest_charter = charter.scholarly_date_avg
        return earliest_charter
    
    
class CharterWitness(Base):
    __tablename__ = 'charter_witnesses'
    charter_id = sql.Column(sql.String, sql.ForeignKey('charters.id'), primary_key=True) 
    person_id = sql.Column(sql.String, sql.ForeignKey('people.id'), primary_key=True)
    role = sql.Column(sql.String)
    link = sql.Column(sql.String)


Base.metadata.create_all(engine)

print('Database Configured Successfully')

Database Configured Successfully


## 3) Scrape ASC and PASE for Charter Info

Our first step will be to get the urls for every charter in the ASC database. Then, using `ASCCharterPage` each page of the ASC database will be requested and parsed into a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) object using the [dhelp](https://github.com/thePortus/dhelp) package.

As each ASC page is scraped, the link to charter on the PASE database will be used to instantiate a corresponding `PASECharterPage` object. This object will then be used to grab further information about the charter not located on the ASC page. We will also grab the link to people appearing on the charter from each ASC page which we will use in the following step.

### 3a) Define Object to Scrape Info from a ASC Charter Page

In [3]:
class ASCCharterPage:
    """Extracts data from a single charter from the ASC database."""
    
    def __init__(self, url, options={'delay': 0.1}):
        self._url = url
        # fetch page and parse into beautifulsoup object
        with dhelp.WebPage(self._url) as page_soup:
            # get only portion of page with charter-specific content
            self.data = page_soup.body.div.table.find('td', id='content').div
        # eager load navbar to speed up link retreival
        self._navbar = self.data.find('ul', class_='charter-nav')
            
    @property
    def id(self):
        """Gives the charter ID, with spaces removed."""
        return self.data.div.div.h1.get_text().replace(' ', '')
        
    @property
    def pase_source(self):
        """Url to source in PASE database, if extant, otherwise None."""
        raw_url = self._navbar.find('li', class_='charter-pase-source').find('a')['href']
        # fixes link on asc page, which points to an obsolete address
        return raw_url.replace('ASC', 'Sources').replace('source.jsp', 'DisplaySource.jsp')
        
    @property
    def pase_witnesses(self):
        """URL to list of people appearing on charter in the PASE database."""
        try:
            return self._navbar.find('li', class_='charter-pase-witnesses').find('a')['href']
        # return None if no witnesses are found (i.e. no link exists)
        except:
            return None
    
    @property
    def description(self):
        """Modern description of charter."""
        return self.data.p.get_text()
    
    @property
    def text(self):
        """Full text of the charter, in original language. Editorial clause markings removed."""
        # grab text and convert from latin-1 to utf-8 encoding
        raw_text = self.data.find_all('div')[3].get_text()
        clean_text = bytearray(raw_text, 'latin-1').decode('utf-8')
        # remove text of embedded editorial marks
        remove_phrases = [
            'DATING CLAUSE', 'INVOCATION', 'PROMULGATION PLACE', 'CURSE',
            'DISPOSITIVE WORD', 'BOUNDS', 'PROEM',
        ]
        for remove_phrase in remove_phrases:
            clean_text = clean_text.replace(remove_phrase, '')
        # removes extra whitespace by spliting into list of words and rejoining
        return dhelp.LatinText(clean_text).rm_spaces().stringify()
    
print('ASC Charter Scraper Defined')

ASC Charter Scraper Defined


### 3b) Define Object to Scrape Info from a PASE Charter Page

In [4]:
class PASECharterPage:
    """Extracts further information about charters from their PASE page."""
    
    def __init__(self, url):
        # fetch page and convert to beautifulsoup object
        with dhelp.WebPage(url, options={'delay': 0.1}) as page_soup:
            self.data = page_soup
        sections = self.data.find_all('div', class_='t01')
        self._charter_info = sections[0].table.tr.td.table.find_all('tr')
        self._source_info = sections[1].table.tr.td.table.find_all('tr')
    
    @property
    def sawyer(self):
        """Returns the sawyer number if extant, otherwise None."""
        try:
            return self._charter_info[0].td.get_text()
        except:
            return None
    
    @property
    def birch(self):
        """Returns the birch number if extant, otherwise None."""
        try:
            return self._charter_info[1].td.get_text()
        except:
            return None
    
    @property
    def kemble(self):
        """Returns the kemble number if extant, otherwise None."""
        try:
            return self._charter_info[2].td.get_text()
        except:
            return None
    
    @property
    def british_academy(self):
        """Returns the British Academy reference, otherwise None."""
        try:
            return self._charter_info[3].td.get_text()
        except:
            return None
    
    @property
    def source_used(self):
        """Gives modern source used."""
        try:
            return self._charter_info[4].td.get_text()
        except:
            return None
    
    @property
    def archive(self):
        """Gives name of modern archive housing the charter."""
        try:
            return self._charter_info[5].td.get_text()
        except:
            return None
    
    @property
    def language(self):
        """Gives language(s) used in charter."""
        try:
            return self._source_info[0].td.get_text()
        except:
            return None
    
    @property
    def date(self):
        """Gives long-form version of date."""
        try:
            return self._source_info[1].td.get_text()
        except:
            return None
    
    @property
    def scholarly_date(self):
        """Gives short-form version of date."""
        try:
            return self._source_info[2].td.get_text()
            
        except:
            return None
        
    @property
    def scholarly_date_low(self):
        """Will return low date if date range exists, otherwise return scholarly_date"""
        try:
            return int(self.scholarly_date.split('x')[0].replace(' ', ''))
        except:
            return int(self.scholarly_date)
        
    @property
    def scholarly_date_high(self):
        """Will return high date if date range exists, otherwise return scholarly_date"""
        try:
            return int(self.scholarly_date.split('x')[1].replace(' ', ''))
        except:
            return int(self.scholarly_date)
        
    @property
    def scholarly_date_avg(self):
        """Returns mean date if date range exists, otherwise return scholarly_date"""
        low_date = self.scholarly_date_low
        high_date = self.scholarly_date_high
        if low_date == high_date:
            return low_date
        else:
            return round((low_date + high_date) / 2, 2)
    
    @property
    def notes(self):
        """Gives miscellaneous notes on charter."""
        try:
            return self.data.find('div', class_='rec').p.get_text()
        except:
            return None


print('PASE Charter Scraper Defined')

PASE Charter Scraper Defined


### 3c) Scrape Charter Info from ASC and PASE

In [5]:
# first, we need to open a session with the local database
Session = sessionmaker(bind=engine)
session = Session()

# before scraping information, we need to get the urls for every ASC charter
asc_links = []
charter_soup = None
# fetch page and parse into beautifulsoup object
with dhelp.WebPage('http://www.aschart.kcl.ac.uk/idc/idx_sawyerNo.html') as page_soup:
    # get only portion of page with charter-specific content
    charter_soup = page_soup.body.div.table.find('tr', class_='r02').find('td', id='content').div
# looping through each section and group of rulers
for ruler_section in charter_soup.find_all('ul', class_='asc-expand'):
    for ruler_group in ruler_section.find_all('li'):
        # get relative links from <a> tags append full link to self.data by adding root_url
        for charter_link_wrapper in ruler_group.find_all('li'):
            asc_links.append('http://www.aschart.kcl.ac.uk' + charter_link_wrapper.a['href'])

charter_counter = 0
# loop through each charter found online
for charter_url in asc_links:
    clear_output()
    print('Gathering charters ({})....'.format(charter_counter + 1))
    try:
        asc_charter_page = ASCCharterPage(charter_url)
        pase_charter_page = PASECharterPage(asc_charter_page.pase_source)
        # make new charter and add to session
        session.add(Charter(
            id=asc_charter_page.id,
            description=asc_charter_page.description,
            sawyer=pase_charter_page.sawyer,
            birch=pase_charter_page.birch,
            kemble=pase_charter_page.kemble,
            british_academy=pase_charter_page.british_academy,
            source_used=pase_charter_page.source_used,
            archive=pase_charter_page.archive,
            language=pase_charter_page.language,
            date=pase_charter_page.date,
            scholarly_date=pase_charter_page.scholarly_date,
            scholarly_date_low=pase_charter_page.scholarly_date_low,
            scholarly_date_high=pase_charter_page.scholarly_date_high,
            scholarly_date_avg=pase_charter_page.scholarly_date_avg,
            text=asc_charter_page.text,
            notes=pase_charter_page.notes,
            asc_source=asc_charter_page._url,
            pase_source=asc_charter_page.pase_source,
            pase_witnesses=asc_charter_page.pase_witnesses
        ))
    except:
        print('Error loading page at', charter_url, '(skipped)')
    charter_counter += 1
# commit all changes to the local db
session.commit()
session.close()

print('Charters successfully scraped')

Gathering charters (467)....
Fetching http://www.aschart.kcl.ac.uk/charters/s1482.html
Successfully scraped http://www.aschart.kcl.ac.uk/charters/s1482.html
Fetching http://www.pase.ac.uk/jsp/Sources/DisplaySource.jsp?sourceKey=1972
Successfully scraped http://www.pase.ac.uk/jsp/Sources/DisplaySource.jsp?sourceKey=1972
Error loading page at http://www.aschart.kcl.ac.uk/charters/s1482.html (skipped)
Charters successfully scraped


## 4) Scrape PASE for People in Charters

### 4a) Define Object to Scrape Info from a PASE Charter Witnesses Page

In [6]:
class PASEWitnesses:
    """Gets basic information about people on the charter from a PASE witnesses page"""
    
    def __init__(self, url):
        self._url = url
        with dhelp.WebPage(self._url, options={'delay': 0.1}) as page_soup:
            self._soup = page_soup
            # get only portion of page with relevant data
            try:
                self.data = page_soup.find('div', class_='rec').find('ul').find_all('li')
            # sometimes notes preceed data, in which case get second div with class rec
            except:
                self.data = page_soup.find_all('div', class_='rec')[1].find('ul').find_all('li')
    @property
    def witnesses(self):
        witness_list = []
        for witness_entry in self.data:
            witness_link_element = witness_entry.find('a')
            try:
                witness_role = witness_entry.find('strong').get_text()
            except:
                witness_role = 'Witness'
            witness_list.append({
                    'role': witness_role,
                    'name': witness_link_element.get_text(),
                    'link': witness_link_element['href'].replace('../', 'http://www.pase.ac.uk/jsp/'),
                    'description': witness_entry.find('em').get_text()
                })
            # look for nested witnesses, sometimes buried in recursive em tags
            nested_witnesses_element = None
            try:
                nested_witnesses_element = self._soup.find('div', class_='rec').find('ul').find('em')
            except:
                nested_witnesses_element = self._soup.find_all('div', class_='rec')[1].find('ul').find('em')
            while nested_witnesses_element is not None:
                nested_witnesses = nested_witnesses_element.find_all('li')
                for nested_witness_entry in nested_witnesses:
                    nested_witness_link_element = witness_entry.find('a')
                    try:
                        nested_witness_role = nested_witness_entry.find('strong').get_text()
                    except:
                        nested_witness_role = 'Witness'
                    witness_list.append({
                        'role': nested_witness_role,
                        'name': nested_witness_link_element.get_text(),
                        'link': nested_witness_link_element['href'].replace('../', 'http://www.pase.ac.uk/jsp/'),
                        'description': nested_witness_entry.find('em').get_text()
                    })
                nested_witnesses_element = nested_witnesses_element.find('em')
        return witness_list
    

print('PASE Witnesses Scraper Defined')

PASE Witnesses Scraper Defined


### 4b) Scrape Witness Info

In [7]:
# first, we need to open a session with the local database
Session = sessionmaker(bind=engine)
session = Session(autoflush=False)

charter_counter = 0
# now we want to query our local db for every charter, which is returned as a Charter object
for charter in session.query(Charter):
    clear_output()
    print('Gathering charter witnesses from charter ({})....'.format(charter_counter + 1))
    # then get the link to the corresponding PASE page with witness information, skip if there is no link
    witnesses_link = charter.pase_witnesses
    if witnesses_link is not None:
        # use the scraper to get witness data as list of dicts, only proceed if results were found
        witness_list = PASEWitnesses(witnesses_link).witnesses
        # loop through the list of witnesses
        for witness in witness_list:
            # query to see if person already exists in db, if no results found, then add them
            person_query = session.query(Person).filter(Person.id == witness['name'])
            # if any results are in the list, it will set person_found to True
            person_found = False
            for person in person_query:
                person_found = True
            if not person_found:
                try:
                    session.add(Person(
                        id=witness['name'],
                        description=witness['description'],
                        link=witness['link']
                    ))
                    print('Added person {}'.format(witness['name']))
                    session.commit()
                except:
                    session.rollback()
            # add charter/person relationship information to `charter_witnesses` table created above
            try:
                session.add(CharterWitness(
                    charter_id=charter.id,
                    person_id=witness['name'],
                    role=witness['role'],
                    link=str(witnesses_link)
                ))
                session.commit()
            except:
                session.rollback()
            print('Added person/charter relationship {} -> {}'.format(witness['name'], charter.id))
    charter_counter += 1
    
# commit all changes to the local db
session.close()

print('Witnesses successfully scraped')

Gathering charter witnesses from charter (347)....
Fetching http://www.pase.ac.uk/jsp/ASC/factoid.jsp?factoidKey=51596
Successfully scraped http://www.pase.ac.uk/jsp/ASC/factoid.jsp?factoidKey=51596
Added person/charter relationship  Cenwulf 3 -> S1442
Added person  Wullaf 5
Added person/charter relationship  Wullaf 5 -> S1442
Added person  Cynethryth 4
Added person/charter relationship  Cynethryth 4 -> S1442
Added person  Anonymous 689
Added person/charter relationship  Anonymous 689 -> S1442
Added person  Ælfflæd 8
Added person/charter relationship  Ælfflæd 8 -> S1442
Added person/charter relationship  Æthelred 1 -> S1442
Added person  Worcester 1
Added person/charter relationship  Worcester 1 -> S1442
Added person  Winchcombe 1
Added person/charter relationship  Winchcombe 1 -> S1442
Added person/charter relationship  Wærfrith 6 -> S1442
Added person/charter relationship  Beorhthun 6 -> S1442
Added person/charter relationship  Beorhtmund 4 -> S1442
Added person/charter relationship 

In [9]:
import sqlite3
import csv

con = sqlite3.connect('charters.db')


def dump_table(tablename, filename):
    with open(filename, 'w+') as outfile:
        outcsv = csv.writer(outfile)
        cursor = con.execute('select * from ' + tablename)
        # dump column titles (optional)
        outcsv.writerow(x[0] for x in cursor.description)
        # dump rows
        outcsv.writerows(cursor.fetchall())
    return True

if (
    dump_table('charters', 'export_charters.csv') and
    dump_table('people', 'export_people.csv') and
    dump_table('charter_witnesses', 'export_charter_witnesses.csv')
):
    print('Table data successfully exported')
    

Table data successfully exported
