# Convert a year's worth of Historic Hansard into a dataframe for analysis

This notebook analyses Commonwealth Hansard XML files [from this GitHub repository](https://github.com/wragge/hansard-xml). Give it a `year` (between 1901 and 1980), and a `house` (either 'hofreps' or 'senate'), and it will download all the proceedings of that year and house, extract some basic data about debates and speeches, and provide the results as a dataframe for exploration.

In [None]:
import requests
import requests_cache
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
import arrow
import pandas as pd
import altair as alt

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

Note that the GitHub API only allows 60 unauthorised requests per hour. So it's a good idea to cache things. Note that requests to download files aren't included in the API tally. If you need more requests you'll need to use authentication.

In [None]:
API_URL = 'https://api.github.com/repos/wragge/hansard-xml/contents'

<div class="alert alert-info"><img src="../images/hhicon.png" width="50px" style="vertical-align: bottom; margin-right: 10px;">Just set the values below to a year and a house, then run all the cells!</div>

In [None]:
year = '1901' # 1901 to 1980
house = 'hofreps' # hofreps or senate

In [None]:
def count_words(para):
    '''
    Count the number of words in an element.
    '''
    words = 0
    for string in para.stripped_strings:
        words += len(string.split())
    return words

def get_paras(section):
    '''
    Find all the para type containers in an element and count the total number of words.
    '''
    words = 0
    for para in section.find_all(['para', 'quote', 'list'], recursive=False):
        words += count_words(para)
    return words

def get_words_in_speech(start, speech):
    '''
    Get the top-level containers in a speech and find the total number of words across them all.
    '''
    words = 0
    words += get_paras(start)
    words += get_paras(speech)
    for cont in speech.find_all('continue', recursive=False):
        cont_start = cont.find('talk.start', recursive=False)
        words += get_paras(cont_start)
        words += get_paras(cont)
    return words
                            
def get_interjections(speech):
    '''
    Get details of any interjections within a speech.
    '''
    speeches = []
    for index, intj in enumerate(speech.find_all('interjection', recursive=False)):
        start = intj.find('talk.start', recursive=False)
        speaker = start.find('talker')
        name = speaker.find('name', role='metadata').string
        id = speaker.find('name.id').string
        words = get_words_in_speech(start, intj)
        speeches.append({'interjection_idx': index, 'speaker': name, 'id': id, 'type': intj.name, 'words': words})
    return speeches     

def get_speeches(debate):
    '''
    Get details of any speeches in a debate (or subdebate)
    '''
    speeches = []
    for index, speech in enumerate(debate.find_all(['speech', 'question', 'answer'], recursive=False)):
        start = speech.find('talk.start', recursive=False)
        speaker = start.find('talker')
        name = speaker.find('name', role='metadata').string
        id = speaker.find('name.id').string
        words = get_words_in_speech(start, speech)
        speeches.append({'speech_idx': index, 'speaker': name, 'id': id, 'type': speech.name, 'words': words})
        # Interjections are within a speech
        interjections = get_interjections(speech)
        # Tag interjections with the speech index
        for intj in interjections:
            intj['speech_idx'] = index
            speeches.append(intj)
    return speeches

def get_subdebates(debate):
    '''
    Get details of any subdebates within a debate.
    '''
    speeches = []
    for index, sub in enumerate(debate.find_all('subdebate.1', recursive=False)):
        subdebate_info = {'subdebate_title': sub.subdebateinfo.title.string, 'subdebate_idx': index}
        new_speeches = get_speeches(sub)
        # Add the subdebate info to the speech
        for sp in new_speeches:
            sp.update(subdebate_info)
        speeches += new_speeches
    return speeches

def get_debates(soup):
    '''
    Get details of all the debates in day's proceedings.
    '''
    speeches = []
    date = soup.find('session.header').date.string
    for index, debate in enumerate(soup.find_all('debate')):
        debate_info = {
            'date': date,
            'debate_title': debate.debateinfo.title.string,
            'debate_type': debate.debateinfo.type.string,
            'debate_idx': index
        }
        new_speeches = get_subdebates(debate)
        new_speeches += get_speeches(debate)
        # Add the debate info to the speech
        for sp in new_speeches:
            sp.update(debate_info)
        speeches += new_speeches
    return speeches

def summarise_year(year, house):
    '''
    Get each day's proceedings for the supplied year/house and extract information about debates and speeches.
    '''
    speeches = []
    response = s.get(f'{API_URL}/{house}/{year}')
    data = response.json()
    files = [f for f in data if f['type'] == 'file']
    for f in tqdm(files):
        response = s.get(f['download_url'])
        soup = BeautifulSoup(response.text)
        speeches += get_debates(soup)
    df = pd.DataFrame(speeches)
    return df

In [None]:
df = summarise_year(year=year, house=house)

In [None]:
df.head()

## Who made the most speeches?

In [None]:
df.loc[df['type'] == 'speech']['speaker'].value_counts()[:20]

## Who made the most interjections?

In [None]:
df.loc[df['type'] == 'interjection']['speaker'].value_counts()[:20]

## Who spoke the most words?

In [None]:
df.groupby(by='speaker')['words'].sum().to_frame().reset_index().sort_values('words', ascending=False)[:20]

## Which debates generated the most words?

Note that there's variation in the way debate titles were recorded, and in the OCR results, so this sort of grouping isn't always going to work. To get something more accurate, you'd have to do some normalisation of debate titles first.

In [None]:
df.groupby(by=['debate_title'])['words'].sum().to_frame().reset_index().sort_values('words', ascending=False)[:20]

## How many words were spoken each day of proceedings?

I've only included words in speeches with identified speakers (including interjections), so some procedural content might not be included in the totals.

In [None]:
words_per_day = df.groupby(by=['date'])['words'].sum().to_frame().reset_index()
alt.Chart(words_per_day).mark_bar(size=2).encode(
    x='date:T',
    y='words:Q',
    tooltip=['date:T', 'words:Q']
).properties(width=700)

## Most popular topics of questions

In [None]:
df.loc[(df['debate_type'] == 'Questions') | (df['debate_title'] == 'QUESTION') | (df['type'] == 'question')]['subdebate_title'].value_counts()[:20]

----

Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io/).