# Shay, Florrie & Sam's Data Science Project: MASTER NOTE BOOK
AKA The Data Science Project team of dreams 

# Introduction to the Project and Dataset


This project will analyse data from the Old Bailey Online dataset, which is the largest collection of historical trial records from London's central criminal court, containing approximately 197,745 trials from 1674 to 1913. The Old Bailey Proceedings document detailed accounts of criminal trials, including information about defendants, victims, offences, verdicts, and punishments. We obtained the dataset from the Old Bailey API at https://www.oldbaileyonline.org/.

Let's get started.

In [None]:
#Just importing some standard modules that we'll probably need.
import numpy as np 
import matplotlib.pyplot as plt
plt.style.use ('fivethirtyeight')
import seaborn as sns
import pandas as pd
from scipy import stats
pd.set_option ('mode.copy_on_write', True)

# make sure you install the requirements.txt file to run this notebook!

# Importing and Cleaning

 We parsed XML trial records and extracted key variables (defendants, victims, offences, verdicts, punishments)

# Data Acquisition and Cleaning

The Old Bailey's data is formatted through a variety of XML files. XML files are useful because they allow historical records to be stored in a structured, hierarchical format that preserves relationships between people and events and metadata. Which is important when you are trying to format your data in a way that you could look up certain statistics for the Old Bailey. Now, XML files are somewhat finicky unless you know how to clean them.

Now, calling the below "data cleaning" may seem slightly disingenuous given we are selectively extracting particular variables, as opposed to correcting errors in the data itself. But, I digress. 

The first thing we need to do is get our imports and installs sorted.


In [52]:
!pip install pandas
!pip install word2number

import xml.etree.ElementTree as ET # We need to import ElementTree to extract XML files, which is the file type the Old Bailey works with. 
# It will allow us to parse the XML files and directly pull their elements, as opposed to us using NLP (or something equally as troublesome)

from pathlib import Path # Pathlib will be important because we are analysing a LOT of XML files, and we'll need to be doing checks throughout to see if all is well

import pandas as pd
import csv
import re
from collections import Counter
from word2number import w2n




ModuleNotFoundError: No module named 'word2number'

Now, let's first just get all the files in one place and check we have all of them. Given the sheer quantity the Old Bailey database contains.

In [None]:
input_dir = Path('.') # Given how many files we are working with, we can't actually keep the files in the GitHub. 
# As such, we are using the pathlib function just in case things end up breaking down across computers.

da_xml_files = list(input_dir.glob('*.xml')) 
# This will search the folder for all files ending with '.xml', thus finding all the files we need. 
# Listing them would be a bit problematic given how many there are. 
# It then creates a file called 'da_xml_files', which contains the name of all those files. 

print(f"Found {len(da_xml_files)} files") # This will double check how many files we have. 

## The Function of Time

Right, so, the most important piece of information we can extract from the XML files is the date of the various sessions. Without this, we won't be able to do much analysis at all as it all will be a function of time. So, we are going to create a function that extracts the session date and year from the tags within the files. 

In [None]:
def if_i_could_put_time_in_a_dictionary(filepath):

    tree = ET.parse(filepath) # This will create an ElementTree object, which is how python essentially "sees" XML files
    root = tree.getroot() # This grabs the top-most tag for the XML files, and will help us find all the nested tags within it
    
    # Now we create empty strings for storage
    session_date = ""
    session_year = ""
    
    for div0 in root.iter('div0'): # The Old Bailey uses <div0> to mark top-level sections, like headers, cases, etc. So it's how we will split our data points by session as opposed to absorbing all of one XML file into a big mess

        if div0.get('type') == 'sessionsPaper': # This checks if the <div0> element has the attribute which is describing a court session with date and time. Otherwise, we're not interested. 

            for interp in div0.iter('interp'): # In Old Bailey, <interp> holds those elements like date, time, etc
                if interp.get('type') == 'date':
                    session_date = interp.get('value', '')
                elif interp.get('type') == 'year':
                    session_year = interp.get('value', '') # Note that Old Bailey stores year and date separately. So, we need to store them separately
            break
    
    if not session_year and len(session_date) >= 4: # Now, the data isn't perfect. So, if we are unable to extract the year, we can extract it from the title.
        session_year = session_date[:4] # The first four numbers are the year for Old Bailey
    
    return {'date': session_date, 'year': session_year}

# Now we run the function on our XML files, and print a couple to check its all worked! 
for xml_file in da_xml_files[:5]:
    filepath = input_dir / xml_file
    metadata = if_i_could_put_time_in_a_dictionary(filepath)
    print(f"{xml_file}: {metadata}")

## Getting Defendant Names (Like that one scene from Shawshank Redemption)

Now, we need to extract the data on a defendant basis, and get a list of names/IDs. The reason we are doing this is because we are interested in the data being on a by-defendant basis, as opposed to a by-trial or session basis. The reason for this is some trials may have more than one defendant, or we have every row being on a per-day basis. Which doesn't really work if you want to get into demographic data.  

Some information we want is their name, their id, and their gender. If we want to test any hypothesis relating to women, we will need this data.

In [None]:
def dammit_dufresne_you_are_putting_me_behind(trial_elem):
    da_defendants = [] # Empty list to store defendant data
    
    # First, we'll create a loop that iterates over every XML name element within the trial element. 
    # Note that <persName> can be ANY name (defendant, witness, judge, etc), so we will need to do something about that
    for person in trial_elem.iter('persName'): 
        if person.get('type') == 'defendantName': # This will ensure we are only extracting defendants specifically
            def_id = person.get('id', '')
            
            given = ""
            surname = ""
            gender = ""
            
            # So, as we've established, the OldBailey XML files have tags for each defendant. 
            # It's actually quite amazing the work they've done because it is REALLY easy to extract if you know how
            # So, we create a for loop that will iterate over every value attribute (what stores defendant data) and assign the correct name and gender. 
            for interp in person.iter('interp'):
                inst = interp.get('inst')
                if inst == def_id:
                    interp_type = interp.get('type')
                    if interp_type == 'given':
                        given = interp.get('value', '')
                    elif interp_type == 'surname':
                        surname = interp.get('value', '')
                    elif interp_type == 'gender':
                        gender = interp.get('value', '')
            
            # Now, again, the Old Bailey data is HUGE and might not be perfect. 
            # So, just in case, we extract all the text content inside the <persName> element (which will be their name)
            # And we just put it in their name. 
            # It's not going to be perfect, but realistically, the name doesn't matter as much so we don't mind. 
            if not given and not surname:
                name_text = ''.join(person.itertext()).strip()
                given = name_text
            
            # Now we just create our dictionary and BOOM. All done. 
            da_defendants.append({
                'id': def_id,
                'given': given,
                'surname': surname,
                'gender': gender
            })
    
    return da_defendants

Now, let's run a quick test to make sure it all works.

In [None]:
# Test on a particular file (when we first did this, we picked five files at random. The below file was one of those five, hence why its specific).
tree = ET.parse(input_dir / '18411129.xml')
root = tree.getroot()

count = 0 # We don't need to check all the trials within the file, so we'll set up a counter. 
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        trial_id = div1.get('id')
        defs = dammit_dufresne_you_are_putting_me_behind(div1)
        print(f"{trial_id}: {defs}")
        count += 1
        if count >= 10: # End at 10 trials. Again, we don't need to waste time checking all of them. 
            break

Look there above. You can see some trials have more than one defendant. Thank god we decided to do this on a per-defendant basis, otherwise we would have been in trouble. Good luck Henry and Harriett with your trial (or rather, your t18411129)!

## What is the charge? Eating a meal? A succulent Chinese meal?

Well, defendant data is all well and good. But another key factor is the actual crime they committed, otherwise, this isn't really good court data. Thankfully, we can extract that pretty easily as well! Though, there is some further explanation we need to do before proceeding. 

See, Old Bailey breaks down crime and offence through specific categorisation. There is the Category and the Subcategory. The category is the broad type of crime committed in the trial, so the likes of theft, assault, murder, fraud, etc. The subcategory handles a more specific classification within that broad category, so for example, theft could specifically be burglary, pickpocketing, or shoplifting. Now, one thing to note is that crime definition changes over time, even over the time of decades. Hence why it is actually hard to compare temporal crime statistics (and why murder is usually very good as a baseline, because its definition rarely changes. Either you've been murdered or you haven't). 

It should be noted that some defendants don't have subcategories on their crimes, and that's fine. It's not strictly vital, and it's something that can be filtered later if necessary.


In [None]:
# Right, so most of this stuff is the same again, so I won't go explaining everything. 
def i_am_offended(trial_elem):
    get_off_fences = [] # Offence dictionary storage
    
    for rs in trial_elem.iter('rs'): # Note we are using <rs> here and not <persName> because Old Bailey stores offence data inside <rs> elements
        # Specifically with type="offenceDescription"
        if rs.get('type') == 'offenceDescription':
            off_id = rs.get('id', '')
            category = ""
            subcategory = ""
            
            for interp in rs.iter('interp'):
                interp_type = interp.get('type')
                if interp_type == 'offenceCategory':
                    category = interp.get('value', '')
                elif interp_type == 'offenceSubcategory':
                    subcategory = interp.get('value', '')
            
            get_off_fences.append({
                'id': off_id,
                'category': category,
                'subcategory': subcategory
            })
    
    return get_off_fences

And there you have it. Now, we do the test once again to just check everything is working fine.

In [None]:
tree = ET.parse(input_dir / '18411129.xml')
root = tree.getroot()

count = 0
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        trial_id = div1.get('id')
        offs = i_am_offended(div1)  
        print(f"{trial_id}: {offs}")
        
        count += 1
        if count >= 10:  
            break

As a quick side. Theft Receiving means that the defendant knowingly accepted or bought goods that had been stolen; which is still a crime. Anyway, not important (truthfully speaking, I didn't know what it was and had to look it up. And I'm explaining it just in case).

We move on.

## 1000 counts of murder? To Arkham Asylum for the hundredth time with you! 

Right, and of course, we need the actual verdict and punishment, otherwise we have no value to the offences assigned to each defendant. After all, if they were all found innocent, then the Old Bailey dataset would be quite odd indeed. 

Once again, note there are subcategories. So, you can have guilty, not guilty as categories, and "guilty of theft" or "guilty of assault" as subcategories. Regarding punishment, categories work in the form of imprisonment -> 6 month imprisonment. So, the subcategories are actually important here and we will need to properly extract them. 

In [None]:
# Again, same old stuff. 
def we_the_jury_find_this_code_to_be_awful(trial_elem):
    john_verdicts = []
    
    for rs in trial_elem.iter('rs'):
        if rs.get('type') == 'verdictDescription':
            ver_id = rs.get('id', '')
            category = ""
            subcategory = ""
            
            for interp in rs.iter('interp'):
                interp_type = interp.get('type')
                if interp_type == 'verdictCategory':
                    category = interp.get('value', '')
                elif interp_type == 'verdictSubcategory':
                    subcategory = interp.get('value', '')
            
            john_verdicts.append({
                'id': ver_id,
                'category': category,
                'subcategory': subcategory
            })
    
    return john_verdicts

def who_even_is_hammurabi_brah(trial_elem):
    arya_starks_list = [] # Storage for defendant punishments
    
    for rs in trial_elem.iter('rs'):
        if rs.get('type') == 'punishmentDescription':
            pun_id = rs.get('id', '')
            category = ""
            subcategory = ""
            
            for interp in rs.iter('interp'):
                interp_type = interp.get('type')
                if interp_type == 'punishmentCategory':
                    category = interp.get('value', '')
                elif interp_type == 'punishmentSubcategory':
                    subcategory = interp.get('value', '')
            
            arya_starks_list.append({
                'id': pun_id,
                'category': category,
                'subcategory': subcategory
            })
    
    return arya_starks_list

And once again, a quick test.

In [None]:
tree = ET.parse(input_dir / '18411129.xml')
root = tree.getroot()

count = 0
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        trial_id = div1.get('id')
        verd = we_the_jury_find_this_code_to_be_awful(div1)
        pun = who_even_is_hammurabi_brah(div1)
        
        print(f"{trial_id}:")
        print(f"  Verdicts: {verd}")
        print(f"  Punishments: {pun}")
        
        count += 1
        if count >= 10: 
            break

## Victim? I hardly know him.

Right, the next thing we are going to do is extract details of the victims of each trial, similarly to how we did the defendants. This is mostly because it could yield interesting data regarding what crimes were committed on whom. After all, say a hypothesis regarding women. It's important to know the crimes being committed AGAINST women as well as by them if we want to get the full picture.

In [None]:
def get_victims_with_consent(trial_elem):
    whomst_wronged = []
    
    for person in trial_elem.iter('persName'):
        if person.get('type') == 'victimName':
            vic_id = person.get('id', '')
            
            given = ""
            surname = ""
            gender = ""
            
            for interp in person.iter('interp'):
                inst = interp.get('inst')
                if inst == vic_id:
                    interp_type = interp.get('type')
                    if interp_type == 'given':
                        given = interp.get('value', '')
                    elif interp_type == 'surname':
                        surname = interp.get('value', '')
                    elif interp_type == 'gender':
                        gender = interp.get('value', '')
            
            if not given and not surname: # As we did before with defendant names. This isn't always going to perfect. But we don't necessarily need it to be. 
                name_text = ''.join(person.itertext()).strip()
                given = name_text
            
            whomst_wronged.append({
                'id': vic_id,
                'given': given,
                'surname': surname,
                'gender': gender
            })
    
    return whomst_wronged

You know what comes next.

In [None]:
tree = ET.parse(input_dir / '18411129.xml')
root = tree.getroot()

count = 0
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        trial_id = div1.get('id')
        vics = get_victims_with_consent(div1)
        
        print(f"{trial_id}: Victims: {vics}")
        
        count += 1
        if count >= 10:  
            break

## Making Sense of the Madness

Now, where do we go from here? Well, we have huge lists with defendants and their crimes, victims, gender, the like. But the issue is that currently, they are all scattered about isolated. Each in their own list. What we need is code that will join this information together. We need to tell which defendant committed which offence, received which punishment, or which victims were affected. This is the process that will actually let us turn the scattered, raw XML into a dataset; in which each row can correspond to one instance of a defendant, potentially even for each of his/her/their crimes. 

In [None]:
def join_together_wholesome(trial_elem):
    # First off, lets create a dictionary to store all the information. A super dictionary if you will. 
    wholesome_joins = {
        'criminalCharge': [],       # This will link the defendant to the offence and its verdict
        'defendantPunishment': [],  # This will link the defendant to the punishment
        'offenceVictim': [],        # This will link the defendant's offence to their victim
        'offencePlace': [],         # This will link the defendant's offence to the location in which it occurred
        'offenceCrimeDate': []      # # This will link the defendant's offence to the date it took place
    }
    
    for join in trial_elem.iter('join'): # This will loop over all <join> elements. These elements are in the Old Bailey XML
        # Basically, the Old Bailey team created <join> elements that link two entities together
        result = join.get('result', '')
        if result in wholesome_joins: # This will make sure to only keep joins that match the keys in the joins dictionary
            # Basically, we don't want to pair random cases together
            targets = join.get('targets', '').split()
            wholesome_joins[result].append({
                'id': join.get('id', ''),
                'targets': targets
            })
    
    return wholesome_joins


# Let's test this to actually make sure that it works. 
# We'll use the Dell case, which has a multi-defendant instance with multiple charges.
# If it works on this one, it'll work on all of them. 

tree = ET.parse(input_dir / '16791210.xml') # This is the file the Dell case is in. 
root = tree.getroot()

for div1 in root.iter('div1'):
    if div1.get('id') == 't16791210-10': # And this is the ID of the Dell case. 
        joins = join_together_wholesome(div1)
        print("Criminal charges (defendant -> offence -> verdict):")
        for j in joins['criminalCharge']:
            print(f"  {j['targets']}")
        print("\nDefendant punishments:")
        for j in joins['defendantPunishment']:
            print(f"  {j['targets']}")
        print("\nOffence victims:")
        for j in joins['offenceVictim']:
            print(f"  {j['targets']}")

# Let's see if it works.

And there you have it. It worked. You can see above that each list shows a link of defendant ID, offence ID, and verdict ID. This means the function successfully captured which defendant actually committed which offence and recieved what verdict. The following list has the punishment as well.

Now, one interesting thing to note is the offence victim list is empty. That is odd. It's likely the case that this particular case had no listed victims. That said, we should still look into it. 

Let's search for trials that specifically HAVE an offence-victim relationship (so a victim who is identified), and then see if the code works on that. If not, then we'll know something is wrong. 

In [None]:
# We'll search the same file to see which cases have victim IDs linked to offences. 
tree = ET.parse(input_dir / '16791210.xml')
root = tree.getroot()

print("1679 file - offenceVictim joins:")
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        joins = join_together_wholesome(div1)
        if joins['offenceVictim']:
            print(f"  {div1.get('id')}: {joins['offenceVictim']}")

And there you have it. All sorted. It's just that the Dell case didn't have any victims listed. 

As such, we can move on from here. 

## Crime Information Innit

I can't think of a funny, cultural reference to go with this section. Sorry.

Anyway, let's now make some functions to extract more information about each crime. As well as some more important information about the defendants, such as their age and occupation. 

In [None]:
# Location of Crime
def WHERE_is_she(trial_elem):
    where_it_happenin = []
    
    for place in trial_elem.iter('placeName'):
        place_id = place.get('id', '')  
        place_name = ''.join(place.itertext()).strip()  
        place_type = ""
        
        for interp in place.iter('interp'):
            if interp.get('type') == 'type':
                place_type = interp.get('value', '')
        
        if place_type == 'crimeLocation':
            where_it_happenin.append({
                'id': place_id,
                'name': place_name
            })
    
    return where_it_happenin

# Date of Crime
def omg_crime_has_got_a_date(trial_elem):
    when_it_happenin = []
    
    for rs in trial_elem.iter('rs'):
        if rs.get('type') == 'crimeDate':
            when_it_happenin.append({
                'id': rs.get('id', ''),
                'date': ''.join(rs.itertext()).strip()
            })
    
    return when_it_happenin


# Onto Defendant Occupation
# I am going to do a brief explanation for this one as functionally, it is different to the previous ones
# Our previous functions worked off a "give me everything related to X within a certain trial"
# This code below functions moreso as "GIVEN this defendant, what is their occupation?"
# The key is the <join> element. As before, it is searching for related pieces of information that are connected via <join>

def get_a_job(trial_elem, defendant_id):
    for join in trial_elem.iter('join'):
        if join.get('result') == 'persNameOccupation': # This makes sure all other joins except occupation are ignored
            targets = join.get('targets', '').split() # Get the IDs the <join> links together
            if defendant_id in targets: # Checks if the defendant is relevant to the relationship. If not, skip.
                for target in targets: # Loop over each linked ID (defendant + occupation)
                    for rs in trial_elem.iter('rs'): # Look through all <rs> elements in the trial
                        # The reason we are doing this is because the <join> tag only tells us which IDs are related
                        # It doesn't tell is where the actual text is that says what their job is
                        if rs.get('id') == target and rs.get('type') == 'occupation': # That's what this code does below
                            return ''.join(rs.itertext()).strip()
    return ""

# Defendant Age
# This is similar to occupation
def how_old_are_you(trial_elem, defendant_id):
    """Find age for a specific defendant."""
    for person in trial_elem.iter('persName'):
        if person.get('id') == defendant_id:
            for interp in person.iter('interp'):
                if interp.get('type') == 'age':
                    return interp.get('value', '')
    return ""

Let's do some quick tests.

In [None]:
tree = ET.parse(input_dir / '16931012.xml')
root = tree.getroot()

count = 0
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        trial_id = div1.get('id')
        locs = WHERE_is_she(div1)
        dates = omg_crime_has_got_a_date(div1)
        
        if locs or dates:  # Makes sure we only print trials at least one location or date
            print(f"{trial_id}:")
            if locs:
                print(f"  Locations: {locs}")
            if dates:
                print(f"  Dates: {dates}")
            count += 1
            if count >= 10:  
                break

Now hold on, the location doesn't appear to be extracted by our code. Let's try a test on all the XML files to see if its an issue that that particular file simply doesn't have locations within it. The code below will find the first XML file that has a viable location. 

In [None]:
found = False

for xml_file in da_xml_files:
    filepath = input_dir / xml_file
    tree = ET.parse(filepath)
    root = tree.getroot()
    
    for div1 in root.iter('div1'):
        if div1.get('type') == 'trialAccount':
            locations = WHERE_is_she(div1)
            if locations:
                print(f"Found location in {xml_file}, trial {div1.get('id')}: {locations}")
                found = True
                break  # Stop after first location in this file
    if found:
        break  # Stop after finding first location in all files

if not found:
    print("No crime locations found in any file.")

Stoke Newington of all places. God forbid the land of the gentrified today experiences such crimes. Anyway, let's test that file specifically to see if it all works.

In [None]:
tree = ET.parse(input_dir / '17151012.xml')
root = tree.getroot()

count = 0
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        trial_id = div1.get('id')
        locs = WHERE_is_she(div1)
        dates = omg_crime_has_got_a_date(div1)
        
        if locs and dates:  # We'll briefly change this to 'and' so we can actually pull up good ones. 
            print(f"{trial_id}:")
            if locs:
                print(f"  Locations: {locs}")
            if dates:
                print(f"  Dates: {dates}")
            count += 1
            if count >= 50:  
                break

## David Lammy would have this section not exist

Right, now we are going to extract our last direct piece of information (there is one more thing that is less direct but we'll go into that later). Judge and Jury information. This is important because if we potentially want to run any hypothesis about bias in the judicial system historically, you will need to see patterns will specific jurors and judges. 

So, we are going to extract those on a per-offence basis, and it will be quite interesting to see later if any judges have any particular predilection towards a certain punishment.

In [None]:
def its_too_late_i_have_already_depicted_you_as_the_buffoonish_juror_three_and_myself_as_the_unbothered_and_collected_juror_twelve(root): # I am not sorry for referencing 12 Angry Men. Good film. You should watch it
    # I am maybe alightly sorry about the length of that function name lol
    chewbacca_defence = []
    for person in root.iter('persName'):
        if person.get('type') == 'jurorName':
            juror_id = person.get('id', '')
            given = ""
            surname = ""
            for interp in person.iter('interp'):
                if interp.get('type') == 'given':
                    given = interp.get('value', '')
                elif interp.get('type') == 'surname':
                    surname = interp.get('value', '')
            chewbacca_defence.append({
                'id': juror_id,
                'name': f"{given} {surname}".strip()
            })
    return chewbacca_defence

def someones_judgy(root):
    fudges = []
    for person in root.iter('persName'):
        if person.get('type') == 'judiciaryName':
            judge_id = person.get('id', '')
            given = ""
            surname = ""
            for interp in person.iter('interp'):
                if interp.get('type') == 'given':
                    given = interp.get('value', '')
                elif interp.get('type') == 'surname':
                    surname = interp.get('value', '')
            fudges.append({
                'id': judge_id,
                'name': f"{given} {surname}".strip()
            })
    return fudges


Test. Test. Test.

In [None]:
tree = ET.parse(input_dir / '17840707.xml')
root = tree.getroot()

jurors = its_too_late_i_have_already_depicted_you_as_the_buffoonish_juror_three_and_myself_as_the_unbothered_and_collected_juror_twelve(root)
judges = someones_judgy(root)

print(f"Jurors ({len(jurors)}):")
for j in jurors[:5]:
    print(f"  {j}")

print(f"\nJudges ({len(judges)}):")
for j in judges[:5]:
    print(f"  {j}")

## You see, four Farthings make a Halfpenny, and two Halfpennies make one Penny, and four Pence make a Groat, and three Groat make a Shilling, and five Shillings make a Pou- oh, you get the point...
---
## TLDR, this section is about money

Sean_Bean_Money.mp3. Anyway, money is being handled differently because it is not explicitly tagged in the XML file like defendants or offences are. So, instead of reading structured <rs> or <persName> elements, we must generate code that searches the raw text files for patterns like "26 s" or "6 d". Basically, we need to perform a bit of regex. Which does open us up to errors, but we don't have to be absolutely perfect with the code either. 

Now, we can't just pull from all the text as then we'd have the issue of including things like fines, wages, etc. What we want is the stolen value and the fined value only. Luckily, Old Bailey contains stolen value in the OffenceDescription element, so we can search specifically there to filter out all the other money nonsense.

Time to parse that free text of the Old Bailey and see if we can get this to work!

In [None]:
# The first thing we need to do is find what the Old Bailey actually uses for currency. We don't any curveballs like marks or something. Or Groats. 

what_currencies = Counter()

for filepath in da_xml_files:
    tree = ET.parse(filepath)
    root = tree.getroot()
    
    text = ''.join(root.itertext())
    
    strike_matches = re.findall(r'\d+\s*([a-zA-Z]{1,10})\.', text)
    for m in strike_matches:
        what_currencies[m.lower()] += 1

print("Most common patterns (number followed by letters then period):")
for pattern, count in what_currencies.most_common(50):
    print(f"  {pattern}: {count}")

Aha! As we can see, our most common ones are shillings, pounds, pence, and guineas. Now, given the "offenceDescription" element only contains monetary reference to stealing, we can thus make code to count this up. Note however, that someone may steal money AND an item valued at a certain amount. So, we need to add that together into a total monetary value that was stolen. 

In [None]:
def wheres_the_money_lebowski(trial_elem):
    # First, we need to combine all the text inside the offenceDescription tags so we can search it
    text = ""
    for rs in trial_elem.iter('rs'):
        if rs.get('type') == 'offenceDescription':
            text += ' ' + ''.join(rs.itertext())
    
    pounds = 0 
    shillings = 0
    pence = 0
    guineas = 0 
    # We are using integers now so we can add multiple values
    
    # Pound search 
    pound_matches = re.findall(r'(\d+)\s*l\.', text) # Note that l. (from Latin's libra) is the old abbreviation for Pounds, not the £ we know and love.
    for match in pound_matches:
        pounds += int(match)
    
    pound_word_matches = re.findall(r'(\d+)\s*pounds?', text, re.IGNORECASE) # Catch pounds in words. 
    for match in pound_word_matches:
        pounds += int(match)
    
    # Shilling search 
    shilling_matches = re.findall(r'(\d+)\s*s\.', text) # Shilling is s. (from Latin's solidus)
    for match in shilling_matches:
        shillings += int(match)
    
    shilling_word_matches = re.findall(r'(\d+)\s*shillings?', text, re.IGNORECASE) 
    for match in shilling_word_matches:
        shillings += int(match)
    
    # Pence search 
    pence_matches = re.findall(r'(\d+)\s*d\.', text) # Pence is d. (from Latin's denarius)
    for match in pence_matches:
        pence += int(match)
    
    pence_word_matches = re.findall(r'(\d+)\s*pence', text, re.IGNORECASE) # Make sure we also catch things like "6 pence"
    for match in pence_word_matches:
        pence += int(match)
    
    # Guinea search
    guinea_matches = re.findall(r'(\d+)\s*guineas?', text, re.IGNORECASE)
    for match in guinea_matches:
        guineas += int(match)
    
    return pounds, shillings, pence, guineas

Let's do a quick test to see if this shambolic attempt at code actually works.

In [None]:
tree = ET.parse(input_dir / '16931012.xml')
root = tree.getroot()

count = 0
for div1 in root.iter('div1'):
    if div1.get('type') == 'trialAccount':
        trial_id = div1.get('id')
        pounds, shillings, pence, guineas = wheres_the_money_lebowski(div1)
        if pounds or shillings or pence or guineas:
            print(f"{trial_id}: {pounds} pounds, {shillings} shillings, {pence} pence, {guineas} guineas")
            count += 1
            if count >= 10:
                break

Alright! It worked. Now we can do the same for fines, which luckily is only in the "punishmentDescription" tag. In fact, fines can only be found in the subcategory to the 'fine' category in general, so it is really easy for us to get those numbers. On the flip side, they spell the word "value" incorrectly multiple times, so we should avoid a regex with that word.

We also are going to include "Marks" for fines, which was a non-codified unit of currency in Britain. I saw it in one or two fines so we may as well be certain. 

In [None]:
def damn_you_fine(trial_elem):
    # Search only within punishmentDescription tags that are specifically fines. 
    # Very similar to how we did before with the <rs> element. 

    text = ""
    for rs in trial_elem.iter('rs'):
        if rs.get('type') == 'punishmentDescription':
            # Check if this is a fine
            for interp in rs.iter('interp'):
                if interp.get('type') == 'punishmentSubcategory' and interp.get('value') == 'fine':
                    text += ' ' + ''.join(rs.itertext())
                    break
    
    fine_pounds = 0
    fine_shillings = 0
    fine_pence = 0
    fine_guineas = 0
    fine_marks = 0
    
    # Pounds 
    for match in re.findall(r'(\d+)\s*l\.', text):
        fine_pounds += int(match)
    for match in re.findall(r'(\d+)\s*pounds?', text, re.IGNORECASE):
        fine_pounds += int(match)
    
    # Shillings
    for match in re.findall(r'(\d+)\s*s\.', text):
        fine_shillings += int(match)
    for match in re.findall(r'(\d+)\s*shillings?', text, re.IGNORECASE):
        fine_shillings += int(match)
    
    # Pence 
    for match in re.findall(r'(\d+)\s*d\.', text):
        fine_pence += int(match)
    for match in re.findall(r'(\d+)\s*pence', text, re.IGNORECASE):
        fine_pence += int(match)
    
    # Guineas
    for match in re.findall(r'(\d+)\s*guineas?', text, re.IGNORECASE):
        fine_guineas += int(match)
    
    # Marks
    for match in re.findall(r'(\d+)\s*marke?s?', text, re.IGNORECASE):
        fine_marks += int(match)
    
    return fine_pounds, fine_shillings, fine_pence, fine_guineas, fine_marks

And a quick test code!

In [None]:
count = 0
for filepath in da_xml_files:
    if count >= 15:
        break
    tree = ET.parse(filepath)
    root = tree.getroot()
    
    for div1 in root.iter('div1'):
        if count >= 15:
            break
        if div1.get('type') == 'trialAccount':
            pounds, shillings, pence, guineas, marks = damn_you_fine(div1)
            if pounds or shillings or pence or guineas or marks:
                print(f"{trial_id}: {pounds} pounds, {shillings} shillings, {pence} pence, {guineas} guineas, {marks} marks")
                count += 1

Awesome, awesome! We got that all to work perfectly! Now, let's move on. 

## Code of Hammurabi or something, idk I didn't watch the movie

Right, now onto our last issue. Imprisonment time isn't something that can be directly extracted either. So, we will need to do that similarly to the money amounts in order to see for how LONG someone was imprisoned. 

In [None]:
def five_HUNDRED_life_sentences(trial_elem):
    text = ""
    for rs in trial_elem.iter('rs'): # This is very similar to before. We are searching the <rs> element for the imprison subcategory
        if rs.get('type') == 'punishmentDescription':
            for interp in rs.iter('interp'):
                if interp.get('type') == 'punishmentCategory' and interp.get('value') == 'imprison':
                    text += ' ' + ''.join(rs.itertext()) # This will add the full text of our <rs> element to our text string, meaning we can analyse it.
                    # I might have already explained what that line does in a previous function but uh
                    # My memory is going ngl
                    break
    
    # Some common misspellings I found in the XML files. We are likely to miss some. 
    # More can be added if we discover more. 
    minor_spelling_mistake = {
        'tweleve': 'twelve',
        'bight': 'eight',
        'pour': 'four'
    }
    
    # I ran some code as to find the words which provide the most noise. 
    # This will allow us to exclude them. 
    # As with the spelling mistakes, we can add more should it become an issue. But given we have so MANY trials and so little errors
    # We shouldn't need to worry
    noise = ['last', 'calendar', 'calender', 'each', 'first', 'there', 'sir', 'fine', 'the', 'a', 'an']
    

    # First we find numbers that are followed by a time unit, making sure we actually ignore case specific language
    match = re.search(r'([a-zA-Z\-]+|\d+)\s*(month|year|week|day)s?', text, re.IGNORECASE)
    if match:
        num_str = match.group(1).lower()
        unit = match.group(2).lower() # If we find one, we extract the number and the unit and convert them both to lower case
        
        # This will fix the three common spelling mistakes we found. 
        if num_str in minor_spelling_mistake:
            num_str = minor_spelling_mistake[num_str] 
        
        # Skip the common noise words, we don't need them
        if num_str in noise:
            return ""
        
        # Now we use the word2number import to convert them and bam. We done. 
        try:
            if num_str.isdigit():
                num = int(num_str)
            else:
                num = w2n.word_to_num(num_str)
            return f"{num} {unit}s"
        except:
            return ""
    
    return ""


It's the FINAL test. Hell yeah, I am sick of writing these.

In [None]:
count = 0
for filepath in da_xml_files:
    if count >= 15:
        break
    tree = ET.parse(filepath)
    root = tree.getroot()
    
    for div1 in root.iter('div1'):
        if count >= 15:
            break
        if div1.get('type') == 'trialAccount':
            duration = five_HUNDRED_life_sentences(div1)
            if duration:
                print(f"{filepath.name} - {div1.get('id')}: {duration}")
                count += 1

It WORKS. Let's gooooooooooooo. 

Now, let's bring it all together into one big place.

## Bringing It All Together

We are at the end of our XML journey, sorry kids. But it was quite fun, if not finnicky. Now, lets create one last function that takes everything we've done and uses all our tested functions. After that, we shall convert it to a csv and actually perform analysis on it. Further data cleaning will have to be performed on an adhoc basis (for example, fixing case elements or adding something like the population of London for each year). This also includes things like converting to date-time, if needed. The main function of this portion was just extracting the data correctly from the XML files.

Anyway, let us commence forth.

In [53]:
df = pd.read_csv('old_bailey_actual_final.csv')
df.head(5)

  df = pd.read_csv('old_bailey_actual_final.csv')


Unnamed: 0,session_date,session_year,trial_id,defendant_name,defendant_gender,defendant_age,defendant_occupation,victim_gender,offence_category,offence_subcategory,...,offence_value_shillings,offence_value_pence,offence_value_guineas,fine_value_pounds,fine_value_shillings,fine_value_pence,fine_value_guineas,fine_value_marks,juror_ids,judge_ids
0,16740429,1674,t16740429-1,Prisoner,male,,,male,violentTheft,highwayRobbery,...,0,0,0,0,0,0,0,0,,
1,16740429,1674,t16740429-2,another,male,,,male,theft,grandLarceny,...,0,0,0,0,0,0,0,0,,
2,16740429,1674,t16740429-3,others,male,,,male,theft,burglary,...,0,0,0,0,0,0,0,0,,
3,16740429,1674,t16740429-3,one,male,,,male,theft,burglary,...,0,0,0,0,0,0,0,0,,
4,16740429,1674,t16740429-3,more,male,,,male,theft,burglary,...,0,0,0,0,0,0,0,0,,


## what do we have here? 
(In bold are the most useful columns for our research)

Here we have the old bailey proceedings, where each row represents a trial. 
- in the session_date column we have the date of that trial in YYYY/MM/DD format - which needs to be converted to datetime format it it is to be used
- **In session_year we have the year**
- in trial_id, each trial has a unique ID number. 
- in defendant_name we have some traditional names and some given names like "prisoner". this is the person who is being trialed.
- **in defendant_gender we have their gender which is either male, female or indeterminate.**
- **in defendant occupation we have a mix of job roles**
- **in victim_gender we have the gender of the victim of the crime, if the crime was comitted against someone**
- offence_category contains 9 unique values that are overarching categories of crime. 
- **offence_subtagories are more specific, containing over 50 types of crime which all fall into a large category of crime.**
- **the punishment column contains 6 different punishment types containing NaN, death, transport, miscpunish, corporal, nopunish, imprison.**
- **punishment_detail contains more specific punishment types like publicwhipping, hardlabour, insanity, branding, pardoned etc.**
-  in crime location we have some areas of london listed, but only 15 values listed. 
- value_pounds, shillings and pence with the highest value in all 3 columns is 69712 shillings, 4000 pounds and 500 pence.
- fine_pounds contains floats, with the price in pounds that the defendant was fined for the crime.
- Juror and judge Id's contain the relative juror and judge unique identifiers for that case. 


# Planning: Data Science Life Cycle
Any good Data Science project starts with a good plan. To do this we need to start by understanding the data science lifecycle. The lifecycle below has been adapted slightly to fit an LIS project rather than a data science business project. See column 'How we will do this' to see a summary about how we completed every step of the life cycle in this project.

| Phases | Description | How we will do this |
|--------|-------------|---------------------|
| Identifying problems and understanding the project | Discovering the answers for basic questions including requirements, priorities and budget of the project. | As a group we evaluated our collective interests and aligned these with the assessment brief to decide on the project topic that would fit best. |
| Data Collection | Collecting data from relevant sources either in structured or unstructured form. | Using our collective interests in crime, gender inequality and historic datasets, we found a good dataset - The Old Bailey files. We then took our time to fully understand the data by exploring the XML files and the potential of the dataset. |
| Data processing | Processing and fine-tuning the raw data, critical for the goodness of the overall project. | AKA cleaning the data. We did this as shown above by merging XML files and extracting tags to create a pandas dataframe, and create a new CSV file which is used from here onwards.|
| Data modelling | Preparing the appropriate model to achieve desired performance. | Actually doing the analysis, starting below. In order to do this, we needed to plan what we wanted to find out by generating some hypotheses. See hypotheses below.|
| Visualisation | Creating clear, professional data visualisations to support analysis. | Using seaborn and plotly visualisations, including a Shiny app.py |
| Model deployment | Executing the analysed model in desired format and channel.| Shiny App. |
| Limitations & Future Work | Acknowledging constraints and proposing extensions. | critical analysis of our research, findings and methods, in written format as a markdown |

source: https://www.onlinemanipal.com/blogs/data-science-lifecycle-explained
# Analysing

## Hypotheses: 

1. Female defendants were less likely to receive death or corporal punishment compared to male defendants

2. Defendants gender and occupation can be used to predict punishment



### Lets begin:
Female defendants were less likely to receive death or corporal punishment compared to male defendants

In [None]:
# what kinds of punishments are there in this dataset and which ones are we going to categorise as harsh
df['punishment_detail'].unique() 

array([nan, 'respitedForPregnancy', 'drawnAndQuartered', 'burning',
       'branding', 'fine', 'publicWhipping', 'sureties', 'pillory',
       'whipping', 'pardon', 'executed', 'houseOfCorrection', 'newgate',
       'sentenceRespited', 'respited', 'privateWhipping',
       'militaryNavalDuty', 'forfeiture', 'brandingOnCheek',
       'hangingInChains', 'hardLabour', 'deathAndDissection', 'insanity',
       'otherInstitution', 'no_subcategory', 'penalServitude',
       'preventiveDetention'], dtype=object)

In [54]:
# Define harsh punishments
harsh = ['executed', 'drawnAndQuartered', 'burning', 'branding', 'hangingInChains', 
         'deathAndDissection', 'hardLabour', 'publicWhipping', 'privateWhipping', 'whipping', 'pillory']

medium = ['fine', 'sureties', 'forfeiture', 'houseOfCorrection', 'newgate', 'penalServitude', 
          'preventiveDetention', 'militaryNavalDuty', 'sentenceRespited', 'respited']

lenient = ['pardon', 'respitedForPregnancy', 'insanity', 'no_subcategory']

In [55]:
# Split by gender
male = df[df['defendant_gender'] == 'male']
female= df[df['defendant_gender'] == 'female']

In [58]:
# Calculate proportion receiving harsh punishment
male_harsh = (male['punishment_detail'].isin(harsh)).sum() / len(male)
female_harsh = (female['punishment_detail'].isin(harsh)).sum() / len(female)

In [59]:
print("Male defendants harsh punishment:", male_harsh)
print("Female defendants harsh punishment:", female_harsh)
print("Difference:", male_harsh - female_harsh)

Male defendants harsh punishment: 0.13481242360314696
Female defendants harsh punishment: 0.08877213332878442
Difference: 0.046040290274362544


In [None]:
from scipy.stats import chi2_contingency

contingency_table = [
    [(male['punishment_detail'].isin(harsh)).sum(), len(male) - (male['punishment_detail'].isin(harsh)).sum()],
    [(female['punishment_detail'].isin(harsh)).sum(), len(female) - (female['punishment_detail'].isin(harsh)).sum()]
]

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-square:", chi2)
print("P-value:", p_value)

Chi-square: 890.4867690650723
P-value: 1.1478760025674457e-195
Significant? Yes


## Hypothesis: Gender Differences in Harsh Sentencing

**Hypothesis:** Female defendants were less likely to receive harsh punishments compared to male defendants.

**Method:**

We compared the proportion of male and female defendants who received harsh punishments, defined as: executed, drawnAndQuartered, burning, branding, hangingInChains, deathAndDissection, hardLabour, publicWhipping, privateWhipping, whipping, and pillory.

**Results:**

- Male defendants: 13.48% received harsh punishment
- Female defendants: 8.88% received harsh punishment
- Difference: 4.60 percentage points

A chi-square test confirmed this difference is highly statistically significant (χ² = 890.49, p < 0.001).

**Conclusion:**

Female defendants were significantly less likely to receive harsh physical punishments compared to male defendants. This represents a 4.60 percentage point difference, with females receiving harsh punishments at roughly two-thirds the rate of males. This suggests that courts applied more lenient sentencing practices to women overall. 

however now we need to test to see if female defendants recieved lighter punishments than men for the **same types of crime.**

### Punishments for the same types of crime

In [None]:
# when crime # X, harsh punishments are given to men more than women
# Test for theft
male_theft = df[(df['defendant_gender'] == 'male') & (df['offence_category'] == 'theft')]
female_theft = df[(df['defendant_gender'] == 'female') & (df['offence_category'] == 'theft')]

male_theft_harsh = (male_theft['punishment_detail'].isin(harsh)).sum() / len(male_theft)
female_theft_harsh = (female_theft['punishment_detail'].isin(harsh)).sum() / len(female_theft)

print("Theft - Male harsh:", male_theft_harsh)
print("Theft - Female harsh:", female_theft_harsh)

# Test for murder
male_murder = df[(df['defendant_gender'] == 'male') & (df['offence_category'] == 'murder')]
female_murder = df[(df['defendant_gender'] == 'female') & (df['offence_category'] == 'murder')]

male_murder_harsh = (male_murder['punishment_detail'].isin(harsh)).sum() / len(male_murder)
female_murder_harsh = (female_murder['punishment_detail'].isin(harsh)).sum() / len(female_murder)

print("Murder - Male harsh:", male_murder_harsh)
print("Murder - Female harsh:", female_murder_harsh)


NameError: name 'df' is not defined

 analysis

In [None]:
# hypothesis 2. 1870 hypothesis

analysis

In [None]:
# logistic regression to answer hypothesis 3.

analysis

# Modelling
using visualisation tools to map our findings

# evaluating

What patterns can we see here?

# Visualising

Now we present our findings in Shiny!