## Embedding Driven Text Analysis of Crease's Stance Towards Chinese Immigrants

prAxIs UBC Team <br> _Kaiyan Zhang, Irene Berezin, Alex Ronczewski_

2025-8-14

### Library Loading

In [1]:
# This cell loads the necessary libraries for executing the notebook.
import pandas as pd
import numpy as np
import re
import umap
import textwrap
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine
from transformers import AutoTokenizer, AutoModel
import torch
import warnings
from collections import defaultdict, Counter
from typing import Dict, Any, Union


### Introduction

#### Overview of Historical Background

***The 1884 Chinese Regulation Act*** in British Columbia is widely regarded as one of the most notorious discriminatory provincial laws targeting Chinese immigration, it was challenged and ultimately declared unconstitutional in the 1885 case of ***R v. Wing Chong*** by the judge Henry Pering Pellew Crease. The Justice Crease found the legislation to be unconstitutional on economic grounds; infringing on federal authority over immigration, trade, commerce, treaty-making, and taxation. 

The central figure in the ruling, *Henry Pering Pellew Crease*, came from a wealthy English military family, and possessed a prestigious law background. 

* His social identity was above-all English, and this was made clear in his politics. 
* He viewed Canada not as a new society to be built, but as an extension of the british empire. 
* He displayed mistrust towards Canadians, referring to them as "North American Chinamen", afraid that they would "rule the country and job its offices" (Tina Loo).

In previous years, students expressed interest Crease's opinion on the 1884 Chinese regulation act, given that the regulation act was strongly condemned and ultimately struck down by Crease. However, this seems at odds with Crease's position on Chinese immigrants.

This raises an interesting question: **Did Judge Crease strike down the act because of genuine anti-racism concerns, or because he saw the Chinese immigrant labor force as a valuable asset for growing the Canadian economy?**

#### Objective

* We aim to explore this question by analyzing the language used by Justice Crease in his legal opinions and writings related to Chinese immigrants through **Natural Language Processing (NLP)** approaches. By examining the text, we hope to uncover insights into his stance.

* The workshop is also to demonstrate how historians can use computational tools to *help* them answer such a research question, by showing each step in the research process.

#### The Problem: Legal Text Analysis

Legal text analysis is itself a complex task, as legal documents are often lengthy, dense, formal, and filled with specialized terminology. They are also often written in neutral or passive voice, making it difficult to discern the author's personal opinions or biases, it poses unique challenges for historians and legal scholars alike, which also challenged the usual methods of natural language processing (NLP).

Mining insights from such texts requires sophisticated techniques to extract meaningful information and identify patterns. We need the technique to be able to:
* **Understand legal vocabulary**: Legal texts often contain specialized terminology and complex sentence structures, the technique should be able to handle legal jargon and formal language.
* **Identify contextual semantics**: Legal texts often involve nuanced meanings and interpretations, so the technique should be able to capture the context and semantics of legal language.
* **Handle ambiguity**: Legal texts can be ambiguous, with multiple interpretations possible, the technique should be able to handle ambiguity and provide insights into different interpretations.
* **Extract relevant topics**: Legal texts often cover multiple topics and issues, the technique should be able to extract relevant topics and themes from the text.
* **Analyze sentiment**: Legal texts can convey different sentiments, such as positive, negative, or neutral, the technique should be able to analyze sentiment and provide insights into the author's tone and attitude.

#### Research Approach

In this workshop, we will explore how to address these challenges using a comparison approach, that is, while we focus on the text of Justice Crease, we will compare it with other legal texts from the same period to gain a better understanding of the language used in legal documents at that time.

The first subject we will use for comparison is the **1884 Chinese Regulation Act**, which was the law that Crease struck down. The second subject we will use for comparison is **Justice Matthew Baillie Begbie**, who testified alongside Crease in the 1884 Royal Commission on Chinese Immigration.

* Unlike Crease, historical accounts describe Begbie as protective of marginalized peoples, particularly Indigenous communities and Chinese immigrants.
* Similar to what Crease did to the Chinese Regulation Act, Begbie struck down discriminatory municipal by-laws in Victoria that targeted Chinese-owned businesses in the 1888 case of ***R v. Victoria***.

We use machine learning techniques, specifically text embeddings, to do the following:

1. Compile **a corpus of legal cases and commission reports** authored by contemporary judges concerning Chinese immigrants.
2. Apply **Optical Character Recognition (OCR)** to the reports in order to convert them to a machine-readable format. 
3. Examine **keywords** in the texts, to compare the positions of different justices and regulations.
4. Use **machine learning** to assess the relative emphasis on economic versus social justice concerns.
5. Use **sentiment analysis** to evaluate the tone of the documents, focusing on whether they reflect positive, negative, or neutral sentiments, and compare the sentiments of writings by different authors to identify patterns.
6. Use **zero-shot classification** to evaluate whether the documents reflect pro-discrimination, neutral, or anti-discrimination positions.

This approach demonstrates different techniques historians can use to identify patterns in documents for analysis.

#### Data Collection and Preprocessing

We plan to use 10 digitalized texts, they are:

- Legal Documents that address Chinese immigration in BC during the period: 
    - *R v. Wing Chong* 
    - *Wong Hoy Woon v. Duncan* 
    - *R v. Mee Wah, R v. Victoria* 
    - *Chinese Regulation Act, 1884*
- Reports authored by Crease and Begbie for the Royal Commission that show the judges' personal perspectives. 
- The remaining documents enrich our corpus for analysis and supplement our study.

A big issue with working with historical texts is the format they're stored in: usually scans of varying quality from physical books, articles, etc. However, these are not machine-readable file formats (e.g., text files), so our first step will be using **Optical Character Recognition (OCR)** to convert the scanned images into machine-readable text. We chose this approach because: 1. It is a common technique for digitizing printed texts that is already widely used in legal case archives such as the CanLii database, and 2. There are many OCR tools available that vary in cost, effectiveness, and ease of use. Below is a brief overview of early and modern OCR techniques:


- **Early OCR (Pattern Matching):**

    - Compared each character image to a library of fonts and shapes.
    - Worked well for clean, printed text.
    - Struggled with handwriting, unusual fonts, or degraded scans.

- **Modern OCR (Intelligent Recognition):**

    - Uses AI to "read" text more like a human.
    - Analyzes shapes, context, and layout.
    - Handles messy, handwritten, or complex documents much better.

After testing several tools, we found that modern, AI-based OCR methods produced the most accurate results for our historical documents.

#### Data Overview

After OCR, we obtained a `.csv` file containing the text and metadata of the documents. Note that we removed the direct quotes of the *1884 Chinese Regulation Act* in Crease's ruling, as they don't reflect his own language. The structure of the data is as follows:
| Column Name                   | Description                                              |
| ----------------------------- | -------------------------------------------------------- |
| filename                    | Name of the file containing the document text.           |
| author                      | Author of the document (e.g., "Crease", "Begbie").       |
| type                        | Document type (e.g., "case", "report").                  |
| text                        | Full text of the document, which may include OCR errors. |
| act_quote_sentences_removed | Number of quoted sentences removed from the full text.   |

Here, we read the `.csv` file into a pandas DataFrame and display.

In [2]:
# Load the dataset
df = pd.read_csv("../data/metadata_cleaned.csv")

df

Unnamed: 0,filename,author,type,text,act_quote_sentences_removed
0,regina_v_wing_chong.txt,Crease,case,"CREASE, J. 1885. REGINA v. WING CHONG. \r\n\r\...",12
1,wong_hoy_woon_v_duncan.txt,Crease,case,"CREASE, J.\r\n\r\nWONG HOY WOON v. DUNCAN.\r\n...",0
2,regina_v_mee_wah.txt,Begbie,case,BRITISH COLUMBIA REPORTS.\r\n\r\nREGINA v. MEE...,0
3,regina_v_victoria.txt,Begbie,case,"OF BRITISH COLUMBIA.\r\n\r\nREGINA r, CORPORAT...",0
4,quong_wing_v_the_king.txt,Fitzpatrick,case,QUONG WING v. THE KING. CAN. \r\n\r\nSupreme ...,0
5,commission_on_chinese_imigration.txt,Powell,report,"On the 4th of July, 1884, the following Commis...",0
6,chapleau_report_resume.txt,Chapleau,report,RESUMÉ.\r\n\r\n1. That Chinese labor is a most...,0
7,crease_commission.txt,Crease,report,"The Hon. Mr. Justice CREASE, Judge of the Supr...",0
8,begbie_commission.txt,Begbie,report,"Sir MATTHEW BEGBIE, Chief Justice of British C...",0
9,chinese_regulation_act_1884.txt,Others,act,An Act to regulate the Chinese population of B...,0


We are also interested in the length of each document, as it can provide insights into the depth and complexity of the text. Therefore, we create a summary below quantifying the number of characters in each document.



In [3]:
# Summary the distribution of document lengths
# Create a DataFrame to store the document lengths
doc_lengths = []

for row in df.iterrows():
    text_length = len(row[1]['text'])
    doc_lengths.append({'Document': row[1]['filename'], 'Length': text_length})

# Convert to DataFrame and display
doc_lengths_df = pd.DataFrame(doc_lengths)
print(doc_lengths_df)

                               Document  Length
0               regina_v_wing_chong.txt   36819
1            wong_hoy_woon_v_duncan.txt   13912
2                  regina_v_mee_wah.txt   25104
3                 regina_v_victoria.txt    8252
4             quong_wing_v_the_king.txt   46982
5  commission_on_chinese_imigration.txt    3402
6            chapleau_report_resume.txt   10906
7                 crease_commission.txt   30768
8                 begbie_commission.txt   41270
9       chinese_regulation_act_1884.txt   12908


### How Computers Interpret Text?

#### Count Approach: TF-IDF

The **Term Frequency-Inverse Document Frequency (TF-IDF)** is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus). It is one of the earliest and most widely used methods for text analysis. It is essentially a count-based approach that quantifies the importance of words in a document based on their frequency and distribution across multiple documents. TF-IDF works by calculating two components:
1. **Term Frequency (TF)**: Measures how frequently a term appears in a document.
2. **Inverse Document Frequency (IDF)**: Measures how important a term is across the entire corpus, by considering how many documents contain the term.

For our purpose, we can use TF-IDF to identify the most important words in each document, which can help us understand the key themes and topics discussed in the text. More details on what we are going to do:

1. Regroup the text data into 4 groups:
    - Crease's writings
    - Begbie's writings
    - Chinese Regulation Act
    - Other documents
2. For each group, we will: 
    - Create a TF-IDF vectorizer to convert the text into numerical vectors.
    - Remove common filler words ("the", "and", etc.).
    - Calculate the TF-IDF scores for each word in the documents.
    - Identify the most important words based on their TF-IDF scores.
3. The most frequent remaining words can reveal the main topics of each case.

In [4]:
# Define the function to preprocess text in a DataFrame column
def preprocess_text(text_string):
    """
    Cleans and preprocesses text by:
    1. Converting to lowercase
    2. Removing punctuation and numbers
    3. Tokenizing
    4. Removing English stop words 
    5. Removing words with 4 or fewer characters
    """
    # Start with the standard English stop words
    stop_words = set(stopwords.words('english'))
    
    # Add custom domain-specific stop words if needed
    custom_additions = {'would', 'may', 'act', 'mr', 'sir', 'also', 'upon', 'shall'}
    stop_words.update(custom_additions)
    
    # Lowercase and remove non-alphabetic characters
    processed_text = text_string.lower()
    processed_text = re.sub(r'[^a-z\s]', '', processed_text)
    
    # Tokenize
    tokens = processed_text.split()
    
    # Filter out stop words AND short words in a single step
    filtered_tokens = [
        word for word in tokens 
        if word not in stop_words and len(word) > 4
    ]
    # Re-join the words into a single string
    return " ".join(filtered_tokens)

In [5]:
# Apply the function to create the 'processed_text' column
df['processed_text'] = df['text'].apply(preprocess_text)

# Display the first few rows of the processed text
df['processed_text'].head(5)

0    crease regina chong certiorarichinese regulati...
1    crease duncan health regulationsvictoria healt...
2    british columbia reports regina begbie constit...
3    british columbia regina corporation victoria p...
4    quong supreme court canada charles fitzpatrick...
Name: processed_text, dtype: object

In [6]:
# Perform TF-IDF vectorization on the processed text

# Regrouping the DataFrame for better representation
df['group'] = 'Other'
df.loc[df['author'] == 'Crease', 'group'] = 'Crease'
df.loc[df['author'] == 'Begbie', 'group'] = 'Begbie'
df.loc[df['author'] == 'Others', 'group'] = 'Regulation Act'

# Load the vectorizer and transform the processed text
# This calculates IDF based on word rarity across ALL individual texts.
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 3))
tfidf_matrix = vectorizer.fit_transform(df['processed_text'])

# Create a new DataFrame with the TF-IDF scores
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Add the 'group' column to this TF-IDF DataFrame for aggregation
tfidf_df['group'] = df['group'].values

# Group by author and calculate the MEAN TF-IDF score for each word
mean_tfidf_by_group = tfidf_df.groupby('group').mean()

# Collect top words and arrange them into a side-by-side DataFrame
list_of_author_dfs = []
for group_name in ['Crease', 'Begbie', 'Regulation Act', 'Other']:
    # Get the top 10 terms and scores for the current author
    top_words = mean_tfidf_by_group.loc[group_name].sort_values(ascending=False).head(10)
    
    # Convert the Series to a DataFrame
    top_words_df = top_words.reset_index()
    top_words_df.columns = [group_name, f'{group_name}_score']
    
    list_of_author_dfs.append(top_words_df)

# Concatenate the list of DataFrames horizontally
final_wide_df = pd.concat(list_of_author_dfs, axis=1)

# Display the final combined DataFrame
final_wide_df

Unnamed: 0,Crease,Crease_score,Begbie,Begbie_score,Regulation Act,Regulation Act_score,Other,Other_score
0,chinese,0.200601,license,0.227563,chinese,0.443315,canada,0.221373
1,labor,0.16855,chinamen,0.149219,dollars,0.283191,chinese,0.194695
2,infected,0.134599,licenses,0.132835,licence,0.247792,legislation,0.137609
3,white,0.126402,municipality,0.113449,collector,0.212393,country,0.11403
4,taxation,0.10364,statute,0.103499,forfeit,0.190386,wycombe,0.110548
5,british,0.09668,legislature,0.098062,lieutenantgovernor,0.190386,naturalized,0.107645
6,hongkong,0.096142,revenue,0.096895,person,0.190158,great,0.101984
7,dominion,0.092252,corporation,0.093189,possession,0.188833,parliament,0.09726
8,health officer,0.089733,pawnbrokers,0.085087,exceeding,0.176994,county,0.093976
9,china,0.089049,provincial,0.084027,lieutenantgovernor council,0.166587,honorable,0.09271


Undoubtedly, the TF-IDF practice on our corpus has identified some interesting patterns, such as the emphasis on "Chinese" in all groups, the emphasis on "labor" in Crease's writings, the emphasis on "license" in Begbie's writings, and the emphasis on "dollar" in the Chinese Regulation Act. 

However, this approach has limitations, as it does not capture the semantic meaning of words or their relationships to each other. For example, it cannot distinguish between "Chinese" as a noun and "Chinese" as an adjective, or between "labor" as a noun and "labor" as a verb. It also does not consider the context in which words are used, which can lead to misinterpretation of their meaning.

#### Embedding Approach

With the advancement of machine learning, **text embeddings** emerged as a more powerful technique for text analysis. It represents words or phrases as dense vectors in a high-dimensional space, capturing semantic relationships between them. This allows for more nuanced understanding of text, enabling tasks like similarity measurement, clustering, and classification.

There are several popular text embedding models, including:
- **Word2Vec**: A neural network-based model that learns word embeddings by predicting context words given a target word (or vice versa).
- **GloVe**: A global vector representation model that learns word embeddings by factorizing the word co-occurrence matrix.
- **FastText**: An extension of Word2Vec that represents words as bags of character n-grams, allowing it to handle out-of-vocabulary words and capture subword information.
- **BERT**: A transformer-based model that generates contextualized embeddings by considering the entire sentence context, allowing it to capture word meanings based on their surrounding words.

In this workshop, we will use a BERT-based model to generate text embeddings for our corpus. [nlpaueb/legal-bert-base-uncased](https://huggingface.co/nlpaueb/legal-bert-base-uncased) is a BERT model pre-trained on English legal texts, including legislation, law cases, and contracts. It is designed to capture the legal language and semantics, making it suitable for our analysis. 

However, we must note that the model is not perfect and may still have limitations in understanding the nuances of legal language, especially in historical texts. 

### Word Embeddings
#### Creating Word Embeddings

While the model itself has the ability to generate word embeddings that capture the semantic meaning of words, we still need to design our own strategy to extract these meanings from our corpus. 

- Load LEGAL-BERT model and tokenizer.
- Tokenize sentences into smaller subword units using a tokenizer.
- Process each tokenized sentence through the model to extract hidden layer representations.
- Combine subword embeddings to form a single vector for each word by averaging the embeddings of its subword components.
- Aggregate embeddings for repeated words across sentences by averaging their vectors.
- Return a dictionary mapping each word to its mean embedding, capturing its semantic meaning in the context of the text.

In this way, we are not only able to generate word embeddings with contextual meanings over the whole corpus, but also be able to aggregate our corpus into different groups, and generate contextualized word embeddings for each group.

In [7]:
# We will use the Legal-BERT model for this task
tokenizer = AutoTokenizer.from_pretrained('nlpaueb/legal-bert-base-uncased')
model = AutoModel.from_pretrained('nlpaueb/legal-bert-base-uncased').eval() # set the model to evaluation mode

# Define a function to embed words using the tokenizer and model
def embed_words(sentences, tokenizer=tokenizer, model=model, target_words=None,
                device=None, max_length=512):
    """
    Returns a dictionary {word: mean_embedding}.
    Only the mean embedding (float32 numpy array) per word is kept.
    """
    if device is None:
        try:
            device = next(model.parameters()).device
        except Exception:
            device = torch.device("cpu")
    device = torch.device(device)
    model.to(device).eval()

    target_set = None if target_words is None else set(target_words)

    sums = {}   # word -> torch.Tensor sum of embeddings
    counts = {} # word -> occurrence count

    with torch.no_grad():
        for sent in sentences:
            enc = tokenizer(
                sent,
                return_tensors="pt",
                truncation=True,
                max_length=max_length
            )
            enc = {k: v.to(device) for k, v in enc.items()}
            outputs = model(**enc)
            hidden = outputs.last_hidden_state.squeeze(0)  # (seq_len, hidden)
            tokens = tokenizer.convert_ids_to_tokens(enc["input_ids"][0])

            i = 0
            while i < len(tokens):
                tok = tokens[i]
                if tok in ("[CLS]", "[SEP]", "[PAD]"):
                    i += 1
                    continue

                # Gather wordpieces
                j = i + 1
                piece_embs = [hidden[i]]
                word = tok[2:] if tok.startswith("##") else tok
                while j < len(tokens) and tokens[j].startswith("##"):
                    piece_embs.append(hidden[j])
                    word += tokens[j][2:]
                    j += 1

                if target_set is not None and word not in target_set:
                    i = j
                    continue

                word_emb = torch.stack(piece_embs, dim=0).mean(dim=0)
                if word in sums:
                    sums[word] += word_emb
                    counts[word] += 1
                else:
                    sums[word] = word_emb.clone()
                    counts[word] = 1
                i = j

    return {w: (sums[w] / counts[w]).cpu().numpy() for w in sums}


In [8]:
# Define a function to clean and preprocess text
def clean_text(text):
    
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    
    return text.strip()

In [9]:
warnings.filterwarnings("ignore")

# Group texts to form a single text per group
grouped_texts = df.groupby('group')['text'].apply(lambda x: ' '.join(x)).reset_index()
# Use pd.concat instead of the deprecated DataFrame.append
grouped_texts = pd.concat(
    [grouped_texts, pd.DataFrame([{'group': 'All', 'text': ' '.join(df['text'])}])],
    ignore_index=True
)

# Create new columns for word and sentence tokens
grouped_texts['word_tokens'] = grouped_texts['text'].apply(lambda x: word_tokenize(clean_text(x)))
grouped_texts['sentence_tokens'] = grouped_texts['text'].apply(lambda x: sent_tokenize(x))

# Apply clean_text to the sentence tokens
grouped_texts['sentence_tokens'] = grouped_texts['sentence_tokens'].apply(
    lambda x: [clean_text(sent) for sent in x]
)

In [10]:
# Embed the words in each group
grouped_texts['word_embeddings'] = grouped_texts['sentence_tokens'].apply(
    lambda x: embed_words(x)
    )

# Compute the number of unique words in each group
grouped_texts['num_unique_words'] = grouped_texts['word_tokens'].apply(lambda x: len(set(x)))

grouped_texts.head()

Unnamed: 0,group,text,word_tokens,sentence_tokens,word_embeddings,num_unique_words
0,Begbie,BRITISH COLUMBIA REPORTS.\r\n\r\nREGINA v. MEE...,"[british, columbia, reports, regina, v, mee, w...","[british columbia reports, regina v mee wah, b...","{'british': [0.20316587, 0.13844684, 0.0951799...",2622
1,Crease,"CREASE, J. 1885. REGINA v. WING CHONG. \r\n\r\...","[crease, j, 1885, regina, v, wing, chong, 14th...","[crease j, 1885, regina v wing chong, 14th 15...","{'crease': [0.4121399, 0.32047123, 0.53407174,...",2654
2,Other,QUONG WING v. THE KING. CAN. \r\n\r\nSupreme ...,"[quong, wing, v, the, king, can, supreme, cour...","[quong wing v the king, can, supreme court of ...","{'quong': [-0.0015669838, 0.3608557, 0.3371874...",1906
3,Regulation Act,An Act to regulate the Chinese population of B...,"[an, act, to, regulate, the, chinese, populati...",[an act to regulate the chinese population of ...,"{'an': [-0.19061895, -0.11869411, -0.18200988,...",576
4,All,"CREASE, J. 1885. REGINA v. WING CHONG. \r\n\r\...","[crease, j, 1885, regina, v, wing, chong, 14th...","[crease j, 1885, regina v wing chong, 14th 15...","{'crease': [0.31894925, 0.42798737, 0.46770898...",4849


We created word embeddings of all tokens in each group, respectively. The word embeddings are stored in a dictionary format, where each key is a word and the value is its corresponding embedding vector.

It is clear that the word embeddings of the same word in different groups are different, which reflects the contextualized meaning of the word in each group. 
- For example, the word "Chinese" has a different embedding in Crease's writings compared to Begbie's writings, indicating that the two authors used the word in different contexts and with different connotations.
- However, since they were embedded using the same model, the word embeddings of the same word in different groups are still similar, which reflects the shared meaning of the word across different contexts.
- The dimensionality of all word embeddings is 768, which is the size of the hidden layer in the LEGAL-BERT model we used.

In [None]:
# Display the word embedding of Chinese for the whole corpus
chinese_embedding = grouped_texts[grouped_texts['group'] == 'All']['word_embeddings'].values[0].get('chinese')

# Display first 20 dimensions for brevity
print(f"First 20 Dimensions of Word Embedding for 'Chinese' in the Full Corpus:\n {chinese_embedding[:20]}\n")
print(f"Total Dimensions of Word Embedding for 'Chinese': {len(chinese_embedding)}\n")

First 20 Dimensions of Word Embedding for 'Chinese' in the Full Corpus:
 [ 1.2322300e-01  2.6763847e-01  7.2645694e-02  3.9198916e-02
  3.2321563e-01  1.2995906e-01  1.2225201e-01  2.5251046e-01
 -2.4391446e-01 -9.4088748e-02  6.4341635e-02  4.6708676e-01
  1.5845809e-04 -7.6655887e-02 -4.1823617e-01  2.5841817e-01
  1.2294206e-01 -1.2723301e-01 -6.0239416e-01  4.2815152e-01]

Total Dimensions of Word Embedding for 'Chinese': 768



In [12]:
# Display the word embedding of Chinese in Crease's text
crease_embeddings = grouped_texts[grouped_texts['group'] == 'Crease']['word_embeddings'].values[0]
# Display first 20 dimensions for brevity
print(f"First 20 Dimensions of Word Embeddings for 'Chinese' in Crease's Text:\n{crease_embeddings.get('chinese')[:20]}\n") 
print(f"Total Dimensions of Word Embeddings for 'Chinese' in Crease's Text: {len(crease_embeddings.get('chinese'))}\n")

First 20 Dimensions of Word Embeddings for 'Chinese' in Crease's Text:
[ 0.12104391  0.31581154  0.04769319  0.0805686   0.3044722   0.09346943
  0.14147684  0.2550373  -0.22372353 -0.09209424  0.0625285   0.49330303
  0.00912888 -0.07314441 -0.439856    0.3086382   0.09130041 -0.1273665
 -0.5595881   0.41660848]

Total Dimensions of Word Embeddings for 'Chinese' in Crease's Text: 768



In [13]:
begbie_embeddings = grouped_texts[grouped_texts['group'] == 'Begbie']['word_embeddings'].values[0]
# Display first 20 dimensions for brevity
print(f"First 20 Dimensions of Word Embeddings for 'Chinese' in Begbie's Text:\n{begbie_embeddings.get('chinese')[:20]}\n")
print(f"Total Dimensions of Word Embeddings for 'Chinese' in Begbie's Text: {len(begbie_embeddings.get('chinese'))}\n")

First 20 Dimensions of Word Embeddings for 'Chinese' in Begbie's Text:
[ 0.12386303  0.27892804  0.02562958 -0.02250974  0.30389023  0.16883047
  0.00716521  0.309511   -0.28151745  0.00930497  0.05432036  0.43589708
 -0.04397878 -0.05242987 -0.52100396  0.3696377   0.16586976 -0.14758825
 -0.60179514  0.38511708]

Total Dimensions of Word Embeddings for 'Chinese' in Begbie's Text: 768



#### Measurement of Similarity

Another important aspect of word embeddings is the ability to measure the similarity between words based on their embeddings. This can be done using cosine similarity, which calculates the cosine of the angle between two vectors in the embedding space. The cosine similarity ranges from 0 to 1, where:

- 0 indicates no similarity (orthogonal vectors)
- 1 indicates perfect similarity (identical vectors)
- The closer the cosine similarity is to 1, the more similar the words are in meaning.

This allows us to identify related words and concepts based on their embeddings, enabling us to explore the semantic relationships between words in our corpus. And more importantly, it doesn't only allows us to measure the similarity between words, but also allows us to measure the similarity between sentences, paragraphs, and even entire documents, as long as they are represented as vectors in the same embedding space.

The math behind cosine similarity is as follows:
$$
\text{cosine\_similarity}(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}
$$
Where $a$ and $b$ are the embedding vectors of the two words, and $||a||$ and $||b||$ are their Pythagorean norms (lengths).

Focusing on the word "Chinese", we can calculate its cosine similarity with other words in the same group to identify related terms. This can help us understand how the word is used in different contexts and how it relates to other concepts. Here, we will list out the top 10 most similar words to "Chinese" in each group, along with their cosine similarity scores.

**Note**: All words are put into lowercase.

In [14]:
# Compute top-10 most similar words to target for EVERY group (including "All")
target = "chinese"
top_n = 10
all_results = []
# Iterate through each group and compute similarities
for _, grp_row in grouped_texts.iterrows():
    group = grp_row['group']
    emb_dict = grp_row['word_embeddings']
    if target not in emb_dict:
        continue
    target_vec = emb_dict[target]
    sims = []
    for w, vec in emb_dict.items():
        if w == target:
            continue
        try:
            sim = 1 - cosine(target_vec, vec)
        except Exception:
            continue
        sims.append((w, sim))
    sims_sorted = sorted(sims, key=lambda x: x[1], reverse=True)[:top_n]
    for rank, (w, sim) in enumerate(sims_sorted, 1):
        all_results.append({'group': group, 'rank': rank, 'word': w, 'similarity': sim})  # Use :4f for better readability

similar_words_df = pd.DataFrame(all_results)

# Display the first few rows of the DataFrame with similar words
sims_wide = similar_words_df.pivot(index='rank', columns='group', values='similarity')
words_wide = similar_words_df.pivot(index='rank', columns='group', values='word')

# Combine with a tidy multi-level column index: 
wide_combined = pd.concat({'word': words_wide, 'similarity': sims_wide}, axis=1)
wide_combined = (
    wide_combined.swaplevel(0,1, axis=1)
                 .sort_index(axis=1, level=0)
)

wide_combined  # Display


group,All,All,Begbie,Begbie,Crease,Crease,Other,Other,Regulation Act,Regulation Act
Unnamed: 0_level_1,similarity,word,similarity,word,similarity,word,similarity,word,similarity,word
rank,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
1,0.882102,chinamen,0.885641,china,0.884988,chinamen,0.878179,china,0.76229,immigrant
2,0.879779,china,0.866699,chinamen,0.875142,chinaman,0.874599,chinamen,0.75,whites
3,0.878989,chinaman,0.863639,chinaman,0.86762,china,0.86938,chinaman,0.744102,employer
4,0.833691,japanese,0.813474,whites,0.857302,white,0.816787,orientals,0.73943,native
5,0.832887,immigrant,0.810684,white,0.843365,aliens,0.80949,alien,0.736644,yards
6,0.832593,white,0.80013,universal,0.817652,confederation,0.807922,immigrants,0.732831,person
7,0.831415,immigrants,0.796812,english,0.817285,coolies,0.807445,provincial,0.710163,race
8,0.82972,whites,0.794389,europeans,0.816437,immigrants,0.807062,oriental,0.709648,found
9,0.820621,orientals,0.790069,provincial,0.815707,sweet,0.806823,aliens,0.708453,emergency
10,0.818171,aliens,0.787839,canton,0.814447,japanese,0.802767,asiatic,0.70748,useless


### Embedding Driven Text Analysis
#### Creating Keyword-Focused Stance Embeddings

In comparison to generating word embeddings, modeling stance of each text is more challenging, as it requires us to capture the author's position on a specific issue or topic. Oftentimes, the stance is not explicitly stated in the text, but rather implied through the language used. 

There is not a universal optimum for stance modeling, as it depends on the specific context and the author's perspective. However, we can use a combination of techniques to create focused embeddings that capture the stance of each text. The strategy we used is as follows:

- Tokenize the text into smaller units and identify the positions of specific keywords or phrases that are relevant to the stance being analyzed.
- For each occurrence of the keywords, extract a surrounding "window" of text to capture the context in which the keywords are used.
- Represent the text in the window as numerical vectors using a pre-trained language model, which encodes the meaning of the words and their relationships.
- Combine the vectors within each window using a pooling method (e.g., averaging or selecting the maximum value) to create a single representation for the context around the keyword.
- If multiple occurrences of the keywords are found, average their representations to create a unified vector that captures the overall stance in the text.
- If no keywords are found, use a fallback representation based on the overall text.

This approach thus allows us to create focused embeddings that capture the stance of each text focusing on specific keywords or phrases. The sentence is used as the basic unit of analysis here, but larger chunks of text can also be used if needed. 

In the end, we will store the lists of embeddings in a dictionary format, where each key is the author and the value is a list of embeddings for each text authored by that author.

In [15]:
def embed_text(
    text,
    focus_token= None,
    window = 10,
    pooling = "mean",  # "mean" (default), "max", or "min"
    tokenizer=tokenizer,
    model=model):

    # Run the model once
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    hidden = outputs.last_hidden_state.squeeze(0) 

    if focus_token is None:
        return hidden[0].cpu().numpy()
    
    # Normalize to list
    keywords = (
        [focus_token] if isinstance(focus_token, str)
        else focus_token
    )

    # Pre-tokenize each keyword to its subtoken ids
    kw_token_ids = {
        kw: tokenizer.convert_tokens_to_ids(tokenizer.tokenize(kw))
        for kw in keywords
    }

    input_ids = inputs["input_ids"].squeeze(0).tolist()
    spans = []  # list of (start, end) index pairs

    # find every match of every keyword
    for kw, sub_ids in kw_token_ids.items():
        L = len(sub_ids)
        for i in range(len(input_ids) - L + 1):
            if input_ids[i:i+L] == sub_ids:
                spans.append((i, i+L))

    if not spans:
        # fallback on CLS vector
        return hidden[0].cpu().numpy()

    # For each span, grab the window around it
    vecs = []
    for (start, end) in spans:
        lo = max(1, start - window)
        hi = min(hidden.size(0), end + window)
        # mean‑pool over all tokens in this extended window
        span_vec = hidden[lo:hi]
        
        if pooling == "mean":
            pooled = span_vec.mean(dim=0)
        elif pooling == "max":
            pooled = span_vec.max(dim=0).values
        elif pooling == "min":
            pooled = span_vec.min(dim=0).values
        else:
            raise ValueError(f"Unknown pooling method: {pooling}")
        
        vecs.append(pooled.cpu().numpy())

    # Average across all spans
    return np.mean(np.stack(vecs, axis=0), axis=0)

In [16]:
crease_cases = df[(df['author'] == 'Crease') & (df['type'] == 'case')]['text'].tolist()
begbie_cases = df[(df['author'] == 'Begbie') & (df['type'] == 'case')]['text'].tolist()
act_1884 = df[df['type'] == 'act']['text'].tolist()

act_dict = {
    'Crease': crease_cases,
    'Begbie': begbie_cases,
    'Act 1884': act_1884}

In [17]:
act_snippets = {}

keywords = ["Chinese", "China", "Chinaman", "Chinamen", 
            "immigrant", "immigrants", "alien", "aliens", 
            "immigration"]

for auth, texts in act_dict.items():
    snippets = []
    for txt in texts:
        sentence = sent_tokenize(txt)
        for sent in sentence:
            if any(keyword in sent for keyword in keywords):
                snippets.append(sent)
    act_snippets[auth] = snippets

In [18]:
# Investigate the length of the snippets
n_snippet = {auth: len(snippets) for auth, snippets in act_snippets.items()}

print("Snippet size by author:")
for auth, num in n_snippet.items():
    print(f"{auth}: {num}")

Snippet size by author:
Crease: 83
Begbie: 18
Act 1884: 24


In [19]:
# Create embeddings
embeddings_dict = {'Crease': [], 'Begbie': [], 'Act 1884': []}

for auth, snippets in act_snippets.items():
    for snip in snippets:
        v = embed_text(snip, focus_token=keywords, window=15)
        embeddings_dict[auth].append(v) 

#### Measuring Stance Similarity

Just like word embeddings, cosine similarity can also be used to measure the stance similarity between texts. The interpretation of cosine similarity in this context is similar to that of word embeddings, where a higher cosine similarity indicates a stronger alignment in stance between two texts.

With sentence being the basic unit of analysis, we can calculate the overall cosine similarity between each pair of authors' texts in various ways, but here we will focus on two of them:
1. **Mean Embeddings**: We calculate the mean embedding for each author's texts and then compute the cosine similarity between these mean embeddings. This gives us a single similarity score for each pair of authors, reflecting their overall stance alignment.
2. **Pairwise Embeddings**: We calculate the cosine similarity between each pair of texts authored by different authors, then average the scores to get a more comprehensive view of stance alignment across all texts.

Note that similarity scores are not deterministic, as they depend on the specific texts and the context in which the keywords are used. However, they can provide valuable insights into the stance of each author and how it relates to other authors' positions. This reinforces the idea that **stance is not a fixed attribute**, but rather a dynamic and context-dependent aspect of language.

In [20]:
# Compute the pairwise cosine similarity
mean_crease = np.mean(embeddings_dict["Crease"], axis=0, keepdims=True)
mean_begbie = np.mean(embeddings_dict["Begbie"], axis=0, keepdims=True)
mean_act_1884 = np.mean(embeddings_dict["Act 1884"], axis=0, keepdims=True)

sim_crease_begbie = cosine_similarity(mean_crease, mean_begbie)[0, 0]
sim_crease_act_1884 = cosine_similarity(mean_crease, mean_act_1884)[0, 0]
sim_begbie_act_1884 = cosine_similarity(mean_begbie, mean_act_1884)[0, 0]

print(f"Cosine similarity between mean Crease and mean Begbie: {sim_crease_begbie:.4f}")
print(f"Cosine similarity between mean Crease and mean Act 1884: {sim_crease_act_1884:.4f}")
print(f"Cosine similarity between mean Begbie and mean Act 1884: {sim_begbie_act_1884:.4f}")

Cosine similarity between mean Crease and mean Begbie: 0.9886
Cosine similarity between mean Crease and mean Act 1884: 0.9663
Cosine similarity between mean Begbie and mean Act 1884: 0.9624


In [21]:
# Extract embeddings for Crease, Begbie and the Act 1884
crease_embeddings = embeddings_dict["Crease"]
begbie_embeddings = embeddings_dict["Begbie"]
act_1884_embeddings = embeddings_dict["Act 1884"]

# Define a function to compute mean cosine similarity
def mean_cosine_similarity(embeddings1, embeddings2):
    similarities = [
        1 - cosine(e1, e2)
        for e1 in embeddings1
        for e2 in embeddings2
    ]
    return sum(similarities) / len(similarities)

# Extract embeddings
crease_emb = embeddings_dict["Crease"]
begbie_emb = embeddings_dict["Begbie"]
act_1884_emb = embeddings_dict["Act 1884"]

# Compute mean similarities
crease_begbie_sim = mean_cosine_similarity(crease_emb, begbie_emb)
crease_act_sim = mean_cosine_similarity(crease_emb, act_1884_emb)
begbie_act_sim = mean_cosine_similarity(begbie_emb, act_1884_emb)

# Output
print(f"Mean cosine similarity between Crease and Begbie embeddings: {crease_begbie_sim:.4f}")
print(f"Mean cosine similarity between Crease and Act 1884 embeddings: {crease_act_sim:.4f}")
print(f"Mean cosine similarity between Begbie and Act 1884 embeddings: {begbie_act_sim:.4f}")

Mean cosine similarity between Crease and Begbie embeddings: 0.8661
Mean cosine similarity between Crease and Act 1884 embeddings: 0.8391
Mean cosine similarity between Begbie and Act 1884 embeddings: 0.8424


#### Visualizing Text Embeddings

While the embeddings themselves are high-dimensional vectors (in our case, 768-dimensional), we can visualize them in a lower-dimensional space (e.g., 2D or 3D) using **dimensionality reduction** techniques such as **UMAP** (Uniform Manifold Approximation and Projection). 

**UMAP** is a dimensionality reduction technique that projects high-dimensional embeddings into a 2D space while preserving local structure, making it ideal for visualizing our embeddings. 

Using **Plotly Express**, we create an interactive scatter plot where each point represents a text snippet, colored by author, with hover functionality to display the corresponding sentence. This visualization highlights clusters and relationships between snippets, offering insights into semantic similarities across authors.


In [22]:
all_vecs = np.vstack(embeddings_dict["Crease"] + embeddings_dict["Begbie"] + embeddings_dict["Act 1884"])
labels  = (["Crease"] * len(embeddings_dict["Crease"])) + (["Begbie"] * len(embeddings_dict["Begbie"])) + (['Act 1884'] * len(embeddings_dict["Act 1884"]))

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1)
proj = reducer.fit_transform(all_vecs) 

def wrap_text(text, width=60):
    return '<br>'.join(textwrap.wrap(text, width=width))

In [23]:
umap_df = pd.DataFrame(proj, columns=['UMAP 1', 'UMAP 2'])
umap_df['Author'] = labels
umap_df['Text'] = [snip for auth in act_snippets for snip in act_snippets[auth]]
umap_df['Text'] = umap_df['Text'].apply(lambda t: wrap_text(t, width=60))

fig = px.scatter(umap_df, x='UMAP 1', y='UMAP 2', 
                 color='Author', hover_data=['Text'], 
                 width=800, height=500 )
fig.update_traces(marker=dict(size=5))
fig.update_layout(title='UMAP Projection of Word Embeddings by Author')
fig.show()

#### Investigating Texts

The stance embeddings ultimately serve as analytical tools to support our text analysis objectives. 

- By calculating the "conceptual mean stance" for each author, we gain a quantitative basis for comparing the positions of different authors. 
- However, embeddings alone cannot fully capture the nuances of language or the complexity of an author's stance. To truly understand the perspectives reflected in the texts, it is essential to investigate the sentences that are most similar to the conceptual average position of each author.

Here, we will examine the top 10 sentences with the highest stance similarity to the mean stance of each author. 

This approach allows us to delve deeper into the texts, uncovering how the language used aligns with the calculated average stance and providing richer insights into the authors' positions on the issue of Chinese immigrants.

In [24]:
# Print out the 10 most similar embedding sentences to Crease's mean embedding

crease_similarity_df = pd.DataFrame(columns=['Author', 'Text', 'Similarity Score'])

# Iterate through the embeddings and their corresponding sentences
for auth, snippets in act_snippets.items():
    for snippet, emb in zip(snippets, embeddings_dict[auth]):
        similarity = cosine_similarity(emb.reshape(1, -1), mean_crease)[0][0]
        crease_similarity_df.loc[len(crease_similarity_df)] = [auth, snippet, similarity]
        
# Sort by similarity score
crease_sorted_similarity = crease_similarity_df.sort_values(by='Similarity Score', ascending=False)

print("Top 10 most similar sentences to Crease's mean embedding:\n")
for _, row in crease_sorted_similarity.head(10).iterrows():
    wrapped_para = textwrap.fill(row['Text'], width=100)
    print(f"Author: {row['Author']}\nSentence: {wrapped_para}\nSimilarity Score: {row['Similarity Score']:.4f}\n")

Top 10 most similar sentences to Crease's mean embedding:

Author: Crease
Sentence: The Act is found associated with another Act now disallowed, the express object of which is to
prevent the Chinese altogether from coming to this country, and the principle "noscitur a sociis" is
kept up by the preamble of the present Act, which describes the Chinese in terms which, I venture to
think, have never before in any other country found a place in an Act of Parliament.
Similarity Score: 0.9652

Author: Crease
Sentence: In the case of the Chinese treaties, they were forced at the point of the bayonet on China, to
obtain a right for us to enter China, and in return for a similar permission to us, full permission
was given for the Chinese to trade and reside in British dominions everywhere.
Similarity Score: 0.9625

Author: Begbie
Sentence: Statutes were by their title and preamble  expressly aimed at Chinamen by name; that this
distinction also renders inapplicable all the United States' cases c

In [25]:
# Print out the 10 most similar embedding sentences to Begbie's mean embedding

begbie_similarity_df = pd.DataFrame(columns=['Author', 'Text', 'Similarity Score'])

# Iterate through the embeddings and their corresponding sentences
for auth, snippets in act_snippets.items():
    for snippet, emb in zip(snippets, embeddings_dict[auth]):
        similarity = cosine_similarity(emb.reshape(1, -1), mean_begbie)[0][0]
        begbie_similarity_df.loc[len(begbie_similarity_df)] = [auth, snippet, similarity]
        
# Sort by similarity score
begbie_sorted_similarity = begbie_similarity_df.sort_values(by='Similarity Score', ascending=False)

print("Top 10 most similar sentences to Begbie's mean embedding:\n")
for _, row in begbie_sorted_similarity.head(10).iterrows():
    wrapped_para = textwrap.fill(row['Text'], width=100)
    print(f"Author: {row['Author']}\nSentence: {wrapped_para}\nSimilarity Score: {row['Similarity Score']:.4f}\n")

Top 10 most similar sentences to Begbie's mean embedding:

Author: Begbie
Sentence: Statutes were by their title and preamble  expressly aimed at Chinamen by name; that this
distinction also renders inapplicable all the United States' cases cited; that this enactment is
quite general extending to all laundries without exception and we must not look beyond the words of
the enactment to enquire what its object was; that there is in fact one laundry in Victoria not
conducted by Chinamen on which the tax will fall with equal force so that it is impossible to say
that Chinamen are hereby exclusively selected for taxation; the circumstance that they are chiefly
affected being a mere coincidence; that the bylaw only imposes $100.00 per annum, keeping far within
the limit of $150.00 permitted by the Statute; that the tax clearly is calculated to procuring
additional Municipal revenue and that no other object is hinted at.
Similarity Score: 0.9666

Author: Begbie
Sentence: When we find (1st) no

In [26]:
# Print out the 10 most similar embedding sentences to the Regulation Act's mean embedding

regulation_similarity_df = pd.DataFrame(columns=['Author', 'Text', 'Similarity Score'])

# Iterate through the embeddings and their corresponding sentences
for auth, snippets in act_snippets.items():
    for snippet, emb in zip(snippets, embeddings_dict[auth]):
        similarity = cosine_similarity(emb.reshape(1, -1), mean_act_1884)[0][0]
        regulation_similarity_df.loc[len(regulation_similarity_df)] = [auth, snippet, similarity]
        
# Sort by similarity score
regulation_sorted_similarity = regulation_similarity_df.sort_values(by='Similarity Score', ascending=False)

print("Top 10 most similar sentences to the Regulation Act's mean embedding:\n")
for _, row in regulation_sorted_similarity.head(10).iterrows():
    wrapped_para = textwrap.fill(row['Text'], width=100)
    print(f"Author: {row['Author']}\nSentence: {wrapped_para}\nSimilarity Score: {row['Similarity Score']:.4f}\n")

Top 10 most similar sentences to the Regulation Act's mean embedding:

Author: Act 1884
Sentence: In case any employer of Chinese fails to deliver to the Collector the list mentioned in the
preceding section, when required so to do, or knowingly states anything falsely therein, such
employer shall, on complaint of the Collector and upon conviction before a Justice of the Peace
having jurisdiction within the district wherein such employer carries on his business, forfeit and
pay a fine not exceeding one hundred dollars for every Chinese in his employ, to be recovered by
distress of the goods and chattels of such employer failing to pay the same, or in lieu thereof
shall be liable to imprisonment for a period not less than one month and not exceeding two calendar
months.
Similarity Score: 0.9662

Author: Act 1884
Sentence: The Toll Collector at any and every toll gate which may exist in the Province from time to time,
shall, before allowing any Chinese to pass through any toll gate, dema

### Topic Mining and Alignment Analysis

### Sentiment Analysis

### LLM and Zero-Shot Classification 

In [27]:
# Create the full snippets dictionary
act_1884_full = " ".join(act_1884)
crease_cases_full = " ".join(crease_cases)
begbie_cases_full = " ".join(begbie_cases)

full_cases = {"Crease": crease_cases_full, "Begbie": begbie_cases_full, "Act 1884": act_1884_full}

In [28]:
# We create a dictionary to hold the full snippets for each author
full_snippets = {}
for author, text in full_cases.items():
    sentence = sent_tokenize(text)
    snippets = []
    for sent in sentence:
        if len(sent) > 30:  # Filter out short and meaningless sentences created by tokenization
            snippets.append(sent)
            
    full_snippets[author] = snippets

In [29]:
# Create a DataFrame to display snippet size by author
snippet_sizes = [{'Author': auth, 'Snippet Count': len(snippets)} for auth, snippets in full_snippets.items()]
snippet_sizes_df = pd.DataFrame(snippet_sizes)

# Display the DataFrame
print(snippet_sizes_df)

     Author  Snippet Count
0    Crease            274
1    Begbie            201
2  Act 1884             40


In [30]:
# Create pipeline for zero-shot classification
from transformers import pipeline

zero_shot = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    tokenizer="facebook/bart-large-mnli",
    hypothesis_template="This legal text {}."
)

labels = [
    "advocates for equal legal treatment of Chinese immigrants compared to white or European settlers, opposing racial discrimination",
    "describes the status or treatment of Chinese immigrants without expressing support or opposition to racial inequality",
    "justifies or reinforces unequal legal treatment of Chinese immigrants relative to white or European settlers, supporting racially discriminatory policies"
]

def get_scores(snippet):
    out = zero_shot(snippet, candidate_labels=labels)
    return dict(zip(out["labels"], out["scores"]))




Device set to use cpu


In [31]:
# Run zero-shot classification on the snippets from the Chinese Regulation Act 1884
act_scores = {}

for auth, snippets in full_snippets.items():
    scores = []
    for snip in snippets:
        score = get_scores(snip)
        scores.append(score)
    act_scores[auth] = scores

rows = []

for auth, snippets in full_snippets.items():
    for snip, score_dict in zip(snippets, act_scores[auth]):
        row = {
            "Author": auth,
            "Text": snip,
            "Pro": score_dict[labels[0]],
            "Neutral": score_dict[labels[1]],
            "Cons": score_dict[labels[2]]
        }
        rows.append(row)

# Create DataFrame to store the scores
df_scores = pd.DataFrame(rows)

In [32]:
# Print out the top 10 sentences with the highest "Pro" scores
top_pro_sentences = df_scores.nlargest(10, 'Pro')

print("\nTop 10 sentences with the highest 'Pro' scores:\n")

for _, row in top_pro_sentences.iterrows():
    wrapped_para = textwrap.fill(row['Text'], width=100)
    print(f"Author: {row['Author']}\nSentence: {wrapped_para}\nPro Score: {row['Pro']:.4f}\n")


Top 10 sentences with the highest 'Pro' scores:

Author: Begbie
Sentence: Justices FIELD, HOFFMAN, SAWYER and DEADY and other Judges whom they cite, all confirm this, that a
State, or Provincial law imposing special disabilities or unequal burdens on Chinamen is
unconstitutional and void.
Pro Score: 0.9632

Author: Crease
Sentence: 6, ratified 26th June, 1843, p. 221, and Lord Elgin's treaty of October, 1860, authenticated copies
of which were produced in Court, secure to Chinese coming into British dominions the same "full
security for persons and property as subjects of Her Majesty."
Pro Score: 0.9309

Author: Crease
Sentence: The treaties I have quoted between Great Britain and China, binding on the Dominion and on us in
British Columbia, secure to the Chinese, just as the treaties between Great Britain and other
foreign countries secure to other foreigners, the same rights in regard to the equality of taxation
which I have described as being enjoyed by citizens of this country.
Pr

In [33]:
# Print out the top 10 sentences with the highest "Cons" scores
top_cons_sentences = df_scores.nlargest(10, 'Cons')

print("\nTop 10 sentences with the highest 'Cons' scores:\n")

for _, row in top_cons_sentences.iterrows():
    wrapped_para = textwrap.fill(row['Text'], width=100)
    print(f"Author: {row['Author']}\nSentence: {wrapped_para}\nCons Score: {row['Cons']:.4f}\n")


Top 10 sentences with the highest 'Cons' scores:

Author: Crease
Sentence: And again, "A tax imposed by the law on these persons for the mere right to reside here, is an
appropriate and effective means to discourage the immigration of the Chinese into the State."
Cons Score: 0.9659

Author: Crease
Sentence: He reviewed the legislation against Chinese since confederation, contending it was levelled against
a particular race of aliens and, therefore, beyond provincial control, per *Gwynne*, J., in
*Citizens Insurance Co. v. Parsons*, 4 S. C. R., at p. 346.
Cons Score: 0.9637

Author: Crease
Sentence: The power asserted in the Act in question (the California Act) is the right of the State to
prescribe the terms upon which the Chinese shall be permitted to reside in it, and be so used as to
cut off all intercourse between them and the people of the State, and defeat the commercial policy
of the nation.
Cons Score: 0.9594

Author: Crease
Sentence: The provisions of the Act I have given som

In [34]:
# Group by author and calculate mean scores
mean_scores = df_scores.groupby("Author")[["Pro", "Neutral", "Cons"]].mean()
median_scores = df_scores.groupby("Author")[["Pro", "Neutral", "Cons"]].median()

print("Mean scores by author:")
print(mean_scores)

print("\nMedian scores by author:")
print(median_scores)


Mean scores by author:
               Pro   Neutral      Cons
Author                                
Act 1884  0.291177  0.181931  0.526892
Begbie    0.287563  0.301680  0.410758
Crease    0.259287  0.289256  0.451457

Median scores by author:
               Pro   Neutral      Cons
Author                                
Act 1884  0.298038  0.160079  0.514868
Begbie    0.285689  0.276666  0.395044
Crease    0.259358  0.254776  0.430397


In [35]:
df_scores['Text'] = df_scores['Text'].apply(lambda t: wrap_text(t, width = 50))

fig = px.scatter(
    df_scores,
    x="Pro",
    y="Cons",
    color="Author",
    hover_data=["Text"],
    title="Pros vs Cons Scores by Author",
    width=800,
    height=600
)

fig.update_traces(marker=dict(size=5))
fig.show()

In [36]:
# Define a function to chunk text into overlapping windows
def chunk_into_windows(text, max_tokens=512, stride=128):
    
    # Break into sentences first for cleaner boundaries
    sents = sent_tokenize(text)
    windows = []
    current = ""
    for sent in sents:
        # Tentative window if we add this sentence
        cand = current + " " + sent if current else sent
        # Count tokens
        n_tokens = len(tokenizer.encode(cand, add_special_tokens=False))
        if n_tokens <= max_tokens:
            current = cand
        else:
            # finalize current window, then start new from overlapping tail
            windows.append(current)
            # keep the stride tokens from the end of the current window
            tail_ids = tokenizer.encode(current, add_special_tokens=False)[-stride:]
            tail_text = tokenizer.decode(tail_ids)
            current = tail_text + " " + sent
    if current:
        windows.append(current)
    return windows

In [37]:
# Run classification per author
rows = []
for author, text in full_cases.items():
    
    windows = chunk_into_windows(text, max_tokens=256, stride=64)
    
    # classify each window
    for win in windows:
        out = zero_shot(win, candidate_labels=labels, truncation=True, max_length=256)
        # Extract scores and labels
        score_dict = dict(zip(out["labels"], out["scores"]))
        rows.append({
            "Author": author,
            "Text": win,
            "Pro": score_dict[labels[0]],
            "Neutral": score_dict[labels[1]],
            "Cons": score_dict[labels[2]]
        })

all_scores = pd.DataFrame(rows)

In [38]:
# Print out the top 5 windows with the highest "Pro" scores
top_pro_windows = all_scores.nlargest(5, 'Pro')

print("\nTop 5 windows with the highest 'Pro' scores:\n")
for _, row in top_pro_windows.iterrows():
    wrapped_para = textwrap.fill(row['Text'], width=100)
    print(f"Author: {row['Author']}\nWindow: {wrapped_para}\nPro Score: {row['Pro']:.4f}\n")


Top 5 windows with the highest 'Pro' scores:

Author: Begbie
Window: from the constitution and the relation between the dominion and the provinces. but the judges in
those foreign courts have had a much longer and more varied experience on these topics than
ourselves ; their institutions are closely analogous in many judgment. respects, though, it is true,
contrasted in others to our own. And their opinions and reasonings being also founded on
international law, and, I take the liberty of saying, on natural equity and common sense, they are
entitled to great weight beyond the limits of their own jurisdiction. I shall only mention Lee Sing
v. Washburn, 20 Cal. Rep. 354; Baker v. Portland, 5 Law 750; Teburcio Parrott's case, coram SAWYER
and HOFFMAN, J.J., 1880, and the Quene ordinance case, coram FIELD and SAWYER, J.J., 1879; the two
latter cases published in a separate pamphlet form, in which the opinions of Mr. Justices FIELD,
HOFFMAN, SAWYER and DEADY and other Judges whom they cite

In [39]:
# Print out the top 5 windows with the highest "Cons" scores
top_cons_windows = all_scores.nlargest(5, 'Cons')

print("\nTop 5 windows with the highest 'Cons' scores:\n")
for _, row in top_cons_windows.iterrows():
    wrapped_para = textwrap.fill(row['Text'], width=100)
    print(f"Author: {row['Author']}\nWindow: {wrapped_para}\nCons Score: {row['Cons']:.4f}\n")


Top 5 windows with the highest 'Cons' scores:

Author: Crease
Window: act the legal presumption of innocence until conviction is reversed ; in every case the onus
probandi, though in a statute highly penal, is shifted from the informant on to the shoulders of the
accused, and he a foreigner not knowing one word of the law, or even the language of the accuser. In
other words, every Chinese is guilty until proved innocent—a provision which fills one conversant
with subjects with alarm; for if such a law can be tolerated as against Chinese, the precedent is
set, and in time of any popular outcry can easily be acted on for putting any other foreigners or
even special classes among ourselves, as coloured people, or French, Italians, Americans, or
Germans, under equally the same law. That certainly is interfering with aliens. The proposition that
it is a Provincial tax for revenue purposes, supposing it to be so intended under the provisions of
the Act, is so manifestly calculated to defeat

In [40]:
# Calculate the mean scores and median scores for each author
mean_scores = all_scores.groupby("Author")[["Pro", "Neutral", "Cons"]].mean()
median_scores = all_scores.groupby("Author")[["Pro", "Neutral", "Cons"]].median()

print("Mean scores by author:")
print(mean_scores)

print("\nMedian scores by author:")
print(median_scores)

Mean scores by author:
               Pro   Neutral      Cons
Author                                
Act 1884  0.289885  0.127291  0.582824
Begbie    0.325492  0.288779  0.385729
Crease    0.263973  0.216049  0.519978

Median scores by author:
               Pro   Neutral      Cons
Author                                
Act 1884  0.316400  0.121227  0.581850
Begbie    0.327127  0.301877  0.314771
Crease    0.217157  0.196228  0.520265


### Conclusion

### References

1. *Regina v. Wing Chong*, 1 B.C.R. Pt. II 150 (1885). 
2. *Wong Hoy Woon v. Duncan*, 3 B.C.R. 318 (1894). 
3. *Regina v. Mee Wah*, 3 B.C.R. 403 (1886).
4. *Regina v. Corporation of Victoria*, 1 B.C.R. Pt. II 331 (1888). 
5. *Quong Wing v. The King*, 49 S.C.R. 440 (1914). 
6. Law Society of British Columbia. (1896). *The British Columbia Reports: Being reports of cases determined in the Supreme and County Courts and in Admiralty and on appeal in the Full Court and Divisional Court* (Vol. 3). Victoria, BC: The Province Publishing Company.
7. Canada. Royal Commission on Chinese Immigration. (1885). *Report of the Royal Commission on Chinese Immigration: report and evidence*. Ottawa: Printed by order of the Commission. 
8. Thomas, P. (2012, June 12–14). Courts of last resort: The judicialization of Asian Canadian politics 1878 to 1913. Paper presented at the Annual Conference of the Canadian Political Science Association, University of Alberta, Edmonton, Canada. Retrieved from <https://cpsa-acsp.ca/papers-2012/Thomas-Paul.pdf> 
9. McLaren, J. P. S. (1991). The Early British Columbia Supreme Court and the "Chinese Question": Echoes of the rule of law. Manitoba Law Journal, 20(1), 107–147. Retrieved from <https://www.canlii.org/w/canlii/1991CanLIIDocs168.pdf> 
10. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The Muppets straight out of Law School (arXiv:2010.02559). arXiv. <https://doi.org/10.48550/arXiv.2010.02559>
11. Loo, T. (1994). Crease, Sir Henry Pering Pellew. In Dictionary of Canadian Biography (Vol. 13). University of Toronto/Université Laval. Retrieved August 8, 2025, from <https://www.biographi.ca/en/bio/crease_henry_pering_pellew_13E.html> 
12. Williams, D. R. (1990). Begbie, Sir Matthew Baillie. In Dictionary of Canadian Biography (Vol. 12). University of Toronto/Université Laval. Retrieved August 8, 2025, from <https://www.biographi.ca/en/bio/begbie_matthew_baillie_12E.html> 
13. Ariai, F., Mackenzie, J., & De Martini, G. (2025). *Natural Language Processing for the legal domain: A survey of tasks, datasets, models and challenges*. arXiv preprint arXiv:2410.21306.