## Text Embeddings for Regina V Wing Chong (1885)

In [1]:
# Data Wrangling
import os
import numpy as np
import pandas
from nltk.tokenize import word_tokenize

  from scipy.stats import fisher_exact


In [2]:
with open('data/Regina_V_Wing_Chong.txt', encoding='utf-8') as f:
    full_text = f.read()
print(full_text)

CREASE, J. 1885. REGINA v. WING CHONG. 

14th & 15th July, Certiorari—“Chinese Regulation Act, 1884,” s. 5—Constitutionality—B.N.A. Act, 1867, ss. 91, 92—“Aliens”—“Trade and Commerce”—Taxation. 
On the return to a writ of certiorari. Held, that the “Chinese Regulation Act, 1884,” is ultra vires of the Provincial Legislature, on the following grounds: 
1. It is an interference with the rights of Aliens. 
2. It is an interference with Trade and Commerce. 
3. It is an infraction of the existing treaties between the Imperial Government and China. 
4. It imposes unequal taxation. 

14th & 15th July—On the return of a writ of certiorari directed to Edwin Johnson, Esquire, Police Magistrate for the City of Victoria, to return into this Court a certain conviction made by him under which one Wing Chong was fined $20 for not having in his possession a license issued under the “Chinese Regulation Act, 1884.”

The Attorney-General in support of the conviction said there were five points raised on 

### BERT Word Embeddings

In [3]:
import re

def clean_text(text):
    
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    
    return text.strip()

text_cleaned = clean_text(full_text)
print(text_cleaned[:500])  # Print the first 500 characters of the cleaned text

crease j 1885 regina v wing chong 

14th  15th july certiorarichinese regulation act 1884 s 5constitutionalitybna act 1867 ss 91 92alienstrade and commercetaxation 
on the return to a writ of certiorari held that the chinese regulation act 1884 is ultra vires of the provincial legislature on the following grounds 
1 it is an interference with the rights of aliens 
2 it is an interference with trade and commerce 
3 it is an infraction of the existing treaties between the imperial government and c


In [4]:
# Load pre-trained BERT tokenizer and model
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

In [5]:

# Create the word embeddings
# Tokenize the cleaned text into words
tokens = word_tokenize(text_cleaned)

token_frequencies = {}

for token in tokens:
    token_frequencies[token] = token_frequencies.get(token, 0) + 1

In [6]:
sorted_tokens = sorted(token_frequencies.items(), key=lambda x: x[1], reverse=True)

# Example: print top 10 most frequent tokens
print("Most frequent tokens:")
for token, freq in sorted_tokens[:20]:
    print(f"{token}: {freq}")

Most frequent tokens:
the: 629
of: 360
and: 254
to: 234
in: 181
a: 133
that: 109
as: 100
is: 87
it: 83
be: 82
act: 80
or: 78
chinese: 74
for: 74
by: 70
not: 62
with: 60
on: 58
was: 58


In [7]:
import re
# Build ethnicity vocabulary
ethnicities = [
    "chinese", "japanese", "black", "white", "yellow", "chinamans",
    "canada", "american", "european", "china", "chinaman", "britain",
    "canadian", "latino", "mongolian", "asian", "indian", "india", "english",
    "british", "america", "columbia", "ontario", "australia", "australian",
    "germans", "german", "chinamen", "italian", "italy", "french", "france"
]

pattern = re.compile(r"\b(" + "|".join(map(re.escape, ethnicities)) + r")\b", flags=re.IGNORECASE)

# Mask in any string
def mask_ethnicity(tokens):
    masked = []
    for tok in tokens:
        masked.append(pattern.sub("[MASK]", tok))
        
    return masked

In [8]:
example_word = ["chinaman", "chinese women"]

mask_ethnicity(example_word)

['[MASK]', '[MASK] women']

In [9]:

# Get unique words to avoid redundant computation
unique_tokens = list(set(tokens))

unique_tokens = mask_ethnicity(unique_tokens)

# Include the word "chinese" as our target
unique_tokens.append("chinese")

# Print the shape of unique tokens
print(f'There are {len(unique_tokens)} unique tokens in this corpus.')

There are 1600 unique tokens in this corpus.


In [10]:
# Prepare a dictionary to store word embeddings
bert_word_embeddings = {}

# For each word, get its BERT embedding by feeding it as a single-token input
for word in unique_tokens:
    word_inputs = tokenizer(word, return_tensors='pt', truncation=True, max_length=10)
    with torch.no_grad():
        word_outputs = bert_model(**word_inputs)
        # Use the [CLS] token embedding as the word embedding
        word_embedding = word_outputs.last_hidden_state[:, 0, :].squeeze().numpy()
        bert_word_embeddings[word] = word_embedding
    

In [11]:
# Print embedding for the word of interest 'chinese'

print(f"BERT embedding for 'chinese':\n{bert_word_embeddings.get('chinese')}")

BERT embedding for 'chinese':
[-1.73933744e-01  2.28281140e-01 -4.60987777e-01 -3.10309321e-01
  6.09321818e-02 -1.18675649e-01  5.69706410e-02  5.09568810e-01
 -1.16129011e-01 -2.93345749e-01 -2.01437280e-01 -1.02072828e-01
  3.72352898e-02  4.77217045e-03 -1.51318666e-02 -1.24202862e-01
 -2.10853815e-01  4.64721173e-01  3.23807836e-01 -4.53069210e-02
 -4.75586876e-02 -2.62311012e-01 -4.20340359e-01 -2.10749120e-01
 -1.17265537e-01 -2.15970218e-01  1.14206724e-01 -1.76633611e-01
  8.67219642e-02 -2.59426124e-02 -1.39534533e-01  3.87654811e-01
 -1.36946052e-01  4.76951659e-01  3.38810384e-02  1.88151821e-01
 -2.12915927e-01 -1.21353865e-01  7.67081231e-02  1.45254448e-01
 -2.10498676e-01 -1.03174210e-01  2.61935860e-01 -3.14799324e-02
  2.80385673e-01 -3.71496201e-01 -1.69398880e+00 -1.00097865e-01
 -4.19999659e-01 -2.02351600e-01  2.84618661e-02  6.71245903e-02
  2.95703322e-01  2.80955940e-01 -2.57328600e-02  4.13093179e-01
 -3.49687040e-01  3.93703401e-01  1.61472648e-01  2.30470523

In [12]:
# Compute cosine similarity between all words with Chinese in the model
from scipy.spatial.distance import cosine

similarity_scores = {}

for other_word in bert_word_embeddings.keys():
    if other_word != "chinese":
        similarity = 1 - cosine(bert_word_embeddings["chinese"], bert_word_embeddings[other_word])
        similarity_scores[other_word] = similarity

# Sort by cosine similarity
sorted_similarity = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top 10 most similar words
print("Top 10 most similar words to 'chinese':")
for word, score in sorted_similarity[:10]:
    print(f"{word}: {score:.4f}")

Top 10 most similar words to 'chinese':
incompatible: 0.9546
introduced: 0.9533
internal: 0.9530
peace: 0.9526
employers: 0.9518
lin: 0.9513
statistics: 0.9510
assessed: 0.9504
natural: 0.9502
treated: 0.9502


In [13]:
# Generate a t-SNE plot for visualization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)

word_embeddings = np.array(list(bert_word_embeddings.values()))
tsne_results = tsne.fit_transform(word_embeddings)

In [14]:
import plotly.express as px
# Create a DataFrame for visualization
df_tsne = pandas.DataFrame(tsne_results, columns=['x', 'y'])
df_tsne['word'] = list(bert_word_embeddings.keys())
# Highlight the word 'chinese' in the plot
df_tsne['highlight'] = df_tsne['word'].apply(lambda x: 'chinese' if x == 'chinese' else '')

fig = px.scatter(
    df_tsne,
    x='x',
    y='y',
    title='t-SNE Visualization of BERT Word Embeddings',
    color='highlight',                        
    hover_data=['word'], 
    text= 'highlight'
)

fig.show()


  sf: grouped.get_group(s if len(s) > 1 else s[0])


### Sentence Embeddings

In [15]:
from pathlib import Path

# Read the txt file as lines
lines = Path("data/Regina_V_Wing_Chong.txt").read_text(encoding="utf-8").splitlines()

# Extract line 67 as the target
target = lines[91]
print("Line 91:", target)


Line 91: The aliens in this case being Chinese, the first enquiry must be, what is the object of the Act? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the Act I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this Act was passed, another Act was passed, the very object of which was

In [16]:
paragraphs = [p.strip() for p in full_text.split("\n\n") if p.strip()]

for paragraph in paragraphs[:5]:
    print(paragraph)

CREASE, J. 1885. REGINA v. WING CHONG.
14th & 15th July, Certiorari—“Chinese Regulation Act, 1884,” s. 5—Constitutionality—B.N.A. Act, 1867, ss. 91, 92—“Aliens”—“Trade and Commerce”—Taxation. 
On the return to a writ of certiorari. Held, that the “Chinese Regulation Act, 1884,” is ultra vires of the Provincial Legislature, on the following grounds: 
1. It is an interference with the rights of Aliens. 
2. It is an interference with Trade and Commerce. 
3. It is an infraction of the existing treaties between the Imperial Government and China. 
4. It imposes unequal taxation.
14th & 15th July—On the return of a writ of certiorari directed to Edwin Johnson, Esquire, Police Magistrate for the City of Victoria, to return into this Court a certain conviction made by him under which one Wing Chong was fined $20 for not having in his possession a license issued under the “Chinese Regulation Act, 1884.”
The Attorney-General in support of the conviction said there were five points raised on the r

In [17]:
from sentence_transformers import SentenceTransformer

# Import the sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Calculate embeddings by calling model.encode()
paragraph_embeddings = model.encode(paragraphs, convert_to_tensor=True)
print(paragraph_embeddings.shape)


torch.Size([78, 384])


In [18]:
# We also want to encode the target separately
target_embedding = model.encode(target, convert_to_tensor=True)

In [40]:
import torch
from torch.nn.functional import cosine_similarity

# Calculate the cosine similarity
sims = cosine_similarity(target_embedding.unsqueeze(0), paragraph_embeddings)

k = min(10, sims.shape[0])

topk = torch.topk(sims, k=k-1)

top_paragraphs = []

for score, idx in zip(topk.values, topk.indices):
    top_paragraphs.append(paragraphs[idx])
    print(f"{score:.4f}\t{paragraphs[idx]}")

1.0000	The aliens in this case being Chinese, the first enquiry must be, what is the object of the Act? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the Act I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this Act was passed, another Act was passed, the very object of which was p

In [20]:
import spacy

# Tokenize the text into sentences
nlp = spacy.load("en_core_web_sm")
doc = nlp(full_text)
sentences = [sent.text.strip() for sent in doc.sents]

print(sentences)

['CREASE, J. 1885.', 'REGINA v. WING CHONG.', '14th & 15th July, Certiorari—“Chinese Regulation Act, 1884,” s. 5—Constitutionality—B.N.A. Act, 1867, ss.', '91, 92—“Aliens”—“Trade and Commerce”—Taxation.', 'On the return to a writ of certiorari.', 'Held, that the “Chinese Regulation Act, 1884,” is ultra vires of the Provincial Legislature, on the following grounds: \n1.', 'It is an interference with the rights of Aliens.', '2.', 'It is an interference with Trade and Commerce.', '3.', 'It is an infraction of the existing treaties between the Imperial Government and China.', '4.', 'It imposes unequal taxation.', '14th & 15th July—On the return of a writ of certiorari directed to Edwin Johnson, Esquire, Police Magistrate for the City of Victoria, to return into this Court a certain conviction made by him under which one Wing Chong was fined $20 for not having in his possession a license issued under the “Chinese Regulation Act, 1884.”', 'The Attorney-General in support of the conviction sa

In [21]:
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
print(sentence_embeddings.shape)

torch.Size([238, 384])


In [None]:
# Calculate the cosine similarity
sims = cosine_similarity(target_embedding.unsqueeze(0), sentence_embeddings)

k = min(10, sims.shape[0])

topk = torch.topk(sims, k=k-1)

for score, idx in zip(topk.values, topk.indices):
    print(f"{score:.4f}\t{sentences[idx]}")

0.7384	The provisions of the Act I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this Act was passed, another Act was passed, the very object of which was plainly stated to be "to prevent the immigration" of Chinese."
0.6861	The aliens in this case being Chinese, the first enquiry must be, what is the object of the Act?
0.6467	And again, "A tax imposed by the law on these persons for the mere right to reside here, is an appropriate and effective means to discourage the immigration of the Chinese into the State.
0.6201	Its object, though not apparent on the face of the Act, was to prevent Chinese coming into the Province and drive out those who had already come.
0.6025	The power asserted in the Act in question (the California Act) is the right of the State to prescribe the terms upon which the Chinese shall be permitted to reside in it, and be so used as to cut off all intercourse

We apply a trained model to mask key words related to ethnicity and nationality identities.

In [23]:
from transformers import pipeline
import numpy

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

def mask_ethnicity_hf(text):
    entities = ner(text)
    spans_to_mask = [e for e in entities if e["entity_group"] == "MISC" or e["entity_group"] == "ORG" or e["entity_group"] == "PER" or e["entity_group"] == "LOC" or e["entity_group"] == "NORP"]
    # typically nationality is in MISC or NORP depending on the model
    masked = text
    for ent in sorted(spans_to_mask, key=lambda e: e["start"], reverse=True):
        masked = masked[:ent["start"]] + "[MASK]" + masked[ent["end"]:]
    return masked

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu

`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.



In [24]:
def mask_ethnicity(texts):
    masked_list = []
    for sent in texts:
        sent = mask_ethnicity_hf(sent)
        masked_list.append(sent)
        
    return masked_list

In [25]:
# Example output applying this pre-trained model
example_text = """And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, 
from every state and every city, we will be able to speed up that day when all of God's children, Black men and white men, 
Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: Free at last. 
Free at last. Thank God almighty, we are free at last."""

masked_example = mask_ethnicity_hf(example_text)

print(masked_example)

And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, 
from every state and every city, we will be able to speed up that day when all of [MASK]'s children, [MASK] men and white men, 
[MASK] and [MASK]s, [MASK] and [MASK], will be able to join hands and sing in the words of the old [MASK] spiritual: Free at last. 
Free at last. Thank [MASK] almighty, we are free at last.


In [26]:
masked_paragraphs = mask_ethnicity(paragraphs)

masked_paragraphs[40]

'The aliens in this case being [MASK], the first enquiry must be, what is the object of the [MASK]? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the [MASK] I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this [MASK] was passed, another [MASK] was passed, the very object of which 

In [28]:
# Calculate embeddings by calling model.encode()
masked_paragraph_embeddings = model.encode(masked_paragraphs, convert_to_tensor=True)
print(masked_paragraph_embeddings.shape)

torch.Size([78, 384])


In [35]:
# Calculate the cosine similarity
sims = cosine_similarity(target_embedding.unsqueeze(0), masked_paragraph_embeddings)

k = min(10, sims.shape[0])

topk = torch.topk(sims, k=k-1)

top_masked_paragraphs = []

for score, idx in zip(topk.values, topk.indices):
    top_masked_paragraphs.append(masked_paragraphs[idx])
    print(f"{score:.4f}\t{masked_paragraphs[idx]}")

0.4159	And again, "A tax imposed by the law on these persons for the mere right to reside here, is an appropriate and effective means to discourage the immigration of the [MASK] into the State."
0.3908	The aliens in this case being [MASK], the first enquiry must be, what is the object of the [MASK]? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the [MASK] I have given somewhat in extens

### Stance Detection

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Choose a strong NLI model
model_name = "roberta-large-mnli"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create an NLI pipeline
nli = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    device=-1,                   
    return_all_scores=True        
)

# Define the premise
premise = "Chinese rights"

results = []
for sent in top_paragraphs:
    inputs = tokenizer.encode_plus(premise, sent, return_tensors="pt", truncation=True)
    out = model(**inputs).logits.softmax(dim=-1).tolist()[0]
    label_idx = out.index(max(out))
    label = ["FAVOR", "NEUTRAL", "AGAINST"][label_idx]
    results.append((sent, label, dict(zip(["favor","neutral","against"], out))))

# Print stance results
for sent, label, probs in results:
    print(f"{label.lower():>12}  {probs[label.lower()]:.2f}  ->  {sent}")

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu

`return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.



     neutral  0.55  ->  The aliens in this case being Chinese, the first enquiry must be, what is the object of the Act? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the Act I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this Act was passed, another Act was passed, the very obje