## Text Embeddings for Regina V Wing Chong (1885)

In [1]:
# Data Wrangling
import os
import numpy as np
import pandas
from nltk.tokenize import word_tokenize

In [2]:
with open('data/Regina_V_Wing_Chong.txt', encoding='utf-8') as f:
    full_text = f.read()
print(full_text)

CREASE, J. 1885. REGINA v. WING CHONG. 

14th & 15th July, Certiorari—“Chinese Regulation Act, 1884,” s. 5—Constitutionality—B.N.A. Act, 1867, ss. 91, 92—“Aliens”—“Trade and Commerce”—Taxation. 
On the return to a writ of certiorari. Held, that the “Chinese Regulation Act, 1884,” is ultra vires of the Provincial Legislature, on the following grounds: 
1. It is an interference with the rights of Aliens. 
2. It is an interference with Trade and Commerce. 
3. It is an infraction of the existing treaties between the Imperial Government and China. 
4. It imposes unequal taxation. 

14th & 15th July—On the return of a writ of certiorari directed to Edwin Johnson, Esquire, Police Magistrate for the City of Victoria, to return into this Court a certain conviction made by him under which one Wing Chong was fined $20 for not having in his possession a license issued under the “Chinese Regulation Act, 1884.”

The Attorney-General in support of the conviction said there were five points raised on 

### BERT Word Embeddings

In [3]:
import re

def clean_text(text):
    
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    
    return text.strip()

text_cleaned = clean_text(full_text)
print(text_cleaned[:500])  # Print the first 500 characters of the cleaned text

crease j 1885 regina v wing chong 

14th  15th july certiorarichinese regulation act 1884 s 5constitutionalitybna act 1867 ss 91 92alienstrade and commercetaxation 
on the return to a writ of certiorari held that the chinese regulation act 1884 is ultra vires of the provincial legislature on the following grounds 
1 it is an interference with the rights of aliens 
2 it is an interference with trade and commerce 
3 it is an infraction of the existing treaties between the imperial government and c


In [4]:
# Load pre-trained BERT tokenizer and model
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('nlpaueb/legal-bert-base-uncased')
bert_model = BertModel.from_pretrained('nlpaueb/legal-bert-base-uncased')

In [None]:
# Create the word embeddings
# Tokenize the cleaned text into words
tokens = word_tokenize(text_cleaned)

token_frequencies = {}

for token in tokens:
    token_frequencies[token] = token_frequencies.get(token, 0) + 1

In [6]:
sorted_tokens = sorted(token_frequencies.items(), key=lambda x: x[1], reverse=True)

# Example: print top 10 most frequent tokens
print("Most frequent tokens:")
for token, freq in sorted_tokens[:20]:
    print(f"{token}: {freq}")

Most frequent tokens:
the: 629
of: 360
and: 254
to: 234
in: 181
a: 133
that: 109
as: 100
is: 87
it: 83
be: 82
act: 80
or: 78
chinese: 74
for: 74
by: 70
not: 62
with: 60
on: 58
was: 58


In [7]:
import re
# Build ethnicity vocabulary
ethnicities = [
    "chinese", "japanese", "black", "white", "yellow", "chinamans", "hong kong",
    "canada", "american", "americans", "european", "china", "chinaman", "britain",
    "canadian", "latino", "mongolian", "asian", "indian", "india", "english",
    "british", "america", "columbia", "ontario", "australia", "australian",
    "germans", "german", "chinamen", "italian", "italy", "french", "france"
]

pattern = re.compile(r"\b(" + "|".join(map(re.escape, ethnicities)) + r")\b", flags=re.IGNORECASE)

# Mask in any string
def mask_ethnicity(tokens):
    masked = []
    for tok in tokens:
        masked.append(pattern.sub("[MASK]", tok))
        
    return masked

In [8]:
example_word = ["chinaman", "chinese women"]

mask_ethnicity(example_word)

['[MASK]', '[MASK] women']

In [9]:
tokens = mask_ethnicity(tokens)

# Get unique words to avoid redundant computation
unique_tokens = list(set(tokens))


# Include the word "chinese" as our target
unique_tokens.append("chinese")

# Print the shape of unique tokens
print(f'There are {len(unique_tokens)} unique tokens in this corpus.')

There are 1578 unique tokens in this corpus.


In [10]:
# Prepare a dictionary to store word embeddings
bert_word_embeddings = {}

# For each word, get its BERT embedding by feeding it as a single-token input
for word in unique_tokens:
    word_inputs = tokenizer(word, return_tensors='pt', truncation=True, max_length=10)
    with torch.no_grad():
        word_outputs = bert_model(**word_inputs)
        # Use the [CLS] token embedding as the word embedding
        word_embedding = word_outputs.last_hidden_state[:, 0, :].squeeze().numpy()
        bert_word_embeddings[word] = word_embedding
    

In [11]:
# Print embedding for the word of interest 'chinese'

print(f"BERT embedding for 'chinese':\n{bert_word_embeddings.get('chinese')}")

BERT embedding for 'chinese':
[-6.20929539e-01 -1.41670823e-01  6.38972700e-01  5.66699132e-02
  2.49502540e-01  3.55757505e-01 -9.64455083e-02  3.54799002e-01
 -2.72700071e-01 -6.37607515e-01  1.72131464e-01  5.87601185e-01
  5.80037721e-02 -1.98575929e-01 -6.22221410e-01  6.23443425e-01
 -2.84136593e-01 -2.01131850e-01 -1.16010755e-01  3.39487463e-01
 -1.49680659e-01  4.16029960e-01  4.64205593e-01 -4.62918848e-01
  3.87409419e-01  6.31607294e-01  6.86673880e-01  2.19446510e-01
 -3.76841813e-01  1.29365414e-01 -2.28451476e-01 -2.85087526e-01
  3.50298733e-01  4.33137774e-01 -4.69815671e-01  2.95415729e-01
  5.21581918e-02 -2.85912603e-02  4.41664994e-01  2.89366961e-01
  3.54161382e-01 -7.48492539e-01  7.74241015e-02 -1.15738958e-01
 -1.74300909e-01  1.22695386e-01 -2.15352607e+00 -3.29316437e-01
  1.01312399e-02 -3.54919508e-02 -1.23483628e-01  6.59714639e-01
 -8.31658393e-03  6.29764616e-01  6.69252157e-01 -4.71154869e-01
  7.91465193e-02 -6.24100566e-01 -4.18076426e-01 -6.55551255

In [12]:
# Compute cosine similarity between all words with Chinese in the model
from scipy.spatial.distance import cosine

similarity_scores = {}

for other_word in bert_word_embeddings.keys():
    if other_word != "chinese":
        similarity = 1 - cosine(bert_word_embeddings["chinese"], bert_word_embeddings[other_word])
        similarity_scores[other_word] = similarity

# Sort by cosine similarity
sorted_similarity = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top 10 most similar words
print("Top 10 most similar words to 'chinese':")
for word, score in sorted_similarity[:10]:
    print(f"{word}: {score:.4f}")

Top 10 most similar words to 'chinese':
chong: 0.8652
alien: 0.8581
fourteen: 0.8564
hong: 0.8516
aliens: 0.8370
fortiori: 0.8365
stranger: 0.8306
mode: 0.8299
multitude: 0.8282
425: 0.8276


In [13]:
similarity_scores = {}

for other_word in bert_word_embeddings.keys():
    if other_word != "commerce":
        similarity = 1 - cosine(bert_word_embeddings["commerce"], bert_word_embeddings[other_word])
        similarity_scores[other_word] = similarity

# Sort by cosine similarity
sorted_similarity = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top 10 most similar words
print("Top 10 most similar words to 'commerce':")
for word, score in sorted_similarity[:10]:
    print(f"{word}: {score:.4f}")

Top 10 most similar words to 'commerce':
agency: 0.8791
arbitrary: 0.8773
productive: 0.8733
operation: 0.8618
informant: 0.8595
chong: 0.8587
injury: 0.8578
relations: 0.8570
inhabitants: 0.8541
levied: 0.8539


In [14]:
emd = np.array(bert_word_embeddings.get('chinese')) - np.array(bert_word_embeddings.get('alien'))

similarity_scores = {}

for other_word in bert_word_embeddings.keys():
    similarity = 1 - cosine(emd, bert_word_embeddings[other_word])
    similarity_scores[other_word] = similarity

# Sort by cosine similarity
sorted_similarity = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top 10 most similar words
print("Top 10 most similar words to 'chinese - alien':")
for word, score in sorted_similarity[:10]:
    print(f"{word}: {score:.4f}")

Top 10 most similar words to 'chinese - alien':
chinese: 0.2665
party: 0.1287
enquirywhich: 0.1114
follow: 0.1109
were: 0.1084
exceeding: 0.1078
besides: 0.1065
aforesaid: 0.1061
wellknown: 0.1050
origin: 0.1020


In [15]:
# Generate a 2D PCA for visualiaztion
from sklearn.decomposition import PCA
pca = PCA(n_components=2)

word_embeddings = np.array(list(bert_word_embeddings.values()))
pca_results = pca.fit_transform(word_embeddings)

In [16]:
import plotly.express as px
df_pca = pandas.DataFrame(pca_results, columns = ['x', 'y'])
df_pca['word'] = list(bert_word_embeddings.keys())
# Highlight the word 'chinese' in the plot
df_pca['highlight'] = df_pca['word'].apply(lambda x: 'chinese' if x == 'chinese' else '')

fig = px.scatter(
    df_pca,
    x='x',
    y='y',
    title=' Visualization of 2D PCA of the legal-BERT Word Embeddings',
    color='highlight',                        
    hover_data=['word'], 
    text= 'highlight'
)

fig.show()

  sf: grouped.get_group(s if len(s) > 1 else s[0])


In [17]:
# Generate a t-SNE plot for visualization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)

tsne_results = tsne.fit_transform(word_embeddings)

In [18]:
# Create a DataFrame for visualization
df_tsne = pandas.DataFrame(tsne_results, columns=['x', 'y'])
df_tsne['word'] = list(bert_word_embeddings.keys())
# Highlight the word 'chinese' in the plot
df_tsne['highlight'] = df_tsne['word'].apply(lambda x: 'chinese' if x == 'chinese' else '')

fig = px.scatter(
    df_tsne,
    x='x',
    y='y',
    title='t-SNE Visualization of legal-BERT Word Embeddings',
    color='highlight',                        
    hover_data=['word'], 
    text= 'highlight'
)

fig.show()






### Sentence Embeddings

In [19]:
from pathlib import Path

# Read the txt file as lines
lines = Path("data/Regina_V_Wing_Chong.txt").read_text(encoding="utf-8").splitlines()

# Extract line 67 as the target
target = lines[91]
print("Line 91:", target)


Line 91: The aliens in this case being Chinese, the first enquiry must be, what is the object of the Act? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the Act I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this Act was passed, another Act was passed, the very object of which was

In [20]:
paragraphs = [p.strip() for p in full_text.split("\n\n") if p.strip()]

for paragraph in paragraphs[:5]:
    print(paragraph)

CREASE, J. 1885. REGINA v. WING CHONG.
14th & 15th July, Certiorari—“Chinese Regulation Act, 1884,” s. 5—Constitutionality—B.N.A. Act, 1867, ss. 91, 92—“Aliens”—“Trade and Commerce”—Taxation. 
On the return to a writ of certiorari. Held, that the “Chinese Regulation Act, 1884,” is ultra vires of the Provincial Legislature, on the following grounds: 
1. It is an interference with the rights of Aliens. 
2. It is an interference with Trade and Commerce. 
3. It is an infraction of the existing treaties between the Imperial Government and China. 
4. It imposes unequal taxation.
14th & 15th July—On the return of a writ of certiorari directed to Edwin Johnson, Esquire, Police Magistrate for the City of Victoria, to return into this Court a certain conviction made by him under which one Wing Chong was fined $20 for not having in his possession a license issued under the “Chinese Regulation Act, 1884.”
The Attorney-General in support of the conviction said there were five points raised on the r

In [21]:
from sentence_transformers import SentenceTransformer

# Import the sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Calculate embeddings by calling model.encode()
paragraph_embeddings = model.encode(paragraphs, convert_to_tensor=True)
print(paragraph_embeddings.shape)


torch.Size([78, 384])


In [22]:
# We also want to encode the target separately
target_embedding = model.encode(target, convert_to_tensor=True)

In [23]:
import torch
from torch.nn.functional import cosine_similarity

# Calculate the cosine similarity
sims = cosine_similarity(target_embedding.unsqueeze(0), paragraph_embeddings)

k = min(10, sims.shape[0])

topk = torch.topk(sims, k=k-1)

top_paragraphs = []

for score, idx in zip(topk.values, topk.indices):
    top_paragraphs.append(paragraphs[idx])
    print(f"{score:.4f}\t{paragraphs[idx]}")

1.0000	The aliens in this case being Chinese, the first enquiry must be, what is the object of the Act? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the Act I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this Act was passed, another Act was passed, the very object of which was p

In [24]:
import spacy

# Tokenize the text into sentences
nlp = spacy.load("en_core_web_sm")
doc = nlp(full_text)
sentences = [sent.text.strip() for sent in doc.sents]

print(sentences)

['CREASE, J. 1885.', 'REGINA v. WING CHONG.', '14th & 15th July, Certiorari—“Chinese Regulation Act, 1884,” s. 5—Constitutionality—B.N.A. Act, 1867, ss.', '91, 92—“Aliens”—“Trade and Commerce”—Taxation.', 'On the return to a writ of certiorari.', 'Held, that the “Chinese Regulation Act, 1884,” is ultra vires of the Provincial Legislature, on the following grounds: \n1.', 'It is an interference with the rights of Aliens.', '2.', 'It is an interference with Trade and Commerce.', '3.', 'It is an infraction of the existing treaties between the Imperial Government and China.', '4.', 'It imposes unequal taxation.', '14th & 15th July—On the return of a writ of certiorari directed to Edwin Johnson, Esquire, Police Magistrate for the City of Victoria, to return into this Court a certain conviction made by him under which one Wing Chong was fined $20 for not having in his possession a license issued under the “Chinese Regulation Act, 1884.”', 'The Attorney-General in support of the conviction sa

In [25]:
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
print(sentence_embeddings.shape)

torch.Size([238, 384])


In [26]:
# Calculate the cosine similarity
sims = cosine_similarity(target_embedding.unsqueeze(0), sentence_embeddings)

k = min(10, sims.shape[0])

topk = torch.topk(sims, k=k-1)

for score, idx in zip(topk.values, topk.indices):
    print(f"{score:.4f}\t{sentences[idx]}")

0.7384	The provisions of the Act I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this Act was passed, another Act was passed, the very object of which was plainly stated to be "to prevent the immigration" of Chinese."
0.6861	The aliens in this case being Chinese, the first enquiry must be, what is the object of the Act?
0.6467	And again, "A tax imposed by the law on these persons for the mere right to reside here, is an appropriate and effective means to discourage the immigration of the Chinese into the State.
0.6201	Its object, though not apparent on the face of the Act, was to prevent Chinese coming into the Province and drive out those who had already come.
0.6025	The power asserted in the Act in question (the California Act) is the right of the State to prescribe the terms upon which the Chinese shall be permitted to reside in it, and be so used as to cut off all intercourse

We apply a trained model to mask key words related to ethnicity and nationality identities.

In [27]:
from transformers import pipeline
import numpy

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

def mask_ethnicity_hf(text):
    entities = ner(text)
    spans_to_mask = [e for e in entities if e["entity_group"] == "MISC" or e["entity_group"] == "ORG" or e["entity_group"] == "PER" or e["entity_group"] == "LOC" or e["entity_group"] == "NORP"]
    # typically nationality is in MISC or NORP depending on the model
    masked = text
    for ent in sorted(spans_to_mask, key=lambda e: e["start"], reverse=True):
        masked = masked[:ent["start"]] + "[MASK]" + masked[ent["end"]:]
    return masked

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu

`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.



In [28]:
def mask_ethnicity(texts):
    masked_list = []
    for sent in texts:
        sent = mask_ethnicity_hf(sent)
        masked_list.append(sent)
        
    return masked_list

In [29]:
# Example output applying this pre-trained model
example_text = """And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, 
from every state and every city, we will be able to speed up that day when all of God's children, Black men and white men, 
Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: Free at last. 
Free at last. Thank God almighty, we are free at last."""

masked_example = mask_ethnicity_hf(example_text)

print(masked_example)

And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, 
from every state and every city, we will be able to speed up that day when all of [MASK]'s children, [MASK] men and white men, 
[MASK] and [MASK]s, [MASK] and [MASK], will be able to join hands and sing in the words of the old [MASK] spiritual: Free at last. 
Free at last. Thank [MASK] almighty, we are free at last.


In [30]:
masked_paragraphs = mask_ethnicity(paragraphs)

masked_paragraphs[40]

'The aliens in this case being [MASK], the first enquiry must be, what is the object of the [MASK]? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the [MASK] I have given somewhat in extenso bear out that view, and the concurrent and previous local legislation bear out the same impression, for on the same day as this [MASK] was passed, another [MASK] was passed, the very object of which 

In [31]:
# Calculate embeddings by calling model.encode()
masked_paragraph_embeddings = model.encode(masked_paragraphs, convert_to_tensor=True)
print(masked_paragraph_embeddings.shape)

torch.Size([78, 384])


In [32]:
# Calculate the cosine similarity
sims = cosine_similarity(target_embedding.unsqueeze(0), masked_paragraph_embeddings)

k = min(10, sims.shape[0])

topk = torch.topk(sims, k=k-1)

top_masked_paragraphs = []

for score, idx in zip(topk.values, topk.indices):
    top_masked_paragraphs.append(masked_paragraphs[idx])
    print(f"{score:.4f}\t{masked_paragraphs[idx]}")

0.4159	And again, "A tax imposed by the law on these persons for the mere right to reside here, is an appropriate and effective means to discourage the immigration of the [MASK] into the State."
0.3908	The aliens in this case being [MASK], the first enquiry must be, what is the object of the [MASK]? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the [MASK] I have given somewhat in extens

### Natural Language Inference

In [33]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Choose a strong NLI model
model_name = "lexlms/legal-roberta-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create an NLI pipeline
nli = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    device=-1,                   
    return_all_scores=True        
)

# Define the premise
premise = "Chinese immigrants should enjoy equal rights and legal protections."

results = []
for sent in top_masked_paragraphs:
    inputs = tokenizer.encode_plus(premise, sent, return_tensors="pt", truncation=True)
    out = model(**inputs).logits.softmax(dim=-1).tolist()[0]
    label_idx = out.index(max(out))
    label = ["FAVOR", "NEUTRAL", "AGAINST"][label_idx]
    results.append((sent, label, dict(zip(["favor","neutral","against"], out))))

# Print stance results
for sent, label, probs in results:
    print(f"{label.lower():>12}  {probs[label.lower()]:.2f}  ->  {sent}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at lexlms/legal-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu

`return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.



     neutral  0.51  ->  And again, "A tax imposed by the law on these persons for the mere right to reside here, is an appropriate and effective means to discourage the immigration of the [MASK] into the State."
     neutral  0.53  ->  The aliens in this case being [MASK], the first enquiry must be, what is the object of the [MASK]? On applying to the preamble, we find that it looks like a bill of indictment as against a race not suited to live among a civilized nation, and certainly does not prepare one for legislation which would encourage or tolerate their settlement in the country. Indeed, the first lines of the preamble sound an alarm at the multitude of people coming in, who are of the repulsive habits described in the last part of the preamble, and prepares one for measures which should have a tendency to abate that alarm by deterrent influences and enactments which should have the effect of materially lessening the number of such undesirable visitors. The provisions of the [MAS

In [34]:
from transformers import pipeline
import pandas as pd

# Load the MNLI‑based zero‑shot classifier
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=-1
)

# Use the NLI labels as your “candidate labels”
candidate_labels = ["entailment", "neutral", "contradiction"]

records = []
for para in paragraphs:
    out = classifier(
        sequences=para,
        candidate_labels=candidate_labels,
        hypothesis_template="Given the context that the texts for classification are from a legal ruling in 1885, this paragraph is {} of the premise 'Chinese immigrants should enjoy equal rights and legal protections'."
    )
    # out['labels'] is sorted by score descending
    scores = dict(zip(out["labels"], out["scores"]))
    pred = out["labels"][0]

    records.append({
        "paragraph": para,
        "entailment":   scores.get("entailment", 0.0),
        "neutral":      scores.get("neutral",    0.0),
        "contradiction":scores.get("contradiction", 0.0),
        "predicted":    pred
    })

#  Build a DataFrame
df_nli = pd.DataFrame(records)

# Inspect the first few rows
print(df_nli.head())

Device set to use cpu


                                           paragraph  entailment   neutral  \
0             CREASE, J. 1885. REGINA v. WING CHONG.    0.485249  0.398294   
1  14th & 15th July, Certiorari—“Chinese Regulati...    0.354459  0.321517   
2  14th & 15th July—On the return of a writ of ce...    0.443300  0.300308   
3  The Attorney-General in support of the convict...    0.291023  0.360960   
4  *Richards*, Q. C., for Wing Chong—The object o...    0.232745  0.329130   

   contradiction      predicted  
0       0.116457     entailment  
1       0.324025     entailment  
2       0.256391     entailment  
3       0.348017        neutral  
4       0.438125  contradiction  


In [35]:
df_nli.shape

(78, 5)

In [36]:
counts = df_nli['predicted'].value_counts()

proportions = df_nli['predicted'].value_counts(normalize=True)

result = pd.DataFrame({
    'count': counts,
    'proportion': proportions
})

print(result)

               count  proportion
predicted                       
neutral           36    0.461538
contradiction     22    0.282051
entailment        20    0.256410


### Topic Modelling Through BERTopic

In [37]:
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer


ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
vectorizer = CountVectorizer(stop_words="english", ngram_range=(1,2), max_df=0.85, min_df=2)

topic_model = BERTopic(
    vectorizer_model=vectorizer, 
    ctfidf_model= ctfidf_model
)

topics, probs = topic_model.fit_transform(paragraphs)

df_topic = topic_model.get_topic_info()
print(df_topic)

   Topic  Count                                   Name  \
0     -1     18     -1_china_subjects_default_distress   
1      0     42  0_dominion_taxation_province_commerce   
2      1     18      1_license_shall_possession_person   

                                      Representation  \
0  [china, subjects, default, distress, void, tra...   
1  [dominion, taxation, province, commerce, alien...   
2  [license, shall, possession, person, exceeding...   

                                 Representative_Docs  
0  [If the legislation here be to drive people fr...  
1  [The Attorney-General in reply—The ten dollar ...  
2  [Section 9. “In case any employer of Chinese f...  


In [38]:
for para in df_topic["Representative_Docs"][1]:
    print(para)

The Attorney-General in reply—The ten dollar impost was a tax whether designated by the name of a licence or otherwise. The distinction between direct and indirect taxation was well drawn in the case of Reed v. Mousseau before the Supreme Court of Canada, in which Mr. Justice Strong mentioned the Privy Council had decided the Provincial legislatures had exclusive power to impose direct taxation, and that it did not follow they might not have power even to impose indirect taxation. The tax did not sanction the carrying on of any business out of the product of which the consumer would, indirectly, contribute towards payment of the tax. Then as to inequality, the one dictum of Cooley was but an expression of opinion that political wisdom required uniform taxation; for the same author cited numerous authorities, including that of the Supreme Court of the United States establishing the right to discriminate unless fettered by the express language of the Constitution. Todd on Parliamentary G

In [39]:
rep_list = []

for list in df_topic["Representation"]:
    rep_list.extend(list)
    
print(rep_list)

['china', 'subjects', 'default', 'distress', 'void', 'trade', 'aliens trade', 'does', 'revenue', 'character', 'dominion', 'taxation', 'province', 'commerce', 'aliens', 'powers', 'trade', 'legislation', 'foreigners', 'government', 'license', 'shall', 'possession', 'person', 'exceeding', 'chinese shall', 'having', 'premises', 'employer', 'goods chattels']


In [40]:
topic_labels = topic_model.generate_topic_labels(
    nr_words=3,       
    separator=" ",     
    topic_prefix=False 
)

topic_list = []

for label in topic_labels:
    phrases = label.split(" ")
    topic_list.extend(phrases)
    
print(topic_list)

['china', 'subjects', 'default', 'dominion', 'taxation', 'province', 'license', 'shall', 'possession']


In [41]:
from umap import UMAP
import pandas as pd
import plotly.express as px

# Get your topic-term embeddings
embeddings = topic_model.c_tf_idf_.toarray()

# 1) Build a UMAP reducer with random init
umap_model = UMAP(
    n_neighbors=15,
    n_components=2,
    metric="cosine",
    init="random",
    random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)

# 2) Build a DataFrame and scatter
df = pd.DataFrame(reduced_embeddings, columns=["x", "y"])
df["topic"] = topic_model.get_topic_info()["Topic"].values

fig = px.scatter(
    df,
    x="x",
    y="y",
    text="topic",
    title="Topic visualization"
)
fig.show()


In [42]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("nlpaueb/legal-bert-base-uncased")



No sentence-transformers model found with name nlpaueb/legal-bert-base-uncased. Creating a new one with mean pooling.


In [43]:
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

topic_model = BERTopic(embedding_model=embedding_model,
                       vectorizer_model= vectorizer,
                       ctfidf_model= ctfidf_model)

topics, probs = topic_model.fit_transform(masked_paragraphs)

df_topic = topic_model.get_topic_info()
print(df_topic)

   Topic  Count                                    Name  \
0     -1     32  -1_collector_mask shall_possession_sec   
1      0     36         0_mask mask_taxation_case_trade   
2      1     10                     1_sections_14_15_25   

                                      Representation  \
0  [collector, mask shall, possession, sec, goods...   
1  [mask mask, taxation, case, trade, aliens, leg...   
2  [sections, 14, 15, 25, certificate mask, appli...   

                                 Representative_Docs  
0  [Section 9. “In case any employer of [MASK] fa...  
1  [Aliens may be taxed, may be subjected to the ...  
2  [Sections 17 and 18 prevent the exhumation of ...  


In [44]:
for para in df_topic['Representative_Docs']:
    print(para)

['Section 9. “In case any employer of [MASK] fails to deliver to the “collector the list mentioned in the preceding section, when required “so to do, or knowingly states anything falsely therein, such “employer shall, on complaint of the collector and upon conviction “before a Justice of the Peace having jurisdiction within the “district wherein such employer carries on his business, forfeit “and pay a fine not exceeding one hundred dollars for every [MASK] “in his employ, to be recovered by distress of the goods and chattels “of such employer failing to pay the same, or in lieu thereof shall be “liable to imprisonment for a period not less than one month and not “exceeding two calendar months.”', 'During the argument on the case before me, the Attorney-General claims that this was direct taxation, and a direct tax within the Province, to raise revenue for [MASK] purposes, and, therefore *intra vires*; but the question is not one of name but of fact. Does it interfere with trade or com

In [45]:
topic_labels = topic_model.generate_topic_labels(
    nr_words=5,       
    separator=" ",     
    topic_prefix=False 
)

topic_labels

['collector mask shall possession sec goods',
 'mask mask taxation case trade aliens',
 'sections 14 15 25 certificate mask']

In [46]:
from umap import UMAP
import pandas as pd
import plotly.express as px

# Get your topic-term embeddings
embeddings = topic_model.c_tf_idf_.toarray()

# 1) Build a UMAP reducer with random init
umap_model = UMAP(
    n_neighbors=15,
    n_components=2,
    metric="cosine",
    init="random",
    random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)

# 2) Build a DataFrame and scatter
df = pd.DataFrame(reduced_embeddings, columns=["x", "y"])
df["topic"] = topic_model.get_topic_info()["Topic"].values

fig = px.scatter(
    df,
    x="x",
    y="y",
    text="topic",
    title="Topic visualization"
)
fig.show()